Confluence Cluster Issues
This article applies to Confluence clustered 5.4 or earlier.
Please familiarize yourself with the documentation:
- A good table of contents can be found at Confluence Clustering Overview. Come back to this page if you need to find something related to confluence clustering.
- Start with the Technical Overview of Clustering in Confluence. This holds most of what you need to know.
- Consider the Cluster Checklist. This will give you a good idea of what you need to prepare, if you are serious about choosing clustered.
- Lastly make sure you are familiar with Cluster Troubleshooting, which covers the most common scenarios.
Cluster Safety Check
Scheduled to run once every 30 seconds is the Cluster Safety Check. What it basically does is:
- Fetch the cluster safety number in the database and compares it to the value it has cached.
- If they are the same it generates a new number, caches it and updates the value in the database
- Waits 30 secs and repeats step 1.
If the database number and the value it has cached is different, it will throw a cluster panic and effectively prevent users from accessing the instance.
The reason it does this is because it believes that some other instance (that is not part of its cluster) is accessing the database and updating the cluster safety number. If two instances are updating the database without each other knowledge, it is of course dangerous, as the caches in both instances will get out of sync with the database, and users can potentially override changes.
There are few well known situations that this could happen. See Confluence will not start due to fatal error in Confluence cluster:
- Another instance (for example a test instance) has been started accidentally and it points to the same database as your production instance.
- You have accidentally deployed the application twice (so that it starts up twice in the same application server). For example, you have referred to the Confluence in your sever.xml and a confluence.xml file within the tomcat configuration (thus starting up Confluence twice).
- Your database is taking a long time to commit (e.g. it is doing a back up that lasts over 30 secs and freezes commits). In this case the cluster safety job sends a new value to the database, but when it checks next time in 30secs, the previous commit hasn't occurred so it still fetches the previous value which is out of sync with its cached value.
- You are running a Confluence cluster and one of the nodes leaves the cluster due to problems communicating with the other nodes.
This last one is the least common, but the hardest to debug.
The cluster safety check should only run once per a cluster. Hence if one node runs the cluster safety job, the other nodes will not (due to job synchronisation). However if a node leaves the cluster, it no longer communicates with the other node, nor does it synchronise its jobs. Thus both the cluster and the node that left will run the cluster safety jobs independently and since they don't share the cache, their cached values and the value in the database will soon be out of sync triggering the cluster panic.
Nodes communicate via UDP using unicast (they first use multicast to discover a new node, once discovered they communicate via unicast). If there are network problems, then the nodes may experience communication delays:
2010-02-12 02:48:48,811 WARN [Logger@9247854 3.3.1/389] [Coherence] log 2010-02-12 02:48:48.811 Oracle Coherence GE 3.3.1/389 <Warning> (thread=PacketPublisher, member=1): Experienced a 13801 ms communication delay (probable remote GC) with Member(Id=2, Timestamp=2010-02-11 22:00:14.65, Address=184.108.40.206:8088, MachineId=23897, Location=process:21524@RTPCPAPWKI02); 88 packets rescheduled, PauseRate=0.0030, Threshold=387
and eventually timeout:
2010-02-12 03:03:58,705 WARN [Logger@9247854 3.3.1/389] [Coherence] log 2010-02-12 03:03:58.705 Oracle Coherence GE 3.3.1/389 <Warning> (thread=PacketPublisher, member=1): Timeout while delivering a packet; removing Member(Id=2, Timestamp=2010-02-11 22:00:14.65, Address=220.127.116.11:8088, MachineId=23897, Location=process:21524@RTPCPAPWKI02) 2010-02-12 03:03:58,734 INFO [Cluster:EventDispatcher] [confluence.cluster.coherence.TangosolClusterManager] memberLeft Member has left cluster: Member(Id=2, Timestamp=2010-02-12 03:03:58.705, Address=18.104.22.168:8088, MachineId=23897, Location=process:21524@RTPCPAPWKI02)
As you can see, a timeout is quickly followed by the removal of the node from the cluster (as it is assumed that the node is down).
The timeout value is by default 60 seconds in production installations. The heart beat, runs once every second and thus a node has to fail to respond to a heart beat for 60 secs in order for the node to determine if the node is dead. More details can be found in Coherence's packet delivery doc.
Once a node is removed from the cluster (but the node is still up), eventually (within 30secs) one of the instances (i.e. the remaining cluster/node or the node that left the cluster) will trigger a cluster panic. Once an instance triggers a cluster panic, it will update the database value to ensure that all other nodes will also cluster panic.
In confluence clusters there is no notion of master or slave node. Thus there is no way of knowing which of the nodes should be left running and thus to be safe all nodes are set to panic. It doesn't matter if only one node of 4 node cluster leaves, it will lead to the panic of all of the 4 nodes.
Coherence War Stories
There are number of reasons why communication will fail between two nodes. These are documented well in the following Coherence's war stories pdf.
Certain scheduled jobs will only run by one node in the cluster. Others will be run by all nodes. Examples of jobs that only run once per a cluster include:
- Daily Report Mail
- Cluster Safety Check
Examples of jobs that will run in every node:
- Incremental Indexing
- Index Optimisation