はじめに

A mechanism was added in Confluence 2.3 and above to ensure database consistency when running multiple cluster nodes against the same database. This is called the cluster safety mechanism, and is designed to ensure that your wiki cannot become inconsistent because updates by one user are not visible to another. A failure of this mechanism is a fatal error in Confluence and is called cluster panic.

Because the cluster safety mechanism helps prevents data inconsistency whenever any two copies of Confluence running against the same database, it is enabled in all instances of Confluence, not just clusters.

How cluster safety works

A scheduled task, ClusterSafetyJob, runs every 30 seconds in Confluence. In a cluster, this job is run only on one of the nodes. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across a cluster. It does the following:

  1. Generate a new random number
  2. データベースとキャッシュの両方に既存のセーフティ番号がある場合、既存のセーフティ番号を比較します。
  3. If the numbers differ, publish a ClusterPanicEvent. Currently in Confluence, this causes the following to happen:
    • アプリケーションへのすべてのアクセスの無効化
    • すべてのスケジュール済みタスクの無効化
    • update the database safety number to a new value, which will cause all nodes accessing the database to fail.
  4. If the numbers are the same or aren't set yet, update the safety numbers:
    • データベースのセーフティ番号を新しいランダムな数字に更新
    • キャッシュのセーフティ番号を新しいランダムな数字に設定

How to fix it

If cluster panic occurs, two or more instances of Confluence were updating the same database without being in the same cluster. Possible reasons (with associated solutions) for this are:

  • (most common) Two instances of Confluence have been started in your application server.
    • Solution: Check your application server configuration to make sure that two copies don't start up.
  • Two copies of your application server are running. Sometimes starting an application server twice will result in two processes running, even though only one can be accessed over the network.
    • Solution: Check a list of running processes (for example, with ps) and make sure your application server is only running once.
  • In a cluster, there is a networking failure between nodes in the cluster.
    • Solution: Check that multi-cast traffic is being transmitted successfully, and that the network between your nodes is low-latency (<100 ms).

In all cases, when starting Confluence after a cluster panic, you must ensure all cluster nodes have been shut down completely. If necessary, use tools like ps and kill to get a list of Java processes and terminate them manually.

技術的詳細

The cluster safety number in the database is stored in the CLUSTERSAFETY table. This table has just one row: the current safety number.