Recovering from a Bitbucket Data Center cluster split-brain

お困りですか?

アトラシアン コミュニティをご利用ください。

コミュニティに質問

プラットフォームについて: Data Center のみ - この記事は、Data Center プラットフォームのアトラシアン製品にのみ適用されます。

この KB は Data Center バージョンの製品用に作成されています。Data Center 固有ではない機能の Data Center KB は、製品のサーバー バージョンでも動作する可能性はありますが、テストは行われていません。サーバー*製品のサポートは 2024 年 2 月 15 日に終了しました。サーバー製品を利用している場合は、アトラシアンのサーバー製品のサポート終了のお知らせページにて移行オプションをご確認ください。

*Fisheye および Crucible は除く

要約

When running a data center cluster, you can encounter a split-brain issue when communication between the nodes break down. When this happens, the nodes that can communicate will report that they've lost connection via failed "heartbeats":

2023-01-19 10:14:48,808 WARN  [hz.hazelcast.cached.thread-128]  c.h.i.cluster.impl.MembershipManager [127.0.0.2]:5701 [bitbucket-cluster-name] [3.12.12] Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f is suspected to be dead for reason: Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.443. Now: 2023-01-19 10:14:48.806, heartbeat timeout: 60000 ms, suspicion level: 1.00

2023-01-19 10:14:50,501 WARN  [hz.hazelcast.cached.thread-1]  c.h.i.c.impl.ClusterHeartbeatManager [127.0.0.3]:5701 [bitbucket-cluster-name] [3.12.12] Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.410. Now: 2023-01-19 10:14:50.499, heartbeat timeout: 60000 ms, suspicion level: 1.00

2023-01-19 10:14:50,231 WARN  [hz.hazelcast.cached.thread-18]  c.h.i.c.impl.ClusterHeartbeatManager [127.0.0.4]:5701 [bitbucket-cluster-name] [3.12.12] Suspecting Member [127.0.0.1]:5701 - 52252540-6226-4e19-8e28-73f880aae99f because it has not sent any heartbeats since 2023-01-19 10:13:48.450. Now: 2023-01-19 10:14:50.229, heartbeat timeout: 60000 ms, suspicion level: 1.00

The result is that the node(s) are ejected from the cluster and then form their own sub-cluster, continuously serving user requests.

This is known as cluster split-brain and can happen on any node (for example, if you restart a node you may see the heartbeat message above on the same node or on a different node). 

診断

Cluster split-brain can have a number of causes. It cannot be wholly prevented nor immediately detected. Here are a few items to diagnose if you are experiencing a split-brain scenario from the easiest identifier to instance behavior.

Node missing from Clustering page

If you suspect you have a split-brain scenario, navigate to Administration > Clustering and take note of the nodes listed.

If a node is missing, check for the following:

  • Connect to the node via command line and ensure the node is up and the Bitbucket service is running.
  • If your configuration allows you to bypass the proxy/load balancer, connect to the node via UI and navigate to Administration > Clustering to identify what nodes are listed.

If the missing node is up and running and its clustering page shows only itself or a subset of nodes, you are likely in a split-brain situation.

Inconsistent state / stale caches

The cluster will become inconsistent as there is now your main cluster and the sub-cluster (split-brain):

  • Change of logging may only apply to sub-cluster
  • Pull request rescopes may execute only on the sub-cluster
    • Depending on when the split happened, this may actually cause rescopes to run multiple times
  • Crowd/LDAP sync may run simultaneously across both clusters, causing inconsistency with users and groups
  • User/group renames will become stale because of broken cache syncing
  • Deleted projects and repositories will still be visible on the other cluster

These are just a few items that you may experience. Overall, the cluster will be in an inconsistent state and cache synchronization to keep all nodes updated will break down.

原因

Network partitioning, where the network is split in a way that one set of nodes cannot see the other, is the cause. This network failure could have many reasons, but would need to be investigated by your network team to determine why the nodes lost communication. 

ソリューション

クラスタのスプリットブレインから回復するには、次のことを行います。

  1. ネットワーク接続が適切な状態であることを確認します。 
    1. Double check parameter hazelcast.network.tcpip.members (for tcp_ip node discovery) that all the IPs are listed in bitbucket.properties.
      1. If using hazelcast.network.multicast=true (for multicast node discovery), verify that the same multicast address is being used by all the nodes. Investigate with your networking team's multicast expert.
  2. Restart the nodes that left the cluster one at a time, and ensure that each one rejoins the cluster (go to Administration > Clustering) before starting the next node.

Last modified on Mar 9, 2023

この内容はお役に立ちましたか?

はい
いいえ
この記事についてのフィードバックを送信する
Powered by Confluence and Scroll Viewport.