The Instance Health Checks are complaining that at least 1 node in the Jira Data Center cluster is not replicating, even though it is actually replicating successfully
プラットフォームについて: Data Center のみ - この記事は、Data Center プラットフォームのアトラシアン製品にのみ適用されます。
この KB は Data Center バージョンの製品用に作成されています。Data Center 固有ではない機能の Data Center KB は、製品のサーバー バージョンでも動作する可能性はありますが、テストは行われていません。サーバー*製品のサポートは 2024 年 2 月 15 日に終了しました。サーバー製品を利用している場合は、アトラシアンのサーバー製品のサポート終了のお知らせページにて移行オプションをご確認ください。
*Fisheye および Crucible は除く
要約
The Instance Health Checks are complaining that at least 1 node in the Jira Data Center cluster is not replicating, even though it is actually replicating successfully.
環境
Any Jira 8.x version
Data Center only
診断
When checking the Jira application logs of one of the healthy Nodes, we can see that they are complaining that one particular Jira node (or more) is not replicating:
grep -h 'is not replicating' atlassian-jira.log* | sort 2021-09-07 19:34:52,751+0000 Caesium-1-1 ERROR ServiceRunner [c.a.t.healthcheck.concurrent.SupportHealthCheckProcess] Health check 'Cluster Cache Replication' failed with severity 'critical': 'The node problematic-node-ID is not replicating' 2021-09-07 20:34:52,706+0000 Caesium-1-3 ERROR ServiceRunner [c.a.t.healthcheck.concurrent.SupportHealthCheckProcess] Health check 'Cluster Cache Replication' failed with severity 'critical': 'The node problematic-node-ID is not replicating' 2021-09-07 21:34:52,768+0000 Caesium-1-1 ERROR ServiceRunner [c.a.t.healthcheck.concurrent.SupportHealthCheckProcess] Health check 'Cluster Cache Replication' failed with severity 'critical': 'The node problematic-node-ID is not replicating'
However, in these same logs, when checking the replication process related to the "problematic node" (the one that the health checks are complaining about), we can see that the cache replication completes successfully:
2021-09-08 08:21:32,287+0000 localq-stats-0 INFO [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [scheduled] Running cache replication queue stats for: 20 queues... 2021-09-08 08:00:27,926+0000 localq-stats-0 INFO [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: problematic-node-ID snapshot stats: ... 2021-09-08 08:00:27,927+0000 localq-stats-0 INFO [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [VIA-COPY] Cache replication replicatePutsViaCopy-queue stats per node: problematic-node-ID snapshot stats: ... 2021-09-08 08:21:32,289+0000 localq-stats-0 INFO [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] [scheduled] ... done running cache replication queue stats for: 20 queues.
- When creating a new Jira ticket while being logged directly into the "problematic node", we can see that this ticket can be found and accessed when logging directly into any other "healthy" node, which is another indication that the replication is actually working properly
- When running a telnet command between all the Jira nodes of the cluster using their hostname/IP address and the ehcache ports (configured in the files <JIRA_HOME>/cluster.properties of each node), we can confirm that all the nodes are able to communicate with each other
- When checking the Clustering page in ⚙ > System, the application status of the "problematic node" might be empty:
原因
We have seen situations where the health check is reporting false positives about the cluster cache replication for some nodes. Unfortunately, the exact root cause of this issue is currently unknown.
ソリューション
Schedule a maintenance window and re-start the "problematic" Jira node. After the restart, the health checks should stop complaining about this node.
アトラシアン サポートにデータを提供する
If a restart of the node did not resolve the issue, please reach out to Atlassian Support via this link. To help the Atlassian support team investigate the issue faster, please attach a support zip from each node to the ticket raised to Atlassian support.