Cluster Cache Replication HealthCheck fails due to unable to complete within the timeout

お困りですか?

アトラシアン コミュニティをご利用ください。

コミュニティに質問


プラットフォームについて: サーバーと Data Center のみ。この記事は、サーバーおよび Data Center プラットフォームのアトラシアン製品にのみ適用されます。



問題

Jira Data Center cluster replication relies on nodes being recorded in a database and also receiving and sending updates. The Cluster Cache Replication  Health Check confirms that the replication is working in the entire cluster. If an active node is not responding, the other nodes are going to report warnings and the one with the error will report a critical result. See for more details: Cluster Cache Replication health check fails in Jira Data Center

In some cases, due to the long time required to execute the Health Check, it might fail with following error: The health check was unable to complete within the timeout of 20000

Screenshot:



The following errors appear in the atlassian-jira.log

2018-02-12 05:47:56,248  WARN  HealthCheckWatchdog:thread-6  ServiceRunner          [support.healthcheck.concurrent.SupportHealthCheckTask]  Health check Cluster Cache Replication was unable to complete within the timeout of 20000.  
2018-02-12 05:47:56,249  ERROR  HealthCheck:thread-3  ServiceRunner          [plugins.healthcheck.service.ClusterHeartbeatService]  Failed to wait until cluster node appear in the cache  
java.lang.InterruptedException: sleep interrupted
	at java.lang.Thread.sleep(Native Method) [?:1.8.0_102]
	at com.atlassian.jira.plugins.healthcheck.service.SleepTimeoutFactory$SleepTimeout.sleep(SleepTimeoutFactory.java:32) [?:?]
	at com.atlassian.jira.plugins.healthcheck.service.ClusterHeartbeatService.getClusterNodesReplicationInfo(ClusterHeartbeatService.java:71) [?:?]
	at com.atlassian.jira.plugins.healthcheck.cluster.ClusterReplicationHealthCheck.doCheck(ClusterReplicationHealthCheck.java:41) [?:?]
	at com.atlassian.jira.plugins.healthcheck.cluster.AbstractClusterHealthCheck.check(AbstractClusterHealthCheck.java:52) [?:?]
	at com.atlassian.support.healthcheck.impl.PluginSuppliedSupportHealthCheck.check(PluginSuppliedSupportHealthCheck.java:51) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_102]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_102]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_102]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]
2018-02-12 05:47:56,278  WARN  HealthCheck:thread-3  ServiceRunner          [plugins.healthcheck.cluster.ClusterReplicationHealthCheck]  Node jiranode-2 does not seem to replicate its cache  

Please note, this is a different problem than when the cluster is not properly configured and The node is not replicating due to a network condition. See: Cluster Cache Replication health check fails in Jira Data Center


診断

環境

  • Jira Datacenter
  • Large number of scheduled threads

Cause 1 (pre Jira 7.4.2)

In Jira 7.4.2 and lower we use the Jira Instance Health (JIH) plugin, that has a timeout of 20s set for the check run while the check itself has 60s timeout. We replaced JIH in Jira 7.4.3 with the Atlassian Troubleshooting & Support Tools Plugin (ATST). The bug is fixed in ATST 1.6.1 as detailed in  ATST-864 - Getting issue details... STATUS  as it increases the timeout for the check and also the later version of Jira has a different way of sending heartbeats that the check verifies.

回避策

Add the below JVM argument as per Setting properties and options on startup:

-Datlassian.healthcheck.timeout-ms=60000

If this doesn't resolve it, you may be affected by additional cause below.

原因 2

An underlying cause is some or all nodes don't schedule cluster replication heartbeat due to a busy scheduler (Caesium), so the health check can't check the value within the timeout. The scheduler has a limited number of threads and this can cause contention for the check, as it may wait for an available thread and that takes longer than 20s (or 60s after increasing it).

Health check for cluster replication uses its own cache to store nodeID.

  • Each node periodically puts a heart-beat value into that cache, this is done by thread scheduled by Scheduler. Example of scheduled task HealthCheckSchedulerImpl triggered during start-up:

    2017-03-31 13:12:15,241 localhost-startStop-1 INFO      [c.a.j.p.h.scheduler.impl.HealthCheckSchedulerImpl] Scheduling job with : JobConfig[jobRunnerKey=com.atlassian.jira.plugins.healthcheck.scheduler.impl.HealthCheckSchedulerImpl,runMode=RUN_LOCALLY,schedule=Schedule[type=INTERVAL,intervalScheduleInfo=IntervalScheduleInfo[firstRunTime=Fri Mar 31 13:12:30 CEST 2017,intervalInMillis=10000]],parameters={}]
  • Then during the execution of the Healthcheck, it verifies the status:
    • It tries to check all non-replicating nodes (checking staus isReplicating from cache) and waits for heart-beat for each live node. 
  • Normally, when heart-beat thread runs periodically, data will be in cache, so it's reply immediately. 
  • In current case, it waits in loop until it gets interrupted after 20000ms as per the above error. 

回避策

Unfortunatly, there is no reliable workaround to this problem: 

  • You can try to run the Health Check during low peak hours, that might decrease contention for the scheduler.
  • Reduce number of scheduled jobs (eg: if you have high numbers of Mail Handlers) or space them across the day.

ソリューション

Unfortunately, there is no resolution. The best way will be to increase number of scheduler threads, but this values is hardcoded and set to 4. See  JRASERVER-65809 - Getting issue details... STATUS

原因 3

There is a trailing white space on the node's ID and JIRA doesn't handle this consistently: the value is read from the cluster.properties file with the trailing space and saved to the database's clusternode table. However, it seems that the health check mechanism trims trailing white spaces and it results on the node not being found. We have raised  JRASERVER-67243 - Getting issue details... STATUS  to address that.

回避策

  1. Make sure the jira.node.id property has no trailing space in its value on the cluster.properties file;
  2. Ensure the NODE_ID column of the clusternode database table has no trailing space after the value;

原因 4

Java arg -Djava.rmi.server.hostname= is set to the wrong server hostname. RMI is used for EHCache, so wrong settings there affects the configuration and replication workflow.

ソリューション

  1. Check Java arg -Djava.rmi.server.hostname= and either remove it (if not required) or ensure it's set to the proper value.


最終更新日 2022 年 6 月 8 日

この内容はお役に立ちましたか?

はい
いいえ
この記事についてのフィードバックを送信する
Powered by Confluence and Scroll Viewport.