Jira Data Center Node state is showing as in Maintenance while the node is actually running and not re-indexing
プラットフォームについて: Server および Data Center のみ。この記事は、Server および Data Center プラットフォームのアトラシアン製品にのみ適用されます。
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Fisheye および Crucible は除く
要約
Normally, a node on a Jira Data Center cluster will show the "maintenance" status when the node is being reindexed and cannot currently serve users, as explained in Jira cluster monitoring:
{"state":"MAINTENANCE"}
However, there can be some unexpected situations where the node is showing the "maintenance" status even though it is running and it's not performing a re-indexing operation.
環境
Any Jira Data Center version on 7.x or 8.x.
原因
Root cause 1
Some of the nodes are configured with an incorrect value in the JVM startup parameter -Djava.rmi.server.hostname. For example, the hostname might be set to a non resolvable domain, or it might be set to an incorrect IP address.
Since this JVM parameter overwrites the hostname configured in the <JIRA_HOME>/cluster.properties file, the node will end up using the incorrect hostname upon node startup.
As a result, the 2 following symptoms might happen:
- the cluster cache replication will fail since it relies on the hostname value to be correct (for more detailed information about this symptom, refer to the KB article JIRA Data Center Asynchronous Cache replication failing health check)
- some nodes might show the "maintenance" state when accessing the URL <node-address>/status.
Root cause 2
The database is configured with an unsupported database collation, and we are hitting the bug - JRASERVER-65708Getting issue details... STATUS
Root cause 3
Jira is on version 8.19.1 or higher, and the node which is showing in the MAINTENANCE status has indexes which are in an inconsistent state (out of sync with the Jira Database). As explained in the feature request - JRASERVER-66970Getting issue details... STATUS , from Jira 8.19.1, if a Jira node has inconsistent indexes, the node state will enter the "MAINTENANCE" mode.
Please note that in this case, the "MAINTENANCE" status is expected by design, since it is meant to tell the load balancer not to route traffic to that node (because of the inconsistent index state).
診断
Diagnosis for Root cause 1
Check the Jira application log to see if you can find any trace of cache replication failure:
example of error 1
2021-11-10 08:59:35,390-0800 localq-reader-16 ERROR [c.a.j.c.distribution.localq.LocalQCacheOpReader] [LOCALQ] [VIA-COPY] Abandoning sending: LocalQCacheOp{cacheName='com.atlassian.jira.crowd.embedded.ofbiz.EagerOfBizUserCache.userCache', action=PUT, key={10100,brian_campbell}, value == null ? false, replicatePutsViaCopy=true, creationTimeInMillis=1636563571212} from cache replication queue: [queueId=queue_node2_5_78882aaeb08e9a4c81687b5de2add74f_put, queuePath=/vxxxx/atlassian/application-data/jira/localq/queue_node2_5_78882aaeb08e9a4c81687b5de2add74f_put], failuresCount: 1/1. Removing from queue. Error: java.rmi.NoSuchObjectException: no such object in table com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpSender$UnrecoverableFailure: java.rmi.NoSuchObjectException: no such object in table at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:90) at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpReader.run(LocalQCacheOpReader.java:96) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.rmi.NoSuchObjectException: no such object in table
example of error 2
2021-11-10 18:34:18,323-0800 HealthCheck:thread-6 WxxxxN ServiceRunner [c.a.t.j.healthcheck.cluster.ClusterReplicationHealthCheck] Node node3 does not seem to replicate its cache 2021-11-10 18:34:18,324-0800 HealthCheck:thread-6 WxxxxN ServiceRunner [c.a.t.j.healthcheck.cluster.ClusterReplicationHealthCheck] Node node1 does not seem to replicate its cache 2021-11-10 18:34:18,328-0800 support-zip ERROR [c.a.t.healthcheck.concurrent.SupportHealthCheckProcess] Health check 'Cluster Cache Replication' failed with severity 'critical': '["The node node3 is not replicating","The node node1 is not replicating"]'
For each Jira node, check if the -Djava.rmi.server.hostname JVM startup parameter is in use. If it's in use, then check if it is using a correct IP address or a resolvable hostname. If the IP is incorrect of the hostname is not resolvable, then this root cause applies.
Diagnosis for Root cause 2
Go to the page ⚙ > System > Troubleshooting and support tools > Instance health > Database, and check if the health check if complaining about an unsupported collation.
Diagnosis for Root cause 3
- Jira is running on version 8.19.1 or higher
The following WARNING/INFO can be found in the Jira application logs:
2021-11-22 14:29:38,069+0000 http-nio-8080-exec-10 url: /status WARN anonymous XXXxXXXxX - XX.XXX.X.XXX /status [c.a.j.issue.index.IndexConsistencyUtils] Index consistency check failed for index 'Issue': expectedCount=875155; actualCount=713032 2021-11-22 14:29:38,070+0000 http-nio-8080-exec-10 url: /status INFO anonymous XXXxXXXxX - XX.XXX.X.XXX /status [c.a.jira.servlet.ApplicationStateResolverImpl] Checking index consistency. Time taken: 160.9 ms 2021-11-22 14:29:38,070+0000 http-nio-8080-exec-10 url: /status WARN anonymous XXXxXXXxX - XX.XXX.X.XXX /status [c.a.jira.servlet.ApplicationStateResolverImpl] The issue index is inconsistent. This node will report its status as MAINTENANCE. You will find information on how to resolve this problem here: https://jira.atlassian.com/browse/JRASERVER-66970
ソリューション
Solution for Root cause 1
For each affected node:
- either remove the parameter -Djava.rmi.server.hostname from the JVM startup parameter, if a correct hostname value is already set up in the <JIRA_HOME>/cluster.properties file, and re-start the node
- or change the value of this parameter to a correct IP address or resolvable hostname
For more detailed information, refer to the KB article JIRA Data Center Asynchronous Cache replication failing health check.
Solution for Root cause 2
Refer to the workaround mentioned in the bug - JRASERVER-65708Getting issue details... STATUS
Solution for Root cause 3
The solution consists in fixing the index inconsistency on the problematic node:
- Access the problematic node using its IP address via a browser
- Go to ⚙ > System > Indexing
- Select Full re-index and click Re-index
- Wait until the re-indexing completes and confirm that the status of this node changes to RUNNING
Note that it is possible to prevent the node from going into MAINTENANCE mode when the indexes are out of sync as explained in the Current status section of the feature request - JRASERVER-66970Getting issue details... STATUS . If you want to ensure that in the future, the node remains in RUNNING mode while having inconsistent indexes (which was the expected behavior prior to Jira 8.19.1), you will need to add the following JVM startup parameter to each Jira node and re-start each node:
-Dcom.atlassian.jira.status.index.check=false
Note: Even if Jira node is in MAINTENANCE mode, that specific node will still be accessible when browsing Jira through IP address / hostname; Only when browsing Jira through base URL bound to Load Balancer will this node be inaccessible since Load Balancer will not route any requests to it.