Troubleshooting disk I/O bottlenecks In Jira Data Center

お困りですか?

アトラシアン コミュニティをご利用ください。

コミュニティに質問

問題

Troubleshooting disk performance bottlenecks is difficult and be tricky in large and complex environments. In certain scenarios, the problem is not easily noticeable and several side-effects might happen before the actual root cause can be identified. Some metrics related to cache replication are available since Jira 7.6 and some indexing specific metrics (index-writer-stats) were added in Jira 8.10. Starting in Jira 8.12, additional metrics have been added to logs that help to troubleshoot these kinds of problems. These metrics are part of the new Document-based replication in Jira Data Center functionality.  

We have observed severe disk I/O problems in customers using SSD in a RAID-1 array, with Dell controllers (in this case, Perc 6/i). Using the usual tools to evaluate the disk performance, such as Test disk access speed, we haven't identified any performance problems. In fact, the results were fairly good with an average of 500 MB/s.

However, nasty problems were observed in Jira, such as:

  • Cluster index replication synchronisation problem
  • Triggering a Service Desk-related custom field change, such as an SLA update, causing the node's index to be locked for several minutes due to the poor throughput to process the reindex operations
  • During peak hours, the same situation of the node's index being locked was observed. When bulk operations happened such as releases, project imports, bulk issue creation, the situation usually led the environment to a crash requiring the cluster to be restarted

In Jira atlassian-jira.log log file, errors such as below were seen when bulk updates were triggered:


2020-11-11 10:46:44,929-0300 http-nio-8080-exec-177 url:/jira/servicedesk/portal/12/create/22 username:atlassian [c.a.j.issue.index.DefaultIndexManager] Wait attempt timed out - waited 90000 milliseconds
com.atlassian.jira.issue.index.IndexException: Wait attempt timed out - waited 90000 milliseconds


The index across the cluster nodes was locked for ~30 minutes due to the reindex operation requested by bulk updates.

This problem was more noticeable after an upgrade to Jira 8.11, and aggravated after a new update was performed to Jira 8.12. Previously, though, searching the index was known to be slow, and cluster index synchronization was constantly falling behind for several minutes.

The background for this is that for versions such as Jira 8.10 and below (down to Jira 8.0), adding or updating the index was an in-memory operation – adding documents to Lucene buffer, which was flushing asynchronously. However, as of 8.11, Jira performs conditional updates before every write, which consists of a search performed in the index to check if the index contains a document with a higher version, and performing such searches may trigger an index flush – which is a disk operation. This guarantees index consistency when there are concurrent index updates, and although it might slightly decrease the index performance, it resolved a lot of indexing inconsistency bugs constantly seen in large Jira data center environments.

With DBR, introduced in Jira 8.12, this added more indexing events (adding documents to index) across the cluster. This aggravated any possible pre-existing performance problems in the cluster, as the disk performance wasn't able to handle that additional load.

トラブルシューティング

For all metrics below, Jira records snapshot and total statistics. The snapshot stats cover the last ~5min and the total stats cover the time from the start of a node. You should check the last entry for each metric, because every 5 minutes, new metrics will be logged. 

If you're troubleshooting an ongoing Jira performance problem, it's best to look at the total stats as the snapshot might be affected by a heavy operation that might have been processed in Jira at that time. 

Cache Replication via invalidation Stats Logs (available since Jira 7.6)

These are metrics related to cache replication via invalidation but can be used as real disk performance statistics. 

We can look at the timeToAddMillis metric, which is the time to store a cache replication message in a local store before sending it to the other node – in this case, it's really disk performance, as there is no OS buffer. In healthy clusters, we expect the average to be less than 1 millisecond


キャッシュ レプリケーション ログの例を見る
["LOCALQ"]["VIA-INVALIDATION"]
"Cache replication queue stats per node":atl-node1 total stats:{
   "timestampMillis":1605104041350,
   "nodeId":"atl-node2",
   "queueSize":0,
   "startQueueSize":0,
   "startTimestampMillis":1604096513275,
   "startMillisAgo":1007528075,
   "closeCounter":0,
   "addCounter":803366,
   "droppedOnAddCounter":0,
   "criticalAddCounter":0,
   "criticalPeekCounter":0,
   "criticalRemoveCounter":0,
   "peekCounter":0,
   "peekOrBlockCounter":803369,
   "removeCounter":803366,
   "backupQueueCounter":0,
   "closeErrorsCounter":0,
   "addErrorsCounter":0,
   "peekErrorsCounter":0,
   "peekOrBlockErrorsCounter":0,
   "removeErrorsCounter":0,
   "backupQueueErrorsCounter":0,
   "lastAddTimestampMillis":1605103205923,
   "lastAddMillisAgo":835427,
   "lastPeekTimestampMillis":0,
   "lastPeekMillisAgo":0,
   "lastPeekOrBlockTimestampMillis":1605103205924,
   "lastPeekOrBlockMillisAgo":835426,
   "lastRemoveTimestampMillis":1605103205937,
   "lastRemoveMillisAgo":835413,
   "lastBackupQueueTimestampMillis":0,
   "lastBackupQueueMillisAgo":0,
   "timeToAddMillis":{
      "count":803366,
      "min":2,
      "max":783,
      "sum":11871332,
      "avg":14,
      "distributionCounter":{
         "10":8996,
         "20":770471,
         "50":23344,
         "100":279,
         "9223372036854775807":276
      }
   },
   "timeToPeekMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "timeToPeekOrBlockMillis":{
      "count":803369,
      "min":0,
      "max":38161261,
      "sum":10047103930,
      "avg":12506,
      "distributionCounter":{
         
      }
   },
   "timeToRemoveMillis":{
      "count":803366,
      "min":0,
      "max":518,
      "sum":10561931,
      "avg":13,
      "distributionCounter":{
         "10":21124,
         "20":771005,
         "50":10772,
         "100":251,
         "9223372036854775807":214
      }
   },
   "timeToBackupQueueMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "staleCounter":0,
   "sendCounter":803364,
   "droppedOnSendCounter":0,
   "timeToSendMillis":{
      "count":803364,
      "min":0,
      "max":1345,
      "sum":355865,
      "avg":0,
      "distributionCounter":{
         "10":802038,
         "20":251,
         "50":363,
         "100":447,
         "200":207,
         "500":51,
         "1000":6,
         "5000":1
      }
   },
   "sendRuntimeExceptionCounter":1,
   "sendCheckedExceptionCounter":0,
   "sendNotBoundExceptionCounter":4
}
Check a grep command to get the averages from the log
grep -o -E 'VIA-INVALIDATION.*snapshot stats.*timeToAddMillis":{"count":\d+,"min":\d+,"max":\d+,"sum":\d+,"avg":\d+' atlassian-jira.log |grep -oE "\d+$"


Index Writer Stats Logs (issue stats) (available since Jira 8.10)

These metrics are related to the indexer performance. Each index is updated by a single thread and these metrics describe all direct index write operations statistics for the given index.

For the ISSUE index we can look at the updateDocumentsWithVersionMillis metric, which measures the time of updating the Lucene index with a single issue (single document). We expect the average to be less than 10 milliseconds, and 99% to be less than 100 milliseconds. In a nutshell, this number shows how many index updates there can be in the cluster. For example, if the average is at 10 milliseconds, it means that Jira can update the issue index 100 times per second at max. However, if we have, for example, 110 milliseconds, this means we can have only 8 issue reindex operations per second in the whole cluster. If you consider that a single issue create can reindex the issue from 3 to 6 times (the last, for JSD/JSM issues), Jira would be able to only create 1 issue per second without affecting the index queue (i.e. delaying other index updates arriving from user requests). This impacts also other issue updated operations, such as bulk updates, which can trigger a lot more than 1 issue index update per second.

Supposing that an update that affects 2000 issues is requested, and this gets replicated through the nodes. If our time to write to the index is at 110 milliseconds, we would be able to update the index for only 9 issues per second, and it would take at least 4 minutes for the queue to be processed, only for that replication operation triggered, and users would be still trying to trigger operations on the UI that usually led to two scenarios: 

  1. Duplicated issues being created as users hit the create button multiple times as Jira was hanging when trying to update the index
  2. Slowness being reported by users as Jira was unable to acquire the read lock for the index, several errors in logs about the failure to acquire the read lock for more than 30 seconds


See an example of a LocalQ cache invalidation log
INFO["c.a.jira.index.WriterWithStats"]["index-writer-stats"]
"ISSUE":"total stats":{
   "addDocumentsMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "deleteDocumentsMillis":{
      "count":502,
      "min":0,
      "max":5,
      "sum":62,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "updateDocumentsMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "updateDocumentConditionallyMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "updateDocumentsWithVersionMillis":{
      "count":2125,
      "min":0,
      "max":923,
      "sum":238331,
      "avg":112,
      "distributionCounter":{
         "10":85268,
         "100":132996,
         "500":29132,
         "1000":4652,
         "5000":10,
         "10000":0,
         "30000":0,
         "60000":0
      }
   },
   "updateDocumentsWithVersionSize":{
      "count":252058,
      "min":1,
      "max":1,
      "sum":252058,
      "avg":1,
      "distributionCounter":{
         "1":252058,
         "10":0,
         "100":0,
         "1000":0,
         "10000":0
      }
   },
   "replaceDocumentsWithVersionMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         "10":0,
         "100":0,
         "500":0,
         "1000":0,
         "5000":0,
         "10000":0,
         "30000":0,
         "60000":0
      }
   },
   "replaceDocumentsWithVersionSize":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         "1":0,
         "10":0,
         "100":0,
         "1000":0,
         "10000":0
      }
   },
   "optimizeMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "closeMillis":{
      "count":0,
      "min":0,
      "max":0,
      "sum":0,
      "avg":0,
      "distributionCounter":{
         
      }
   },
   "commitMillis":{
      "count":15110,
      "min":0,
      "max":1227,
      "sum":1727819,
      "avg":114,
      "distributionCounter":{
         
      }
   }
},
"index writer version cache stats":{
   "put":251520,
   "get":252058,
   "getFound":191259,
   "clear":502
}
Check a grep command to get the averages from the log
grep -o -E 'updateDocumentsWithVersionMillis":{"count":\d+,"min":\d+,"max":\d+,"sum":\d+,"avg":\d+' atlassian-jira.log |grep -oE "\d+$"


DBR Receiver Stats Logs (8.12 and up only)

Starting with Jira 8.12, there is a new metric related to the DBR functionality that can help us in this troubleshooting. The DBR receiver logs describe the part responsible for receiving DBR messages with the Lucene document(s) from other nodes. Please see more information about how DBR works at Document-based replication in Jira Data Center.

The snapshot stats cover the last ~5min and the total stats cover the time from the start of the Jira application on that node.

For these stats, we can look at the processDBRMessageUpdateWithRelatedIndexInMillis metric, as it measures the partial time to process the DBR message locally - the time to perform a conditional update of the local index with the documents from the DBR message. In healthy clusters, we expect the average to be less than 10 milliseconds.


DBR receiver ログの例を確認
["DBR"]["RECEIVER"]
"total stats period":PT282H15M27.996S,
"data":{
   "receiveDBRMessage":195939,
   "receiveDBRMessageUpdate":18937,
   "receiveDBRMessageUpdateWithRelated":177002,
   "receiveDBRMessageDelayedInMillis":{
      "count":195939,
      "min":2,
      "max":10225,
      "sum":38425076,
      "avg":196,
      "distributionCounter":{
         "500":189797,
         "1000":824,
         "1500":360,
         "2000":261,
         "3000":381,
         "4000":347,
         "5000":361,
         "9223372036854775807":3608
      }
   },
   "skipDBRMessageWhenIndexNotAvailable":0,
   "skipDBRMessageWhenIndexReplicationPaused":0,
   "processDBRMessageUpdateIssueIndex":0,
   "processDBRMessageUpdateCommentIndex":18260,
   "processDBRMessageUpdateWorklogIndex":677,
   "processDBRMessageUpdateBytes":{
      "count":18937,
      "min":809,
      "max":52332,
      "sum":28856470,
      "avg":1523,
      "distributionCounter":{
         "0":0,
         "1000":598,
         "50000":18336,
         "100000":3,
         "500000":0,
         "1000000":0,
         "2000000":0,
         "5000000":0,
         "10000000":0
      }
   },
   "processDBRMessageUpdateSerializeInMillis":{
      "count":18937,
      "min":0,
      "max":127,
      "sum":163,
      "avg":0,
      "distributionCounter":{
         "0":18923,
         "1":7,
         "5":5,
         "10":0,
         "50":1,
         "100":0,
         "500":1,
         "1000":0
      }
   },
   "processDBRMessageUpdateIndexInMillis":{
      "count":18937,
      "min":0,
      "max":1101,
      "sum":136421,
      "avg":7,
      "distributionCounter":{
         "0":1516,
         "1":4250,
         "5":2789,
         "10":8290,
         "50":1996,
         "100":25,
         "500":48,
         "1000":21,
         "9223372036854775807":2
      }
   },
   "processDBRMessageUpdateErrors":0,
   "processDBRMessageUpdateWithRelatedIssueIndex":{
      "count":177001,
      "min":1,
      "max":1,
      "sum":177001,
      "avg":1,
      "distributionCounter":{
         "0":0,
         "1":177001
      }
   },
   "processDBRMessageUpdateWithRelatedCommentIndex":{
      "count":177001,
      "min":0,
      "max":1090,
      "sum":1127869,
      "avg":6,
      "distributionCounter":{
         "0":99000,
         "1":30338,
         "10":43573,
         "100":2883,
         "9223372036854775807":1207
      }
   },
   "processDBRMessageUpdateWithRelatedWorklogIndex":{
      "count":177001,
      "min":0,
      "max":18,
      "sum":2798,
      "avg":0,
      "distributionCounter":{
         "0":175336,
         "1":1105,
         "10":554,
         "100":6
      }
   },
   "processDBRMessageUpdateWithRelatedChangesIndex":{
      "count":177001,
      "min":1,
      "max":707,
      "sum":802138,
      "avg":4,
      "distributionCounter":{
         "0":0,
         "1":62899,
         "10":100910,
         "100":13059,
         "9223372036854775807":133
      }
   },
   "processDBRMessageUpdateWithRelatedBytes":{
      "count":177001,
      "min":148202,
      "max":1476974,
      "sum":38540709056,
      "avg":217742,
      "distributionCounter":{
         "0":0,
         "1000":0,
         "50000":0,
         "100000":0,
         "500000":175898,
         "1000000":425,
         "2000000":678,
         "5000000":0,
         "10000000":0
      }
   },
   "processDBRMessageUpdateWithRelatedSerializeInMillis":{
      "count":177001,
      "min":2,
      "max":367,
      "sum":1117486,
      "avg":6,
      "distributionCounter":{
         "0":0,
         "1":0,
         "5":44638,
         "10":127280,
         "50":4862,
         "100":150,
         "500":71,
         "1000":0
      }
   },
   "processDBRMessageUpdateWithRelatedIndexInMillis":{
      "count":177001,
      "min":0,
      "max":94874,
      "sum":344333681,
      "avg":1945,
      "distributionCounter":{
         "0":158,
         "1":9,
         "5":41213,
         "10":6664,
         "50":63143,
         "100":6424,
         "500":38848,
         "1000":11506,
         "9223372036854775807":9036
      }
   },
   "processDBRMessageUpdateWithRelatedErrors":0,
   "maxErrorsSample":10,
   "processDBRMessageUpdateErrorsSample":{
      
   },
   "processDBRMessageUpdateWithRelatedErrorsSample":{
      
   }
}
Check a grep command to get the averages from the log
grep -o -E 'processDBRMessageUpdateWithRelatedIndexInMillis":{"count":\d+,"min":\d+,"max":\d+,"sum":\d+,"avg":\d+' atlassian-jira.log |grep -oE "\d+$"


For baselines, in one customer experiencing the problem here detailed, we were seeing the following statistics:

  • timeToAddMillis - 15 milliseconds average

  • updateDocumentsWithVersionMillis - 165 milliseconds average

After fixing IO problems the above statistics dropped to:

  • timeToAddMillis - 0 millisecond average

  • updateDocumentsWithVersionMillis - 5 milliseconds average


You can use a tool such as JSON formatters to have a more human-readable format. If you have homebrew installed, you can install jsonpp and have the JSON already formatted directly in your Terminal, as in the example below: 

cat atlassian-jira.log | grep "LOCALQ" | grep INVALIDATION | grep total| tail -1|cut -d' ' -f 22|jsonpp 
{
  "timestampMillis": 1605104041350,
  "nodeId": "atl-node2",
  "queueSize": 0,
  "startQueueSize": 0,
  "startTimestampMillis": 1604096513275,
  "startMillisAgo": 1007528075,
  "closeCounter": 0,
.....


ソリューション


The problem was resolved after improving the overall disk performance of the cluster. For Jira Data Center nodes, disk performance is crucial for the overall health of the cluster. The local disk used by nodes has data that can be rebuilt in case of failure, as High Availability is guaranteed by the cluster, critical data is stored in the database or shared home.

Therefore, you're not required to use RAID arrays which might impact the node's performance. Also if old RAID controllers are used, even if you are using very fast SSDs, you might be impacted by disk performance problems depending on the sizes and number of files being written on high load hours, as files can be buffered into the controller memory but when the controller flushes the data to the disk, the operation takes time.

We recommend reviewing the logs above when assessing cluster replication problems as well, because as seen the Disk Performance test doesn't always show the bottleneck on complex environments.

This problem is also mentioned on Background reindex is slow after upgrading to Jira 8.10 and later.

最終更新日 2022 年 7 月 29 日

この内容はお役に立ちましたか?

はい
いいえ
この記事についてのフィードバックを送信する
Powered by Confluence and Scroll Viewport.