Monitoring OpenSearch
Monitoring approaches
Several approaches are available for monitoring OpenSearch clusters. The best choice depends on your deployment model and operational preferences.
AWS-hosted OpenSearch
For clusters managed with AWS OpenSearch Service, AWS CloudWatch is the primary monitoring solution. CloudWatch automatically collects a wide range of metrics from your OpenSearch domain, such as cluster health, resource utilization, and search performance. You can create dashboards and set up configurable alarms to stay informed about your cluster’s status. More about monitoring AWS-hosted OpenSearch Service
Self-hosted OpenSearch
For self-hosted OpenSearch clusters, open-source monitoring tools such. as Prometheus and Grafana, are commonly used. The OpenSearch Prometheus Exporter plugin collects cluster metrics, which you can visualize and analyze with Grafana dashboards. This approach offers flexibility and customization, allowing you to tailor monitoring to your needs. Prometheus also supports alerting rules, so you can receive proactive notifications based on custom thresholds. More about monitoring Self-hosted OpenSearch
Key metrics and alerts
Regardless of whether your OpenSearch cluster is hosted on AWS or self-managed, monitoring the following core metrics is important for maintaining stability, performance, and reliability.
カテゴリ | メトリック | Why it matters |
|---|---|---|
Cluster health and availability | Cluster status (green, yellow, red): overall health and shard allocation | Detects cluster issues early |
Node availability: join and leave events | Identifies unexpected node changes | |
Resource utilization | Disk usage and free storage space | Prevents outages from full disks |
CPU 使用率 | Highlights resource bottlenecks | |
JVM memory pressure, heap usage | Prevents performance degradation | |
Performance metrics | Search latency and indexing latency: response time for search and indexing | Ensures fast user experience |
Thread pool queues: size of search/write queues | Identifies backlogs or slowdowns | |
Error rates and failures | 5xx error rate frequency | Detects instability or misconfiguration |
Automated snapshot failures, backup completion status | Ensures data protection | |
Jira-specific metrics | Point-in-time (PIT) contexts: usage of PIT searches | Important for Jira search reliability |
Scroll contexts: usage of scroll APIs | Important for bulk data operations |
Troubleshooting and best practices
For general OpenSearch monitoring, use AWS CloudWatch alarms to track cluster health, resource usage, performance, and error rates. Each alarm includes troubleshooting steps and best practices. Explore recommended CloudWatch alarms for Amazon OpenSearch Service
Jira-specific metrics
Point-in-Time (PIT) metrics
Alarms on CurrentPointInTime (number of open PIT contexts) or AvgPointInTimeAliveTime (average lifetime of PIT contexts) indicate that PIT searches aren’t closing promptly, or that the number of concurrent PIT contexts is approaching or exceeding cluster limits.
To address these alarms, you can:
Configure PIT Keepalive duration
Set theopensearch.pointintime.keepalive.secondsproperty in thejira-config.propertiesfile to control how long a PIT remains active. Lowering this value can help ensure PIT contexts are closed sooner, minimizing resource usage. However, setting this value too low might result in failed search results, as PIT contexts could expire before queries complete. The default is 120 seconds, adjust it carefully based on your workload and monitoring data.Monitor for unusual patterns
If you notice a sudden increase in open PIT contexts, check for recent changes in Jira usage, such as new plugins, integrations, or bulk operations that could generate excessive PIT searches.Increase PIT Limits
If your workload requires more concurrent PIT contexts, raise the limit by updating thesearch.max_open_point_in_time_contextnode setting using the OpenSearch REST API:PUT _cluster/settings { "persistent": { "search.max_open_point_in_time_context": <desired_limit> } }Increasing this limit will use more resources. Monitor cluster health and resource usage after making changes.
Scroll Metrics
Alarms on ScrollCurrent (number of open scroll contexts) might indicate that scrolls aren't being cleaned up, leading to resource leaks and potential cluster instability.
To address these alarms, you can:
Check permissions
Make sure Jira has the required permissions to delete or clear scroll contexts. Without proper permissions, scrolls may accumulate and not be cleaned up.Monitor usage patterns
If scroll usage remains high, check for your bulk operations or long-running queries. Consider optimizing or batching them differently to reduce the number of open scroll contexts.
Enabling logging for debugging
Enabling detailed logging can help with troubleshooting and performance analysis. OpenSearch provides several logging options to help you identify issues such as slow queries or indexing bottlenecks. Enable these logs temporarily during troubleshooting to minimize performance impact:
Request-level slow query logs: Capture queries that exceed a set execution time. Use these logs to find inefficient or problematic queries.
Shard-level slow indexing logs: Record indexing operations that are slower than expected at the shard level.
Shard-level slow search logs: Record search operations that are slow at the shard level.
For more information, check:
その他のリソース
Identifying slow JQL queries