Monitoring OpenSearch

Monitoring is essential for maintaining the health and performance of your OpenSearch clusters, whether they're hosted on AWS or self-managed. Proactive monitoring helps you identify potential issues, such as resource constraints or cluster instability, before they impact your applications or users.

This guide provides practical recommendations for setting up monitoring and alerting for both AWS OpenSearch Service and self-hosted OpenSearch environments. You'll find here:

  • Commonly tracked metrics.

  • Approaches for configuring dashboards and alerts using AWS CloudWatch or open-source tools like Prometheus and Grafana.

  • Best practices for interpreting and responding to alerts.


Monitoring approaches

Several approaches are available for monitoring OpenSearch clusters. The best choice depends on your deployment model and operational preferences.

AWS-hosted OpenSearch

For clusters managed with AWS OpenSearch Service, AWS CloudWatch is the primary monitoring solution. CloudWatch automatically collects a wide range of metrics from your OpenSearch domain, such as cluster health, resource utilization, and search performance. You can create dashboards and set up configurable alarms to stay informed about your cluster’s status. More about monitoring AWS-hosted OpenSearch Service

Self-hosted OpenSearch

For self-hosted OpenSearch clusters, open-source monitoring tools such. as Prometheus and Grafana, are commonly used. The OpenSearch Prometheus Exporter plugin collects cluster metrics, which you can visualize and analyze with Grafana dashboards. This approach offers flexibility and customization, allowing you to tailor monitoring to your needs. Prometheus also supports alerting rules, so you can receive proactive notifications based on custom thresholds. More about monitoring Self-hosted OpenSearch

Key metrics and alerts

Regardless of whether your OpenSearch cluster is hosted on AWS or self-managed, monitoring the following core metrics is important for maintaining stability, performance, and reliability.

カテゴリ

メトリック

Why it matters

Cluster health and availability

Cluster status (green, yellow, red): overall health and shard allocation

Detects cluster issues early

Node availability: join and leave events

Identifies unexpected node changes

Resource utilization

Disk usage and free storage space

Prevents outages from full disks

CPU 使用率

Highlights resource bottlenecks

JVM memory pressure, heap usage

Prevents performance degradation

Performance metrics

Search latency and indexing latency: response time for search and indexing

Ensures fast user experience

Thread pool queues: size of search/write queues

Identifies backlogs or slowdowns

Error rates and failures

5xx error rate frequency

Detects instability or misconfiguration

Automated snapshot failures, backup completion status

Ensures data protection

Jira-specific metrics

Point-in-time (PIT) contexts: usage of PIT searches

Important for Jira search reliability

Scroll contexts: usage of scroll APIs

Important for bulk data operations

Troubleshooting and best practices

For general OpenSearch monitoring, use AWS CloudWatch alarms to track cluster health, resource usage, performance, and error rates. Each alarm includes troubleshooting steps and best practices. Explore recommended CloudWatch alarms for Amazon OpenSearch Service

Jira-specific metrics

Point-in-Time (PIT) metrics

Alarms on CurrentPointInTime (number of open PIT contexts) or AvgPointInTimeAliveTime (average lifetime of PIT contexts) indicate that PIT searches aren’t closing promptly, or that the number of concurrent PIT contexts is approaching or exceeding cluster limits.

To address these alarms, you can:

  • Configure PIT Keepalive duration
    Set the opensearch.pointintime.keepalive.seconds property in the jira-config.properties file to control how long a PIT remains active. Lowering this value can help ensure PIT contexts are closed sooner, minimizing resource usage. However, setting this value too low might result in failed search results, as PIT contexts could expire before queries complete. The default is 120 seconds, adjust it carefully based on your workload and monitoring data.

  • Monitor for unusual patterns
    If you notice a sudden increase in open PIT contexts, check for recent changes in Jira usage, such as new plugins, integrations, or bulk operations that could generate excessive PIT searches.

  • Increase PIT Limits
    If your workload requires more concurrent PIT contexts, raise the limit by updating the search.max_open_point_in_time_context node setting using the OpenSearch REST API:

    PUT _cluster/settings
    {
      "persistent": {
        "search.max_open_point_in_time_context": <desired_limit>
      }
    }

    Increasing this limit will use more resources. Monitor cluster health and resource usage after making changes.

Scroll Metrics

Alarms on ScrollCurrent (number of open scroll contexts) might indicate that scrolls aren't being cleaned up, leading to resource leaks and potential cluster instability.

To address these alarms, you can:

  • Check permissions
    Make sure Jira has the required permissions to delete or clear scroll contexts. Without proper permissions, scrolls may accumulate and not be cleaned up.

  • Monitor usage patterns
    If scroll usage remains high, check for your bulk operations or long-running queries. Consider optimizing or batching them differently to reduce the number of open scroll contexts.

Enabling logging for debugging

Enabling detailed logging can help with troubleshooting and performance analysis. OpenSearch provides several logging options to help you identify issues such as slow queries or indexing bottlenecks. Enable these logs temporarily during troubleshooting to minimize performance impact:

  • Request-level slow query logs: Capture queries that exceed a set execution time. Use these logs to find inefficient or problematic queries.

  • Shard-level slow indexing logs: Record indexing operations that are slower than expected at the shard level.

  • Shard-level slow search logs: Record search operations that are slow at the shard level.

For more information, check:

その他のリソース



最終更新日 2025 年 11 月 6 日

この内容はお役に立ちましたか?

はい
いいえ
この記事についてのフィードバックを送信する
Powered by Confluence and Scroll Viewport.