Web Crawler Bots and Confluence: How Public Access Can Lead to Performance Issues
プラットフォームについて: Data Center - この記事は、Data Center プラットフォームのアトラシアン製品に適用されます。
このナレッジベース記事は製品の Data Center バージョン用に作成されています。Data Center 固有ではない機能の Data Center ナレッジベースは、製品のサーバー バージョンでも動作する可能性はありますが、テストは行われていません。サーバー*製品のサポートは 2024 年 2 月 15 日に終了しました。サーバー製品を利用している場合は、アトラシアンのサーバー製品のサポート終了のお知らせページにて移行オプションをご確認ください。
*Fisheye および Crucible は除く
要約
When Confluence instances are exposed to the internet, they become vulnerable to web crawler bots that can access and scrape the instance's endpoints to retrieve public data. These crawler bots often ignore the directives specified in a domain's robots.txt
file, leading to excessive and aggressive requests.
This behavior can significantly increase network traffic to the Confluence instance, potentially overwhelming the application. As a result, this can lead to performance degradation and even outages.
Such issues are frequently misinterpreted as problems within the Confluence application itself or the hosting environment of the servers so it's important for administrators to recognize the impact of these bots and implement appropriate measures to mitigate their effects on system performance and availability.
環境
All Confluence versions that are publicly accessible through the internet.
診断
When diagnosing the impact of web crawler bots on your Confluence instance, it's important to recognize the symptoms that may indicate their presence. These symptoms often mimic those of typical performance issues, making them challenging to identify. They can occur at regular intervals, during specific timeframes, or persist continuously.
Key indicators of potential bot-related performance degradation include:
Inconsistent Performance: You may notice fluctuations in performance that do not align with normal usage patterns. This can manifest as random slowdowns or brief periods of unresponsiveness.
Increased Server Load: A sudden or unexplained spike in server CPU usage, could suggest an influx of automated requests from bots.
- High Memory Pressure: The server hosting the application is using an unusually high amount of memory, and the heap usage of Confluence's JVM is consistently reaching or nearing its maximum limit (Xmx value).
Network Traffic Anomalies: Higher-than-normal network traffic, particularly from a small number of IP addresses or user agents, might indicate bot activity.
Absence of Typical Performance Bottlenecks: Unlike common performance issues, your Confluence instance may not exhibit traditional signs such as stuck threads, database latency, or heap pressure. This absence makes it difficult to pinpoint the cause of degradation.
A great method of confirming if your Confluence instance is being overloaded with web crawler bots is analyzing Tomcat access logs.
These logs are stored on <confluence-installation-folder>/logs/conf_access_log.YYYY-MM-DD.log
by default and provide a detailed record of all incoming requests to your Confluence server, including identifiers that can help you distinguish requests originating from bots.
From Confluence 7.11 onward, access logging is enabled by default. If your instance is not recording such logs, make sure to go over the Configure access logs documentation and configure it accordingly.
As an example, below we have a small log extract taken from Tomcat access logs, which shows multiple requests originated from crawler bots reaching the instance:
Note that the bot names appear at the end of each request.
[13/Jun/2024:21:21:17 +0200] - http-nio-8090-exec-2 110.41.65.179 GET /label/force%2Bmeeting%2Bmethod%2Bon%2Bpet-lab%2Btask%2Bun HTTP/1.1 200 58ms 9490 https://<instance-base-URL>/label/force%2Bmeeting%2Bmethod%2Bon%2Bpet-lab%2Bsuv%2Btask%2Bun Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PanguBot;pangubot@huawei.com)
[13/Jun/2024:21:21:17 +0200] - http-nio-8090-exec-27 52.167.144.206 GET /label/aggregate+coverage+database_management+estimation+eu+intra_regional_trade+qa+territory+world HTTP/1.1 200 63ms 9080 - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
[13/Jun/2024:21:21:18 +0200] - http-nio-8090-exec-13 114.119.132.146 GET /label/chapter1%2Bchapter20%2Bcompiler%2Bconcepts%2Bconstruction%2Bcosta_rica%2Bdata_sources%2Bestablishment_survey HTTP/1.1 200 103ms 9615 https://<instance-base-URL>/label/chapter1%2Bchapter20%2Bcompiler%2Bconcepts Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)
[13/Jun/2024:21:21:22 +0200] - http-nio-8090-exec-24 52.14.134.113 GET /label/border_trade+data_compilation+hs_system+inward_processing+quality_assurance+ships_and_crafts+temporary_admission HTTP/1.1 200 79ms 9016 - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
In just a few seconds, this log snippet reveals activity from four distinct bots accessing a publicly available instance. If left unchecked, these bots can generate up to 2 million requests in a single day, severely impacting performance and potentially causing outages.
Beyond log analysis, it's also advisable to examine access patterns at a network level. This can be done by reviewing the instance's access history through your firewall, Web Application Firewall (WAF) or proxy. Look for unusual patterns or spikes in traffic that may indicate bot activity, such as rapid, repeated requests from specific IP addresses.
ソリューション
To protect your Confluence instance from performance issues caused by excessive requests from web crawler bots, consider implementing the following strategies. Combined, these measures can help mitigate the risk of outages:
Configure the robots.txt file
Use the robots.txt
file to provide instructions to web crawlers about which parts of your site should not be accessed or indexed. This file needs to be published at the root of your Confluence instance's internet domain (e.g. confluence.mycompany.com/robots.txt).
Disallow unnecessary paths: Add directives to disallow crawlers from accessing certain paths, especially those that are resource-intensive. For example:
User-agent: * Disallow: /label Disallow: /download Disallow: /rest
- Allow Essential Paths: If there are paths that should remain accessible to certain bots, specify them with Allow directives.
Before making any changes, consult with your networking team to ensure correct configuration of this file.
Block Malicious IP addresses
Since not all bots adhere to robots.txt
directives, it is crucial to map and block IP addresses associated with problematic web crawlers. Therefore, please reach out to your networking team so they can use security tools and/or firewall settings to block these IPs, effectively preventing them from impacting your network's performance.
Monitor and Analyze Traffic
- Regularly monitor the traffic to your Confluence instance to identify patterns of bot activity. Use this data to adjust your defense strategies as needed.
- Implement analytics tools to help differentiate between human users and bots, which can inform further tuning of your access policies.
By applying these strategies, you can significantly reduce the risk of your Confluence instance being overwhelmed by bot traffic, ensuring smoother performance and reducing the likelihood of outages.