Web Crawler Bots and Confluence: How Public Access Can Lead to Performance Issues

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Summary

When Confluence instances are exposed to the internet, they become vulnerable to web crawler bots that can access and scrape the instance's endpoints to retrieve public data. These crawler bots often ignore the directives specified in a domain's robots.txt file, leading to excessive and aggressive requests.

This behavior can significantly increase network traffic to the Confluence instance, potentially overwhelming the application. As a result, this can lead to performance degradation and even outages.

Such issues are frequently misinterpreted as problems within the Confluence application itself or the hosting environment of the servers so it's important for administrators to recognize the impact of these bots and implement appropriate measures to mitigate their effects on system performance and availability.

Environment

All Confluence versions that are publicly accessible through the internet.

Diagnosis

When diagnosing the impact of web crawler bots on your Confluence instance, it's important to recognize the symptoms that may indicate their presence. These symptoms often mimic those of typical performance issues, making them challenging to identify. They can occur at regular intervals, during specific timeframes, or persist continuously.

Key indicators of potential bot-related performance degradation include:

Inconsistent Performance: You may notice fluctuations in performance that do not align with normal usage patterns. This can manifest as random slowdowns or brief periods of unresponsiveness.
Increased Server Load: A sudden or unexplained spike in server CPU usage, could suggest an influx of automated requests from bots.
High Memory Pressure: The server hosting the application is using an unusually high amount of memory, and the heap usage of Confluence's JVM is consistently reaching or nearing its maximum limit (Xmx value).
Network Traffic Anomalies: Higher-than-normal network traffic, particularly from a small number of IP addresses or user agents, might indicate bot activity.
Absence of Typical Performance Bottlenecks: Unlike common performance issues, your Confluence instance may not exhibit traditional signs such as stuck threads, database latency, or heap pressure. This absence makes it difficult to pinpoint the cause of degradation.

A great method of confirming if your Confluence instance is being overloaded with web crawler bots is analyzing Tomcat access logs.

These logs are stored on <confluence-installation-folder>/logs/conf_access_log.YYYY-MM-DD.log by default and provide a detailed record of all incoming requests to your Confluence server, including identifiers that can help you distinguish requests originating from bots.

From Confluence 7.11 onward, access logging is enabled by default. If your instance is not recording such logs, make sure to go over the Configure access logs documentation and configure it accordingly.

As an example, below we have a small log extract taken from Tomcat access logs, which shows multiple requests originated from crawler bots reaching the instance:

💡 Note that the bot names appear at the end of each request.

[13/Jun/2024:21:21:17 +0200] - http-nio-8090-exec-2 110.41.65.179 GET /label/force%2Bmeeting%2Bmethod%2Bon%2Bpet-lab%2Btask%2Bun HTTP/1.1 200 58ms 9490 https://<instance-base-URL>/label/force%2Bmeeting%2Bmethod%2Bon%2Bpet-lab%2Bsuv%2Btask%2Bun Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PanguBot;pangubot@huawei.com)

[13/Jun/2024:21:21:17 +0200] - http-nio-8090-exec-27 52.167.144.206 GET /label/aggregate+coverage+database_management+estimation+eu+intra_regional_trade+qa+territory+world HTTP/1.1 200 63ms 9080 - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36

[13/Jun/2024:21:21:18 +0200] - http-nio-8090-exec-13 114.119.132.146 GET /label/chapter1%2Bchapter20%2Bcompiler%2Bconcepts%2Bconstruction%2Bcosta_rica%2Bdata_sources%2Bestablishment_survey HTTP/1.1 200 103ms 9615 https://<instance-base-URL>/label/chapter1%2Bchapter20%2Bcompiler%2Bconcepts Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)

[13/Jun/2024:21:21:22 +0200] - http-nio-8090-exec-24 52.14.134.113 GET /label/border_trade+data_compilation+hs_system+inward_processing+quality_assurance+ships_and_crafts+temporary_admission HTTP/1.1 200 79ms 9016 - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)

In just a few seconds, this log snippet reveals activity from four distinct bots accessing a publicly available instance. If left unchecked, these bots can generate up to 2 million requests in a single day, severely impacting performance and potentially causing outages.

Beyond log analysis, it's also advisable to examine access patterns at a network level. This can be done by reviewing the instance's access history through your firewall, Web Application Firewall (WAF) or proxy. Look for unusual patterns or spikes in traffic that may indicate bot activity, such as rapid, repeated requests from specific IP addresses.

Solution

To protect your Confluence instance from performance issues caused by excessive requests from web crawler bots, consider implementing the following strategies. Combined, these measures can help mitigate the risk of outages:

Configure the robots.txt file

Use the robots.txt file to provide instructions to web crawlers about which parts of your site should not be accessed or indexed. This file needs to be published at the root of your Confluence instance's internet domain (e.g. confluence.mycompany.com/robots.txt).

Disallow unnecessary paths: Add directives to disallow crawlers from accessing certain paths, especially those that are resource-intensive. For example:
1 2 3 4User-agent: * Disallow: /label Disallow: /download Disallow: /rest
Allow Essential Paths: If there are paths that should remain accessible to certain bots, specify them with Allow directives.

⚠️ Before making any changes, consult with your networking team to ensure correct configuration of this file.

Block Malicious IP addresses

Since not all bots adhere to robots.txt directives, it is crucial to map and block IP addresses associated with problematic web crawlers. Therefore, please reach out to your networking team so they can use security tools and/or firewall settings to block these IPs, effectively preventing them from impacting your network's performance.

Monitor and Analyze Traffic

Regularly monitor the traffic to your Confluence instance to identify patterns of bot activity. Use this data to adjust your defense strategies as needed.
Implement analytics tools to help differentiate between human users and bots, which can inform further tuning of your access policies.

By applying these strategies, you can significantly reduce the risk of your Confluence instance being overwhelmed by bot traffic, ensuring smoother performance and reducing the likelihood of outages.

Updated on April 8, 2025

Was this helpful?

It wasn't accurateIt wasn't clearIt wasn't relevant

Atlassian Support

Web Crawler Bots and Confluence: How Public Access Can Lead to Performance Issues

Summary

Environment

Diagnosis

Solution

Still need help?