Build resiliency in Bamboo Data Center
BAMBOO Data center
In Bamboo versions earlier than 8.0, when the server’s work got interrupted or if a server went down for more than 5 minutes, Bamboo builds would fail due to lack of connection of the building agent with the server. Bamboo agents were designed to die when they couldn't connect to a server for longer than 5 minutes.
With Bamboo Data Center, the agent will continue its work and finish building even if the connection with the server is lost. Once the agent’s building work is done, it tries to connect to the server. If the server is already online, the agent will send build results, logs, and artifact to the server, and pick up the next tasks from the server. If the server is still down, the agent will try to reconnect with the server after some time.
If the transmission problems are caused by the network failure, the effective timeout is considerably shorter as in such case the server recognizes that the agent is offline and terminates the build on its end. This behaviour is configured by heartbeat timeouts. For more information, see Changing the remote agent heartbeat interval.
It is important to understand that this improved build resiliency to server failures will work only if the build process can be finished. Bamboo will not be able to finish the build if:
a child process is failing or stopped
an agents process is stopped while the build is running
a resource required for build process is unavailable (this includes resources provided by the Bamboo server, like REST endpoints and artifacts from other builds)
a build is failing because of intermittent infrastructure problems
Build resiliency with elastic agents
Same logic applies to agents started at EC2 environment. To achieve it Bamboo agent is started using the Tanuki wrapper, which is also used by the remote agent. The wrapper allows to restart Bamboo agent when Java process is interrupted by connection timeout error.
If you’re using elastic images provided by Bamboo 8.0 (or based on them), elastic agents use the agent wrapper and can fully benefit from improved build resiliency. Old images are still functional but will work with the ‘short’ timeout only.
After server restart, elastic agents that use the agent wrapper are able to fully resume their operation. Agents without wrapper are allowed to return the result they worked on but then they will terminate.
Disabling elastic tunnel is no longer prerequisite for seamless restarts/improved build resiliency.