Stash is now known as Bitbucket Server.
See the

Unknown macro: {spacejump}

of this page, or visit the Bitbucket Server documentation home page.

メタデータの末尾にスキップ
メタデータの先頭に移動

このページでは...

... describes how to set up a single Stash server in a highly available configuration.

For production installs...

... 最初に「企業での Stash の使用」をお読みになることを強く推奨します。

For Active/Active HA with Stash...

... see Bitbucket Data Center resources instead.

 

If Stash is a critical part of your development workflow, maximizing application availability becomes an important consideration. There are many possible configurations for setting up a HA environment for Stash, depending on the infrastructure components and software (SAN, clustered databases, etc.) you have at your disposal. This guide provides a high-level overview and the background information you need to be able to set up a single Stash Server in a highly available configuration. The guide also describes one possible configuration in more detail.

Note that Atlassian's Bitbucket Data Center resources product uses a cluster of Stash nodes to provide Active/Active failover. It is the deployment option of choice for larger enterprises that require high availability and performance at scale. Read about Failover for Bitbucket Data Center.

 

Please note that your feedback and comments are welcome! We very much value additional lessons learned from your experience with alternative scenarios!

 

高可用性

High availabilty describes a set of practices aimed at delivering a specific level of "availability" by eliminating and/or mitigating failure through redundancy. Failure can result from unscheduled down-time due to network errors, hardware failures or application failures, but can also result from failed application upgrades. Setting up a highly available system involves:

プロアクティブな考慮事項

    • 変更管理 (変更を適用するためのステージングおよび本番インスタンスを含む)
    • ネットワーク、アプリケーション、ストレージ、およびデータベースの冗長性
    • ネットワークとアプリケーションの両方でのシステムの監視

リアクティブな考慮事項

    • Technical failover mechanisms, either automatic or scripted semi-automatic with manual switchover
    • 危機的状況におけるガイド アクションの標準操作手順

このガイドでは、変更管理などのプロセスがはすでに網羅されていると見なし、冗長性/レプリケーションフェイルオーバー手順に焦点を当てます。システムやアプリケーション障害からすばやく復旧できるようにインフラストラクチャをセットアップする際には、さまざまなオプションがあります。これらのオプションでは、提供されるアップタイムのレベルが異なります。一般に、必要なアップタイムが増えると、インフラストラクチャや、環境を管理するために必要な知識の複雑性も増します (拡張によってコストも増加します)。 

Understanding the availability requirements for Stash

Subversion、CVS、ClearCase およびその他の多くの中央バージョン管理システムでは、バージョン管理システムの任意の操作で利用可能な中央サーバーが必要です。コードのコミット、リポジトリからの最新の変更の取得、ブランチの切り替え、または差分の取得ではすべて、中央バージョン管理システムにアクセスする必要があります。このサーバーがダウンすると、開発者ができる作業は大幅に制限されます。コミットの準備が整うまでコーディングを続行することはできますが、その後はブロックされます。

Git is a distributed version control system and developers have a full clone of the repository on their machines. As a result, most operations that involve the version control system don't require access to the central repository. When Stash is unavailable developers are not blocked to the same extent as with a central version control system.

As a result, the availability requirements for Stash may be less strict than the requirements for say Subversion.

Consequences of Stash unavailability

(tick) 影響を受けない(error) 影響を受ける

開発者

  • コードのコミット
  • ブランチの作成
  • ブランチの切り替え
  • コミットおよびファイルの差分の確認
  • ...
  • 同僚からの変更の取得

開発者

  • リポジトリのクローン
  • 中央リポジトリからの変更の取得
  • 中央リポジトリへの変更のプッシュ
  • Access Stash UI - create/do pull requests, browse code

ビルド サーバー

  • リポジトリのクローン
  • 変更のポーリング

継続的デプロイ

  • リポジトリのクローン

フェイルオーバー オプション

高可用性およびリカバリ ソリューションは次のように分類することができます。

フェイルオーバー オプション

復旧時間説明Possible with Stash
自動修正/再起動

2 ~ 10 分 (アプリケーションの不具合)

数日~数時間 (システムの不具合)

  • 単一ノード。セカンダリ サーバーは利用不可
  • アプリケーションおよびサーバーは監視対象
  • 本番環境システムで障害が発生すると、スクリプト経由で再起動を実施
  • ディスクまたはハードウェアの障害が発生した場合、サーバーを再プロビジョニングし、バックアップからアプリケーション データを復元する必要が出ることがあります
(tick)
コールド スタンバイ2 ~ 10 分
  • セカンダリ サーバーを利用可
  • Stash is NOT running on secondary server
  • ファイルシステムおよびデータベース (オプション) は "アクティブ" サーバーと "スタンバイ" サーバーとの間でレプリケートされます
  • すべてのリクエストは "アクティブ" サーバーに転送されます
  • On failure, Stash is started on the 'standby' server and shut down on the 'active' server. All requests are now routed to the 'standby' server, which becomes 'active'.
(tick)
ウォーム スタンバイ0 ~ 30 秒
  • セカンダリ サービスを利用可
  • Stash is running on both the 'active' server and the 'standby' server, but all requests are routed to the 'active' server
  • ファイルシステムおよびデータベースのデータは "アクティブ" サーバーと "スタンバイ" サーバーとの間でレプリケートされます
  • すべてのリクエストは "アクティブ" サーバーに転送されます
  • 障害が発生すると、すべてのリクエストが "スタンバイ" サーバーに転送され、このサーバーが "アクティブ" になります
  • (error) This configuration is currently not supported by Stash, because Stash uses in-memory caches and locking mechanisms. At this time, Stash only supports a single application instance writing to the Stash home directory at a time.
(error)
アクティブ/アクティブ< 5 秒
  • Provided by Bitbucket Data Center resources, using a cluster of Stash nodes and a load balancer.
  • Stash is running, and serving requests, on all cluster nodes.
  • ファイルシステムとデータベース データがすべてのクラスタ ノードで共有されます。データベースのクラスタ化はまだサポートされていません
  • すべてのリクエストはロード バランサに転送され、ロード バランサはリクエストを利用可能なクラスタ ノードに分散します。あるクラスタ ノードがダウンした場合、ロード バランサはすぐに故障を検出し、数秒以内に他のノードにリクエストを自動的に振り向けます。
  • Stash Data Center is the deployment option of choice for larger enterprises that require high availability and performance at scale.
(tick)

 

自動修正

Before implementing failover solutions for your Stash instance consider evaluating and leveraging automatic correction measures. These can be implemented through a monitoring service that watches your application and performs scripts to start, stop, kill or restart services.

  1. 管理サービスは、システムで障害が発生したことを検出します。
  2. 修正スクリプトは、エラーが発生したシステムのシャットダウンを試みます。
    1. 定義した時間が経過した後にシステムが適切にシャットダウンされない場合、修正スクリプトによってプロセスが終了されます。
  3. プロセスが実行されていないことを確認したら、再起動します。
  4. If this restart solved the failure, the mechanism ends.
    1. 修正の一部または全部が失敗した場合、フェイルオーバー メカニズムがトリガーされことが想定されます (実装されている場合)。

コールド スタンバイ

The cold standby (also called Active/Passive) configuration consists of two identical Stash servers, where only one server is ever running at a time. The Stash home directory on each of the servers is either a shared (and preferably highly available) network file system or is replicated from the active to the standby Stash server. When a system failure is detected, Stash is restarted on the active server. If the system failure persists, a failover mechanism is started that shuts down Stash on the active server and starts Stash on the standby server, which is promoted to 'active'. At this time, all requests should be routed to the newly active server.

For each component in the chain of high availability measures, there are various implementation alternatives. Although Atlassian does not recommend any particular technology or product, this guide gives examples and options for each step. In the following, each component in the system is described and an example configuration is used to illustrate the descriptions.

システム セットアップ

This section describes one possible configuration for how to set up a single instance of Stash for high availability.


システム セットアップ

 

コンポーネント

説明
Request RouterForwards traffic from users to the active Stash instance.
高可用性マネージャー

アプリケーション サーバーの健康状態を追跡し、スタンバイ サーバ ーへのフェイルオーバーのタイミングを判断し、それをアクティブとして指定します。

フェイルオーバー メカニズムを管理し、システムの不具合が発生した際に通知を送信します。

Stash server

 

Each server hosts an identical Stash installation (identical versions).

Only one server is ever running a Stash instance at any one time (know as the active server). All others are considered as standbys.

The Stash home directory resides on a replicated or shared file system visible to all application servers (described in more detail below).

The Stash home directory must never be modified when the server is in standby mode.

Stash DB

The production database, which should be highly available. How this is achieved is not explored in this document. See the following database vendor-specific information on the HA options available to you:

Example HA implementation 

This particular implementation is provided to illustrate the concepts, but hasn't been tested in production. We strongly recommend that you devise a solution that best fits your organisation's existing best practices and standards and is thoroughly tested for production readiness.

The example configuration that we'll use to illustrate the concepts consists of a Linux cluster of two nodes. Each node is a CentOS server with Java, Git and Stash installed. Stash's home directory is replicated between the nodes using DRBD, a block-level disk replication mechanism. The cluster is managed by CMANPacemaker, a high availability resource manager, is used to manage two HA resources: Stash and a Virtual IP. Pacemaker runs on each machine, elects the 'primary' node for Stash and starts Stash on this node. The Virtual IP resource is configured to run on the same node as the Stash resource, removing the need for a separate 'request router' component. Pacemaker monitors Stash and when it detects a failure tries to restart Stash on the primary node. If the restart fails, or does not resolve the issues, it fails over to the secondary node. The Virtual IP resource is configured to run on the same node as the Stash resource and will also be moved to the secondary node.

Scripts to create a virtual network based on this example configuration using packer, vagrant and VirtualBox can be found in the stash-ha-example repository. Specifically, the scripts for installing the required software components can be found in the  packer/scripts  directory. The scripts for configuring the cluster can be found in the  vagrant/scripts directory.

Example Stash HA implementation

Request router

All high availability solutions are based on redundancy, monitoring and failover. In the cold standby approach, only one server is running Stash at a time. It is the request router's responsibility to route all incoming requests to the node that is currently the primary node. For full high availability, the request router should be highly available itself, meaning that the component is monitored by the HA manager and can be failed over to a redundant copy in the network.

要件

  • Routes all incoming requests to the node that is currently the 'primary' node
  • Should be highly available itself

オプション

Solution in example HA implementation

The example HA implementation does not include a separate Request Router server. Instead it includes a virtual IP HA resource that is co-located with the Stash resource. The virtual IP resource is managed by Pacemaker and will be moved to the standby node when the Stash resource fails over to the standby node.

Data replication

Stash stores its data in two places: the Stash home directory and the database that you have configured. The Stash home directory contains, among other things, the Git repositories being managed by Stash (with some additional Stash-specific files and directories), installed plugins, caches and log files. The database contains, among other things, your project and repository information and metadata, pull requests data and the data for your installed plugins.

Data in Stash's home directory and in the database are very tightly coupled. For instance, repository pull requests have their metadata, participants and comments stored in the database but certain Git-oriented information around merging and conflicts (which are used to display the diffs in the user interface) are stored in the managed Git repositories. If the two were to fall out of sync you might see an incorrect pull request diff, you might be left unable to merge the pull request, or Stash may simply refuse to display the pull request at all. Similarly, Stash plugins are installed from jar files in the Stash home directory but their state is stored in the database. If the two were to fall out of sync then plugins may malfunction or not appear installed at all, thus degrading your Stash experience.

When designing a high availability solution for Stash based on a replicated file system and database, it's important that the file system replication is atomic. The replicated file system must be a consistent snapshot of the 'active' filesystem. This is important because changes to a Git repository happen in predictable ways: first the objects (files, trees and commits) are written to disk, followed by updates of the refs (branches and tags). Some synchronisation tools such as rsync perform file-by-file syncing, which can result in an inconsistent Git repository if the repository is modified while the sync is happening (for example, if object files have not been synced, but the updated refs have been).

Furthermore, the tight coupling between the Stash home directory and database makes it essential that the Stash home directory and database are always consistent and in sync (see here for more information). By extension, this means that any high availability solution based on a replicated file system and database needs to ensure that the replicated file system and database are in sync. For example, if the replication is based on hourly synchronisation to a standby node, care must be taken to ensure that the synchronisation of the database and filesystem happen at the same time.

要件

  • File system replication must replicate a consistent snapshot of Stash's home directory.
  • The database and the file system must be replicated at the same time.

オプション

Solution in example HA implementation

The example HA implementation uses a DRBD managed block device for its Stash home directory. By default, DRBD runs in a Primary/Secondary configuration in which only a single node can mount the DRBD managed volume at a time. In this configuration, DRBD should be managed by Pacemaker to ensure that the DRBD volume is co-located with the Stash resource.

In preparation for experimentation with an Active/Active configuration, the example HA implementation has configured DRBD in a dual-primary configuration, which allows both nodes to mount the DRBD managed volume at the same time. 

監視

To allow for monitoring in a high availability environment, Stash, since version 2.10, has supported a REST-based health check endpoint at /status that describes the current health of the instance. This endpoint supports only the GET verb and requires no authentication, XSRF protection header values, or mime-type headers. The /status endpoint has been designed to return sane output even when Stash is currently unavailable as a result of database migration or backup. Please note that other URLs such as /login or /rest/api/latest/application-properties will redirect to the maintenance page when Stash is performing database migration or backup. Using these endpoints may unintentionally trigger failover when these URLs are used for monitoring the health of the system.

使用例

> curl -i -u user -X GET http://localhost:7990/stash/status
Enter host password for user 'user':
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-AREQUESTID: 1040x7x0
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
Content-Type: application/json;charset=ISO-8859-1
Content-Length: 19
Date: Tue, 07 Jan 2014 17:20:04 GMT
{"state":"RUNNING"}

The following is a list of the responses the /status health check endpoint will return:

HTTP ステータス コードレスポンス エンティティ説明
200{"state":"RUNNING"}Stash is running normally
200{"state":"MAINTENANCE"}Stash is in maintenance mode
503{"state":"STARTING"}Stash is starting
503{"state":"STOPPING"}Stash is stopping
200{"state":"FIRST_RUN"}Stash is running for the first time and has not yet been configured
404 Stash failed to start up in an unexpected way (the web application failed to deploy)
500{"state":"ERROR"}Stash is in an error state

If a connection error occurs when trying to connect to the endpoint (but the server is reachable) then Tomcat has failed to start.

Monitoring frequency

Stash's health check is simple and not resource intensive. You should feel free to check as often as is deemed necessary to maximise continuity of Stash in your organisation. We do recommend, however, to not check more frequently than every 15 seconds so that the HA resource manager / cluster does not mistake transitory slowdowns such as stop-the-world garbage collection in Stash's JVM. We recommend a monitor timeout of 30 seconds because the first check after startup can be fairly slow. After startup completes, the check should take only a few milliseconds.

要件

  • Monitoring scripts must use the /status URL. Any other URL may redirect to the maintenance page when a backup is being performed, unintentionally triggering failover.
  • When a request to /status returns anything other than a 200 status code, Stash should be considered to be in an error state and should be failed over the standby node.

Solution in example HA implementation

The example HA implementation includes an OCF compliant script that's used for monitoring Stash's health. The script can be found here.

フェイルオーバー

The following table outlines how we recommend that your HA resource manager responds to failure events:

イベント

Response
Network connection from the request router to Stash is lostFailover to a secondary node
Server failureFailover to a secondary node
Stash crashes completelyRestart Stash on the active node
Stash reaches its memory limits (OOME)Restart Stash on the active node
Stash loses connection to the database

Nothing. Stash will recover when the database comes back on line.

Stash on another node will also fail to start if the database is unavailable.

The database is reported down

Nothing. Stash will recover when the database recovers.

Stash on another node will also fail to start.
Stash fails to start up (e.g. wrong Git binary version)

Nothing. Manual intervention required.

Stash on another node will also fail to start.

Split brain

A split-brain condition results when a cluster of nodes encounters a network partition and multiple nodes believe the others are dead and proceed to take over the cluster resources. In the context of a Stash HA installation this would involve multiple Stash instances running concurrently and making filesystem and database changes, potentially causing the filesystem and database to fall out of sync. As previously noted this must not be permitted to happen. There are several ways to address this:

Network redundancy

This involves configuring redundant and independent communications paths between nodes in the cluster. If you maximise the connectivity between nodes you minimise the likelihood of a network partition and a split brain. This is a preventative measure but it is still sometimes possible for the network to partition.

Resource fencing

This involves ensuring that the first node that believes the others are dead 'fences off' access to the resource that other nodes (which appear dead but may still be alive) may try to access. The losing nodes are prevented from making modifications, therefore maintaining consistency. In a Stash HA, the resources that would need to be fenced are the database and the replicated file system.

Node fencing or STONITH

This is a more aggressive tactic and again involves the first node that believes the others are dead, but instead of fencing off access to particular resources, it denies all resource access to them. This is most commonly achieved by power-cycling the losing nodes (aka "Shoot The Other Node In The Head" or STONITH). In a Stash HA, this would involve power-cycling the losing Stash servers.

要件

  • The application should fail over to a secondary node when a server failure is detected by the cluster manager (that is, the whole node is down or unreachable).
  • When an application failure is detected, the application should be restarted. If restarting does not resolve the issue, the application should be failed over to a secondary node.

Solution in example HA implementation

The example implementation uses Pacemaker to manage failover. Pacemaker in turn uses the provided OCF script to properly shut down the failing Stash and start Stash on the secondary node.

Please note that the vagrant provisioning script in the example implementation contains a simplified configuration that is aimed at testing failover. It configures Stash to immediately failover to a secondary node, without attempting to restart the application. It also disables the STONITH feature for ease of testing. In a production system, at least one restart should be attempted before failing over and STONITH should be enabled to handle 'split brain' occurrences.

ライセンス

Developer licenses can be used for non-production installations of Stash deployed on a cold stand-by server. For more information see developer licenses.