Tuning load balancing for Hipchat Data Center

お困りですか?

アトラシアン コミュニティをご利用ください。

コミュニティに質問

You want your Hipchat Data Center deployment to be highly-available, and a lot depends on the configuration of your Load Balancer. The load balancer is the gateway to your Hipchat cluster, and how you choose to tune and configure it can have a huge impact on cluster performance and your users' experience in the event of a failure in the cluster.

This article talks about configuration requirements for your load balancer to ensure the best possible performance, and provides a good example configuration for HAProxy. You can use this configuration with minimal changes in your environment, or use the recommendations in it to construct a configuration that works for the load balancing solution you choose for your deployment.

知っておくとよいポイント...

This article is only about load balancing failover to the Hipchat application nodes. It doesn't cover the load balancer(s) you might use to improve availability for the external data stores (Postgres, Redis, and the NFS share).

Basic requirements

In order to work with Hipchat Data Center, your load balancer must use HTTP persistent connections (also known as "HTTP connection reuse" or http-keepalive ) to minimize the overhead of creating new TCP connections on the load balancer. HA Proxy, the example load balancer in this article, uses http-keepalive by default, however other load balancers may require that you explicitly turn that feature on.

The load balancer must also use cookie based "sticky sessions". This ensures minimal disruption to the end user's chat sessions by routing them consistently to the same Hipchat worker node.

We recommend using round-robin routing, which means that the load balancer routes new requests to each node in sequence. We use this method to keep the load as evenly distributed as possible without overwhelming (or "flooding") a single node when it reconnects.

The load balancer must also have a method for determining the node's health, or running "health checks". This is how the load balancer detects if a node is working, or if it's down.

What's a Health Check?

In a health check, the load balancer contacts each node once per second. That may seem like a lot, but in network terms it's leisurely. In our sample HAProxy configuration, we use httpchk.

  • If a node responds to a health check within 3 seconds (3000ms), it passes the health check.
  • If the node doesn't respond within 3 seconds, or if the load balancer receives a network error, the node fails the check.

Once a node has failed five consecutive health checks (which could take up to 8 seconds), it's considered "failed" and the load balancer switches into "degraded mode" (node outage) operation. 

Operational Scenarios

These scenarios explain what our HAProxy configuration does. If you're developing your own configuration, these are the parameters you'll need to replicate for best performance.

Normal operation - all nodes operational

All three application nodes pass the health checks.

  • Continuing client chat requests (which are POSTs to /http-bind) are sent to a specific Hipchat application node, based on the web session affinity or "sticky session" cookie.
  • When routing new sessions, the load balancer uses round-robin load distribution to assign the new session to the next node.
  • All Web and API requests are also distributed among the application nodes in a round-robin fashion.

Degraded mode - one node fails, two nodes operational

Two of the application nodes pass the health-check, but one node has failed.

Within one second of a node's health check failing:

  • Any client chat requests that would otherwise be routed to the failed application node are distributed between the remaining application nodes, with best effort for even distribution. 
    Only the clients whose sessions were originally on the failed node have to reconnect.
  • All incoming web and API requests are distributed between the remaining application nodes in a round-robin fashion.
    API calls are considered instantaneous, and do not bind to a specific node; failing requests may retry and be routed to one of the remaining nodes.
  • Any client chat requests to the unaffected nodes are still routed to the same nodes. Clients assigned to the nodes that are up are not impacted, and remain assigned to the same nodes. 

If more than one node fails, and the remaining node is unable to accommodate the number of users attempting to use the service, we can't guarantee that the Hipchat services will remain available. See the explanation below.

Recovery mode - all nodes operational

All three Hipchat nodes pass their health checks again. The cluster is returning to normal operation.

Within five seconds of node recovery:

  • Client chat requests that were redirected from the failed node to one of the remaining nodes during Degraded mode begin returning to the original, recovered node. 
    Only the clients with sessions assigned to recovered node must reconnect.
  • All incoming web and API requests return to being distributed between all three application nodes in a round-robin fashion. 
    Long-running requests may take more time to rebalance, depending on how long they take to time out on the server.
  • Any ongoing client sessions chat requests to the unaffected nodes are still routed to the same nodes.

The good stuff - Reference HAProxy load balancer configuration

Check out our configuration file for HAProxy 1.6.3 below. (This was tested on Ubuntu 16.04 LTS.)

You can use this config file as-is, or use it as a reference configuration when creating your own load balancer config file.

haproxy.cfg
// HAProxy Config 1.0.2 -  last updated May 28, 2018
// there are three TODOs for you before you can use this file 

global
     stats socket /var/run/haproxy.sock mode 600 level admin

defaults
     log global
     mode http
     option dontlognull
     timeout connect 3000
     timeout client 5000
     timeout server 60000
     option forwardfor

// TODO: replace the path following `crt` with the path to your deployment's pem certificate on the LB instance
frontend f_hcdc
     bind 0.0.0.0:443 ssl crt /etc/ssl/private/hipchatdc.example.com.pem
     default_backend b_hcdc_stateless
     acl bosh_request path_beg /http-bind
     use_backend b_hcdc_sticky if bosh_request

// TODO: replace <hipchatdc.example.com> with the DNS endpoint used to access the deployment
// TODO: replace "one", "two" and "three" with the IPs or DNS entries for your Hipchat nodes
backend b_hcdc_stateless
     balance roundrobin
     option httpchk GET / HTTP/1.0\r\nHost:\ <hipchatdc.example.com>
     http-check expect status 302
     option http-keep-alive
     default-server inter 1s fall 1 rise 3
     server s1 one:80 check observe layer4
     server s2 two:80 check observe layer4
     server s3 three:80 check observe layer4

backend b_hcdc_sticky
     balance roundrobin
     stick-table type string len 128 size 10M
     acl has_sticky_session cookie(s),in_table(b_hcdc_sticky)
     stick store-request cookie(s) unless has_sticky_session
     stick match cookie(s)
     option httpchk GET /http-bind HTTP/1.0 
     http-check expect status 200
     option http-keep-alive
     default-server inter 1s fall 1 rise 3
     server s1 one:80 check observe layer4 on-marked-down shutdown-sessions tcp-ut 2s
     server s2 two:80 check observe layer4 on-marked-down shutdown-sessions tcp-ut 2s
     server s3 three:80 check observe layer4 on-marked-down shutdown-sessions tcp-ut 2s


パフォーマンスの影響

During a node failure, if the total capacity of the remaining nodes in the cluster is fewer than the number of active users, we can't guarantee that the Hipchat services will remain available. The closer your Hipchat cluster is to running at maximum capacity during normal operation, the less likely it is that the services will remain available during a failure.

For example, a three-node Hipchat Data Center cluster might have a hardware configuration that gives it a total capacity of 6000 users, and it normally serves 3000 daily active users. If the cluster loses one of its nodes the remaining capacity is 2/3, or 4000 users. Since the remaining capacity is greater than the 3000 active users, the degraded cluster remains up.

However, a three-node Hipchat Data Center cluster with a total capacity of 9000 users, and 8500 active users has less of a safety margin. If this cluster loses one of its nodes, the remaining capacity is reduced by 1/3, and is now about 6000 users. That's less than 8500, so in this case, the degraded cluster can not seamlessly support the number of active users. Cluster administrators should expect performance degradation, unpredictable behavior and, in the most severe cases, complete loss of chat services until the failed node recovers.

最終更新日 2018 年 5 月 31 日

この内容はお役に立ちましたか?

はい
いいえ
この記事についてのフィードバックを送信する
Powered by Confluence and Scroll Viewport.