You are here

Back-end Load Balancing and Health Checks

OLD: Affects version(s): 
4.2.4 or newer

 
Load Balancing

Airlock has the built-in capability to distribute browser requests to several back-end systems. In many situations, Airlock will be able to replace more expensive dedicated load balancers. This article shows how to configure this feature, and how it will affect Airlock's runtime behavior.

Configuration

Back-end Load Balancing is enabled by connecting a Back-end Group containing two or more Back-end Hosts to one specific mapping.

Each Back-end Host can be assigned a Weight value, or it can be marked as a Spare. Spare Back-end hosts do not carry a weight value. Airlock automatically assigns a weight to a Back-end Host when its Spare checkbox is un-ticked. For more configuration information see the following manual pages:
Back-end Group
Back-end Load Balancing

Distribution Algorithm

Airlock distributes sessions randomly over the configured Back-end Hosts. The Weights assigned to all non-spare and enabled Back-end Hosts of a Back-end Group are used to calculate which percentage of the actual number of sessions each Back-end host should receive. The calculated percentage is used as a starting point when distributing load over a Back-end Group's Back-end Hosts. Depending on the operational situation, these values can change dynamically.

Note that what is actually distributed over the Back-end Hosts according to the set weights (and calculated percentages) are the individual sessions, not the requests. Once a specific user's session is tied to a specific Back-end Host, it will be served by that particular Back-end Host for the remainder of the session, or until that Back-end Host becomes irresponsive. Thus session stickiness (see below) is guaranteed.

The following illustration show the load balancing process:

 

 

Session Stickiness

When session handling is enabled (see screenshot below), session stickiness is guaranteed by the Airlock session cookie. All requests of one users session are sent to the same Back-end Host as long as this system is available. If the Back-end Host is set to "bad", Airlock chooses another host from the configured Back-end Group. The session information will be updated so that session stickiness remains guaranteed.

If a TCP or SSL connection has been established but a back-end timeout occured, the request will not be resent to another failover back-end, regardless of GET or POST. This is to avoid performing the same transaction twice,  because Airlock does not know how far the request has been processed by the first back-end.

If Airlock session handling is disabled, it is possible to enable the load balancing cookie to handle session stickiness. This cookie stores information about which Back-end Host handles the requests of one specific user. It is sent to the browser to guarantee session stickiness.

Example of an encrypted load balancing cookie:

AL_LB=$xc/1OyIgo!xxFxsjg_TOa46gESbVeA=

In both cases this “back-end stickiness” applies to all Mappings connected to the same group of Back-end Hosts.

Health Checks

Airlock provides two different health check mechanisms to check the availability of the Back-end Hosts. The health checks can either be in-band (using regular, end-user requests) or out-of-band (requests for a pre-defined page). As soon as one system of a Back-end Group is no longer available, which is if no TCP connection is possible or a back-end failure is detected by HTTP timeout, error code or error message, Airlock marks this system as failed ("bad") and tries to reach one of the other "Spare" or "Enabled" systems in the same Back-end Group. For a detailed description of the two health check types, see below.

Configuration

The health check configuration has to be done in the corresponding Back-end Group in the panel "Health Checks". By default, both In-band Checks and Out-of-band Checks are disabled.

In-band Checks

In-band Health Checks are performed with each user request. Failing checks are tracked with a sliding average over the last 20 seconds, calculating the current failure rate. One single request can only impact the failure rate by a maximum of 5%. If more than 10% of the requests within that 20 seconds time window fail, Airlock marks the corresponding back-end server as "bad". The values for the sliding average calcualtion can be customized, see section "Expert Settings".

When a system fails and no Out-of-band Health Checks are configured, Airlock will still try to deliver requests to this system. But since it is marked as "bad", no more than one probe request is sent to such a system at any time (this value can be changed, see section "Expert Settings"). This way no more than one request and user is stuck per failed system. As soon as one probe request is successfully sent to and returned by the system, Airlock removes the "bad" status and returns to delivering regular load to this system.

The following sections describe the cases in which a back-end server will be marked as "bad".

Back-end HTTP timeout

If a back-end request times out , Airlock does not re-send the request to any of the other systems. This way, the transaction on the back-end system will not be triggered again. The back-end system which did not respond within the expected time will be marked as "bad".
The Back-end HTTP timeout is set to 120 seconds by default. It is possible to change it on a Mapping in the panel "Basic".

TCP connect fail or connect timeout

As opposed to the behavior when reaching the Back-end HTTP timeout, in this case Airlock tries to connect to one of the other configured Back-end host. Each back-end system which is not available will be marked as "bad". Airlock will abort the requests only if all back-end systems in the Back-end Group fail to connect. The TCP back-end connect timout ist set to 2 seconds by default. With the following Expert Setting, the value can be overwritten.

SecurityGateway * BackendConnectTimeout "<value>"

Additional response-based In-band Checks

Additionaly it is possible to enable response-based In-band Checks. In this case the health check decision is made upon the response from the back-end system, based on the HTTP response status code and/or the response content. The settings can be enabled on the tab "Health Checks" > "In-band Checks" in the corresponding Back-end Group. The screenshot below shows the default settings.

Note: Connection level errors such as connection timeouts and handshake failures will still mark back-end systems as "bad" even if "In-band Checks" are disabled.

For more In-band Checks configuration information, see the following manual page: Health Checks

Expert Settings

There are a few properties only accessible through the Airlock Configuration Center "Expert Settings" > "Security Gate", where resource entries can be added to override the system's default settings.

It is possible to change the default value of concurrent fail-check requests by setting the resource in the Expert Settings. This value must never be "0" because then a Back-end Host marked as "bad" can never be marked as "good" again due to the lack of a request that could reset its state. By default this resource is set to "1". The higher the number, the faster a host is marked as "good" again, but also the more requests suffer from being used to check the host’s availability. Reasonable values are "1" or "2".

SecurityGateway * BackendHostManagerMaxConcurrentRequestsToBadHosts  "<value>"

The maximal number of concurrent connections to a normal Back-end Host (a "good" host) can be changed with the following resource. By default this value is set to 80% of number of "security_gate" child processes, which is determined by the system’s hardware like RAM and CPU.

SecurityGateway * BackendHostManagerMaxConcurrentRequests "<value>"

By default the sliding average time window is set to 20 seconds, the threshold to mark a system as bad is set to 10 percent and the maximal impact per request is 5 percent. The following settings can be used to override these defaults.

SecurityGateway * BackendFailureDetection.InBand.TimeWindow           "<value>" # seconds
SecurityGateway * BackendFailureDetection.InBand.ThresholdToBad       "<value>" # percent
SecurityGateway * BackendFailureDetection.InBand.MaxImpactPerRequest  "<value>"  # percent

Out-of-band Checks

Out-of-band checks allow constant monitoring of a specific page independent of any end-user traffic. This page is requested in a defined interval, the response analyzed, and used to decide whether a back-end server is in a "good" or a "bad" state. If Out-of-band Checks are configured, no probe requests from end-users requests are being sent to "bad" servers (see In-band Checks).
For both states, "good" and "bad", it is possible to configure how often the test page should be called, and how many subsequent failed or successful attempts are necessary to initiate a state change from "good" to "bad" or vice versa. For this settings see next section.

Configuration of Out-of-band Checks

Out-of-band Check can be enabled under the panel "Health Checks" - "Out-of-band Checks" in the corresponding Back-end Group. It is important to configure the correct "Check URL path", otherwise the checks will always fail. As for In-band Checks, the health decision is made depending on the TCP connection, on HTTP request/response time and if required on HTTP response status code and/or the response content. For Out-of-band Checks, the HTTP request timeout has to be defined separatly. The screenshot below shows the default settings.

Note: The TCP connection behavior is globaly defined and applies for Out-of-band Checks as well as for In-band Checks.

For more Out-of-band Checks configuration information, see the following manual page: Health Checks

Knowledge Base Categories: