Application capacity optimizations with global load balancing

Last reviewed 2018-01-18 UTC

Most load balancers use a round-robin or flow-based hashing approach todistribute traffic. However, load balancers that use this approach can havedifficulty adapting when demand spikes beyond available servingcapacity. This article explains how usingCloud Load Balancing can address these issues and optimize your global application capacity. Thisoften results in a better user experience and lower costs compared totraditional load-balancing implementations.

This article is part of a best practices series focused on Google's CloudLoad Balancing products. For a tutorial that accompanies this article, seeCapacity Management with Load Balancing.For a deep dive on latency, seeOptimizing Application Latency with Load Balancing.

Capacity challenges in global applications

Scaling global applications can be challenging, especially if you have limitedIT budgets and unpredictable and bursty workloads. In public cloud environmentssuch as Google Cloud, the flexibility provided by featureslikeautoscaling and load balancing can help. However, autoscalers have some limitations,as explained in this section.

Latency in starting new instances

The most common issue with autoscaling is that the requested application isn'tready to serve your traffic quickly enough. Depending on your VM instanceimages, scripts typically must be run and information loaded before VM instancesare ready. It often takes a few minutes before load balancing is able to directusers to new VM instances. During that time, traffic is distributed to existingVM instances, which might already be over capacity.

Applications limited by backend capacity

Some applications can't be autoscaled at all. For example, databases often havelimited backend capacity. Only a specific number of frontends can access adatabase that doesn't scale horizontally. If your application relies on externalAPIs that support only a limited number of requests per second, the applicationalso can't be autoscaled.

Non-elastic licenses

When you use licensed software, your license often limits you to a presetmaximum capacity. Your ability to autoscale might therefore be restrictedbecause you can't add licenses on the fly.

Too little VM instance headroom

To account for sudden bursts of traffic, an autoscaler should include ampleheadroom (for example, the autoscaler is triggered at 70% of CPU capacity). Tosave costs, you might be tempted to set this target higher, such as 90% of CPUcapacity. However, higher trigger values might result in scaling bottleneckswhen confronted with bursts of traffic, such as an advertising campaign thatsuddenly increases demand. You need to balance headroom size based on how spikyyour traffic is and how long your new VM instances take to get ready.

Regional quotas

If you have unexpected bursts in a region, your existingresource quotas might limit the number of instances you can scale to below the level required tosupport the current burst. Processing an increase to your resource quota cantake a few hours or days.

Addressing these challenges with global load balancing

Theexternal Application Load Balancers andexternal proxy Network Load Balancers are global load balancing products proxied through globally synchronized GoogleFront End (GFE) servers, making it easier to mitigate these types of loadbalancing challenges. These products offer a solution to the challenges becausetraffic is distributed to backends differently than in most regional loadbalancing solutions.

These differences are described in the following sections.

Algorithms used by other load balancers

Most load balancers use the same algorithms to distribute traffic betweenbackends:

Round-robin. Packets are equally distributed between all backendsregardless of the packets' source and destination.
Hashing. Packet flows are identified based on hashes of trafficinformation, including source IP, destination IP, port, and protocol. Alltraffic that produces the same hash value flows to the same backend.

Hashing load balancing is the algorithm currently available forexternal passthrough Network Load Balancers.This load balancer supports 2-tuple hashing (based on source and destinationIP), 3-tuple hashing (based on source IP, destination IP, and protocol), and5-tuple hashing (based on source IP, destination IP, source port, destination port, and protocol).

With both of these algorithms, unhealthy instances are taken out of thedistribution. However, current load on the backends is rarely a factor in loaddistribution.

Some hardware or software load balancers use algorithms that forward trafficbased on other metrics, such as weighted round-robin, lowest load, fastestresponse time, or number of active connections. However, if load increases overthe expected level due to sudden traffic bursts, traffic is still distributed tobackend instances that are over capacity, leading to drastic increases in latency.

Some load balancers allow advanced rules where traffic that exceeds thebackend's capacity is forwarded to another pool or redirected to a staticwebsite. This enables you to effectively reject this traffic and send a "serviceunavailable, please try again later" message. Some load balancers give you theoption to put requests in a queue.

Global load balancing solutions are often implemented with a DNS-basedalgorithm, serving different regional load balancing IPs based on the user'slocation and backend load. These solutions offer failover to another region forall or part of the traffic for a regional deployment. However, on any DNS-basedsolution, failover usually takes minutes, depending on the time-to-live (TTL)value of the DNS entries. In general, a small amount of traffic will continue tobe directed to the old servers well past the time that the TTL should haveexpired everywhere. DNS-based global load balancing is therefore not the optimalsolution for dealing with traffic in bursty scenarios.

How external Application Load Balancers work

The external Application Load Balancer uses a different approach. Traffic is proxied throughGFE servers deployed throughout most of Google's global network edge locations.This currently constitutes over 80 locations around the world. The loadbalancing algorithm is applied at the GFE servers.

Map showing about 80 points of presence around the world

The external Application Load Balancer is available through a single stable IP address that isannounced globally at the edge nodes, and the connections are terminated by anyof the GFEs.

Note: The algorithm described here is equivalent for all GFE-based loadbalancers on Google Cloud, including the external Application Load Balancer andexternal proxy Network Load Balancer.

The GFEs are interconnected through Google's global network. Data that describesavailable backends and the available serving capacity for each load-balancedresource is continually distributed to all GFEs using a global control plane.

Diagram showing how requests go through the GFE before going to Google data centers

Traffic to load-balanced IP addresses is proxied to backend instances that aredefined in the external Application Load Balancer configuration using a special loadbalancing algorithm calledWaterfall by Region. This algorithm determines theoptimal backend for servicing the request by taking into account the proximityof the instances to the users, the incoming load as well as the availablecapacity of backends in each zone and region. Finally, worldwide load andcapacity is also taken into account.

The external Application Load Balancer distributes traffic based on available instances. To addnew instances based on load, the algorithm works in conjunction withautoscaling instance groups.

Traffic flow within a region

Under normal circumstances, all traffic is sent to the region closest to theuser. Load balancing is then performed according to these guidelines:

Within each region, traffic is distributed across instance groups,which can be in multiple zones according to each group's capacity.
If capacity is unequal between zones, zones are loaded in proportion totheir available serving capacity.
Within zones, requests are spread evenly over the instances in eachinstance group.
Sessions are persisted based on client IP address or on a cookie value,depending on thesession affinity setting.
Unless the backend becomes unavailable, existing TCP connections nevermove to a different backend.

The following diagram shows load distribution in this case, where each regionis under capacity and can handle the load from the users closest to thatregion.

Diagram showing 50 RPS going to 3 different regions that can each handle this load

Traffic overflow to other regions

If an entire region reaches capacity as determined by theserving capacity set in the backend services,the Waterfall by Region algorithm is triggered, and traffic overflows to theclosest region that has available capacity. As each region reaches capacity,traffic spills to the next closest region, and so on. A region's proximity tothe user is defined by network round-trip time from the GFE to the instancebackends.

The following diagram shows the overflow to the next closest region when oneregion receives more traffic than it can handle regionally.

Diagram showing a 150 RPS overload in one region causing overflow to the next closest region

Cross-regional overflow due to unhealthy backends

If health checks discover that more than half of the backends in a region areunhealthy, the GFEs preemptively overflow some traffic to the next closestregion. This happens in order to avoid traffic completely failing as the regionbecomes unhealthy. This overflow occurs even if the remaining capacity in theregion with the unhealthy backends is sufficient.

The following diagram shows the overflow mechanism in effect, because themajority of backends in one zone are unhealthy.

Diagram showing partial backend failure in one region causing overflow to the next closest region

Note: Failover or overflow is restricted to a limited number of alternate instance groups. In case you want to intentionally fail over traffic temporarily, setting backend capacity for the instance groups that should not be used to zero is preferred over marking instances in those groups unhealthy. Similarly, if there are some unhealthy instance groups and if you observe the load balancer to be returning HTTP 502 errors with statusDetailfailed_to_pick_backend orfailed_to_pick_backend_by_hash despite there being one or more healthy instance groups, drain the unhealthy instance groups by setting the capacity to zero so that the load balancer starts forwarding traffic to the healthy instance groups.

All regions above capacity

When traffic to all regions is at or above capacity, traffic is balanced so thatevery region is at the same relative level of overflow compared toits capacity. For example, if global demand exceedsglobal capacity by 20%, traffic is distributed in a way that all regions receiverequests at 20% over their regional capacity, while keeping traffic as local aspossible.

The following diagram shows this global overflow rule in effect. In this case asingle region receives so much traffic that it cannot be distributed at all withthe available serving capacity globally.

Temporary overflow during autoscaling

Autoscaling is based on the capacity limits configured on each backend serviceand brings up new instances when traffic gets close to configured capacitylimits. Depending on how quickly request levels are rising and how fast newinstances come online, overflow to other regions might be unnecessary. In othercases, overflow can act as a temporary buffer until new local instances areonline and ready to serve live traffic. When the capacity expanded byautoscaling is sufficient, all new sessions are distributed to the closestregion.

Latency effects of overflow

According to the Waterfall by Region algorithm, overflow of some trafficby the external Application Load Balancer to other regions can occur. However, TCP sessionsand SSL traffic are still terminated by the GFE closest to the user. This isbeneficial to application latency; for details, see Optimizing Application Latency with Load Balancing.

Hands-on: Measuring effects of capacity management

To understand how overflow occurs and how you can manage it using the HTTP loadbalancer, see theCapacity Management with Load Balancing tutorial that accompanies this article.

Using an external Application Load Balancer to address capacity challenges

To help address the challenges discussed earlier, external Application Load Balancers andexternal proxy Network Load Balancers can overflow capacity toother regions. For global applications, responding to users with slightly higheroverall latency results in a better experience than using a regional backend.Applications that use a regional backend have nominally lower latency, but theycan become overloaded.

Let's revisit how an external Application Load Balancer can help address the scenariosmentioned at the beginning of the article:

Latency in starting new instances. If the autoscaler can't add capacityfast enough during local traffic bursts, the external Application Load Balancer temporarilyoverflows connections to the next closest region. This ensures that existinguser sessions in the original region are handled at optimal speed as theyremain on existing backends, while new user sessions experience only aslight latency bump. As soon as additional backend instances are scaled upin the original region, new traffic is again routed to the region closest tothe users.
Applications limited by backend capacity. Applications that can't beautoscaled but that are available in multiple regions can still overflow tothe next closest region when demand in one region is beyond capacitydeployed for the usual traffic needs.
Non-elastic licenses. If the number of software licenses islimited, and if the license pool in the current region has been exhausted,the external Application Load Balancer can move traffic to a region where licenses areavailable. For this to work, the maximum number of instances is set to themaximum number of licenses on the autoscaler.
Too little VM headroom. The possibility of regional overflow helpsto save money, because you can set up autoscaling with a high CPU usagetrigger. You can also configure available backend capacity below eachregional peak, because overflow to other regions ensures that globalcapacity will always be sufficient.
Regional quotas. If Compute Engine resource quotas don't matchdemand, the external Application Load Balancer overflow automatically redirects part of thetraffic to a region that can still scale within its regional quota.

What's next

The following pages provide more information and background on Google's loadbalancing options:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2018-01-18 UTC.

Movatterモバイル変換

Application capacity optimizations with global load balancing Stay organized with collections Save and categorize content based on your preferences.