Manage traffic and load for your workloads in Google Cloud

When you run an application stack on distributed resources in the cloud, networktraffic must be routed efficiently to the available resources across multiplelocations. This part of theGoogle Cloud infrastructure reliability guide describes traffic- and load-management techniques that you can use to help toimprove the reliability of your cloud workloads.

Capacity planning

To ensure that your application deployed in Google Cloud has adequateinfrastructure resources, you must estimate the capacity that's required, andmanage the deployed capacity. This section provides guidelines to help you planand manage capacity.

Forecast the application load

When you forecast the load, consider factors like the number of users and therate at which the application might receive requests. In your forecasts,consider historical load trends, seasonal variations, load spikes during specialevents, and growth driven by business changes like expansion to newgeographies.

Estimate capacity requirements

Based on your deployment architecture and considering the performance andreliability objectives of your application, estimate the quantity ofGoogle Cloud resources that are necessary to handle the expected load. Forexample, if you plan to use Compute Engine managed instance groups (MIGs),decide the size of each MIG, VM machine type, and the number, type, and size ofpersistent disks. You can use theGoogle Cloud Pricing Calculator to estimate the cost of the Google Cloud resources.

Plan adequate redundancy

When you estimate the capacity requirements, provide adequate redundancy forevery component of the application stack. For example, to achieveN+1 redundancy,every component in the application stack must have at least one redundantcomponent beyond the minimum that's necessary to handle the forecast load.

Benchmark the application

Run load tests to determine the resource efficiency of your application.Resource efficiency is the relationship between the load on the application andthe resources such as CPU and memory that the application consumes. The resourceefficiency of an application can deteriorate when the load is exceptionallyhigh, and the efficiency might change over time. Conduct the load tests for bothnormal and peak load conditions, and repeat the benchmarking tests at regularintervals.

Manage quotas

Google Cloud servicequotas are per-project limits, which help you control the consumption of cloudresources. Quotas are of two types:Resource quotas are the maximum resourcesthat you can create, such as the number of regionalGoogle Kubernetes Engine (GKE) clusters in a region.Rate quotas limit the number of API requests that can be sent toa service in a specific period. Quotas can be zonal, regional, or global. Reviewthe current resource quotas and API rate quotas for the services that you planto use in your projects. Ensure that the quotas are sufficient for the capacitythat you need. When required, you canrequest a higher quota value.

Reserve compute capacity

To make sure that capacity for Compute Engine resources is availablewhen necessary, you can create reservations. A reservation provides assuredcapacity in a specific zone for a specified number of VMs of a machine type thatyou choose. A reservation can be specific to a project, or shared acrossmultiple projects. For more information about reservations, seeChoose a reservation type.

Monitor utilization, and reassess requirements periodically

After you deploy the required resources, monitor the capacity utilization. Youmight find opportunities to optimize cost by removing idle resources.Periodically reassess the capacity requirements, and consider any changes in theapplication behavior, performance and reliability objectives, user load, andyour IT budget.

Autoscaling

When you run an application on resources that are distributed across multiplelocations, the application remains available during outages at one of thelocations. In addition, redundancy helps ensure that users experience consistentapplication behavior. For example, when there's a spike in the load, theredundant resources ensure that the application continues to perform at apredictable level. But when the load on the application is low, redundancy canresult in inefficient utilization of cloud resources.

For example, the shopping cart component of an ecommerce application might needto process payments for 99.9% of orders within 200 milliseconds after orderconfirmation. To meet this requirement during periods of high load, you mightprovision redundant compute and storage capacity. But when the load on theapplication is low, a portion of the provisioned capacity might remain unused orunder-utilized. To remove the unused resources, you would need to monitor theutilization and adjust the capacity. Autoscaling helps you manage cloud capacityand maintain the required level of availability without the operational overheadof managing redundant resources. When the load on your application increases,autoscaling helps to improve the availability of the application by provisioningadditional resources automatically. During periods of low load, autoscalingremoves unused resources, and helps to reduce cost.

Certain Google Cloud services, like Compute Engine, let youconfigure autoscaling for the resources that you provision. Managed serviceslike Cloud Run can scale capacity automatically without youhaving to configure anything. The following are examples of Google Cloudservices that support autoscaling. This list is not exhaustive.

  • Compute Engine: MIGs let you scale stateless applicationsthat are deployed on Compute Engine VMs automatically to match thecapacity with the current load. For more information, seeAutoscaling groups of instances.
  • GKE: You can configure GKEclusters to automatically resize the node pools to match the current load.For more information, seeCluster autoscaler.For GKE clusters that you provision in theAutopilot mode, GKE automatically scales the nodes and workloadsbased on the traffic.
  • Cloud Run: Services that you provision inCloud Run scale out automatically to the number of containerinstances that are necessary to handle the current load. When theapplication has no load, the service automatically scales in the number ofcontainer instances to zero. For more information, seeAbout container instance autoscaling.
  • Cloud Run functions: Each request to a function is assigned toan instance of the function. If the volume of inbound requests exceeds thenumber of existing function instances, Cloud Run functions automaticallystarts new instances of the function. For more information, seeCloud Run functions execution environment.
  • Bigtable: When you create a cluster in aBigtable instance, you can configure the cluster to scaleautomatically. Bigtable monitors the CPU and storage load,and adjusts the number of nodes in the cluster to maintain the targetutilization rates that you specify. For more information, seeBigtable autoscaling.
  • Google Cloud Serverless for Apache Spark: When you submit an Apache Spark batchworkload, Google Cloud Serverless for Apache Spark dynamically scales the workloadresources, such as the number of executors, to run the workloadefficiently. For more information, seeGoogle Cloud Serverless for Apache Spark for Spark autoscaling.

Load balancing

Load balancing helps to improve application reliability by routing traffic toonly the available resources and by ensuring that individual resources aren'toverloaded.

Consider the following reliability-related design recommendations when choosingand configuring load balancers for your cloud deployment.

Load-balance internal traffic

Configure load balancing for the traffic between the tiers of the applicationstack as well, not just for the traffic between the external clients and theapplication. For example, in a 3-tier web application stack, you can use aninternal load balancer for reliable communication between the web and app tiers.

Choose an appropriate load balancer type

To load-balance external traffic to an application that's distributed acrossmultiple regions, you can use a global load balancer or multiple regional loadbalancers. For more information, seeBenefits and risks of global load balancing for multi-region deployments.

If the backends are in a single region and you don't need the features of globalload balancing, you can use a regional load balancer, which is resilient to zoneoutages.

When you choose the load balancer type, consider other factors besidesavailability, such as geographic control over TLS termination, performance,cost, and the traffic type. For more information, seeChoose a load balancer.

Configure health checks

Autoscaling helps to ensure that your applications have adequate infrastructureresources to handle the current load. But even when sufficient infrastructureresources exist, an application or parts of it might not be responsive. Forexample, all the VMs that host your application might be in theRUNNING state. But the application software that's deployed on some of theVMs might have crashed.Load-balancing health checks ensure that the load balancers route application traffic to only the backendsthat are responsive. If your backends are MIGs, then consider configuring anextra layer of health checks toautoheal the VMs that aren't available. When autohealing is configured for a MIG,the unavailable VMs are proactively deleted, and new VMs are created.

Rate limiting

At times, your application might experience a rapid or sustained increase inthe load. If the application isn't designed to handle the increased load, theapplication or the resources that it uses might fail, making the applicationunavailable. The increased load might be caused by malicious requests, such asnetwork-based distributed denial-of-service (DDoS) attacks. A sudden spike inthe load can also occur due to other reasons such as configuration errors in theclient software. To ensure that your application can handle excessive load,consider applying suitable rate-limiting mechanisms. For example, you can setquotas for the number of API requests that a Google Cloud service canreceive.

Rate-limiting techniques can also help optimize the cost of your cloudinfrastructure. For example, by setting project-level quotas for specificresources, you can limit the billing that the project can incur for thoseresources.

Network Service Tier

Google CloudNetwork Service Tiers let you optimize connectivity between systems on the internet and yourGoogle Cloud workloads. For applications that serve users globally andhave backends in more than one region, choose Premium Tier. Traffic from theinternet enters the high-performance Google network at the point of presence(PoP) that's closest to the sending system. Within the Google network, trafficis routed from the entry PoP to the appropriate Google Cloud resource, such asa Compute Engine VM. Outbound traffic is sent through the Googlenetwork, exiting at the PoP that's closest to the destination. This routing methodhelps to improve the availability perception of users by reducing the number ofnetwork hops between the users and the PoPs closest to them.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-11-20 UTC.