Scaling based on load balancing serving capacity

This document describes how to scale amanaged instance group (MIG)based on the serving capacity of an external Application Load Balancer or an internal Application Load Balancer.This means thatautoscaling adds or removes VM instances in the group when the load balancerindicates that the group has reached a configurable fraction of itsfullness,wherefullness is defined by thetarget capacity of the selected balancing modeof the backend instance group.

You can also scale a MIG based on itsCPU utilization or onMonitoring metrics.

Limitations

You can autoscale a managed instance group based on the serving capacity of anexternal Application Load Balancer and aninternal Application Load Balancer. Other types of loadbalancers are not supported.

Before you begin

Scaling based on HTTP(S) load balancing serving capacity

Compute Engine provides support for load balancing within yourinstance groups. You can use autoscaling in conjunction with load balancing bysetting up an autoscaler that scales based on the load of your instances.

An external or internal HTTP(S) load balancer distributes requeststo backend services according to its URL map. The load balancer can have one ormorebackend services, each supportinginstance group or network endpoint group (NEG) backends. Whenbackends are instance groups, the HTTP(S) load balancer offers twobalancing modes:UTILIZATION andRATE. WithUTILIZATION, you can specify a maximum targetfor average backend utilization of instances in the instance group. WithRATE,you must specify a target number of requests per second on a per-instance basisor a per-group basis. (Only zonal instance groups support specifying a maximumrate for the whole group. Regional managed instance groups don't supportdefining a maximum rate per group.)

The balancing mode and the target capacity that you specify define theconditions under which Google Cloud determines when a backend VM is atfull capacity. Google Cloud attempts to send traffic to healthy VMs thathave remaining capacity. If all VMs are already at capacity, the targetutilization or rate is exceeded.

When you attach an autoscaler to an instance group backend of anHTTP(S) load balancer, the autoscaler scales the managed instance group tomaintain a fraction of the load balancing serving capacity.

For example, assume the load balancing serving capacity of a managed instancegroup is defined as 100 RPS per instance. If you create an autoscaler withthe HTTP(S) load balancing policy and set it to maintain a target utilizationlevel of 0.8 or 80%, the autoscaler adds or removes instances from themanaged instance group to maintain 80% of the serving capacity, or 80 RPS perinstance.

The following diagram shows how the autoscaler interacts with a managedinstance group and backend service:

The  relationships between the autoscaler, managed instance groups, and load  balancing backend services.
The autoscaler watches the serving capacity of the managed instance group, which is defined in the backend service, and scales based on the target utilization. In this example, the serving capacity is measured in themaxRatePerInstance value.

Applicable load balancing configurations

You can set one of three options for your load balancingserving capacity. When youfirst create the backend, you can choose amongmaximum backend utilization,maximum requests per second per instance, ormaximum requests per second of the whole group. Autoscalingonly works withmaximum backend utilization andmaximum requests persecond/instance because the value of these settings can be controlled byadding or removing instances. For example, if you set a backend to handle 10requests per second per instance, and the autoscaler is configured to maintain80% of that rate, then the autoscaler can add or removeinstances when the requests per second per instance changes.

Autoscaling does not work with maximum requests per group because this settingis independent of the number of instances in the instance group. The loadbalancer continuously sends the maximum number of requests per group to theinstance group, regardless of how many instances are in the group.

For example, if you set the backend to handle 100 maximum requests per groupper second, the load balancer sends 100 requests per second tothe group, whether the group has two instances or 100 instances.Because this value cannot be adjusted, autoscaling does not work with a loadbalancing configuration that uses the maximum number of requests per second pergroup.

Enable autoscaling based on load balancing serving capacity

Permissions required for this task

To perform this task, you must have the followingpermissions:

  • compute.autoscalers.create on the project
  • compute.instanceGroupManagers.use on the project

Console

  1. Go to theInstance groups page in the Google Cloud console.

    Go to Instance groups

  2. If you have an instance group, select it, and then clickEdit.If you don't have an instance group, clickCreate instance group.
  3. ClickGroup size & autoscaling to expand the section.
  4. In theAutoscaling mode list, make sure thatOn: add and remove instances to the group is selected.
  5. Specify the minimum and maximum numbers of instances that you want theautoscaler to create in this group.
  6. In theAutoscaling signals section, clickAdd a signal.
  7. Set theSignal type toHTTP load balancing utilization.
  8. Enter theTarget HTTP load balancing utilization value in percentage.For example, for 60% HTTP load balancing utilization, enter60.

  9. You can use theInitialization period field to set the initialization period, which tells theautoscaler how long it takes for your application to initialize. Specifying an accurateinitialization period improves autoscaler decisions. For example, when scaling out, theautoscaler ignores data from VMs that are still initializing because those VMsmight not yet represent normal usage of your application. The default initializationperiod is 60 seconds.

  10. Save your changes.

gcloud

To enable an autoscaler that scales on serving capacity, use theset-autoscalingsub-command. For example, the following command creates an autoscaler thatscales the target managed instance group to maintain 60% of the servingcapacity. Along with the--target-load-balancing-utilization parameter,the--max-num-replicas parameter is also required when creating anautoscaler:

gcloud compute instance-groups managed set-autoscaling example-managed-instance-group \    --max-num-replicas 20 \    --target-load-balancing-utilization 0.6 \    --cool-down-period 90

You can use the--cool-down-period flag to set the initialization period, which tells theautoscaler how long it takes for your application to initialize. Specifying an accurateinitialization period improves autoscaler decisions. For example, when scaling out, theautoscaler ignores data from VMs that are still initializing because those VMsmight not yet represent normal usage of your application. The default initializationperiod is 60 seconds.

You can verify that your autoscaler was successfully created by using theinstance-groups managed describe sub-command:

gcloud compute instance-groups managed describe example-managed-instance-group

For a list of availablegcloud commands and flags, see thegcloud reference.

Note: If autoscaling is already enabled for a managed instance group, theset-autoscaling command updates the existing autoscaler to the new specifications.

REST

Note: Although autoscaling is a feature ofmanaged instance groups, it is a separate API resource. Keep that inmind when you construct API requests for autoscaling.

To create an autoscaler, use theautoscalers.insert methodfor a zonal MIG or theregionAutoscalers.insert methodfor a regional MIG.

The following example creates an autoscaler for a zonal MIG:

POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/autoscalers/

Your request body must contain thename,target, andautoscalingPolicyfields.autoscalingPolicy must defineloadBalancingUtilization.

You can use thecoolDownPeriodSec field to set the initialization period, which tells theautoscaler how long it takes for your application to initialize. Specifying an accurateinitialization period improves autoscaler decisions. For example, when scaling out, theautoscaler ignores data from VMs that are still initializing because those VMsmight not yet represent normal usage of your application. The default initializationperiod is 60 seconds.

{ "name": "example-autoscaler", "target": "zones/us-central1-f/instanceGroupManagers/example-managed-instance-group", "autoscalingPolicy": {    "maxNumReplicas": 20,    "loadBalancingUtilization": {       "utilizationTarget": 0.8     },    "coolDownPeriodSec": 90  }}

For more information about enabling autoscaling based on load balancingserving capacity, complete the tutorial,Globally autoscaling a web service on Compute Engine.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.