About instance autoscaling in Cloud Run services

This page describes the Cloud Run default autoscaling behavior. If youneed more control over your scaling behavior, learn about the alternativescaling option,manual scaling.

By default, each Cloud Runrevisionis automatically scaled to the number of instances needed to handleall incoming requests, events, or CPU utilization.

When a revision does not receive any traffic, by default, it is scaled to zeroinstances. However, if needed, you can change this default tospecify an instance to be kept idle or "warm" using theminimum instances setting. If your serviceis using CPU even when it's not processing requests, you should set minimuminstances equal to1.

In addition to the rate of incoming requests, events, or CPU utilization, thenumber of instances scheduled is impacted by:

The average CPU utilization of existing instances over a one minute window, targeting to keep scheduled instances toa 60% CPU utilization.
The current request concurrency, targeting to keep instance concurrency at60% of themaximum concurrency over a one minute window.
Themaximum number of instances setting
Theminimum number of instances setting

The Cloud Run autoscaler evaluates these periodically.

Note: The 60% utilization threshold is atarget threshold. Under certain conditions, such as at lower instance counts, the utilization thresholds for scaling are higher.Important: When Cloud Run scales based on CPU utilization, it considers the averageCPU utilization across all CPUs allocated to an instance. If your applicationis single-threaded but deployed on a multi-CPU instance, this can lead to a lowaverage utilization reading, potentially impacting how CPU-based scalingdecisions are made. For more details on optimizing CPU configuration for yourapplication's architecture, see Configure CPU limits andConcurrency settings.

Instance-based billing and autoscaling

If you configureinstance-based billingfor your Cloud Run service, you should be aware of scalingto andfrom zero behavior.

Scaling from zero. Scaling from zero can only be triggeredby a request, so a service that is not processing requests cannot scale fromzero. For these workloads, you can either set minimum instances > 0, or includea "wake-up request" in your design to restart processing after scaling to zero.

Scaling to zero. Given that no instance is ever at 0% CPU, looking at allCPU usage would result in never scaling to zero. This means the decision toscale from one to zero can only be made by checking to see if the instance isprocessing a request.

About maximum instances for services

In some cases you may want to limit the total number of instancesthat can be started, for cost control reasons, or for better compatibility withother resources used by your service. For example, your Cloud Runservice might interact with a database that can only handle a certain number ofconcurrent open connections.

All services are assigned a maximum instances limit by default, even if youdon't specify your own limit. Set and monitor this limit to determine thescaling behavior and the costs associated with your service. For moreinformation, seeMaximum instances limits.

You can use the maximum instances setting to limit the total number ofinstances that can be started in parallel, as documented inSetting a maximum number of instances.

Exceeding maximum instances

Under normal circumstances, your revision scales out by creating new instancesto handle incoming traffic load. But when you set a maximum instances limit, in somescenarios there will be insufficient instances to meet that traffic load. Inthat case, incoming requests are queued (pending) as follows:

Requests will pend for up to 3.5 times average startup time of container instances of this service, or 10 seconds, whichever is greater.

During this time window, if an instance finishes processing requests, it becomesavailable to process the queued pending requests.If no instances become available during the window, the request fails with a429 error code.

Scaling guarantees

The maximum instances limit is an upper limit per revision and it means that thenumber of instances for this revision shouldn't exceed the maximum.

Under normal circumstances, Cloud Run is able to scale out to the maximuminstances limit very fast to handle all incoming requests or events. However,setting a high limit does not mean that your revision will be able scale out tothe specified number of instances at any given moment. In exceptionalcircumstances, Cloud Run can throttle scaling to ensure good servicefor all customers.

Exceeding maximum instances due to traffic spikes

In some cases, such as rapid traffic surges or system maintenance,Cloud Run might, for a short period of time, create moreinstances than are specified in the maximum instances setting. New instances can bestarted in excess of the maximum instances setting to replace existing instances and to providea grace period for inflight requests to finish processing.

The maximum instance limit can be exceeded under normal operation a few times perweek. The grace period usually lasts up to 15 minutes, or up tothe value specified in therequest timeout setting.These extra instances are destroyed within 15 minutes after they become idle.

If many replacements are needed, the updates are usually spread out over many minutesor hours, but each replacement has an excess instance for just the grace period.Instances in excess of the maximum instance value are normally less than twice theconfigured maximum instances limit, but can be much larger for sudden large traffic spikes.

Load tests experiencemore instances exceeding the maximum instances setting becausethe system may change where traffic spikes are served to preserve capacity for existing workloadsthat have sustained load patterns.

If your service cannot tolerate this temporary behavior, you may wantto factor in a safety margin and set a lower maximum instances value.

Traffic splits

Because the maximum instances limit is a limit for each revision, if the servicesplits traffic across multiple revisions,the total number of instances for the service can exceed the maximum instancesper revision. This can be observed in theInstance Countmetrics.

Deployments

When you deploy a new revision to serve 100% of the traffic,Cloud Run starts enough instances of the new revision before directingtraffic to it. This reduces the impact of new revision deployments on requestlatencies, notably when serving high levels of traffic.Because the maximum instances limit is a limit for each revision, during adeployment, the total number of instances for the service can exceed the maximuminstances per revision. This can be observed in theInstance Countmetrics.

Idle instances and minimizing cold starts

Cloud Run does not immediately shut down instances once they havehandled all requests.To minimize the impact of cold starts, Cloud Run may keep some instancesidle for a maximum of 15 minutes. Cloud Runresources that have GPUs enabled may keep some instances idle for a maximum of10 minutes. These instances are readyto handle requests in case of a sudden traffic spike.

For example, when an instance has finished handling requests, it mayremain idle for a period of time in case another request needs tobe handled. An idle instance may persist resources, such as opendatabase connections. Note that the default billing setting is request-basedbilling unless you explicitly configure your service to haveinstance-based billing.

To keep idle instancespermanently available, use themin-instance setting. Note that usingthis feature willincur cost even when the service is notactively serving requests.

Autoscaling and pending requests

Requests will pend for up to 3.5 times average startup time of container instances of this service, or 10 seconds, whichever is greater.

Autoscaling impact on backing services

As the number of instances automatically increases, yourCloud Run service might encounter limits with its backing services.For example, Cloud SQL has anAPI quota limit.Make sure these backing services have enough quota and can handle connectionsfrom all instances of your Cloud Run service.Consider setting amaximum number of instancesto avoid overloading backing services.

Autoscaling and Pub/Sub

Google recommends using push subscriptions to consume messages from aPub/Sub topic on Cloud Run. Pushed messages are received likeHTTP requests by the container, thus triggering the same autoscaling behavior.

Autoscaling and multiple containers (sidecars)

Cloud Run considers the CPU utilization ofinstances for autoscaling, wherethe CPU utilization of an instance is the percentage of allocated CPU in use.

Note that you allocate CPU when you setCPU limits at the container level. Ifyou usemultiple containers per instance,the actual CPU allocation for that instance is the sum of the CPU limits you seton each container.

What's next

To learn about other scaling options, seemanual scaling.
To manage the maximum number of instances of your Cloud Run services, seeSetting a maximum number of instances.
To manage the maximum number of simultaneous requests handled by each instance, seeSetting concurrency.
To optimize your concurrency setting, seedevelopment tips for tuning concurrency.
To specify an idle instance to keep running to minimize latency or cold startson first requests, seeUsingmin-instance to enable idle instances.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

About instance autoscaling in Cloud Run services Stay organized with collections Save and categorize content based on your preferences.