Use Provisioned Throughput

This page explains how Provisioned Throughput works, how to controloverages or bypass Provisioned Throughput, and how to monitor usage.

How Provisioned Throughput works

This section explains how Provisioned Throughput works by usingquota checking through the quota enforcement period.

Provisioned Throughput quota checking

Your Provisioned Throughput maximum quota is a multiple of thenumber of generative AI scale units (GSUs) purchased and the throughput per GSU.It's checked each time you make a request within yourquota enforcementperiod, which is how frequently the maximum Provisioned Throughputquota is enforced.

At the time a request is received, the true response size is unknown. Because weprioritize speed of response for real-time applications,Provisioned Throughput estimates the output token size. If theinitial estimate exceeds the available Provisioned Throughputmaximum quota, the request is processed as pay-as-you-go. Otherwise, it isprocessed as Provisioned Throughput. This is done by comparing theinitial estimate to your Provisioned Throughput maximum quota.

When the response is generated and the true output token size is known, actualusage and quota are reconciled by adding the difference between the estimate andthe actual usage to your available Provisioned Throughput quotaamount.

Provisioned Throughput quota enforcement windows

Vertex AI applies a dynamic window while enforcingProvisioned Throughput quota for Gemini models. Thisprovides optimal stability for traffic prone to spikes. Instead of a fixedwindow, Vertex AI enforces the quota over a flexible window thatautomatically adjusts, depending on the model type and the number of GSUs thatyou've provisioned. As a result, you might temporarily experience prioritizedtraffic that exceeds your quota amount on a per-second basis in some cases.However, you must not exceed your quota over the window duration. These periodsare based on the Vertex AI internal clock time and areindependent of when requests are made.

How the quota enforcement window works

The enforcement window determines how much you can exceed, or "burst", aboveyour per-second limit, before you're throttled. This window is appliedautomatically. Note that these windows are subject to change to optimize forperformance and reliability.

  • Small GSU allocations (3 GSUs or less): The window can range from 40 to120 seconds to allow for larger individual requests to process withoutinterruption.

    For example, if you buy 1 GSU ofgemini-2.5-flash, you get an average of2,690 tokens per second of continuous throughput. Your total usage over any120-second window can't exceed 322,800 tokens (2,690 tokens per second *120 seconds). Therefore, if you send a request that uses 70,000tokens per second, but the total usage over 120 seconds remains below322,800 tokens, then the 70,000-token per second burst still counts asProvisioned Throughput, since the average usage doesn't exceed2,690 tokens per second.

  • Standard (medium-sized) GSU allocations (more than 3 GSUs): Formedium-sized GSU deployments (for example, fewer than 50 GSUs), the window can rangefrom 5 seconds to 30 seconds. The GSU thresholds and context windows vary basedon the model.

    For example, if you buy 25 GSUs ofgemini-2.5-flash, you get an average of67,250 tokens per second (2,690 tokens per second * 25) of continuousthroughput. Your total usage over any 30-second window can'texceed 2,017,500 tokens (67,250 tokens per second * 30 seconds). Therefore,if you send a request that uses 1,000,000tokens per second but the total usage over 30 seconds remains within2,017,500 tokens, then the 1,000,000-token per second burst still counts asProvisioned Throughput, since the average usage doesn't exceed67,250 tokens per second.

  • High-precision (large-scale) GSU allocations: For large-scale GSUdeployments (for example, 50 GSUs or more), the window can range from 1 to 5seconds to ensure that high frequency requests are processed with maximumaccuracy across the infrastructure.

    For example, if you buy 250 GSUs ofgemini-2.5-flash, you get an average of672,500 tokens per second (2,690 tokens per second * 250) of continuousthroughput. Your total usage over any 5-second window can't exceed3,362,500 tokens (672,500 tokens per second * 5 seconds). Therefore,if you send a request that uses 5,000,000 tokens per second, then it won't be processed asProvisioned Throughput, because the total usage of 5,000,000tokens exceeds the 3,362,500 token limit over a 5-second window. On theother hand, a request that uses 1,000,000 tokens per second can be processedas Provisioned Throughput, if the average usage overthe 5-second window doesn't exceed 672,500 tokens per second.

Control overages or bypass Provisioned Throughput

Use the API to control overages when you exceed your purchased throughputor to bypass Provisioned Throughput on a per-request basis.

Read through each option to determine what you must do to meet your use case.

Default behavior

If a request exceeds the remaining Provisioned Throughput quota, theentire request is processed as an on-demand request by default and is billed atthepay-as-you-go rate. When thisoccurs, the traffic appears asspillover on the monitoring dashboards. Formore information about monitoring Provisioned Throughput usage, seeMonitor Provisioned Throughput.

After your Provisioned Throughput order is active, the defaultbehavior takes place automatically. You don't have to change your code to beginconsuming your order as long as you are consuming it in the region provisioned.

Use only Provisioned Throughput

If you are managing costs by avoiding on-demand charges, use onlyProvisioned Throughput. Requests which exceed theProvisioned Throughput order amount return anerror429.

When sending requests to the API, set theX-Vertex-AI-LLM-Request-Type HTTPheader todedicated.

Use only pay-as-you-go

This is also referred to as using on-demand. Requests bypass the Provisioned Throughputorder and are sent directly to pay-as-you-go. This might be usefulfor experiments or applications that are in development.

When sending requests to the API, set theX-Vertex-AI-LLM-Request-Type HTTP header toshared.

Example

Python

Install

pip install --upgrade google-genai

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values# with appropriate values for your project.exportGOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECTexportGOOGLE_CLOUD_LOCATION=globalexportGOOGLE_GENAI_USE_VERTEXAI=True

fromgoogleimportgenaifromgoogle.genai.typesimportHttpOptionsclient=genai.Client(http_options=HttpOptions(api_version="v1",headers={# Options:# - "dedicated": Use Provisioned Throughput# - "shared": Use pay-as-you-go# https://cloud.google.com/vertex-ai/generative-ai/docs/use-provisioned-throughput"X-Vertex-AI-LLM-Request-Type":"shared"},))response=client.models.generate_content(model="gemini-2.5-flash",contents="How does AI work?",)print(response.text)# Example response:# Okay, let's break down how AI works. It's a broad field, so I'll focus on the ...## Here's a simplified overview:# ...

Go

Learn how to install or update theGo.

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values# with appropriate values for your project.exportGOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECTexportGOOGLE_CLOUD_LOCATION=globalexportGOOGLE_GENAI_USE_VERTEXAI=True

import("context""fmt""io""net/http""google.golang.org/genai")//generateTextshowshowtogeneratetextProvisionedThroughput.funcgenerateText(wio.Writer)error{ctx:=context.Background()client,err:=genai.NewClient(ctx, &genai.ClientConfig{HTTPOptions:genai.HTTPOptions{APIVersion:"v1",Headers:http.Header{//Options://-"dedicated":UseProvisionedThroughput//-"shared":Usepay-as-you-go//https://cloud.google.com/vertex-ai/generative-ai/docs/use-provisioned-throughput"X-Vertex-AI-LLM-Request-Type":[]string{"shared"},},},})iferr!=nil{returnfmt.Errorf("failed to create genai client: %w",err)}modelName:="gemini-2.5-flash"contents:=genai.Text("How does AI work?")resp,err:=client.Models.GenerateContent(ctx,modelName,contents,nil)iferr!=nil{returnfmt.Errorf("failed to generate content: %w",err)}respText:=resp.Text()fmt.Fprintln(w,respText)//Exampleresponse://ArtificialIntelligence(AI)isn't magic, nor is it a single "thing." Instead, it'sabroadfieldofcomputersciencefocusedoncreatingmachinesthatcanperformtasksthattypicallyrequirehumanintelligence.//.....//InSummary://...returnnil}

Node.js

Install

npm install @google/genai

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values# with appropriate values for your project.exportGOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECTexportGOOGLE_CLOUD_LOCATION=globalexportGOOGLE_GENAI_USE_VERTEXAI=True

const{GoogleGenAI}=require('@google/genai');constGOOGLE_CLOUD_PROJECT=process.env.GOOGLE_CLOUD_PROJECT;constGOOGLE_CLOUD_LOCATION=process.env.GOOGLE_CLOUD_LOCATION||'global';asyncfunctiongenerateWithProvisionedThroughput(projectId=GOOGLE_CLOUD_PROJECT,location=GOOGLE_CLOUD_LOCATION){constclient=newGoogleGenAI({vertexai:true,project:projectId,location:location,httpOptions:{apiVersion:'v1',headers:{//Options://-"dedicated":UseProvisionedThroughput//-"shared":Usepay-as-you-go//https://cloud.google.com/vertex-ai/generative-ai/docs/use-provisioned-throughput'X-Vertex-AI-LLM-Request-Type':'shared',},},});constresponse=awaitclient.models.generateContent({model:'gemini-2.5-flash',contents:'How does AI work?',});console.log(response.text);//Exampleresponse://Okay,let's break down how AI works. It'sabroadfield,soI'll focus on the ...//Here's a simplified overview://...returnresponse.text;}

Java

Learn how to install or update theJava.

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values# with appropriate values for your project.exportGOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECTexportGOOGLE_CLOUD_LOCATION=globalexportGOOGLE_GENAI_USE_VERTEXAI=True

importcom.google.genai.Client;importcom.google.genai.types.GenerateContentConfig;importcom.google.genai.types.GenerateContentResponse;importcom.google.genai.types.HttpOptions;importjava.util.Map;publicclassProvisionedThroughputWithTxt{publicstaticvoidmain(String[]args){//TODO(developer):Replacethesevariablesbeforerunningthesample.StringmodelId="gemini-2.5-flash";generateContent(modelId);}//GeneratescontentwithProvisionedThroughput.publicstaticStringgenerateContent(StringmodelId){//ClientInitialization.Oncecreated,itcanbereusedformultiplerequests.try(Clientclient=Client.builder().location("us-central1").vertexAI(true).httpOptions(HttpOptions.builder().apiVersion("v1").headers(//Options://-"dedicated":UseProvisionedThroughput//-"shared":Usepay-as-you-go//https://cloud.google.com/vertex-ai/generative-ai/docs/use-provisioned-throughputMap.of("X-Vertex-AI-LLM-Request-Type","shared")).build()).build()){GenerateContentResponseresponse=client.models.generateContent(modelId,"How does AI work?",GenerateContentConfig.builder().build());System.out.println(response.text());//Exampleresponse://Atitscore,**AI(ArtificialIntelligence)worksbyenablingmachinestolearn,//reason,andmakedecisionsinwaysthatsimulatehumanintelligence.**Insteadofbeing//explicitlyprogrammedforeverysingletask...returnresponse.text();}}}

REST

After youset up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

curl-XPOST\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json"\-H"X-Vertex-AI-LLM-Request-Type: dedicated"\# Options: dedicated, shared$URL\-d'{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Use Provisioned Throughput with an API Key

If you've purchased Provisioned Throughput for a specific project,Google model, and region, and want to use it to send a request with an API key,then you must include the project ID, model, location, and API key as parametersin your request.

For information about how to create a Google Cloud API key bound to aservice account, seeGet a Google Cloud API key.To learn how to send requests to the Gemini API using an API key, seetheGeminiAPI in Vertex AI quickstart.

For example, the following sample showshow to submit a request with an API key while using Provisioned Throughput:

REST

After youset up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

curl\-XPOST\-H"Content-Type: application/json"\"https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/MODEL_ID:generateContent?key=YOUR_API_KEY"\-d$'{  "contents": [    {      "role": "user",      "parts": [        {          "text": "Explain how AI works in a few words"        }      ]    }  ]}'

Monitor Provisioned Throughput

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

You can self-monitor your Provisioned Throughput usage using a setof metrics that are measured on theaiplatform.googleapis.com/PublisherModelresource type.

Provisioned Throughput traffic monitoring is a public Previewfeature.

Dimensions

You can filter on metrics using the following dimensions:

DimensionValues
typeinput
output
request_type

dedicated: Traffic is processed using Provisioned Throughput.

spillover: Traffic is processed as pay-as-you-go quota after you exceed your Provisioned Throughput quota. Note that thespillover metric isn't supported for Provisioned Throughput for Gemini 2.0 models ifexplicit caching is enabled, because these models don't support explicit caching. In this case, the traffic appears asshared.

shared: If Provisioned Throughput is active, then traffic is processed as pay-as-you-go quota using the sharedHTTP header. If Provisioned Throughput isn't active, then traffic is processed as pay-as-you-go, by default.

Path prefix

The path prefix for a metric isaiplatform.googleapis.com/publisher/online_serving.

For example, the full path for the/consumed_throughput metric isaiplatform.googleapis.com/publisher/online_serving/consumed_throughput.

Metrics

The following Cloud Monitoring metrics are available on theaiplatform.googleapis.com/PublisherModel resource for theGemini models. Use thededicated request types to filter forProvisioned Throughput usage.

MetricDisplay nameDescription
/dedicated_gsu_limitLimit (GSU)Dedicated limit in GSUs. Use this metric to understand your Provisioned Throughput maximum quota in GSUs.
/tokensTokensInput and output token count distribution.
/token_countToken countAccumulated input and output token count.
/consumed_token_throughputToken throughputThroughput usage, which accounts for the burndown rate in tokens and incorporates quota reconciliation. SeeProvisioned Throughput quota checking.

Use this metric to understand how your Provisioned Throughput quota was used.
/dedicated_token_limitLimit (tokens per second)Dedicated limit in tokens per second. Use this metric to understand your Provisioned Throughput maximum quota for token-based models.
/charactersCharactersInput and output character count distribution.
/character_countCharacter countAccumulated input and output character count.
/consumed_throughputCharacter throughputThroughput usage, which accounts for the burndown rate in characters and incorporates quota reconciliationProvisioned Throughput quota checking.

Use this metric to understand how your Provisioned Throughput quota was used.

For token-based models, this metric is equivalent to the throughput consumed in tokens multiplied by 4.
/dedicated_character_limitLimit (characters per second)Dedicated limit in characters per second. Use this metric to understand your Provisioned Throughput maximum quota for character-based models.
/model_invocation_countModel invocation countNumber of model invocations (prediction requests).
/model_invocation_latenciesModel invocation latenciesModel invocation latencies (prediction latencies).
/first_token_latenciesFirst token latenciesDuration from request received to first token returned.

Anthropic models also have a filter for Provisioned Throughput butonly fortokens andtoken_count.

Dashboards

Default monitoring dashboards for Provisioned Throughput providemetrics that let you better understand your usage andProvisioned Throughput utilization. To access the dashboards, do thefollowing:

  1. In the Google Cloud console, go to theProvisioned Throughputpage.

    Go to Provisioned Throughput

  2. To view the Provisioned Throughput utilization of each modelacross your orders, select theUtilization summary tab.

    In theProvisioned Throughput utilization by model table, you can view the following for the selected time range:

    • Total number of GSUs you had.

    • Peak throughput usage in terms of GSUs.

    • The average GSU utilization.

    • The number of times you reached your Provisioned Throughput limit.

  3. Select a model from theProvisioned Throughput utilization bymodel table to see more metrics specific to the selected model.

How to interpret monitoring dashboards

Provisioned Throughputchecks available quotain real time at the millisecond level for requests as they are made, butcompares this data against a rollingquota enforcement period,based on the Vertex AIinternal clock time. This comparison is independent of the time when therequests are made. The monitoring dashboards report usage metrics after quotareconciliation takes place. However, these metrics are aggregated to provideaverages for dashboard alignment periods, based on the selected time range.The lowest possible granularity that the monitoring dashboards support is atthe minute level. Moreover, the clock time for the monitoring dashboards isdifferent from that of Vertex AI.

These differences in timings might occasionally result in discrepancies betweenthe data in the monitoring dashboards and real-time performance. These canresult from any of the following reasons:

  • Quota is enforced in real time but the monitoring charts aggregate data into1-minute or higher average dashboard alignment periods, depending on the timerange specified in the monitoring dashboards.

  • Vertex AI and the monitoring dashboards run on differentsystem clocks.

  • Over a period of one second, if a burst of traffic exceeds yourProvisioned Throughput quota based on theenforcement window,the entire request is processed as spillover traffic. However, the overallProvisioned Throughput utilization might appear low when themonitoring data for that second is averaged within the 1-minute alignmentperiod, because the average utilization across the entire alignment periodmight not exceed 100%. If you see spillover traffic, it confirms that yourProvisioned Throughput quota was fully utilized during thequota enforcement period when those specific requests were made. This isregardless of the average utilization shown on the monitoring dashboards.

Example of potential discrepancy in monitoring data

This example illustrates some of the discrepancies resulting from windowmisalignment. Figure 1 represents throughput usage over a specific time period.In this figure:

  • The blue bars represent the traffic admitted asProvisioned Throughput.

  • The orange bar represents traffic that pushes the usage beyond the GSU limitand is processed as spillover.

Throughput usage over time periods
Figure 1. Throughput usage over time periods

Based on the throughput usage, figure 2 represents possible visualdiscrepancies, owing to windowing misalignment. In this figure:

  • The blue line represents Provisioned Throughput traffic.

  • The orange line represents spillover traffic.

Possible discrepancies in monitoring data
Figure 2. Possible visual discrepancies in monitoring dashboards

In this case, the monitoring data might show Provisioned Throughputusage with no spillover for a monitoring aggregation timeframe, whilesimultaneously observing Provisioned Throughput usage below the GSUlimit coinciding with a spillover in another monitoring aggregation timeframe.

Troubleshoot monitoring dashboards

You can troubleshoot unexpected spillover in your dashboards or 429 errorsby performing the following steps:

  1. Zoom In: Set your dashboard time range to 12 hours or less to provide themost granular alignment period of 1 minute. Large time ranges smooth outspikes that cause throttling and increase the alignment period averages.

  2. Check Total Traffic: Your model-specific dashboards show dedicated andspillover traffic as two separate lines, which might lead to the incorrectconclusion that Provisioned Throughput quota isn't fully utilizedand is spilling over prematurely. If your traffic exceeds available quota, theentire request is processed as spillover. For another helpful visualization,add a query to the dashboard using the Metrics Explorer and includetoken throughput for the specific model and region. Don't include anyadditional aggregations or filters to view the total traffic across alltraffic types (dedicated, spillover, and shared).

Monitor Genmedia models

Provisioned Throughput monitoring isn't available onVeo 3 and Imagen models.

Alerting

After alerting is enabled, set default alerts to help you manage your trafficusage.

Enable alerts

To enable alerts in the dashboard, do the following:

  1. In the Google Cloud console, go to theProvisioned Throughputpage.

    Go to Provisioned Throughput

  2. To view the Provisioned Throughput utilization of each modelacross your orders, select theUtilization summary tab.

  3. SelectRecommended alerts, and the following alerts display:

    • Provisioned Throughput Usage Reached Limit
    • Provisioned Throughput Utilization Exceeded 80%
    • Provisioned Throughput Utilization Exceeded 90%
  4. Check the alerts that help you manage your traffic.

View more alert details

To view more information about alerts, do the following:

  1. Go to theIntegrations page.

    Go to Integrations

  2. Entervertex into theFilter field and pressEnter.GoogleVertex AI appears.

  3. To view more information, clickView details. TheGoogleVertex AI details pane displays.

  4. SelectAlerts tab, and you can select anAlert Policy template.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.