Supported models

The following tables show the models that supportProvisioned Throughput, the throughput for eachgenerative AI scale unit (GSU) and the burndownrates for eachmodel.

Google models

Provisioned Throughput only supports models that you call directlyfrom your project using the specific model ID and not a model alias. To useProvisioned Throughput to make API calls to a model, you must use thespecific model version ID (for example,gemini-2.0-flash-001) and not amodel version alias.

Moreover, Provisioned Throughput doesn't supportmodels that are called by other Vertex AI products, such asVertex AI Agents and Vertex AI Search. For example, if you makeAPI calls to Gemini 2.0 Flash while using Vertex AI Search,your Provisioned Throughput order for Gemini 2.0 Flashwon't guarantee the calls made by Vertex AI Search.

Provisioned Throughput doesn't supportbatch prediction calls.

The following table shows the throughput, purchase increment, and burndown ratesfor Google models that support Provisioned Throughput. Yourper-second throughput is defined as your prompt input and generated outputacross all requests per second.

To find out how many tokens your workload requires, refer to theSDKtokenizer or thecountTokens API.

ModelPer-second throughput per GSUUnitsMinimum GSU purchase incrementBurndown rates

Gemini 3 Pro

Latest supported version:gemini-3-pro-preview (preview)

500Tokens1Less than or equal to 200,000 input tokens:
1 input text token = 1 token
1 input image token = 1 token
1 input video token = 1 token
1 input audio token = 1 token
1 output response text token = 6 tokens
1 output reasoning text token = 6 tokens

Greater than 200,000 input tokens:
1 input text token = 2 tokens
1 input image token = 2 tokens
1 input video token = 2 tokens
1 input audio token = 2 tokens
1 output response text token = 9 tokens
1 output reasoning text token = 9 tokens

Gemini 3 Pro Image

Latest supported version:gemini-3-pro-image-preview (preview)

500Tokens1 1 input text token = 1 token
1 input image token = 1 token
1 output text token = 6 tokens
1 output thinking token = 6 tokens
1 output image token = 60 tokens

Gemini 2.5 Pro

Latest supported version:gemini-2.5-pro

650Tokens1Less than or equal to 200,000 input tokens:
1 input text token = 1 token
1 input image token = 1 token
1 input video token = 1 token
1 input audio token = 1 token
1 output response text token = 8 tokens
1 output reasoning text token = 8 tokens

Greater than 200,000 input tokens:
1 input text token = 2 tokens
1 input image token = 2 tokens
1 input video token = 2 tokens
1 input audio token = 2 tokens
1 output response text token = 12 tokens
1 output reasoning text token = 12 tokens

Gemini 2.5 Flash Image

Latest supported version:gemini-2.5-flash-image

2,690Tokens1 1 input text token = 1 token
1 input image token = 1 token
1 output text token = 9 tokens
1 output image token = 100 tokens

Gemini 2.5 Flash

Latest supported version (GA):gemini-2.5-flash

Latest supported version (preview):gemini-2.5-flash-preview-09-2025

2690Tokens1 1 input text token = 1 token
1 input image token = 1 token
1 input video token = 1 token
1 input audio token = 4 tokens
1 output response text token = 9 tokens
1 output reasoning text token = 9 tokens

Gemini 2.5 Flash-Lite

Latest supported version (GA):gemini-2.5-flash-lite

Latest supported version (preview):gemini-2.5-flash-lite-preview-09-2025

8,070Tokens1 1 input text token = 1 token
1 input image token = 1 token
1 input video token = 1 token
1 input audio token = 3 tokens
1 output response text token = 4 tokens
1 output reasoning text token = 4 tokens

Gemini 2.5 Flash with Gemini Live API native audio

Latest supported version:gemini-live-2.5-flash-native-audio

1,620Tokens1 1 input text token = 1 token
1 input audio token = 6 tokens
1 input video token = 6 tokens
1 input image token = 6 tokens
1 input session memory token = 1 token
1 output text token = 4 tokens
1 output audio token = 24 tokens

Gemini 2.0 Flash

Latest supported version:gemini-2.0-flash-001

3,360Tokens1 1 input text token = 1 token
1 input image token = 1 token
1 input video token = 1 token
1 input audio token = 7 tokens
1 output text token = 4 tokens

Gemini 2.0 Flash-Lite

Latest supported version:gemini-2.0-flash-lite-001

6,720Tokens1 1 input text token = 1 token
1 input image token = 1 token
1 input video token = 1 token
1 input audio token = 1 token
1 output text token = 4 tokens

Veo 3.1 preview

Latest supported version:veo-3.1-generate-001 (preview)

0.0040Video seconds11 output video second = 1 output video second
Video+audio seconds11 output video+audio second = 2 output video seconds

Veo 3.1 Fast preview

Latest supported version:veo-3.1-fast-generate-001 (preview)

0.0080Video seconds11 output video second = 1 output video second
Video+audio seconds11 output video+audio second = 1.45 output video seconds

Veo 3

Latest supported version:veo-3.0-generate-001

0.0040Video seconds11 output video second = 1 output video second
Video+audio seconds11 output video+audio second = 2 output video seconds

Veo 3 Fast

Latest supported version:veo-3.0-fast-generate-001

0.0080Video seconds11 output video second = 1 output video second
Video+audio seconds11 output video+audio second = 1.45 output video seconds

Imagen 4 Ultra Generate

imagen-4.0-ultra-generate-001

0.015Images1Only output images count toward your Provisioned Throughput quota.

Imagen 4 Generate

imagen-4.0-generate-001

0.02Images1Only output images count toward your Provisioned Throughput quota.

Imagen 4 Fast Generate

imagen-4.0-fast-generate-001

0.04Images1Only output images count toward your Provisioned Throughput quota.

Imagen 3 Generate 002

imagen-3.0-generate-002

0.02Images1Only output images count toward your Provisioned Throughput quota.

Imagen 3 Generate 001

imagen-3.0-generate-001

0.025Images1Only output images count toward your Provisioned Throughput quota.
Imagen 3 Fast0.05Images1Only output images count toward your Provisioned Throughput quota.

For information about a model's capabilities and input or output limits,see the documentation for the model.

Request access: The modelgemini-live-2.5-flash is inprivate GA. For information about access to this release, see theaccess request page.

You can upgrade to new models as they are made available. For information aboutmodel availability and discontinuation dates, seeGooglemodels.

For more information about supported locations, seeAvailable locations.

Partner models

The following table shows the throughput, purchase increment, and burndown ratesforpartner models thatsupport Provisioned Throughput.Claude models aremeasured in tokens per second, which is defined as a total of input and outputtokens across all requests per second.

ModelThroughput per GSU (tokens/sec)Minimum GSU purchaseGSU purchase incrementBurndown rates
Anthropic's Claude Opus 4.52103511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token
Anthropic's Claude Sonnet 4.5350251Less than 200,000 input tokens:
1 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token

Greater than or equal to 200,000 input tokens:
1 input token = 2 token
1 output token = 7.5 tokens
1 cache write 5m token = 2.5 tokens
1 cache write 1h token = 4 tokens
1 cache hit token = 0.2 token
Anthropic's Claude Opus 4.1703511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token
Anthropic's Claude Haiku 4.51,05081Less than 200,000 input tokens:
1 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token
Anthropic's Claude Opus 4703511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token
Anthropic's Claude Sonnet 4350251Less than 200,000 input tokens:
1 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token

Greater than or equal to 200,000 input tokens:
1 input token = 2 token
1 output token = 7.5 tokens
1 cache write 5m token = 2.5 tokens
1 cache write 1h token = 4 tokens
1 cache hit token = 0.2 token
Anthropic's Claude 3.7 Sonnet (deprecated)3502511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Sonnet v2 (deprecated)3502511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Haiku2,0001011 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token
Anthropic's Claude 3 Opus703511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache hit token = 0.1 token
Anthropic's Claude 3 Haiku4,200511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache write 1h token = 2 tokens
1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Sonnet (deprecated)3502511 input token = 1 token
1 output token = 5 tokens
1 cache write 5m token = 1.25 tokens
1 cache hit token = 0.1 token

For information about supported locations, seeAnthropic Claude regionavailability. To orderProvisioned Throughput for Anthropic models, contact yourGoogle Cloud account representative.

Open models

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

The following table shows the throughput, purchase increment, and burndown ratesforopen models thatsupport Provisioned Throughput.

ModelThroughput per GSU (tokens/sec)Minimum GSU purchaseGSU purchase incrementBurndown rates

DeepSeek-OCR

Latest supported version:deepseek-ocr-maas

3,360111 input text token = 1 token
1 input image token = 1 token
1 output text token = 4 tokens

DeepSeek-V3.2

Latest supported version:deepseek-v3.2-maas

1,680111 input text token = 1 token
1 output text token = 4 tokens

Kimi K2 Thinking

Latest supported version:kimi-k2-thinking-maas

1,680111 input text token = 1 token
1 output text token = 4 tokens

Llama 3.3 70B

Latest supported version:llama-3.3-70b-instruct-maas

1,400111 input text token = 1 token
1 output text token = 1 token

Llama 4 Maverick 17B-128E

Latest supported version:llama-4-maverick-17b-128e-instruct-maas

2,800111 input text token = 1 token
1 input image token = 1 token
1 output text token = 4 tokens

Llama 4 Scout 17B-16E

Latest supported version:llama-4-scout-17b-16e-instruct-maas

4,035111 input text token = 1 token
1 input image token = 1 token
1 output text token = 3 tokens

MiniMax M2

Latest supported version:minimax-m2-maas

3,360111 input text token = 1 token
1 output text token = 4 tokens

OpenAI gpt-oss 120B

Latest supported version:gpt-oss-120b-maas

11,205111 input text token = 1 token
1 output text token = 4 tokens

OpenAI gpt-oss 20B

Latest supported version:gpt-oss-20b-maas

14,405111 input text token = 1 token
1 output text token = 4 tokens

Qwen3 235B

Latest supported version:qwen3-235b-a22b-instruct-2507-maas

4,035111 input text token = 1 token
1 output text token = 4 tokens

Qwen3 Coder

Latest supported version:qwen3-coder-480b-a35b-instruct-maas

1,010111 input text token = 1 token
1 output text token = 4 tokens

Qwen3-Next-80B Instruct

Latest supported version:qwen3-next-80b-a3b-instruct-maas

6,725111 input text token = 1 token
1 output text token = 8 tokens

Qwen3-Next-80B Thinking

Latest supported version:qwen3-next-80b-a3b-thinking-maas

6,725111 input text token = 1 token
1 output text token = 8 tokens

Available capabilities for Google and open models

The following table lists the capabilitiesthat are available with Provisioned Throughput for Google models and open models:

CapabilityGoogle modelsOpen models (preview)
Order through Google Cloud consoleYesYes
Supports global endpointsSeeGlobal endpoint model support.SeeGlobal endpoint model support.
Supportssupervised fine-tuned modelsYesNo
SupportsAPI key usageYesNo
Integrated withimplicit context cachingYesNot applicable
Integrated with explicit context cachingYesNot applicable
ML processingAvailable in specific regions. For details, seeSingle Zone Provisioned Throughput.Not applicable
Available order terms1 week, 1 month, 3 month, and 1 year1 month, 3 month, and 1 year
Change order from the consoleYesNo
Order statuses: pending review, approved, active, expiredYesYes
Overagesspillover to pay-as-you-go by defaultYesYes
API header control: use "dedicated" to only use provisioned throughput or "shared" to only use pay-as-you-goYesYes
Monitoring:metrics,dashboards, andalertingYesYes

Global endpoint model support

Provisioned Throughput supports theglobal endpoint for Google models and open models.

Traffic that exceeds the Provisioned Throughput quota uses theglobal endpoint, by default.

To assign Provisioned Throughput to the global endpoint of a model,selectglobal as the region when youplace a Provisioned Throughput order.

Google models with global endpoint support

The following table lists the Google models for which Provisioned Throughput supports the global endpoint:

ModelLatest supported model version
Gemini 3 Pro (preview)gemini-3-pro-preview
Gemini 3 Pro Image (preview)gemini-3-pro-image-preview
Gemini 2.5 Progemini-2.5-pro
Gemini 2.5 Flash Imagegemini-2.5-flash-image
Gemini 2.5 Flash
  • gemini-2.5-flash-preview-09-2025  (preview)
  • gemini-2.5-flash  (GA)
Gemini 2.5 Flash-Lite
  • gemini-2.5-flash-lite-preview-09-2025  (preview)
  • gemini-2.5-flash-lite  (GA)
Gemini 2.0 Flashgemini-2.0-flash-001
Gemini 2.0 Flash-Litegemini-2.0-flash-lite-001

Open models with global endpoint support

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

The following table lists the open models for which Provisioned Throughput supports the global endpoint:

ModelLatest supported model version
DeepSeek-OCRdeepseek-ocr-maas
DeepSeek-V3.2deepseek-v3.2-maas
Kimi K2 Thinkingkimi-k2-thinking-maas
MiniMax M2minimax-m2-maas
OpenAI gpt-oss 120Bgpt-oss-120b-maas
Qwen3-Next-80B Instructqwen3-next-80b-a3b-instruct-maas
Qwen3-Next-80B Thinkingqwen3-next-80b-a3b-thinking-maas

Supervised fine-tuned model support

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

The following is supported for Google models that supportsupervised fine-tuning:

  • Provisioned Throughput can be applied to both base models andsupervised fine-tuned versions of those base models.

  • Supervised fine-tuned model endpoints and their corresponding base model counttowards the same Provisioned Throughput quota.

    For example, Provisioned Throughput purchased forgemini-2.0-flash-lite-001 for a specific projectprioritizes requests that are made from supervised fine-tuned versions ofgemini-2.0-flash-lite-001 created within that project.Use theappropriate header to control traffic behavior.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.