Supported models Stay organized with collections Save and categorize content based on your preferences.
Google models
Provisioned Throughput only supports models that you call directlyfrom your project using the specific model ID and not a model alias. To useProvisioned Throughput to make API calls to a model, you must use thespecific model version ID (for example,gemini-2.0-flash-001) and not amodel version alias.
Moreover, Provisioned Throughput doesn't supportmodels that are called by other Vertex AI products, such asVertex AI Agents and Vertex AI Search. For example, if you makeAPI calls to Gemini 2.0 Flash while using Vertex AI Search,your Provisioned Throughput order for Gemini 2.0 Flashwon't guarantee the calls made by Vertex AI Search.
Provisioned Throughput doesn't supportbatch prediction calls.
The following table shows the throughput, purchase increment, and burndown ratesfor Google models that support Provisioned Throughput. Yourper-second throughput is defined as your prompt input and generated outputacross all requests per second.
To find out how many tokens your workload requires, refer to theSDKtokenizer or thecountTokens API.
| Model | Per-second throughput per GSU | Units | Minimum GSU purchase increment | Burndown rates |
|---|---|---|---|---|
Latest supported version: | 500 | Tokens | 1 | Less than or equal to 200,000 input tokens: 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output response text token = 6 tokens 1 output reasoning text token = 6 tokens Greater than 200,000 input tokens: 1 input text token = 2 tokens 1 input image token = 2 tokens 1 input video token = 2 tokens 1 input audio token = 2 tokens 1 output response text token = 9 tokens 1 output reasoning text token = 9 tokens |
Latest supported version: | 500 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 output text token = 6 tokens 1 output thinking token = 6 tokens 1 output image token = 60 tokens |
Latest supported version: | 650 | Tokens | 1 | Less than or equal to 200,000 input tokens: 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output response text token = 8 tokens 1 output reasoning text token = 8 tokens Greater than 200,000 input tokens: 1 input text token = 2 tokens 1 input image token = 2 tokens 1 input video token = 2 tokens 1 input audio token = 2 tokens 1 output response text token = 12 tokens 1 output reasoning text token = 12 tokens |
Latest supported version: | 2,690 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 output text token = 9 tokens 1 output image token = 100 tokens |
Latest supported version (GA): Latest supported version (preview): | 2690 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 4 tokens 1 output response text token = 9 tokens 1 output reasoning text token = 9 tokens |
Latest supported version (GA): Latest supported version (preview): | 8,070 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 3 tokens 1 output response text token = 4 tokens 1 output reasoning text token = 4 tokens |
Gemini 2.5 Flash with Gemini Live API native audio Latest supported version: | 1,620 | Tokens | 1 | 1 input text token = 1 token 1 input audio token = 6 tokens 1 input video token = 6 tokens 1 input image token = 6 tokens 1 input session memory token = 1 token 1 output text token = 4 tokens 1 output audio token = 24 tokens |
Latest supported version: | 3,360 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 7 tokens 1 output text token = 4 tokens |
Latest supported version: | 6,720 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 0.0040 | Video seconds | 1 | 1 output video second = 1 output video second |
| Video+audio seconds | 1 | 1 output video+audio second = 2 output video seconds | ||
Latest supported version: | 0.0080 | Video seconds | 1 | 1 output video second = 1 output video second |
| Video+audio seconds | 1 | 1 output video+audio second = 1.45 output video seconds | ||
Latest supported version: | 0.0040 | Video seconds | 1 | 1 output video second = 1 output video second |
| Video+audio seconds | 1 | 1 output video+audio second = 2 output video seconds | ||
Latest supported version: | 0.0080 | Video seconds | 1 | 1 output video second = 1 output video second |
| Video+audio seconds | 1 | 1 output video+audio second = 1.45 output video seconds | ||
| 0.015 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
| 0.02 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
| 0.04 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
| 0.02 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
| 0.025 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
| Imagen 3 Fast | 0.05 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
For information about a model's capabilities and input or output limits,see the documentation for the model.
Request access: The modelgemini-live-2.5-flash is inprivate GA. For information about access to this release, see theaccess request page.You can upgrade to new models as they are made available. For information aboutmodel availability and discontinuation dates, seeGooglemodels.
For more information about supported locations, seeAvailable locations.
Partner models
The following table shows the throughput, purchase increment, and burndown ratesforpartner models thatsupport Provisioned Throughput.Claude models aremeasured in tokens per second, which is defined as a total of input and outputtokens across all requests per second.
| Model | Throughput per GSU (tokens/sec) | Minimum GSU purchase | GSU purchase increment | Burndown rates |
|---|---|---|---|---|
| Anthropic's Claude Opus 4.5 | 210 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude Sonnet 4.5 | 350 | 25 | 1 | Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write 5m token = 2.5 tokens 1 cache write 1h token = 4 tokens 1 cache hit token = 0.2 token |
| Anthropic's Claude Opus 4.1 | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude Haiku 4.5 | 1,050 | 8 | 1 | Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude Opus 4 | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude Sonnet 4 | 350 | 25 | 1 | Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write 5m token = 2.5 tokens 1 cache write 1h token = 4 tokens 1 cache hit token = 0.2 token |
| Anthropic's Claude 3.7 Sonnet (deprecated) | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3.5 Sonnet v2 (deprecated) | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3.5 Haiku | 2,000 | 10 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3 Opus | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3 Haiku | 4,200 | 5 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token |
| Anthropic's Claude 3.5 Sonnet (deprecated) | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token |
For information about supported locations, seeAnthropic Claude regionavailability. To orderProvisioned Throughput for Anthropic models, contact yourGoogle Cloud account representative.
Open models
Preview
This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
The following table shows the throughput, purchase increment, and burndown ratesforopen models thatsupport Provisioned Throughput.
| Model | Throughput per GSU (tokens/sec) | Minimum GSU purchase | GSU purchase increment | Burndown rates |
|---|---|---|---|---|
Latest supported version: | 3,360 | 1 | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 1,680 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 1,680 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 1,400 | 1 | 1 | 1 input text token = 1 token 1 output text token = 1 token |
Latest supported version: | 2,800 | 1 | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 4,035 | 1 | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 output text token = 3 tokens |
Latest supported version: | 3,360 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 11,205 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 14,405 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 4,035 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 1,010 | 1 | 1 | 1 input text token = 1 token 1 output text token = 4 tokens |
Latest supported version: | 6,725 | 1 | 1 | 1 input text token = 1 token 1 output text token = 8 tokens |
Latest supported version: | 6,725 | 1 | 1 | 1 input text token = 1 token 1 output text token = 8 tokens |
Available capabilities for Google and open models
The following table lists the capabilitiesthat are available with Provisioned Throughput for Google models and open models:
| Capability | Google models | Open models (preview) |
|---|---|---|
| Order through Google Cloud console | Yes | Yes |
| Supports global endpoints | SeeGlobal endpoint model support. | SeeGlobal endpoint model support. |
| Supportssupervised fine-tuned models | Yes | No |
| SupportsAPI key usage | Yes | No |
| Integrated withimplicit context caching | Yes | Not applicable |
| Integrated with explicit context caching | Yes | Not applicable |
| ML processing | Available in specific regions. For details, seeSingle Zone Provisioned Throughput. | Not applicable |
| Available order terms | 1 week, 1 month, 3 month, and 1 year | 1 month, 3 month, and 1 year |
| Change order from the console | Yes | No |
| Order statuses: pending review, approved, active, expired | Yes | Yes |
| Overagesspillover to pay-as-you-go by default | Yes | Yes |
| API header control: use "dedicated" to only use provisioned throughput or "shared" to only use pay-as-you-go | Yes | Yes |
| Monitoring:metrics,dashboards, andalerting | Yes | Yes |
Global endpoint model support
Provisioned Throughput supports theglobal endpoint for Google models and open models.
Traffic that exceeds the Provisioned Throughput quota uses theglobal endpoint, by default.
To assign Provisioned Throughput to the global endpoint of a model,selectglobal as the region when youplace a Provisioned Throughput order.
Google models with global endpoint support
The following table lists the Google models for which Provisioned Throughput supports the global endpoint:
| Model | Latest supported model version |
|---|---|
| Gemini 3 Pro (preview) | gemini-3-pro-preview |
| Gemini 3 Pro Image (preview) | gemini-3-pro-image-preview |
| Gemini 2.5 Pro | gemini-2.5-pro |
| Gemini 2.5 Flash Image | gemini-2.5-flash-image |
| Gemini 2.5 Flash | |
| Gemini 2.5 Flash-Lite | |
| Gemini 2.0 Flash | gemini-2.0-flash-001 |
| Gemini 2.0 Flash-Lite | gemini-2.0-flash-lite-001 |
Open models with global endpoint support
Preview
This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
The following table lists the open models for which Provisioned Throughput supports the global endpoint:
| Model | Latest supported model version |
|---|---|
| DeepSeek-OCR | deepseek-ocr-maas |
| DeepSeek-V3.2 | deepseek-v3.2-maas |
| Kimi K2 Thinking | kimi-k2-thinking-maas |
| MiniMax M2 | minimax-m2-maas |
| OpenAI gpt-oss 120B | gpt-oss-120b-maas |
| Qwen3-Next-80B Instruct | qwen3-next-80b-a3b-instruct-maas |
| Qwen3-Next-80B Thinking | qwen3-next-80b-a3b-thinking-maas |
Supervised fine-tuned model support
Preview
This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
The following is supported for Google models that supportsupervised fine-tuning:
Provisioned Throughput can be applied to both base models andsupervised fine-tuned versions of those base models.
Supervised fine-tuned model endpoints and their corresponding base model counttowards the same Provisioned Throughput quota.
For example, Provisioned Throughput purchased for
gemini-2.0-flash-lite-001for a specific projectprioritizes requests that are made from supervised fine-tuned versions ofgemini-2.0-flash-lite-001created within that project.Use theappropriate header to control traffic behavior.
What's next
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.