Supported models

The following tables show the models that supportProvisioned Throughput, the throughput for eachgenerative AI scale unit (GSU) and the burndownrates for eachmodel.

Google models

Provisioned Throughput only supports models that you call directlyfrom your project using the specific model ID and not a model alias. To useProvisioned Throughput to make API calls to a model, you must use thespecific model version ID (for example,gemini-2.0-flash-001) and not amodel version alias.

Moreover, Provisioned Throughput doesn't supportmodels that are called by other Vertex AI products, such asVertex AI Agents and Vertex AI Search. For example, if you makeAPI calls to Gemini 2.0 Flash while using Vertex AI Search,your Provisioned Throughput order for Gemini 2.0 Flashwon't guarantee the calls made by Vertex AI Search.

Provisioned Throughput doesn't supportbatch prediction calls.

The following table shows the throughput, purchase increment, and burndown ratesfor Google models that support Provisioned Throughput. Yourper-second throughput is defined as your prompt input and generated outputacross all requests per second.

To find out how many tokens your workload requires, refer to theSDKtokenizer or thecountTokens API.

Model	Per-second throughput per GSU	Units	Minimum GSU purchase increment	Burndown rates
Gemini 3 Pro Latest supported version:`gemini-3-pro-preview` (preview)	500	Tokens	1	Less than or equal to 200,000 input tokens: 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output response text token = 6 tokens 1 output reasoning text token = 6 tokens Greater than 200,000 input tokens: 1 input text token = 2 tokens 1 input image token = 2 tokens 1 input video token = 2 tokens 1 input audio token = 2 tokens 1 output response text token = 9 tokens 1 output reasoning text token = 9 tokens
Gemini 3 Pro Image Latest supported version:`gemini-3-pro-image-preview` (preview)	500	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 output text token = 6 tokens 1 output thinking token = 6 tokens 1 output image token = 60 tokens
Gemini 2.5 Pro Latest supported version:`gemini-2.5-pro`	650	Tokens	1	Less than or equal to 200,000 input tokens: 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output response text token = 8 tokens 1 output reasoning text token = 8 tokens Greater than 200,000 input tokens: 1 input text token = 2 tokens 1 input image token = 2 tokens 1 input video token = 2 tokens 1 input audio token = 2 tokens 1 output response text token = 12 tokens 1 output reasoning text token = 12 tokens
Gemini 2.5 Flash Image Latest supported version:`gemini-2.5-flash-image`	2,690	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 output text token = 9 tokens 1 output image token = 100 tokens
Gemini 2.5 Flash Latest supported version (GA):`gemini-2.5-flash` Latest supported version (preview):`gemini-2.5-flash-preview-09-2025`	2690	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 4 tokens 1 output response text token = 9 tokens 1 output reasoning text token = 9 tokens
Gemini 2.5 Flash-Lite Latest supported version (GA):`gemini-2.5-flash-lite` Latest supported version (preview):`gemini-2.5-flash-lite-preview-09-2025`	8,070	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 3 tokens 1 output response text token = 4 tokens 1 output reasoning text token = 4 tokens
Gemini 2.5 Flash with Gemini Live API native audio Latest supported version:`gemini-live-2.5-flash-native-audio`	1,620	Tokens	1	1 input text token = 1 token 1 input audio token = 6 tokens 1 input video token = 6 tokens 1 input image token = 6 tokens 1 input session memory token = 1 token 1 output text token = 4 tokens 1 output audio token = 24 tokens
Gemini 2.0 Flash Latest supported version:`gemini-2.0-flash-001`	3,360	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 7 tokens 1 output text token = 4 tokens
Gemini 2.0 Flash-Lite Latest supported version:`gemini-2.0-flash-lite-001`	6,720	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output text token = 4 tokens
Veo 3.1 preview Latest supported version:`veo-3.1-generate-001` (preview)	0.0040	Video seconds	1	1 output video second = 1 output video second
	0.0040	Video+audio seconds	1	1 output video+audio second = 2 output video seconds
Veo 3.1 Fast preview Latest supported version:`veo-3.1-fast-generate-001` (preview)	0.0080	Video seconds	1	1 output video second = 1 output video second
	0.0080	Video+audio seconds	1	1 output video+audio second = 1.45 output video seconds
Veo 3 Latest supported version:`veo-3.0-generate-001`	0.0040	Video seconds	1	1 output video second = 1 output video second
Veo 3 Latest supported version:`veo-3.0-generate-001`	0.0040	Video+audio seconds	1	1 output video+audio second = 2 output video seconds
Veo 3 Fast Latest supported version:`veo-3.0-fast-generate-001`	0.0080	Video seconds	1	1 output video second = 1 output video second
	0.0080	Video+audio seconds	1	1 output video+audio second = 1.45 output video seconds
Imagen 4 Ultra Generate `imagen-4.0-ultra-generate-001`	0.015	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 4 Generate `imagen-4.0-generate-001`	0.02	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 4 Fast Generate `imagen-4.0-fast-generate-001`	0.04	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 3 Generate 002 `imagen-3.0-generate-002`	0.02	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 3 Generate 001 `imagen-3.0-generate-001`	0.025	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 3 Fast	0.05	Images	1	Only output images count toward your Provisioned Throughput quota.

For information about a model's capabilities and input or output limits,see the documentation for the model.

Request access: The modelgemini-live-2.5-flash is inprivate GA. For information about access to this release, see the access request page.

You can upgrade to new models as they are made available. For information aboutmodel availability and discontinuation dates, seeGooglemodels.

For more information about supported locations, seeAvailable locations.

Partner models

The following table shows the throughput, purchase increment, and burndown ratesforpartner models thatsupport Provisioned Throughput.Claude models aremeasured in tokens per second, which is defined as a total of input and outputtokens across all requests per second.

Model	Throughput per GSU (tokens/sec)	Minimum GSU purchase	GSU purchase increment	Burndown rates
Anthropic's Claude Opus 4.5	210	35	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token
Anthropic's Claude Sonnet 4.5	350	25	1	Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write 5m token = 2.5 tokens 1 cache write 1h token = 4 tokens 1 cache hit token = 0.2 token
Anthropic's Claude Opus 4.1	70	35	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token
Anthropic's Claude Haiku 4.5	1,050	8	1	Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token
Anthropic's Claude Opus 4	70	35	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token
Anthropic's Claude Sonnet 4	350	25	1	Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write 5m token = 2.5 tokens 1 cache write 1h token = 4 tokens 1 cache hit token = 0.2 token
Anthropic's Claude 3.7 Sonnet (deprecated)	350	25	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Sonnet v2 (deprecated)	350	25	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Haiku	2,000	10	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3 Opus	70	35	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3 Haiku	4,200	5	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache write 1h token = 2 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Sonnet (deprecated)	350	25	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write 5m token = 1.25 tokens 1 cache hit token = 0.1 token

For information about supported locations, seeAnthropic Claude regionavailability. To orderProvisioned Throughput for Anthropic models, contact yourGoogle Cloud account representative.

Open models

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

The following table shows the throughput, purchase increment, and burndown ratesforopen models thatsupport Provisioned Throughput.

Model	Throughput per GSU (tokens/sec)	Minimum GSU purchase	GSU purchase increment	Burndown rates
DeepSeek-OCR Latest supported version:`deepseek-ocr-maas`	3,360	1	1	1 input text token = 1 token 1 input image token = 1 token 1 output text token = 4 tokens
DeepSeek-V3.2 Latest supported version:`deepseek-v3.2-maas`	1,680	1	1	1 input text token = 1 token 1 output text token = 4 tokens
Kimi K2 Thinking Latest supported version:`kimi-k2-thinking-maas`	1,680	1	1	1 input text token = 1 token 1 output text token = 4 tokens
Llama 3.3 70B Latest supported version:`llama-3.3-70b-instruct-maas`	1,400	1	1	1 input text token = 1 token 1 output text token = 1 token
Llama 4 Maverick 17B-128E Latest supported version:`llama-4-maverick-17b-128e-instruct-maas`	2,800	1	1	1 input text token = 1 token 1 input image token = 1 token 1 output text token = 4 tokens
Llama 4 Scout 17B-16E Latest supported version:`llama-4-scout-17b-16e-instruct-maas`	4,035	1	1	1 input text token = 1 token 1 input image token = 1 token 1 output text token = 3 tokens
MiniMax M2 Latest supported version:`minimax-m2-maas`	3,360	1	1	1 input text token = 1 token 1 output text token = 4 tokens
OpenAI gpt-oss 120B Latest supported version:`gpt-oss-120b-maas`	11,205	1	1	1 input text token = 1 token 1 output text token = 4 tokens
OpenAI gpt-oss 20B Latest supported version:`gpt-oss-20b-maas`	14,405	1	1	1 input text token = 1 token 1 output text token = 4 tokens
Qwen3 235B Latest supported version:`qwen3-235b-a22b-instruct-2507-maas`	4,035	1	1	1 input text token = 1 token 1 output text token = 4 tokens
Qwen3 Coder Latest supported version:`qwen3-coder-480b-a35b-instruct-maas`	1,010	1	1	1 input text token = 1 token 1 output text token = 4 tokens
Qwen3-Next-80B Instruct Latest supported version:`qwen3-next-80b-a3b-instruct-maas`	6,725	1	1	1 input text token = 1 token 1 output text token = 8 tokens
Qwen3-Next-80B Thinking Latest supported version:`qwen3-next-80b-a3b-thinking-maas`	6,725	1	1	1 input text token = 1 token 1 output text token = 8 tokens

Available capabilities for Google and open models

The following table lists the capabilitiesthat are available with Provisioned Throughput for Google models and open models:

Capability	Google models	Open models (preview)
Order through Google Cloud console	Yes	Yes
Supports global endpoints	SeeGlobal endpoint model support.	SeeGlobal endpoint model support.
Supportssupervised fine-tuned models	Yes	No
SupportsAPI key usage	Yes	No
Integrated withimplicit context caching	Yes	Not applicable
Integrated with explicit context caching	Yes	Not applicable
ML processing	Available in specific regions. For details, seeSingle Zone Provisioned Throughput.	Not applicable
Available order terms	1 week, 1 month, 3 month, and 1 year	1 month, 3 month, and 1 year
Change order from the console	Yes	No
Order statuses: pending review, approved, active, expired	Yes	Yes
Overagesspillover to pay-as-you-go by default	Yes	Yes
API header control: use "dedicated" to only use provisioned throughput or "shared" to only use pay-as-you-go	Yes	Yes
Monitoring:metrics,dashboards, andalerting	Yes	Yes

Global endpoint model support

Provisioned Throughput supports theglobal endpoint for Google models and open models.

Traffic that exceeds the Provisioned Throughput quota uses theglobal endpoint, by default.

To assign Provisioned Throughput to the global endpoint of a model,selectglobal as the region when youplace a Provisioned Throughput order.

Google models with global endpoint support

The following table lists the Google models for which Provisioned Throughput supports the global endpoint:

Model	Latest supported model version
Gemini 3 Pro (preview)	`gemini-3-pro-preview`
Gemini 3 Pro Image (preview)	`gemini-3-pro-image-preview`
Gemini 2.5 Pro	`gemini-2.5-pro`
Gemini 2.5 Flash Image	`gemini-2.5-flash-image`
Gemini 2.5 Flash	`gemini-2.5-flash-preview-09-2025` (preview) `gemini-2.5-flash` (GA)
Gemini 2.5 Flash-Lite	`gemini-2.5-flash-lite-preview-09-2025` (preview) `gemini-2.5-flash-lite` (GA)
Gemini 2.0 Flash	`gemini-2.0-flash-001`
Gemini 2.0 Flash-Lite	`gemini-2.0-flash-lite-001`

Open models with global endpoint support

Preview

The following table lists the open models for which Provisioned Throughput supports the global endpoint:

Model	Latest supported model version
DeepSeek-OCR	`deepseek-ocr-maas`
DeepSeek-V3.2	`deepseek-v3.2-maas`
Kimi K2 Thinking	`kimi-k2-thinking-maas`
MiniMax M2	`minimax-m2-maas`
OpenAI gpt-oss 120B	`gpt-oss-120b-maas`
Qwen3-Next-80B Instruct	`qwen3-next-80b-a3b-instruct-maas`
Qwen3-Next-80B Thinking	`qwen3-next-80b-a3b-thinking-maas`

Supervised fine-tuned model support

Preview

The following is supported for Google models that supportsupervised fine-tuning:

Provisioned Throughput can be applied to both base models andsupervised fine-tuned versions of those base models.
Supervised fine-tuned model endpoints and their corresponding base model counttowards the same Provisioned Throughput quota.
For example, Provisioned Throughput purchased forgemini-2.0-flash-lite-001 for a specific projectprioritizes requests that are made from supervised fine-tuned versions ofgemini-2.0-flash-lite-001 created within that project.Use theappropriate header to control traffic behavior.

What's next

Calculate Provisioned Throughput requirements.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Supported models Stay organized with collections Save and categorize content based on your preferences.

Google models

Partner models

Open models

Available capabilities for Google and open models

Global endpoint model support

Google models with global endpoint support

Open models with global endpoint support

Supervised fine-tuned model support

What's next

Supported models