Vertex AI open models for MaaS

Vertex AI supports a curated list of open models as managed models. Theseopen models can be used withVertex AI as a model as aservice (MaaS) and are offered as a managed API. When you use a managed openmodel, you continue to send your requests to Vertex AI endpoints. Managedopen models are serverless, so there's no need to provision or manageinfrastructure.

Managed open models can be discovered using Model Garden. You can alsodeploy models using Model Garden. For more information, seeExplore AImodels inModel Garden.

Before you can use open models, you need togrant user access to openmodels.

Open models

The following open models are offered as managed APIs on Vertex AIModel Garden (MaaS):

Model name	Modality	Description	Quickstart
DeepSeek-OCR	Language, Vision	A comprehensive Optical Character Recognition (OCR) model that analyzes and understands complex documents. It excels at challenging OCR tasks.	Model card
DeepSeek R1 (0528)	Language	DeepSeek's latest version of the DeepSeek R1 model.	Model card
DeepSeek-V3.1	Language	DeepSeek's hybrid model that supports both thinking mode and non-thinking mode.	Model card
DeepSeek-V3.2	Language	DeepSeek's model that harmonizes high computational efficiency with superior reasoning and agent performance.	Model card
GLM 4.7	Language, Code	GLM's model designed for core or vibe coding, tool use, and complex reasoning.	Model card
GLM 5	Language, Code	GLM's model targeting complex systems engineering and long-horizon agentic tasks.	Model card
gpt-oss 120B	Language	A 120B model that offers high performance on reasoning tasks.	Model card
gpt-oss 20B	Language	A 20B model optimized for efficiency and deployment on consumer and edge hardware.	Model card
Kimi K2 Thinking	Language	An open-source thinking agent model that reasons step-by-step and uses tools to solve complex problems.	Model card
Llama 3.3	Language	Llama 3.3 is a text-only 70B instruction-tuned model that provides enhanced performance relative to Llama 3.1 70B and to Llama 3.2 90B when used for text-only applications. Moreover, for some applications, Llama 3.3 70B approaches the performance of Llama 3.1 405B.	Model card
Llama 4 Maverick 17B-128E	Language, Vision	The largest and most capable Llama 4 model that has coding, reasoning, and image capabilities. Llama 4 Maverick 17B-128E is a multimodal model that uses the Mixture-of-Experts (MoE) architecture and early fusion.	Model card
Llama 4 Scout 17B-16E	Language, Vision	Llama 4 Scout 17B-16E delivers state-of-the-art results for its size class, outperforming previous Llama generations and other open and proprietary models on several benchmarks. Llama 4 Scout 17B-16E is a multimodal model that uses the Mixture-of-Experts (MoE) architecture and early fusion.	Model card
MiniMax M2	Language, Code	Designed for agentic and code-related tasks with strong capabilities in planning and executing complex tool-calling tasks.	Model card
Qwen3 235B	Language	An open-weight model with a "hybrid thinking" capability to switch between methodical reasoning and rapid conversation.	Model card
Qwen3 Coder	Language, Code	An open-weight model developed for advanced software development tasks.	Model card
Qwen3-Next-80B Instruct	Language, Code	A model from the Qwen3-Next family of models, specialized for following specific commands.	Model card
Qwen3-Next-80B Thinking	Language, Code	A model from the Qwen3-Next family of models, specialized for complex problem-solving and deep reasoning.	Model card

The following open embedding models are offered as managed APIs on Vertex AIModel Garden (MaaS):

Model name	Description	Output dimensions	Max sequence length	Supported text languages	Quickstart
multilingual-e5-small	Part of the E5 family of text embedding models. Small variant contains 12 layers.	Up to 384	512 tokens	Supported languages	Model card
multilingual-e5-large	Part of the E5 family of text embedding models. Large variant contains 24 layers.	Up to 1024	512 tokens	Supported languages	Model card

Open model regulatory compliance

Thecertifications forGenerative AI on Vertex AI continue toapply when open models are used as a managed API using Vertex AI.If you need details about the models themselves, additional information can befound in the respective model card, or you can contact the respective modelpublisher.

Your data is stored at rest within the selected region or multi-region foropen models on Vertex AI, but the regionalization of dataprocessing may vary. For a detailed list of open models' data processingcommitments, seeData residency for openmodels.

Customer prompts and model responses are not shared with third parties whenusing the Vertex AI API, including open models. Google only processescustomer data as instructed by the customer, which is further described in ourCloud Data Processing Addendum.

Context caching

Preview

This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of the Service Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Context caching helps reduce the cost and latency of requests to Open Models thatcontain repeated content. This is enabled while using pay-as-you-go trafficonly, and doesn't support other traffic types, such asProvisioned Throughput and Batch.

The supported type of caching is implicit caching, which is automatic cachingthat's enabled in all Google Cloud projects by default and provides a 90%discount on cached tokens compared to standard input tokens when cache hitsoccur. With this type of caching, you don't define and call the cachesexplicitly. Instead, our backend pulls from these caches once repeated contextis detected.

Supported models

qwen3-coder-480b-a35b-instruct-maas
kimi-k2-thinking-maas
minimax-m2-maas
gpt-oss-20b-maas
deepseek-v3.1-maas
deepseek-v3.2-maas

ThecachedContentTokenCountfield in your response's metadata indicates the number of tokens in the cachedpart of your input. Caching requests must contain a minimum of 4096 tokens (thisminimum is subject to change during Preview).

When enabled, implicit cache hit cost savings are automatically passed on toyou. Cache hits aren't guaranteed and are dependent on requests sent andother factors. To increase the chances of an implicit cache hit, try the following:

Place large and common contents at the beginning of your prompt.
Send requests with a similar prefix in a short amount of time.

What's next

Before using open models,Grant user access to open models.
Learn how toCall open model APIs.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換