Vertex AI open models for MaaS Stay organized with collections Save and categorize content based on your preferences.
Vertex AI supports a curated list of open models as managed models. Theseopen models can be used withVertex AI as a model as aservice (MaaS) and are offered as a managed API. When you use a managed openmodel, you continue to send your requests to Vertex AI endpoints. Managedopen models are serverless, so there's no need to provision or manageinfrastructure.
Managed open models can be discovered using Model Garden. You can alsodeploy models using Model Garden. For more information, seeExplore AImodels inModel Garden.
Before you can use open models, you need togrant user access to openmodels.
Open models
The following open models are offered as managed APIs on Vertex AIModel Garden (MaaS):
| Model name | Modality | Description | Quickstart |
|---|---|---|---|
| DeepSeek-OCR | Language, Vision | A comprehensive Optical Character Recognition (OCR) model that analyzes and understands complex documents. It excels at challenging OCR tasks. | Model card |
| DeepSeek R1 (0528) | Language | DeepSeek's latest version of the DeepSeek R1 model. | Model card |
| DeepSeek-V3.1 | Language | DeepSeek's hybrid model that supports both thinking mode and non-thinking mode. | Model card |
| DeepSeek-V3.2 | Language | DeepSeek's model that harmonizes high computational efficiency with superior reasoning and agent performance. | Model card |
| GLM 4.7 | Language, Code | GLM's model designed for core or vibe coding, tool use, and complex reasoning. | Model card |
| GLM 5 | Language, Code | GLM's model targeting complex systems engineering and long-horizon agentic tasks. | Model card |
| gpt-oss 120B | Language | A 120B model that offers high performance on reasoning tasks. | Model card |
| gpt-oss 20B | Language | A 20B model optimized for efficiency and deployment on consumer and edge hardware. | Model card |
| Kimi K2 Thinking | Language | An open-source thinking agent model that reasons step-by-step and uses tools to solve complex problems. | Model card |
| Llama 3.3 | Language | Llama 3.3 is a text-only 70B instruction-tuned model that provides enhanced performance relative to Llama 3.1 70B and to Llama 3.2 90B when used for text-only applications. Moreover, for some applications, Llama 3.3 70B approaches the performance of Llama 3.1 405B. | Model card |
| Llama 4 Maverick 17B-128E | Language, Vision | The largest and most capable Llama 4 model that has coding, reasoning, and image capabilities. Llama 4 Maverick 17B-128E is a multimodal model that uses the Mixture-of-Experts (MoE) architecture and early fusion. | Model card |
| Llama 4 Scout 17B-16E | Language, Vision | Llama 4 Scout 17B-16E delivers state-of-the-art results for its size class, outperforming previous Llama generations and other open and proprietary models on several benchmarks. Llama 4 Scout 17B-16E is a multimodal model that uses the Mixture-of-Experts (MoE) architecture and early fusion. | Model card |
| MiniMax M2 | Language, Code | Designed for agentic and code-related tasks with strong capabilities in planning and executing complex tool-calling tasks. | Model card |
| Qwen3 235B | Language | An open-weight model with a "hybrid thinking" capability to switch between methodical reasoning and rapid conversation. | Model card |
| Qwen3 Coder | Language, Code | An open-weight model developed for advanced software development tasks. | Model card |
| Qwen3-Next-80B Instruct | Language, Code | A model from the Qwen3-Next family of models, specialized for following specific commands. | Model card |
| Qwen3-Next-80B Thinking | Language, Code | A model from the Qwen3-Next family of models, specialized for complex problem-solving and deep reasoning. | Model card |
The following open embedding models are offered as managed APIs on Vertex AIModel Garden (MaaS):
| Model name | Description | Output dimensions | Max sequence length | Supported text languages | Quickstart |
|---|---|---|---|---|---|
| multilingual-e5-small | Part of the E5 family of text embedding models. Small variant contains 12 layers. | Up to 384 | 512 tokens | Supported languages | Model card |
| multilingual-e5-large | Part of the E5 family of text embedding models. Large variant contains 24 layers. | Up to 1024 | 512 tokens | Supported languages | Model card |
Open model regulatory compliance
Thecertifications forGenerative AI on Vertex AI continue toapply when open models are used as a managed API using Vertex AI.If you need details about the models themselves, additional information can befound in the respective model card, or you can contact the respective modelpublisher.
Your data is stored at rest within the selected region or multi-region foropen models on Vertex AI, but the regionalization of dataprocessing may vary. For a detailed list of open models' data processingcommitments, seeData residency for openmodels.
Customer prompts and model responses are not shared with third parties whenusing the Vertex AI API, including open models. Google only processescustomer data as instructed by the customer, which is further described in ourCloud Data Processing Addendum.
Context caching
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
Context caching helps reduce the cost and latency of requests to Open Models thatcontain repeated content. This is enabled while using pay-as-you-go trafficonly, and doesn't support other traffic types, such asProvisioned Throughput and Batch.
The supported type of caching is implicit caching, which is automatic cachingthat's enabled in all Google Cloud projects by default and provides a 90%discount on cached tokens compared to standard input tokens when cache hitsoccur. With this type of caching, you don't define and call the cachesexplicitly. Instead, our backend pulls from these caches once repeated contextis detected.
Supported models
- qwen3-coder-480b-a35b-instruct-maas
- kimi-k2-thinking-maas
- minimax-m2-maas
- gpt-oss-20b-maas
- deepseek-v3.1-maas
- deepseek-v3.2-maas
ThecachedContentTokenCountfield in your response's metadata indicates the number of tokens in the cachedpart of your input. Caching requests must contain a minimum of 4096 tokens (thisminimum is subject to change during Preview).
When enabled, implicit cache hit cost savings are automatically passed on toyou. Cache hits aren't guaranteed and are dependent on requests sent andother factors. To increase the chances of an implicit cache hit, try the following:
- Place large and common contents at the beginning of your prompt.
- Send requests with a similar prefix in a short amount of time.
What's next
- Before using open models,Grant user access to open models.
- Learn how toCall open model APIs.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.