Vertex AI open models for MaaS

Vertex AI supports a curated list of open models as managed models. Theseopen models can be used withVertex AI as a model as aservice (MaaS) and are offered as a managed API. When you use a managed openmodel, you continue to send your requests to Vertex AI endpoints. Managedopen models are serverless, so there's no need to provision or manageinfrastructure.

Managed open models can be discovered using Model Garden. You can alsodeploy models using Model Garden. For more information, seeExplore AImodels inModel Garden.

Before you can use open models, you need togrant user access to openmodels.

Open models

The following open models are offered as managed APIs on Vertex AIModel Garden (MaaS):

Model nameModalityDescriptionQuickstart
DeepSeek-OCRLanguage, VisionA comprehensive Optical Character Recognition (OCR) model that analyzes and understands complex documents. It excels at challenging OCR tasks.Model card
DeepSeek R1 (0528)LanguageDeepSeek's latest version of the DeepSeek R1 model.Model card
DeepSeek-V3.1LanguageDeepSeek's hybrid model that supports both thinking mode and non-thinking mode.Model card
DeepSeek-V3.2LanguageDeepSeek's model that harmonizes high computational efficiency with superior reasoning and agent performance.Model card
GLM 4.7Language, CodeGLM's model designed for core or vibe coding, tool use, and complex reasoning.Model card
GLM 5Language, CodeGLM's model targeting complex systems engineering and long-horizon agentic tasks.Model card
gpt-oss 120BLanguageA 120B model that offers high performance on reasoning tasks.Model card
gpt-oss 20BLanguageA 20B model optimized for efficiency and deployment on consumer and edge hardware.Model card
Kimi K2 ThinkingLanguageAn open-source thinking agent model that reasons step-by-step and uses tools to solve complex problems.Model card
Llama 3.3LanguageLlama 3.3 is a text-only 70B instruction-tuned model that provides enhanced performance relative to Llama 3.1 70B and to Llama 3.2 90B when used for text-only applications. Moreover, for some applications, Llama 3.3 70B approaches the performance of Llama 3.1 405B.Model card
Llama 4 Maverick 17B-128ELanguage, VisionThe largest and most capable Llama 4 model that has coding, reasoning, and image capabilities. Llama 4 Maverick 17B-128E is a multimodal model that uses the Mixture-of-Experts (MoE) architecture and early fusion.Model card
Llama 4 Scout 17B-16ELanguage, VisionLlama 4 Scout 17B-16E delivers state-of-the-art results for its size class, outperforming previous Llama generations and other open and proprietary models on several benchmarks. Llama 4 Scout 17B-16E is a multimodal model that uses the Mixture-of-Experts (MoE) architecture and early fusion.Model card
MiniMax M2Language, CodeDesigned for agentic and code-related tasks with strong capabilities in planning and executing complex tool-calling tasks.Model card
Qwen3 235BLanguageAn open-weight model with a "hybrid thinking" capability to switch between methodical reasoning and rapid conversation.Model card
Qwen3 CoderLanguage, CodeAn open-weight model developed for advanced software development tasks.Model card
Qwen3-Next-80B InstructLanguage, CodeA model from the Qwen3-Next family of models, specialized for following specific commands.Model card
Qwen3-Next-80B ThinkingLanguage, CodeA model from the Qwen3-Next family of models, specialized for complex problem-solving and deep reasoning.Model card

The following open embedding models are offered as managed APIs on Vertex AIModel Garden (MaaS):

Model nameDescriptionOutput dimensionsMax sequence lengthSupported text languagesQuickstart
multilingual-e5-smallPart of the E5 family of text embedding models. Small variant contains 12 layers.Up to 384512 tokensSupported languagesModel card
multilingual-e5-largePart of the E5 family of text embedding models. Large variant contains 24 layers.Up to 1024512 tokensSupported languagesModel card

Open model regulatory compliance

Thecertifications forGenerative AI on Vertex AI continue toapply when open models are used as a managed API using Vertex AI.If you need details about the models themselves, additional information can befound in the respective model card, or you can contact the respective modelpublisher.

Your data is stored at rest within the selected region or multi-region foropen models on Vertex AI, but the regionalization of dataprocessing may vary. For a detailed list of open models' data processingcommitments, seeData residency for openmodels.

Customer prompts and model responses are not shared with third parties whenusing the Vertex AI API, including open models. Google only processescustomer data as instructed by the customer, which is further described in ourCloud Data Processing Addendum.

Context caching

Preview

This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Context caching helps reduce the cost and latency of requests to Open Models thatcontain repeated content. This is enabled while using pay-as-you-go trafficonly, and doesn't support other traffic types, such asProvisioned Throughput and Batch.

The supported type of caching is implicit caching, which is automatic cachingthat's enabled in all Google Cloud projects by default and provides a 90%discount on cached tokens compared to standard input tokens when cache hitsoccur. With this type of caching, you don't define and call the cachesexplicitly. Instead, our backend pulls from these caches once repeated contextis detected.

Supported models

  • qwen3-coder-480b-a35b-instruct-maas
  • kimi-k2-thinking-maas
  • minimax-m2-maas
  • gpt-oss-20b-maas
  • deepseek-v3.1-maas
  • deepseek-v3.2-maas

ThecachedContentTokenCountfield in your response's metadata indicates the number of tokens in the cachedpart of your input. Caching requests must contain a minimum of 4096 tokens (thisminimum is subject to change during Preview).

When enabled, implicit cache hit cost savings are automatically passed on toyou. Cache hits aren't guaranteed and are dependent on requests sent andother factors. To increase the chances of an implicit cache hit, try the following:

  • Place large and common contents at the beginning of your prompt.
  • Send requests with a similar prefix in a short amount of time.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.