Mistral AI models Stay organized with collections Save and categorize content based on your preferences.
Mistral AI models on Vertex AI offer fully managed and serverlessmodels as APIs. To use a Mistral AI model on Vertex AI, senda request directly to the Vertex AI API endpoint. BecauseMistral AI models use a managed API, there's no need to provision ormanage infrastructure.
You can stream your responses to reduce the end-user latency perception. Astreamed response uses server-sent events (SSE) to incrementally stream theresponse.
You pay for Mistral AI models as you use them (pay as you go). Forpay-as-you-go pricing, see Mistral AI model pricing on theVertex AIpricing
page.
To see an example of getting started with Mistral AI models on Vertex AI, run the "Getting Started with Mistral AI Models" notebook in one of the following environments:
Open in Colab |Open in Colab Enterprise |Openin Vertex AI Workbench |View on GitHub
Available Mistral AI models
The following models are available from Mistral AI to use inVertex AI. To access a Mistral AI model, go to itsModel Garden model card.
Mistral Medium 3
Mistral Medium 3 is a versatile model designed for a wide range oftasks, including programming, mathematical reasoning, understanding longdocuments, summarization, and dialogue. It excels at complex tasks requiringadvanced reasoning abilities, visual understanding or a high level ofspecialization (e.g. creative writing, agentic workflows, code generation).
It boasts multi-modal capabilities, enabling it to process visual inputs, andsupports dozens of languages, including over 80 coding languages. Additionally,it features function calling and agentic workflows.
Mistral Medium 3 is optimized for single-node inference, particularlyfor long-context applications. Its size allows it to achieve high throughput ona single node.
Go to the Mistral Medium 3 model card
Mistral OCR (25.05)
Mistral OCR (25.05) is an Optical Character Recognition API for documentunderstanding. Mistral OCR (25.05) excels in understanding complexdocument elements, including interleaved imagery, mathematical expressions,tables, and advanced layouts such as LaTeX formatting. The model enables deeperunderstanding of rich documents such as scientific papers with charts, graphs,equations and figures.
Mistral OCR (25.05) is an ideal model to use in combination with a RAGsystem that takes multimodal documents (such as slides or complex PDFs) asinput.
You can couple Mistral OCR (25.05) with other Mistral models to reformatthe results. This combination ensures that the extracted content is not onlyaccurate but also presented in a structured and coherent manner, making itsuitable for various downstream applications and analyses.
Go to the Mistral OCR (25.05) model card
Mistral Small 3.1 (25.03)
Mistral Small 3.1 (25.03) features multimodal capabilities and a context of upto 128,000. The model can process and understand visual inputs and longdocuments, further expanding its range of applications compared to the previousMistral AI Small model. Mistral Small 3.1 (25.03) is a versatile modeldesigned for various tasks such as programming, mathematical reasoning, documentunderstanding, and dialogue. Mistral Small 3.1 (25.03) is designed forlow-latency applications to deliver best-in-class efficiency compared to modelsof the same quality.
Mistral Small 3.1 (25.03) has undergone a full post-training process to alignthe model with human preferences and needs, making it usable out-of-the-box forapplications that require chat or precise instruction following.
Go to the Mistral Small 3.1 (25.03) model card
Codestral 2
Codestral 2 is Mistral's code generation specialized model builtspecifically for high-precision fill-in-the-middle (FIM) completion. It helpsdevelopers write and interact with code through a shared instruction andcompletion API endpoint. As it masters code and can also converse in a varietyof languages, it can be used to design advanced AI applications for softwaredevelopers.
The latest release of Codestral 2 delivers measurable upgrades over priorversion Codestral (25.01):
- 30% increase in accepted completions.
- 10% more retained code after suggestion.
- 50% fewer runaway generations, improving confidence in longer edits.
Improved performance on academic benchmarks for short and long-context FIMcompletion.
- Code generation: code completion, suggestions, translation.
- Code understanding and documentation: code summarization and explanation.
- Code quality: code review, refactoring, bug fixing and test case generation.
- Code fill-in-the-middle: users can define the starting point of the code usinga prompt, and the ending point of the code using an optional suffix and anoptional stop. The Codestral model will then generate the code that fits inbetween, making it ideal for tasks that require a specific piece of code to begenerated.
Go to the Codestral 2 model card
Use Mistral AI models
You can use curl commands to send requests to the Vertex AI endpointusing the following model names:
- For Mistral Medium 3, use
mistral-medium-3 - For Mistral OCR (25.05), use
mistral-ocr-2505 - For Mistral Small 3.1 (25.03), use
mistral-small-2503 - For Codestral 2, use
codestral-2
For more information about using the Mistral AI SDK, see theMistral AI Vertex AI documentation.
Before you begin
To use Mistral AI models with Vertex AI, you must perform thefollowing steps. The Vertex AI API(aiplatform.googleapis.com) must be enabled to useVertex AI. If you already have an existing project with theVertex AI API enabled, you can use that project instead of creating anew project.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator role (
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant roles.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission.Learn how to grant roles.- Go to one of the following Model Garden model cards, then clickEnable:
Make a streaming call to a Mistral AI model
The following sample makes a streaming call to a Mistral AI model.
REST
After youset up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- LOCATION: A region that supports Mistral AI models.
- MODEL: Themodel name you want to use. In the request body, exclude the
@model version number. - ROLE: The role associated with a message. You can specify a
useror anassistant. The first message must use theuserrole. The models operate with alternatinguserandassistantturns. If the final message uses theassistantrole, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to
trueto stream the response andfalseto return the response all at once. - CONTENT: The content, such as text, of the
userorassistantmessage. - MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token isapproximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longerresponses.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict
Request JSON body:
{"model":MODEL, "messages": [ { "role": "ROLE", "content": "CONTENT" }], "max_tokens":MAX_TOKENS, "stream": true}To send your request, choose one of these options:
curl
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict"
PowerShell
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Response
data: { "id": "0e9c8e69e5924f729b39bc60bac9e0be", "object": "chat.completion.chunk", "created": 1720807292, "model": "MODEL", "choices": [ { "index": 0, "delta": { "content": "OUTPUT" }, "finish_reason": null, "logprobs": null } ]}data: { "id": "0e9c8e69e5924f729b39bc60bac9e0be", "object": "chat.completion.chunk", "created": 1720807292, "model": "MODEL", "choices": [ { "index": 0, "delta": { "content": "OUTPUT" }, "finish_reason": null, "logprobs": null } ]}...Make a unary call to a Mistral AI model
The following sample makes a unary call to a Mistral AI model.
REST
After youset up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- LOCATION: A region that supports Mistral AI models.
- MODEL: Themodel name you want to use. In the request body, exclude the
@model version number. - ROLE: The role associated with a message. You can specify a
useror anassistant. The first message must use theuserrole. The models operate with alternatinguserandassistantturns. If the final message uses theassistantrole, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to
trueto stream the response andfalseto return the response all at once. - CONTENT: The content, such as text, of the
userorassistantmessage. - MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token isapproximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longerresponses.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict
Request JSON body:
{"model":MODEL, "messages": [ { "role": "ROLE", "content": "CONTENT" }], "max_tokens":MAX_TOKENS, "stream": false}To send your request, choose one of these options:
curl
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login , or by usingCloud Shell, which automatically logs you into thegcloud CLI . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict"
PowerShell
Note: The following command assumes that you have logged in to thegcloud CLI with your user account by runninggcloud init orgcloud auth login . You can check the currently active account by runninggcloud auth list. Save the request body in a file namedrequest.json, and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Response
{ "id": "e71d13ffb77344a08e34e0a22ea84458", "object": "chat.completion", "created": 1720806624, "model": "MODEL", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "OUTPUT", "tool_calls": null }, "finish_reason": "stop", "logprobs": null } ], "usage": { "prompt_tokens": 17, "total_tokens": 295, "completion_tokens": 278 }}Mistral AI model region availability and quotas
For Mistral AI models, a quota applies for each region where the model isavailable. The quota is specified in queries per minute (QPM) and tokens perminute (TPM). TPM includes both input and output tokens.
Important: Machine learning (ML) processing for all available Mistral AImodels occurs within the US when requests are made to regionally-available APIsin the US, or within the EU when requests are made to regionally-available APIsin Europe.| Model | Region | Quotas | Context length |
|---|---|---|---|
| Mistral Medium 3 | |||
us-central1 |
| 128,000 | |
europe-west4 |
| 128,000 | |
| Mistral OCR (25.05) | |||
us-central1 |
| 30 pages | |
europe-west4 |
| 30 pages | |
| Mistral Small 3.1 (25.03) | |||
us-central1 |
| 128,000 | |
europe-west4 |
| 128,000 | |
| Codestral 2 | |||
us-central1 |
| 128,000 tokens | |
europe-west4 |
| 128,000 tokens |
If you want to increase any of your quotas for Generative AI on Vertex AI, you canuse the Google Cloud console to request a quota increase. To learn more aboutquotas, see theCloud Quotas overview.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.
Open in Colab
Open in Colab Enterprise
Openin Vertex AI Workbench
View on GitHub