Deploy a model to an endpoint Stay organized with collections Save and categorize content based on your preferences.
Before you canget online inferencesfrom a trained model, you must deploy the model to anendpoint.This can be done by using the Google Cloud console, the Google Cloud CLI, orthe Vertex AI API.
This document describes the process for deploying models to endpoints.
What happens when you deploy a model
Deploying a model associates physical resources with themodel so that it can serve online inferences with low latency.
You can deploy multiple models to an endpoint, or you can deploy the same model tomultiple endpoints. For more information, seeReasons to deploy more than one model to the same endpoint.
Prepare to deploy a model to an endpoint
During model deployment, you make the following important decisions about how torun online inference:
| Resource created | Setting specified at resource creation |
|---|---|
| Endpoint | Location in which to run inferences |
| Model | Container to use (ModelContainerSpec) |
| DeployedModel | Compute resources to use for online inference |
After the model is deployed to the endpoint, these deployment settings can't bechanged. To change them, you must redeploy your model.
The first step in the deployment process is to decide which type of endpoint touse. For more information, seeChoose an endpoint type.
Next, make sure that the model is visible in Vertex AI Model Registry.This is required for the model to be deployable.For information about Model Registry, including how toimport model artifacts or create them directly inModel Registry, seeIntroduction to Vertex AI Model Registry.
The next decision to make is which compute resources to use for serving the model.The model's training type (AutoML or custom) and (AutoML) datatype determine the kinds of physical resources available to the model. Aftermodel deployment, you canmutate someof those resources without creating a new deployment.
The endpoint resource provides the service endpoint (URL) you use to request theinference. For example:
https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/endpoints/{endpoint}:predictDeploy a model to an endpoint
You can deploy a model to an endpoint byusing the Google Cloud console or byusing thegcloud CLI or Vertex AI API.
Deploy a model to a public endpoint by using the Google Cloud console
In the Google Cloud console, you can deploy a model to an existing dedicated orshared public endpoint, or you can create a new endpoint during the deploymentprocess. For details, seeDeploy a model by using the Google Cloud console.
Deploy a model to a public endpoint by using the gcloud CLI or Vertex AI API
When you deploy a model by using the gcloud CLI or Vertex AI API,you must first create a dedicated or shared endpoint and then deploy the modelto it. For details, see:
Deploy a model to a Private Service Connect endpoint
For details, seeUse Private Service Connect endpoints for online inference.
Use a rolling deployment to update a deployed model
You can use arolling deployment to replace a deployed model with a newversion of the same model. The new model reuses the compute resources from theprevious one. For details, seeUse a rolling deployment to replace a deployed model.
Undeploy a model and delete the endpoint
You can undeploy a model and delete the endpoint. For details, seeUndeploy a model and delete the endpoint.
Reasons to deploy more than one model to the same endpoint
Deploying two models to the same endpoint lets you gradually replace one modelwith the other. For example, suppose you're using a model, and find a way toincrease the accuracy of that model with new training data. However, you don'twant to update your application to point to a new endpoint URL, and you don'twant to create sudden changes in your application. You can add the new model tothe same endpoint, serving a small percentage of traffic, and gradually increasethe traffic split for the new model until it is serving 100% of the traffic.
Because the resources are associated with the model rather than the endpoint,you could deploy models of different types to the same endpoint. However, thebest practice is to deploy models of a specific type (for example,AutoML tabular or custom-trained) to an endpoint.This configuration is easier to manage.
Reasons to deploy a model to more than one endpoint
You might want to deploy your models with different resources for differentapplication environments, such as testing and production. You might also want tosupport different SLOs for your inference requests. Perhaps one of yourapplications has much higher performance needs than the others. In this case,you can deploy that model to a higher-performance endpoint with more machineresources. To optimize costs, you can also deploy the model to alower-performance endpoint with fewer machine resources.
Scaling behavior
Vertex AI Inference autoscaling scales thenumber of inference nodes based on the number of concurrent requests. This letsyou dynamically adjust to changing request loads while managing costs. For moreinformation, seeScale inference nodes for Vertex AI Inference.
What's next
- Choose an endpoint type.
- Deploy a model by using the Google Cloud console.
- Learn aboutInference request-response logging for dedicated endpoints and Private Service Connect endpoints.
- Learn how toget an online inference.
- Learn how tochange thedefault settings for inference logging.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.