gcloud ai endpoints deploy-model Stay organized with collections Save and categorize content based on your preferences.
- NAME
- gcloud ai endpoints deploy-model - deploy a model to an existing Vertex AI endpoint
- SYNOPSIS
gcloud ai endpoints deploy-model(ENDPOINT:--region=REGION)--display-name=DISPLAY_NAME--model=MODEL[--accelerator=[count=COUNT],[type=TYPE]][--autoscaling-metric-specs=[METRIC-NAME=TARGET,…]][--deployed-model-id=DEPLOYED_MODEL_ID][--disable-container-logging][--enable-access-logging][--gpu-partition-size=GPU_PARTITION_SIZE][--machine-type=MACHINE_TYPE][--max-replica-count=MAX_REPLICA_COUNT][--min-replica-count=MIN_REPLICA_COUNT][--required-replica-count=REQUIRED_REPLICA_COUNT][--reservation-affinity=[key=KEY],[reservation-affinity-type=RESERVATION-AFFINITY-TYPE],[values=VALUES]][--service-account=SERVICE_ACCOUNT][--spot][--traffic-split=[DEPLOYED_MODEL_ID=VALUE,…]][GCLOUD_WIDE_FLAG …]
- EXAMPLES
- To deploy a model
to an endpoint456under project123in regionexample, run:us-central1gcloudaiendpointsdeploy-model123--project=example--region=us-central1--model=456--display-name=my_deployed_model - POSITIONAL ARGUMENTS
- Endpoint resource - The endpoint to deploy a model to. The arguments in thisgroup can be used to specify the attributes of this resource. (NOTE) Someattributes are not given arguments in this group but can be set in other ways.
To set the
projectattribute:- provide the argument
endpointon the command line with a fullyspecified name; - provide the argument
--projecton the command line; - set the property
core/project.
This must be specified.
ENDPOINT- ID of the endpoint or fully qualified identifier for the endpoint.
To set the
nameattribute:- provide the argument
endpointon the command line.
This positional argument must be specified if any of the other arguments in thisgroup are specified.
- provide the argument
--region=REGION- Cloud region for the endpoint.
To set the
regionattribute:- provide the argument
endpointon the command line with a fullyspecified name; - provide the argument
--regionon the command line; - set the property
ai/region; - choose one from the prompted list of available regions.
- provide the argument
- provide the argument
- Endpoint resource - The endpoint to deploy a model to. The arguments in thisgroup can be used to specify the attributes of this resource. (NOTE) Someattributes are not given arguments in this group but can be set in other ways.
- REQUIRED FLAGS
--display-name=DISPLAY_NAME- Display name of the deployed model.
--model=MODEL- ID of the uploaded model.
- OPTIONAL FLAGS
--accelerator=[count=COUNT],[type=TYPE]- Manage the accelerator config for GPU serving. When deploying a model withCompute Engine Machine Types, a GPU accelerator may also be selected.
type- The type of the accelerator. Choices are 'nvidia-a100-80gb', 'nvidia-b200','nvidia-gb200', 'nvidia-h100-80gb', 'nvidia-h100-mega-80gb','nvidia-h200-141gb', 'nvidia-l4', 'nvidia-rtx-pro-6000', 'nvidia-tesla-a100','nvidia-tesla-k80', 'nvidia-tesla-p100', 'nvidia-tesla-p4', 'nvidia-tesla-t4','nvidia-tesla-v100'.
count- The number of accelerators to attach to each machine running the job. This isusually 1. If not specified, the default value is 1.
For example:
--accelerator=type=nvidia-tesla-k80,count=1
--autoscaling-metric-specs=[METRIC-NAME=TARGET,…]- Metric specifications that control autoscaling behavior. At most one entry isallowed per metric.
METRIC-NAME- Resource metric name. Choices are 'cpu-usage', 'gpu-duty-cycle','request-counts-per-minute'.
TARGET- Target value for the given metric. For
cpu-usageandgpu-duty-cycle, the target is the target resource utilization inpercentage (1% - 100%). Forrequest-counts-per-minute, the targetis the number of requests per minute per replica.For example, to set target CPU usage to 70% and target requests to 600 perminute per replica:
--autoscaling-metric-specs=cpu-usage=70,request-counts-per-minute=600
--deployed-model-id=DEPLOYED_MODEL_ID- User-specified ID of the deployed-model.
--disable-container-logging- For custom-trained Models and AutoML Tabular Models, the container of thedeployed model instances will send
stderrandstdoutstreams to Cloud Logging by default. Please note that the logs incur cost, whichare subject toCloudLogging pricing.User can disable container logging by setting this flag to true.
--enable-access-logging- If true, online prediction access logs are sent to Cloud Logging.
These logs are standard server access logs, containing information liketimestamp and latency for each prediction request.
--gpu-partition-size=GPU_PARTITION_SIZE- The partition size of the GPU accelerator. This can be used to partition asingle GPU into multiple smaller GPU instances. Seehttps://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi#multi-instance_gpu_partitionsfor more details.
--machine-type=MACHINE_TYPE- The machine resources to be used for each node of this deployment. For availablemachine types, seehttps://cloud.google.com/ai-platform-unified/docs/predictions/machine-types.
--max-replica-count=MAX_REPLICA_COUNT- Maximum number of machine replicas for the deployment resources the model willbe deployed on.
--min-replica-count=MIN_REPLICA_COUNT- Minimum number of machine replicas for the deployment resources the model willbe deployed on. For normal deployments, the value must be equal to or largerthan 1. If the value is 0, the deployment will be enrolled in the scale-to-zerofeature. If not specified and the uploaded models use dedicated resources, thedefault value is 1.
NOTE: DeploymentResourcePools (model-cohosting) is currently not supported forscale-to-zero deployments.
--required-replica-count=REQUIRED_REPLICA_COUNT- Required number of machine replicas for the deployment resources the model willbe considered successfully deployed. This value must be greater than or equal to1 and less than or equal to min-replica-count.
--reservation-affinity=[key=KEY],[reservation-affinity-type=RESERVATION-AFFINITY-TYPE],[values=VALUES]- A ReservationAffinity can be used to configure a Vertex AI resource (e.g., aDeployedModel) to draw its Compute Engine resources from a Shared Reservation,or exclusively from on-demand capacity.
--service-account=SERVICE_ACCOUNT- Service account that the deployed model's container runs as. Specify the emailaddress of the service account. If this service account is not specified, thecontainer runs as a service account that doesn't have access to the resourceproject.
--spot- If true, schedule the deployment workload on Spot VMs.
--traffic-split=[DEPLOYED_MODEL_ID=VALUE,…]- List of pairs of deployed model id and value to set as traffic split.
- GCLOUD WIDE FLAGS
- These flags are available to all commands:
--access-token-file,--account,--billing-project,--configuration,--flags-file,--flatten,--format,--help,--impersonate-service-account,--log-http,--project,--quiet,--trace-token,--user-output-enabled,--verbosity.Run
$gcloud helpfor details. - NOTES
- These variants are also available:
gcloudalphaaiendpointsdeploy-modelgcloudbetaaiendpointsdeploy-model
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-09 UTC.