gcloud alpha ai endpoints deploy-model

NAME
gcloud alpha ai endpoints deploy-model - deploy a model to an existing Vertex AI endpoint
SYNOPSIS
gcloud alpha ai endpoints deploy-model(ENDPOINT :--region=REGION)--display-name=DISPLAY_NAME--model=MODEL[--accelerator=[count=COUNT],[type=TYPE]][--autoscaling-metric-specs=[METRIC-NAME=TARGET,…]][--deployed-model-id=DEPLOYED_MODEL_ID][--enable-access-logging][--enable-container-logging][--gpu-partition-size=GPU_PARTITION_SIZE][--idle-scaledown-period=IDLE_SCALEDOWN_PERIOD][--initial-replica-count=INITIAL_REPLICA_COUNT][--machine-type=MACHINE_TYPE][--max-replica-count=MAX_REPLICA_COUNT][--min-replica-count=MIN_REPLICA_COUNT][--min-scaleup-period=MIN_SCALEUP_PERIOD][--multihost-gpu-node-count=MULTIHOST_GPU_NODE_COUNT][--required-replica-count=REQUIRED_REPLICA_COUNT][--reservation-affinity=[key=KEY],[reservation-affinity-type=RESERVATION-AFFINITY-TYPE],[values=VALUES]][--service-account=SERVICE_ACCOUNT][--spot][--tpu-topology=TPU_TOPOLOGY][--traffic-split=[DEPLOYED_MODEL_ID=VALUE,…]][--shared-resources=SHARED_RESOURCES :--shared-resources-region=SHARED_RESOURCES_REGION][GCLOUD_WIDE_FLAG]
EXAMPLES
To deploy a model456 to an endpoint123 under projectexample in regionus-central1, run:
gcloudalphaaiendpointsdeploy-model123--project=example--region=us-central1--model=456--display-name=my_deployed_model
POSITIONAL ARGUMENTS
Endpoint resource - The endpoint to deploy a model to. The arguments in thisgroup can be used to specify the attributes of this resource. (NOTE) Someattributes are not given arguments in this group but can be set in other ways.

To set theproject attribute:

  • provide the argumentendpoint on the command line with a fullyspecified name;
  • provide the argument--project on the command line;
  • set the propertycore/project.

This must be specified.

ENDPOINT
ID of the endpoint or fully qualified identifier for the endpoint.

To set thename attribute:

  • provide the argumentendpoint on the command line.

This positional argument must be specified if any of the other arguments in thisgroup are specified.

--region=REGION
Cloud region for the endpoint.

To set theregion attribute:

  • provide the argumentendpoint on the command line with a fullyspecified name;
  • provide the argument--region on the command line;
  • set the propertyai/region;
  • choose one from the prompted list of available regions.
REQUIRED FLAGS
--display-name=DISPLAY_NAME
Display name of the deployed model.
--model=MODEL
ID of the uploaded model. The alpha and beta tracks also support GDC connectedmodels.
OPTIONAL FLAGS
--accelerator=[count=COUNT],[type=TYPE]
Manage the accelerator config for GPU serving. When deploying a model withCompute Engine Machine Types, a GPU accelerator may also be selected.
type
The type of the accelerator. Choices are 'nvidia-a100-80gb', 'nvidia-b200','nvidia-gb200', 'nvidia-h100-80gb', 'nvidia-h100-mega-80gb','nvidia-h200-141gb', 'nvidia-l4', 'nvidia-rtx-pro-6000', 'nvidia-tesla-a100','nvidia-tesla-k80', 'nvidia-tesla-p100', 'nvidia-tesla-p4', 'nvidia-tesla-t4','nvidia-tesla-v100'.
count
The number of accelerators to attach to each machine running the job. This isusually 1. If not specified, the default value is 1.

For example:--accelerator=type=nvidia-tesla-k80,count=1

--autoscaling-metric-specs=[METRIC-NAME=TARGET,…]
Metric specifications that control autoscaling behavior. At most one entry isallowed per metric.
METRIC-NAME
Resource metric name. Choices are 'cpu-usage', 'gpu-duty-cycle','request-counts-per-minute'.
TARGET
Target value for the given metric. Forcpu-usage andgpu-duty-cycle, the target is the target resource utilization inpercentage (1% - 100%). Forrequest-counts-per-minute, the targetis the number of requests per minute per replica.

For example, to set target CPU usage to 70% and target requests to 600 perminute per replica:--autoscaling-metric-specs=cpu-usage=70,request-counts-per-minute=600

--deployed-model-id=DEPLOYED_MODEL_ID
User-specified ID of the deployed-model.
--enable-access-logging
If true, online prediction access logs are sent to Cloud Logging.

These logs are standard server access logs, containing information liketimestamp and latency for each prediction request.

--enable-container-logging
If true, the container of the deployed model instances will sendstderr andstdout streams to Cloud Logging.

Currently, only supported for custom-trained Models and AutoML Tabular Models.

--gpu-partition-size=GPU_PARTITION_SIZE
The partition size of the GPU accelerator. This can be used to partition asingle GPU into multiple smaller GPU instances. Seehttps://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi#multi-instance_gpu_partitionsfor more details.
--idle-scaledown-period=IDLE_SCALEDOWN_PERIOD
Duration (in seconds) without traffic before a deployment is scaled down to zeroreplicas. Defaults to 1 hour if min replica count is 0.
--initial-replica-count=INITIAL_REPLICA_COUNT
Initial number of replicas for the deployment resources the model will be scaledup to. Cannot be smaller than min replica count or larger than max replicacount.
--machine-type=MACHINE_TYPE
The machine resources to be used for each node of this deployment. For availablemachine types, seehttps://cloud.google.com/ai-platform-unified/docs/predictions/machine-types.
--max-replica-count=MAX_REPLICA_COUNT
Maximum number of machine replicas for the deployment resources the model willbe deployed on.
--min-replica-count=MIN_REPLICA_COUNT
Minimum number of machine replicas for the deployment resources the model willbe deployed on. For normal deployments, the value must be equal to or largerthan 1. If the value is 0, the deployment will be enrolled in the scale-to-zerofeature. If not specified and the uploaded models use dedicated resources, thedefault value is 1.

NOTE: DeploymentResourcePools (model-cohosting) is currently not supported forscale-to-zero deployments.

--min-scaleup-period=MIN_SCALEUP_PERIOD
Minimum duration (in seconds) that a deployment will be scaled up before trafficis evaluated for potential scale-down. Defaults to 1 hour if min replica countis 0.
--multihost-gpu-node-count=MULTIHOST_GPU_NODE_COUNT
The number of nodes per replica for multihost GPU deployments. Required formultihost GPU deployments.
--required-replica-count=REQUIRED_REPLICA_COUNT
Required number of machine replicas for the deployment resources the model willbe considered successfully deployed. This value must be greater than or equal to1 and less than or equal to min-replica-count.
--reservation-affinity=[key=KEY],[reservation-affinity-type=RESERVATION-AFFINITY-TYPE],[values=VALUES]
A ReservationAffinity can be used to configure a Vertex AI resource (e.g., aDeployedModel) to draw its Compute Engine resources from a Shared Reservation,or exclusively from on-demand capacity.
--service-account=SERVICE_ACCOUNT
Service account that the deployed model's container runs as. Specify the emailaddress of the service account. If this service account is not specified, thecontainer runs as a service account that doesn't have access to the resourceproject.
--spot
If true, schedule the deployment workload on Spot VMs.
--tpu-topology=TPU_TOPOLOGY
CloudTPU topology to use for this deployment. Required for multihost CloudTPUdeployments:https://cloud.google.com/kubernetes-engine/docs/concepts/tpus#topology.
--traffic-split=[DEPLOYED_MODEL_ID=VALUE,…]
List of pairs of deployed model id and value to set as traffic split.
Deployment resource pool resource - The deployment resource pool to co-host amodel on. The arguments in this group can be used to specify the attributes ofthis resource. (NOTE) Some attributes are not given arguments in this group butcan be set in other ways.

To set theproject attribute:

  • provide the argument--shared-resources on the command line with afully specified name;
  • provide the argument--project on the command line;
  • set the propertycore/project.
--shared-resources=SHARED_RESOURCES
ID of the deployment_resource_pool or fully qualified identifier for thedeployment_resource_pool.

To set thename attribute:

  • provide the argument--shared-resources on the command line.

This flag argument must be specified if any of the other arguments in this groupare specified.

--shared-resources-region=SHARED_RESOURCES_REGION
Cloud region for the deployment_resource_pool.

To set theregion attribute:

  • provide the argument--shared-resources on the command line with afully specified name;
  • provide the argument--shared-resources-region on the command line;
  • provide the argument--region on the command line;
  • set the propertyai/region;
  • choose one from the prompted list of available regions.
GCLOUD WIDE FLAGS
These flags are available to all commands:--access-token-file,--account,--billing-project,--configuration,--flags-file,--flatten,--format,--help,--impersonate-service-account,--log-http,--project,--quiet,--trace-token,--user-output-enabled,--verbosity.

Run$gcloud help for details.

NOTES
This command is currently in alpha and might change without notice. If thiscommand fails with API permission errors despite specifying the correct project,you might be trying to access an API with an invitation-only early accessallowlist. These variants are also available:
gcloudaiendpointsdeploy-model
gcloudbetaaiendpointsdeploy-model

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-09 UTC.