Configure compute resources for inference Stay organized with collections Save and categorize content based on your preferences.
Vertex AI allocatesnodes to handle online and batch inferences.When youdeploy a custom-trained model or AutoML model to anEndpointresource to serve online inferences or whenyourequest batch inferences, you cancustomize the type of virtual machine that the inference service uses forthese nodes. You can optionally configure inference nodes to use GPUs.
Machine types differ in a few ways:
- Number of virtual CPUs (vCPUs) per node
- Amount of memory per node
- Pricing
By selecting a machine type with more computing resources, you can serveinferences with lower latency or handle more inference requests at the sametime.
Manage cost and availability
To help manage costs or ensure availability of VM resources,Vertex AI provides the following:
To help ensure that you pay only for the computing resources that you need,you can use Vertex AI Inference autoscaling. For more information, seeScale inference nodes for Vertex AI Inference.
To make sure that VM resources are available when your inference jobs needthem, you can use Compute Engine reservations. Reservations provide ahigh level of assurance in obtaining capacity for Compute Engineresources. For more information, seeUse reservations with inference.
To reduce the cost of running your inference jobs, you can useSpot VMs. Spot VMs are virtual machine (VM)instances that are excess Compute Engine capacity.Spot VMs have significant discounts, butCompute Engine might preemptively stop or deleteSpot VMs to reclaim the capacity at any time.For more information, seeUse Spot VMs with inference.
Where to specify compute resources
Online inference
If you want to use a custom-trained model or an AutoML tabular model to serveonline inferences, you must specify a machine type when you deploy theModelresource as aDeployedModel to anEndpoint. For other types of AutoMLmodels, Vertex AI configures the machine types automatically.
Specify the machine type (and, optionally, GPU configuration) in thededicatedResources.machineSpec field of yourDeployedModel.
Learn how to deploy each model type:
- Deploy an AutoML tabular model in Google Cloud console
- Deploy a custom-trained model in Google Cloud console
- Deploy a custom-trained model using client libraries
Batch inference
If you want to get batch inferences from a custom-trained model or an AutoMLtabular model, you must specify a machine type when youcreate aBatchPredictionJob resource. Specify themachine type (and, optionally, GPU configuration) in thededicatedResources.machineSpec field of yourBatchPredictionJob.
Machine types
The following tables compare the available machine types for serving inferencesfrom custom-trained models and AutoML tabular models.
For information about TPU accelerator types, seeDeploy a model to Cloud TPU VMs.
Machine types: CPU
E2 Series
| Name | vCPUs | Memory (GB) |
|---|---|---|
e2-standard-2 | 2 | 8 |
e2-standard-4 | 4 | 16 |
e2-standard-8 | 8 | 32 |
e2-standard-16 | 16 | 64 |
e2-standard-32 | 32 | 128 |
e2-highmem-2 | 2 | 16 |
e2-highmem-4 | 4 | 32 |
e2-highmem-8 | 8 | 64 |
e2-highmem-16 | 16 | 128 |
e2-highcpu-2 | 2 | 2 |
e2-highcpu-4 | 4 | 4 |
e2-highcpu-8 | 8 | 8 |
e2-highcpu-16 | 16 | 16 |
e2-highcpu-32 | 32 | 32 |
N1 Series
| Name | vCPUs | Memory (GB) |
|---|---|---|
n1-standard-2 | 2 | 7.5 |
n1-standard-4 | 4 | 15 |
n1-standard-8 | 8 | 30 |
n1-standard-16 | 16 | 60 |
n1-standard-32 | 32 | 120 |
n1-highmem-2 | 2 | 13 |
n1-highmem-4 | 4 | 26 |
n1-highmem-8 | 8 | 52 |
n1-highmem-16 | 16 | 104 |
n1-highmem-32 | 32 | 208 |
n1-highcpu-4 | 4 | 3.6 |
n1-highcpu-8 | 8 | 7.2 |
n1-highcpu-16 | 16 | 14.4 |
n1-highcpu-32 | 32 | 28.8 |
N2 Series
| Name | vCPUs | Memory (GB) |
|---|---|---|
n2-standard-2 | 2 | 8 |
n2-standard-4 | 4 | 16 |
n2-standard-8 | 8 | 32 |
n2-standard-16 | 16 | 64 |
n2-standard-32 | 32 | 128 |
n2-standard-48 | 48 | 192 |
n2-standard-64 | 64 | 256 |
n2-standard-80 | 80 | 320 |
n2-standard-96 | 96 | 384 |
n2-standard-128 | 128 | 512 |
n2-highmem-2 | 2 | 16 |
n2-highmem-4 | 4 | 32 |
n2-highmem-8 | 8 | 64 |
n2-highmem-16 | 16 | 128 |
n2-highmem-32 | 32 | 256 |
n2-highmem-48 | 48 | 384 |
n2-highmem-64 | 64 | 512 |
n2-highmem-80 | 80 | 640 |
n2-highmem-96 | 96 | 768 |
n2-highmem-128 | 128 | 864 |
n2-highcpu-2 | 2 | 2 |
n2-highcpu-4 | 4 | 4 |
n2-highcpu-8 | 8 | 8 |
n2-highcpu-16 | 16 | 16 |
n2-highcpu-32 | 32 | 32 |
n2-highcpu-48 | 48 | 48 |
n2-highcpu-64 | 64 | 64 |
n2-highcpu-80 | 80 | 80 |
n2-highcpu-96 | 96 | 96 |
N2D Series
| Name | vCPUs | Memory (GB) |
|---|---|---|
n2d-standard-2 | 2 | 8 |
n2d-standard-4 | 4 | 16 |
n2d-standard-8 | 8 | 32 |
n2d-standard-16 | 16 | 64 |
n2d-standard-32 | 32 | 128 |
n2d-standard-48 | 48 | 192 |
n2d-standard-64 | 64 | 256 |
n2d-standard-80 | 80 | 320 |
n2d-standard-96 | 96 | 384 |
n2d-standard-128 | 128 | 512 |
n2d-standard-224 | 224 | 896 |
n2d-highmem-2 | 2 | 16 |
n2d-highmem-4 | 4 | 32 |
n2d-highmem-8 | 8 | 64 |
n2d-highmem-16 | 16 | 128 |
n2d-highmem-32 | 32 | 256 |
n2d-highmem-48 | 48 | 384 |
n2d-highmem-64 | 64 | 512 |
n2d-highmem-80 | 80 | 640 |
n2d-highmem-96 | 96 | 768 |
n2d-highcpu-2 | 2 | 2 |
n2d-highcpu-4 | 4 | 4 |
n2d-highcpu-8 | 8 | 8 |
n2d-highcpu-16 | 16 | 16 |
n2d-highcpu-32 | 32 | 32 |
n2d-highcpu-48 | 48 | 48 |
n2d-highcpu-64 | 64 | 64 |
n2d-highcpu-80 | 80 | 80 |
n2d-highcpu-96 | 96 | 96 |
n2d-highcpu-128 | 128 | 128 |
n2d-highcpu-224 | 224 | 224 |
C2 Series
| Name | vCPUs | Memory (GB) |
|---|---|---|
c2-standard-4 | 4 | 16 |
c2-standard-8 | 8 | 32 |
c2-standard-16 | 16 | 64 |
c2-standard-30 | 30 | 120 |
c2-standard-60 | 60 | 240 |
C2D Series
| Name | vCPUs | Memory (GB) |
|---|---|---|
c2d-standard-2 | 2 | 8 |
c2d-standard-4 | 4 | 16 |
c2d-standard-8 | 8 | 32 |
c2d-standard-16 | 16 | 64 |
c2d-standard-32 | 32 | 128 |
c2d-standard-56 | 56 | 224 |
c2d-standard-112 | 112 | 448 |
c2d-highcpu-2 | 2 | 4 |
c2d-highcpu-4 | 4 | 8 |
c2d-highcpu-8 | 8 | 16 |
c2d-highcpu-16 | 16 | 32 |
c2d-highcpu-32 | 32 | 64 |
c2d-highcpu-56 | 56 | 112 |
c2d-highcpu-112 | 112 | 224 |
c2d-highmem-2 | 2 | 16 |
c2d-highmem-4 | 4 | 32 |
c2d-highmem-8 | 8 | 64 |
c2d-highmem-16 | 16 | 128 |
c2d-highmem-32 | 32 | 256 |
c2d-highmem-56 | 56 | 448 |
c2d-highmem-112 | 112 | 896 |
C3 Series
| Name | vCPUs | Memory (GB) |
|---|---|---|
c3-highcpu-4 | 4 | 8 |
c3-highcpu-8 | 8 | 16 |
c3-highcpu-22 | 22 | 44 |
c3-highcpu-44 | 44 | 88 |
c3-highcpu-88 | 88 | 176 |
c3-highcpu-176 | 176 | 352 |
Machine types: GPU
A2 Series
| Name | vCPUs | Memory (GB) | GPUs (NVIDIA A100) |
|---|---|---|---|
a2-highgpu-1g | 12 | 85 | 1 (A100 40GB) |
a2-highgpu-2g | 24 | 170 | 2 (A100 40GB) |
a2-highgpu-4g | 48 | 340 | 4 (A100 40GB) |
a2-highgpu-8g | 96 | 680 | 8 (A100 40GB) |
a2-megagpu-16g | 96 | 1360 | 16 (A100 40GB) |
a2-ultragpu-1g | 12 | 170 | 1 (A100 80GB) |
a2-ultragpu-2g | 24 | 340 | 2 (A100 80GB) |
a2-ultragpu-4g | 48 | 680 | 4 (A100 80GB) |
a2-ultragpu-8g | 96 | 1360 | 8 (A100 80GB) |
A3 Series
| Name | vCPUs | Memory (GB) | GPUs (NVIDIA H100 or H200) |
|---|---|---|---|
a3-highgpu-1g | 26 | 234 | 1 (H100 80GB) |
a3-highgpu-2g | 52 | 468 | 2 (H100 80GB) |
a3-highgpu-4g | 104 | 936 | 4 (H100 80GB) |
a3-highgpu-8g | 208 | 1872 | 8 (H100 80GB) |
a3-edgegpu-8g | 208 | 1872 | 8 (H100 80GB) |
a3-ultragpu-8g | 224 | 2952 | 8 (H200 141GB) |
A4 Series
| Name | vCPUs | Memory (GB) | GPUs (NVIDIA B200) |
|---|---|---|---|
a4-highgpu-8g | 224 | 3,968 | 8 |
A4X Series
| Name | vCPUs | Memory (GB) | GPUs (NVIDIA GB200) |
|---|---|---|---|
a4x-highgpu-4g | 140 | 884 | 4 |
G2 Series
| Name | vCPUs | Memory (GB) | GPUs (NVIDIA L4) |
|---|---|---|---|
g2-standard-4 | 4 | 16 | 1 |
g2-standard-8 | 8 | 32 | 1 |
g2-standard-12 | 12 | 48 | 1 |
g2-standard-16 | 16 | 64 | 1 |
g2-standard-24 | 24 | 96 | 2 |
g2-standard-32 | 32 | 128 | 1 |
g2-standard-48 | 48 | 192 | 4 |
g2-standard-96 | 96 | 384 | 8 |
G4 Series
| Name | vCPUs | Memory (GB) | GPUs (NVIDIA RTX PRO 6000) |
|---|---|---|---|
g4-standard-48 | 48 | 180 | 1 |
g4-standard-96 | 96 | 360 | 2 |
g4-standard-192 | 192 | 720 | 4 |
g4-standard-384 | 384 | 1440 | 8 |
Learn aboutpricing for each machinetype. Read more about the detailed specifications ofthese machine types in theCompute Engine documentation about machinetypes.
Find the ideal machine type
Online inference
To find the ideal machine type for your use case, we recommend loading your modelon multiple machine types and measuring characteristics such as the latency,cost, concurrency, and throughput.
One way to do this is to runthis notebookon multiple machine types and compare the results to find the one that worksbest for you.
Vertex AI reserves approximately 1 vCPU on each replicafor running system processes. This means that running the notebook on a singlecore machine type would be comparable to using a 2-core machine type for servinginferences.
When considering inference costs, remember that although larger machines costmore, they can lower overall cost because fewer replicas are required to servethe same workload. This is particularly evident for GPUs, which tend to costmore per hour, but can both provide lower latency and cost less overall.
Batch inference
For more information, seeChoose machine type and replica count.
Optional GPU accelerators
Some configurations, such as theA2 seriesandG2 series, have afixed number of GPUs built-in.
TheA4X (a4x-highgpu-4g)series requires a minimum replica count of 18. This machine is purchased perrack, and has a minimum of 18 VMs.
Other configurations, such as the N1 series, let you optionally add GPUs to accelerate eachinference node.
Note: GPUs arenot recommended for use with AutoML tabular models. For thistype of model, GPUs don't provide a worthwhile performance benefit. SpecifyingGPUs during AutoML model deployment isn't supported in Google Cloud console.To add optional GPU accelerators, you must account for several requirements:
- You can only use GPUs when your
Modelresource is based on aTensorFlowSavedModel, or when youuse a custom container that has beendesigned to take advantage of GPUs. You can't use GPUs for scikit-learn orXGBoost models. - The availability of each type of GPU varies depending on which region you usefor your model. Learnwhich types of GPUs are available in whichregions.
- You can only use one type of GPU for your
DeployedModelresource orBatchPredictionJob, and there arelimitations on the number of GPUs you can add depending on which machine typeyou are using. The following table describes these limitations.
The following table shows the optional GPUs that are available for onlineinference and how many of each type of GPU you can use with eachCompute Engine machine type:
| Valid numbers of GPUs for each machine type | |||||
|---|---|---|---|---|---|
| Machine type | NVIDIA Tesla P100 | NVIDIA Tesla V100 | NVIDIA Tesla P4 | NVIDIA Tesla T4 | |
n1-standard-2 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-standard-4 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-standard-8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-standard-16 | 1, 2, 4 | 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-standard-32 | 2, 4 | 4, 8 | 2, 4 | 2, 4 | |
n1-highmem-2 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highmem-4 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highmem-8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highmem-16 | 1, 2, 4 | 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highmem-32 | 2, 4 | 4, 8 | 2, 4 | 2, 4 | |
n1-highcpu-2 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highcpu-4 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highcpu-8 | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highcpu-16 | 1, 2, 4 | 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | |
n1-highcpu-32 | 2, 4 | 4, 8 | 2, 4 | 2, 4 | |
Optional GPUs incuradditional costs.
Coschedule multiple replicas on a single VM
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
To optimize the cost of your deployment, you can deploy multiple replicas of thesame model onto a single VM equipped with multiple GPU hardware accelerators,such as thea3-highgpu-8g VM, which has eight NVIDIA H100 GPUs. Each modelreplica can be assigned to one or more GPUs.
For smaller workloads, you can also partition a single GPU into multiplesmaller instances usingNVIDIA multi-instance GPUs (MIG).This lets you assign resources at a sub-GPU level, maximizing theutilization of each accelerator. For more information on multi-instance GPUs,see theNVIDIA multi-instance GPU user guide.
Both of these capabilities are designed to provide more efficient resourceutilization and greater cost-effectiveness for your serving workloads.
Limitations
This feature is subject to the following limitations:
- All of the coscheduled model replicas must be the same model version.
- Usingdeployment resource poolsto share resources across deployments isn't supported.
Supported machine types
The following machine types are supported. Note that, for machine types thatonly have one GPU, no coscheduling is needed.
| Machine type | Coschedule | Coschedule + MIG |
|---|---|---|
| a2-highgpu-1g | N/A | Yes |
| a2-highgpu-2g | Yes | Yes |
| a2-highgpu-4g | Yes | Yes |
| a2-highgpu-8g | Yes | Yes |
| a2-highgpu-16g | Yes | Yes |
| a2-ultragpu-1g | N/A | Yes |
| a2-ultragpu-2g | Yes | Yes |
| a2-ultragpu-4g | Yes | Yes |
| a2-ultragpu-8g | Yes | Yes |
| a3-edgegpu-8g | Yes | Yes |
| a3-highgpu-1g | N/A | Yes |
| a3-highgpu-2g | Yes | Yes |
| a3-highgpu-4g | Yes | Yes |
| a3-highgpu-8g | Yes | Yes |
| a3-megagpu-8g | Yes | Yes |
| a3-ultragpu-8g | Yes | Yes |
| a4-highgpu-8g | Yes | Yes |
| a4x-highgpu-8g | Yes | Yes |
| g4-standard-48 | N/A | Yes |
| g4-standard-96 | Yes | Yes |
| g4-standard-192 | Yes | Yes |
| g4-standard-384 | Yes | Yes |
Prerequisites
Before using this feature, readDeploy a model by using the gcloud CLI or Vertex AI API.
Deploying the model replicas
The following samples demonstrate how to deploy coscheduled model replicas.
Note: In this preview, NVIDIA MIG is supported for Vertex AI API and RESTAPI, but not for Google Cloud CLI.Note: When MIG is enabled, you can't use GPU sharing, because each replica islimited to consuming MIG in a single GPU instance. Therefore, theaccelerator_count must be set to 1 when agpu_partition_size is specified.gcloud
Use the followinggcloud command to deploy coscheduled model replicas on a VM:
gcloudaiendpointsdeploy-modelENDPOINT_ID\--region=LOCATION_ID\--model=MODEL_ID\--display-name=DEPLOYED_MODEL_NAME\--min-replica-count=MIN_REPLICA_COUNT\--max-replica-count=MAX_REPLICA_COUNT\--machine-type=MACHINE_TYPE\--accelerator=type=ACC_TYPE,count=ACC_COUNT\--traffic-split=0=100Replace the following:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- MODEL_ID: The model ID for the model to be deployed.
- DEPLOYED_MODEL_NAME: A name for the
DeployedModel. You can use the display name of theModelfor theDeployedModelas well. - MIN_REPLICA_COUNT: The minimum number of nodes for this deployment.The node count can be increased or decreased as required by the inference load,up to the maximum number of nodes and never fewer than this number of nodes.
- MAX_REPLICA_COUNT: The maximum number of nodes for this deployment.The node count can be increased or decreased as required by the inference load,up to this number of nodes and never fewer than the minimum number of nodes.. One VM is required for every 2 replicas tobe deployed.
- MACHINE_TYPE: The type of VM to use for this deployment. Must befrom the accelerator-optimized family.
- ACC_TYPE: The GPU accelerator type. Should correspond to theMACHINE_TYPE. For
a3-highgpu-8g, usenvidia-h100-80gb. - ACC_COUNT: The number of GPUs that each replica can use. Must beat least 1 and no more than the total number of GPUs in the machine.
REST
Before using any of the request data, make the following replacements:
- PROJECT_NUMBER: The project number.
- LOCATION_ID: The region where you are using Vertex AI.
- MODEL_ID: The ID for the model to be deployed.
- DEPLOYED_MODEL_NAME: A name for the
DeployedModel. You can use the display name of theModelfor theDeployedModelas well. - MACHINE_TYPE: Optional. The machine resources used for each node of thisdeployment. Its default setting is
n1-standard-2.Learn more about machine types. - ACC_TYPE: The GPU accelerator type. Should correspond to the `GPU_PARTITION_SIZE`.
- GPU_PARTITION_SIZE: The GPU partition size. For example, "1g.10gb".
- ACC_COUNT: The number of GPUs that each replica can use. Must be at least 1 and no more than the total number of GPUs in the machine.
- MIN_REPLICA_COUNT: The minimum number of nodes for this deployment.The node count can be increased or decreased as required by the inference load,up to the maximum number of nodes and never fewer than this number of nodes.
- MAX_REPLICA_COUNT: The maximum number of nodes for this deployment.The node count can be increased or decreased as required by the inference load,up to this number of nodes and never fewer than the minimum number of nodes.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT_NUMBER/locations/LOCATION_ID/models/MODEL_ID", "displayName": "DEPLOYED_MODEL_NAME", "dedicatedResources": { "machineSpec": { "machineType": "MACHINE_TYPE", "acceleratorType": "ACC_TYPE", "gpuPartitionSize": "GPU_PARTITION_SIZE", "acceleratorCount": "ACC_COUNT"" }, "minReplicaCount":MIN_REPLICA_COUNT, "maxReplicaCount":MAX_REPLICA_COUNT, "autoscalingMetricSpecs": [ { "metricName": "aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle", "target": 70 } ] } }}To send your request, expand one of these options:
curl (Linux, macOS, or Cloud Shell)
Save the request body in a file namedrequest.json, and execute the following command:
curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"
PowerShell (Windows)
Save the request body in a file namedrequest.json, and execute the following command:
$headers = @{ }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand ContentYou should receive a successful status code (2xx) and an empty response.
Python
To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.
Use the following Python command to deploy coscheduled model replicas on a VM.
endpoint.deploy(model=<var>MODEL</var>,machine_type=MACHINE_TYPE,min_replica_count=MIN_REPLICA_COUNT,max_replica_count=MAX_REPLICA_COUNT,accelerator_type=ACC_TYPE,gpu_partition_size=GPU_PARTITION_SIZE,accelerator_count=ACC_COUNT)Replace the following:
MODEL: The model object returned by thefollowing API call:
model=aiplatform.Model(model_name=model_name)MACHINE_TYPE: The type of VM to use for this deployment. Must befrom the accelerator-optimized family. In the preview, only
a3-highgpu-8gis supported.MIN_REPLICA_COUNT: The minimum number of nodes for this deployment.The node count can be increased or decreased as required by the inference load,up to the maximum number of nodes and never fewer than this number of nodes.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment.The node count can be increased or decreased as required by the inference load,up to this number of nodes and never fewer than the minimum number of nodes.
ACC_TYPE: The GPU accelerator type. Should correspond to theGPU_PARTITION_SIZE.
GPU_PARTITION_SIZE: The GPU partition size. For example,
"1g.10gb". For a comprehensive list of supported partition sizes for eachGPU type, seeMulti-instance GPU partitions.ACC_COUNT: The number of GPUs that each replica can use. Must beat least 1 and no more than the total number of GPUs in the machine. For
a3-highgpu-8g, specify between 1 and 8.
Monitor VM usage
Use the following instructions to monitor the actual machine count for yourdeployed replicas in the Metrics Explorer.
In the Google Cloud console, go to theMetrics Explorer page.
Select the project you want to view metrics for.
From theMetric drop-down menu, clickSelect a metric.
In theFilter by resource or metric name search bar, enter
Vertex AI Endpoint.Select theVertex AI Endpoint > Prediction metric category. UnderActive metrics, selectMachine count.
ClickApply.
Billing
Billing is based on the number of VMs that are used, not the number of GPUs.You canmonitor your VM usage by using Metrics Explorer.
High availability
Because more than one replica is being coscheduled on the same VM,Vertex AI Inference cannot spread your deployment across multiple VMs andtherefore multiple zones until your replica count exceeds the single VM node.For high availability purposes, Google recommends deploying on at least twonodes (VMs).
What's next
- Deploy an AutoML tabular model in Google Cloud console
- Deploy a custom-trained model in Google Cloud console
- Deploy a custom-trained model using client libraries
- Get batch inferences
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-16 UTC.