Configure compute resources for Vertex AI serverless training Stay organized with collections Save and categorize content based on your preferences.
When you perform serverless training, your training coderuns on one or morevirtual machine (VM) instances. You can configure what types of VM to use fortraining: using VMs with more compute resources can speed up training and letyou work with larger datasets, but they can also incur greatertrainingcosts.
In some cases, you can additionally use GPUs to accelerate training. GPUs incuradditional costs.
You can also optionally customize the type and size of your training VMs' bootdisks.
This document describes the different compute resources that you can use forserverless training and how to configure them.
Manage cost and availability
To help manage costs or ensure availability of VM resources,Vertex AI provides the following:
To ensure that VM resources are available when your training jobs needthem, you can use Compute Engine reservations. Reservations provide ahigh level of assurance in obtaining capacity for Compute Engineresources. For more information, seeUse reservations with training.
To reduce the cost of running your training jobs, you can useSpot VMs. Spot VMs are virtual machine (VM)instances that are excess Compute Engine capacity.Spot VMs have significant discounts, butCompute Engine might preemptively stop or deleteSpot VMs to reclaim the capacity at any time.For more information, seeUse Spot VMs with training.
For serverless training jobs that request GPU resources, Dynamic Workload Schedulerlets you schedule the jobs based on when the requested GPU resourcesbecome available. For more information, seeSchedule training jobsbased on resource availability.
Where to specify compute resources
Specify configuration details within aWorkerPoolSpec. Depending on howyou perform serverless training, put thisWorkerPoolSpec in one of the followingAPI fields:
If you are creating a
CustomJobresource, specify theWorkerPoolSpecinCustomJob.jobSpec.workerPoolSpecs.If you are using the Google Cloud CLI, then you can use the
--worker-pool-specflag or the--configflag on thegcloud ai custom-jobs createcommand to specify workerpool options.Learn more aboutcreating a
CustomJob.If you are creating a
HyperparameterTuningJobresource, specifytheWorkerPoolSpecinHyperparameterTuningJob.trialJobSpec.workerPoolSpecs.If you are using the gcloud CLI, then you can use the
--configflag on thegcloud ai hpt-tuning-jobs createcommand to specifyworker pool options.Learn more aboutcreating a
HyperparameterTuningJob.If you are creating a
TrainingPipelineresource withouthyperparameter tuning, specify theWorkerPoolSpecinTrainingPipeline.trainingTaskInputs.workerPoolSpecs.Learn more aboutcreating a custom
TrainingPipeline.If you are creating a
TrainingPipelinewith hyperparameter tuning,specify theWorkerPoolSpecinTrainingPipeline.trainingTaskInputs.trialJobSpec.workerPoolSpecs.
If you are performingdistributedtraining, you can use differentsettings for each worker pool.
Machine types
In yourWorkerPoolSpec, you must specify one of the following machine types inthemachineSpec.machineType field. Each replica inthe worker pool runs on a separate VM that has the specified machine type.
a4x-highgpu-4g*a4-highgpu-8g*a3-ultragpu-8g*a3-megagpu-8g*a3-highgpu-1g*a3-highgpu-2g*a3-highgpu-4g*a3-highgpu-8g*a2-ultragpu-1g*a2-ultragpu-2g*a2-ultragpu-4g*a2-ultragpu-8g*a2-highgpu-1g*a2-highgpu-2g*a2-highgpu-4g*a2-highgpu-8g*a2-megagpu-16g*e2-standard-4e2-standard-8e2-standard-16e2-standard-32e2-highmem-2e2-highmem-4e2-highmem-8e2-highmem-16e2-highcpu-16e2-highcpu-32n2-standard-4n2-standard-8n2-standard-16n2-standard-32n2-standard-48n2-standard-64n2-standard-80n2-highmem-2n2-highmem-4n2-highmem-8n2-highmem-16n2-highmem-32n2-highmem-48n2-highmem-64n2-highmem-80n2-highcpu-16n2-highcpu-32n2-highcpu-48n2-highcpu-64n2-highcpu-80n1-standard-4n1-standard-8n1-standard-16n1-standard-32n1-standard-64n1-standard-96n1-highmem-2n1-highmem-4n1-highmem-8n1-highmem-16n1-highmem-32n1-highmem-64n1-highmem-96n1-highcpu-16n1-highcpu-32n1-highcpu-64n1-highcpu-96c2-standard-4c2-standard-8c2-standard-16c2-standard-30c2-standard-60ct5lp-hightpu-1t*ct5lp-hightpu-4t*ct5lp-hightpu-8t*m1-ultramem-40m1-ultramem-80m1-ultramem-160m1-megamem-96g2-standard-4*g2-standard-8*g2-standard-12*g2-standard-16*g2-standard-24*g2-standard-32*g2-standard-48*g2-standard-96*g4-standard-48*g4-standard-96*g4-standard-192*g4-standard-384*cloud-tpu*
* Machine types marked with asterisks in the preceding list must be used withcertain GPUs or TPUs. See the following sections of this guide.
To learn about the technical specifications of each machine type, read theCompute Engine documentation about machinetypes. To learn about the cost ofusing each machine type for serverless training, readPricing.
The following examples highlight where you specify a machine type when youcreate aCustomJob:
Console
In the Google Cloud console, you can't create aCustomJob directly. However,you cancreate aTrainingPipeline that creates aCustomJob. When you create aTrainingPipeline in the Google Cloud console, specify a machine type foreach worker pool on theCompute and pricing step, in theMachine typefield.
gcloud
gcloudaicustom-jobscreate\--region=LOCATION\--display-name=JOB_NAME\--worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URIJava
Before trying this sample, follow theJava setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AIJava API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.
importcom.google.cloud.aiplatform.v1.AcceleratorType;importcom.google.cloud.aiplatform.v1.ContainerSpec;importcom.google.cloud.aiplatform.v1.CustomJob;importcom.google.cloud.aiplatform.v1.CustomJobSpec;importcom.google.cloud.aiplatform.v1.JobServiceClient;importcom.google.cloud.aiplatform.v1.JobServiceSettings;importcom.google.cloud.aiplatform.v1.LocationName;importcom.google.cloud.aiplatform.v1.MachineSpec;importcom.google.cloud.aiplatform.v1.WorkerPoolSpec;importjava.io.IOException;// Create a custom job to run machine learning training code in Vertex AIpublicclassCreateCustomJobSample{publicstaticvoidmain(String[]args)throwsIOException{// TODO(developer): Replace these variables before running the sample.Stringproject="PROJECT";StringdisplayName="DISPLAY_NAME";// Vertex AI runs your training application in a Docker container image. A Docker container// image is a self-contained software package that includes code and all dependencies. Learn// more about preparing your training application at// https://cloud.google.com/vertex-ai/docs/training/overview#prepare_your_training_applicationStringcontainerImageUri="CONTAINER_IMAGE_URI";createCustomJobSample(project,displayName,containerImageUri);}staticvoidcreateCustomJobSample(Stringproject,StringdisplayName,StringcontainerImageUri)throwsIOException{JobServiceSettingssettings=JobServiceSettings.newBuilder().setEndpoint("us-central1-aiplatform.googleapis.com:443").build();Stringlocation="us-central1";// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.try(JobServiceClientclient=JobServiceClient.create(settings)){MachineSpecmachineSpec=MachineSpec.newBuilder().setMachineType("n1-standard-4").setAcceleratorType(AcceleratorType.NVIDIA_TESLA_T4).setAcceleratorCount(1).build();ContainerSpeccontainerSpec=ContainerSpec.newBuilder().setImageUri(containerImageUri).build();WorkerPoolSpecworkerPoolSpec=WorkerPoolSpec.newBuilder().setMachineSpec(machineSpec).setReplicaCount(1).setContainerSpec(containerSpec).build();CustomJobSpeccustomJobSpecJobSpec=CustomJobSpec.newBuilder().addWorkerPoolSpecs(workerPoolSpec).build();CustomJobcustomJob=CustomJob.newBuilder().setDisplayName(displayName).setJobSpec(customJobSpecJobSpec).build();LocationNameparent=LocationName.of(project,location);CustomJobresponse=client.createCustomJob(parent,customJob);System.out.format("response: %s\n",response);System.out.format("Name: %s\n",response.getName());}}}Node.js
Before trying this sample, follow theNode.js setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AINode.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.
/** * TODO(developer): Uncomment these variables before running the sample.\ * (Not necessary if passing values as arguments) */// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';// const project = 'YOUR_PROJECT_ID';// const location = 'YOUR_PROJECT_LOCATION';// Imports the Google Cloud Job Service Client libraryconst{JobServiceClient}=require('@google-cloud/aiplatform');// Specifies the location of the api endpointconstclientOptions={apiEndpoint:'us-central1-aiplatform.googleapis.com',};// Instantiates a clientconstjobServiceClient=newJobServiceClient(clientOptions);asyncfunctioncreateCustomJob(){// Configure the parent resourceconstparent=`projects/${project}/locations/${location}`;constcustomJob={displayName:customJobDisplayName,jobSpec:{workerPoolSpecs:[{machineSpec:{machineType:'n1-standard-4',acceleratorType:'NVIDIA_TESLA_T4',acceleratorCount:1,},replicaCount:1,containerSpec:{imageUri:containerImageUri,command:[],args:[],},},],},};constrequest={parent,customJob};// Create custom job requestconst[response]=awaitjobServiceClient.createCustomJob(request);console.log('Create custom job response:\n',JSON.stringify(response));}createCustomJob();Python
To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.
fromgoogle.cloudimportaiplatformdefcreate_custom_job_sample(project:str,display_name:str,container_image_uri:str,location:str="us-central1",api_endpoint:str="us-central1-aiplatform.googleapis.com",):# The AI Platform services require regional API endpoints.client_options={"api_endpoint":api_endpoint}# Initialize client that will be used to create and send requests.# This client only needs to be created once, and can be reused for multiple requests.client=aiplatform.gapic.JobServiceClient(client_options=client_options)custom_job={"display_name":display_name,"job_spec":{"worker_pool_specs":[{"machine_spec":{"machine_type":"n1-standard-4","accelerator_type":aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,"accelerator_count":1,},"replica_count":1,"container_spec":{"image_uri":container_image_uri,"command":[],"args":[],},}]},}parent=f"projects/{project}/locations/{location}"response=client.create_custom_job(parent=parent,custom_job=custom_job)print("response:",response)For more context, read theguide to creating aCustomJob.
GPUs
If you havewritten your training code to useGPUs, then you mayconfigure your worker pool to use one or more GPUs on each VM. To use GPUs, youmust use an A2, N1, or G2 machine type. Additionally, using smaller machines typesliken1-highmem-2 with GPUs might cause logging to fail for some workloadsbecause of CPU constraints. If your training job stops returning logs, considerselecting a larger machine type.
Vertex AI supports the following types of GPU for serverless training:
NVIDIA_GB200+ (includesGPUDirect-RDMA)NVIDIA_B200* (includesGPUDirect-RDMA)NVIDIA_H100_MEGA_80GB* (includesGPUDirect-TCPXO)NVIDIA_H100_80GBNVIDIA_H200_141GB* (includesGPUDirect-RDMA)NVIDIA_A100_80GBNVIDIA_TESLA_A100(NVIDIA A100 40GB)NVIDIA_TESLA_P4NVIDIA_TESLA_P100NVIDIA_TESLA_T4NVIDIA_TESLA_V100NVIDIA_L4NVIDIA_RTX_PRO_6000
+ Requires obtaining capacity usingshared reservations.
To learn more about the technical specification for each type of GPU, read theCompute Engine short documentation about GPUs for computeworkloads. To learn about the cost of using eachmachine type for serverless training, readPricing.
In yourWorkerPoolSpec, specify the type of GPU that you want to use in themachineSpec.acceleratorType field and number ofGPUs that you want each VM in the worker pool to use in themachineSpec.acceleratorCount field. However, yourchoices for these fields must meet the following restrictions:
The type of GPU that you choose must be available in the location where youare performing serverless training. Not all types of GPUare available in all regions.Learn about regional availability.
You can only use certain numbers of GPUs in your configuration. For example,you can use 2 or 4
NVIDIA_TESLA_T4GPUs on a VM, but not 3. To see whatacceleratorCountvalues are valid for each type of GPU, see thefollowing compatibility table.You must make sure that your GPU configuration provides sufficient virtualCPUs and memory to the machine type that you use it with. For example, if youuse the
n1-standard-32machine type in your worker pool, then each VM has 32virtual CPUs and 120 GB of memory. Since eachNVIDIA_TESLA_V100GPU canprovide up to 12 virtual CPUs and 76 GB of memory, you must use at least 4GPUs for eachn1-standard-32VM to support its requirements. (2 GPUs provideinsufficient resources, and you can't specify 3 GPUs.)Thefollowing compatibility table accounts forthis requirement.
Note the following additional limitation on using GPUs for custom trainingthat differ from using GPUs with Compute Engine:
- A configuration with 4
NVIDIA_TESLA_P100GPUs only provides up to 64virtual CPUS and up to 208 GB of memory inall regions and zones.
- A configuration with 4
For jobs that useDynamic Workload Scheduler orSpot VMs, update the
scheduling.strategyfield of theCustomJobto the chosen strategy.
The following compatibility table lists the valid values formachineSpec.acceleratorCount depending on your choices formachineSpec.machineType andmachineSpec.acceleratorType:
| Valid numbers of GPUs for each machine type | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Machine type | NVIDIA_H100_MEGA_80GB | NVIDIA_H100_80GB | NVIDIA_A100_80GB | NVIDIA_TESLA_A100 | NVIDIA_TESLA_P4 | NVIDIA_TESLA_P100 | NVIDIA_TESLA_T4 | NVIDIA_TESLA_V100 | NVIDIA_L4 | NVIDIA_H200_141GB | NVIDIA_B200 | NVIDIA_GB200 | NVIDIA_RTX_PRO_6000 |
a3-megagpu-8g | 8 | ||||||||||||
a3-highgpu-1g | 1* | ||||||||||||
a3-highgpu-2g | 2* | ||||||||||||
a3-highgpu-4g | 4* | ||||||||||||
a3-highgpu-8g | 8 | ||||||||||||
a3-ultragpu-8g | 8 | a4-highgpu-8g | 8 | a4x-highgpu-4g | 4 | ||||||||
a2-ultragpu-1g | 1 | ||||||||||||
a2-ultragpu-2g | 2 | ||||||||||||
a2-ultragpu-4g | 4 | ||||||||||||
a2-ultragpu-8g | 8 | ||||||||||||
a2-highgpu-1g | 1 | ||||||||||||
a2-highgpu-2g | 2 | ||||||||||||
a2-highgpu-4g | 4 | ||||||||||||
a2-highgpu-8g | 8 | ||||||||||||
a2-megagpu-16g | 16 | ||||||||||||
n1-standard-4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||||||||
n1-standard-8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||||||||
n1-standard-16 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 | |||||||||
n1-standard-32 | 2, 4 | 2, 4 | 2, 4 | 4, 8 | |||||||||
n1-standard-64 | 4 | 4 | 8 | ||||||||||
n1-standard-96 | 4 | 4 | 8 | ||||||||||
n1-highmem-2 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||||||||
n1-highmem-4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||||||||
n1-highmem-8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||||||||
n1-highmem-16 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 | |||||||||
n1-highmem-32 | 2, 4 | 2, 4 | 2, 4 | 4, 8 | |||||||||
n1-highmem-64 | 4 | 4 | 8 | ||||||||||
n1-highmem-96 | 4 | 4 | 8 | ||||||||||
n1-highcpu-16 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 | |||||||||
n1-highcpu-32 | 2, 4 | 2, 4 | 2, 4 | 4, 8 | |||||||||
n1-highcpu-64 | 4 | 4 | 4 | 8 | |||||||||
n1-highcpu-96 | 4 | 4 | 8 | ||||||||||
g2-standard-4 | 1 | ||||||||||||
g2-standard-8 | 1 | ||||||||||||
g2-standard-12 | 1 | ||||||||||||
g2-standard-16 | 1 | ||||||||||||
g2-standard-24 | 2 | ||||||||||||
g2-standard-32 | 1 | ||||||||||||
g2-standard-48 | 4 | ||||||||||||
g2-standard-96 | 8 | ||||||||||||
g4-standard-48 | 1 | ||||||||||||
g4-standard-96 | 2 | ||||||||||||
g4-standard-192 | 4 | ||||||||||||
g4-standard-384 | 8 | ||||||||||||
* The specified machine type is only available when usingDynamic Workload Scheduler orSpot VMs.
The following examples highlight where you can specify GPUs when youcreate aCustomJob:
Console
In the Google Cloud console, you can't create aCustomJob directly.However, you cancreate aTrainingPipeline that creates aCustomJob. When you create aTrainingPipeline in the Google Cloud console, you can specify GPUs for eachworker pool on theCompute and pricing step. First specify aMachinetype. Then, you can specify GPU details in theAccelerator type andAccelerator count fields.
gcloud
To specify GPUs using the Google Cloud CLI tool, you must use aconfig.yamlfile. For example:
config.yaml
workerPoolSpecs:machineSpec:machineType:MACHINE_TYPEacceleratorType:ACCELERATOR_TYPEacceleratorCount:ACCELERATOR_COUNTreplicaCount:REPLICA_COUNTcontainerSpec:imageUri:CUSTOM_CONTAINER_IMAGE_URIThen run a command like the following:
gcloudaicustom-jobscreate\--region=LOCATION\--display-name=JOB_NAME\--config=config.yamlNode.js
Before trying this sample, follow theNode.js setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AINode.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.
/** * TODO(developer): Uncomment these variables before running the sample.\ * (Not necessary if passing values as arguments) */// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';// const project = 'YOUR_PROJECT_ID';// const location = 'YOUR_PROJECT_LOCATION';// Imports the Google Cloud Job Service Client libraryconst{JobServiceClient}=require('@google-cloud/aiplatform');// Specifies the location of the api endpointconstclientOptions={apiEndpoint:'us-central1-aiplatform.googleapis.com',};// Instantiates a clientconstjobServiceClient=newJobServiceClient(clientOptions);asyncfunctioncreateCustomJob(){// Configure the parent resourceconstparent=`projects/${project}/locations/${location}`;constcustomJob={displayName:customJobDisplayName,jobSpec:{workerPoolSpecs:[{machineSpec:{machineType:'n1-standard-4',acceleratorType:'NVIDIA_TESLA_T4',acceleratorCount:1,},replicaCount:1,containerSpec:{imageUri:containerImageUri,command:[],args:[],},},],},};constrequest={parent,customJob};// Create custom job requestconst[response]=awaitjobServiceClient.createCustomJob(request);console.log('Create custom job response:\n',JSON.stringify(response));}createCustomJob();Python
To learn how to install or update the Vertex AI SDK for Python, seeInstall the Vertex AI SDK for Python. For more information, see thePython API reference documentation.
fromgoogle.cloudimportaiplatformdefcreate_custom_job_sample(project:str,display_name:str,container_image_uri:str,location:str="us-central1",api_endpoint:str="us-central1-aiplatform.googleapis.com",):# The AI Platform services require regional API endpoints.client_options={"api_endpoint":api_endpoint}# Initialize client that will be used to create and send requests.# This client only needs to be created once, and can be reused for multiple requests.client=aiplatform.gapic.JobServiceClient(client_options=client_options)custom_job={"display_name":display_name,"job_spec":{"worker_pool_specs":[{"machine_spec":{"machine_type":"n1-standard-4","accelerator_type":aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,"accelerator_count":1,},"replica_count":1,"container_spec":{"image_uri":container_image_uri,"command":[],"args":[],},}]},}parent=f"projects/{project}/locations/{location}"response=client.create_custom_job(parent=parent,custom_job=custom_job)print("response:",response)For more context, read theguide to creating aCustomJob.
GPUDirect Networking
On Vertex Training, some H100, H200, B200, and GB200 series machines come pre-configuredwith the GPUDirect networking stack. GPUDirect can increase the intergpunetworking speed by up to 2x compared to GPUs without GPUDirect.
GPUDirect does this by reducing the overhead required to transfer packetpayloads between GPUs, significantly improving throughput at scale.
GPUDirect-TCPXO
Thea3-megagpu-8g machine type has:
- 8 NVIDIA H100 GPUs per machine
- Up to 200 Gbps bandwidth on the primary NIC
- 8 secondary NICs each supporting up to 200 Gbps for GPU data transfer
- GPUDirect-TCPXO, which further improves GPU to VM communication
GPUs with GPUDirect are especially equipped for distributed training of largemodels.
Note: For troubleshooting, setNCCL_DEBUG=INFO within thecontainerSpec.envlist of all GPU nodes to get additional logs.GPUDirect-RDMA
Thea4x-highgpu-4g machine types have:
- 4 GB200 GPUs per machine
- 2 host NICs providing a bandwidth of 400 Gbps
- 6 NICs offering up to 2400 Gbps for GPU data transfer
- GPUDirect-RDMA, which enables higher network performance for GPUcommunication for large scale ML training workloads through RoCE(RDMA over Converged Ethernet)
Thea3-ultragpu-8g anda4-highgpu-8g machine types have:
- 8 NVIDIA H200/B200 GPUs per machine
- 2 host NICs providing a bandwidth of 400 Gbps
- 8 NICs offering up to 3200 Gbps for GPU data transfer
- GPUDirect-RDMA, which enables higher network performance for GPUcommunication for large scale ML training workloads through RoCE(RDMA over Converged Ethernet)
source /usr/local/gib/scripts/set_nccl_env.sh to the beginning of the training command to set all the required environment variables to configure NCCL.TPUs
To useTensor Processing Units (TPUs) for custom training onVertex AI, you can configurea worker pool to use aTPU VM.
When you use a TPU VM in Vertex AI, you must only use a single workerpool for custom training, and you must configure this worker pool to use onlyone replica.
TPU v2 and v3
To use TPU v2 or v3 VMs in your worker pool, you must use one of the followingconfigurations:
To configure a TPU VM withTPU v2,specify the following fields in the
WorkerPoolSpec:- Set
machineSpec.machineTypetocloud-tpu. - Set
machineSpec.acceleratorTypetoTPU_V2. - Set
machineSpec.acceleratorCountto8for single TPU or32 or multiple of 32for TPU Pods. - Set
replicaCountto1.
- Set
To configure a TPU VM withTPU v3,specify the following fields in the
WorkerPoolSpec:- Set
machineSpec.machineTypetocloud-tpu. - Set
machineSpec.acceleratorTypetoTPU_V3. - Set
machineSpec.acceleratorCountto8for single TPU or32+for TPU Pods. - Set
replicaCountto1.
- Set
For information about the regional availability of TPUs,seeUsing accelerators.
TPU v5e
TPU v5e requires JAX 0.4.6+, TensorFlow 2.15+, orPyTorch 2.1+. To configure a TPU VM with TPU v5e, specify the following fieldsin theWorkerPoolSpec:
- Set
machineSpec.machineTypetoct5lp-hightpu-1t,ct5lp-hightpu-4t, orct5lp-hightpu-8t. - Set
machineSpec.tpuTopologyto a supported topology for the machine type.For details, see the following table. - Set
replicaCountto1.
The following table shows the TPU v5e machine types and topologies that aresupported for custom training:
| Machine Type | Topology | Number of TPU chips | Number of VMs | Recommended use case |
|---|---|---|---|---|
ct5lp-hightpu-1t | 1x1 | 1 | 1 | Small to medium scale training |
ct5lp-hightpu-4t | 2x2 | 4 | 1 | Small to medium scale training |
ct5lp-hightpu-8t | 2x4 | 8 | 1 | Small to medium scale training |
ct5lp-hightpu-4t | 2x4 | 8 | 2 | Small to medium scale training |
ct5lp-hightpu-4t | 4x4 | 16 | 4 | Large-scale training |
ct5lp-hightpu-4t | 4x8 | 32 | 8 | Large-scale training |
ct5lp-hightpu-4t | 8x8 | 64 | 16 | Large-scale training |
ct5lp-hightpu-4t | 8x16 | 128 | 32 | Large-scale training |
ct5lp-hightpu-4t | 16x16 | 256 | 64 | Large-scale training |
Custom training jobs running on TPU v5e VMs are optimized for throughput andavailability. For more information seev5e Training accelerator types.
For information about the regional availability of TPUs,seeUsing accelerators.For more information about TPU v5e, seeCloud TPU v5e training.
Machine type comparison:
| Machine Type | ct5lp-hightpu-1t | ct5lp-hightpu-4t | ct5lp-hightpu-8t |
|---|---|---|---|
| Number of v5e chips | 1 | 4 | 8 |
| Number of vCPUs | 24 | 112 | 224 |
| RAM (GB) | 48 | 192 | 384 |
| Number of NUMA nodes | 1 | 1 | 2 |
| Likelihood of preemption | High | Medium | Low |
TPU v6e
TPU v6e requires Python 3.10+, JAX 0.4.37+,PyTorch 2.1+ using PJRT as the default runtime, or TensorFlow usingonly the tf-nightly runtime version 2.18+. To configure a TPU VM with TPU v6e,specify the following fields in theWorkerPoolSpec:
- Set
machineSpec.machineTypetoct6e. - Set
machineSpec.tpuTopologyto a supported topology for the machine type.For details, see the following table. - Set
replicaCountto1.
The following table shows the TPU v6e machine types and topologies that aresupported for custom training:
| Machine Type | Topology | Number of TPU chips | Number of VMs | Recommended use case |
|---|---|---|---|---|
ct6e-standard-1t | 1x1 | 1 | 1 | Small to medium scale training |
ct6e-standard-8t | 2x4 | 8 | 1 | Small to medium scale training |
ct6e-standard-4t | 2x2 | 4 | 1 | Small to medium scale training |
ct6e-standard-4t | 2x4 | 8 | 2 | Small to medium scale training |
ct6e-standard-4t | 4x4 | 16 | 4 | Large-scale training |
ct6e-standard-4t | 4x8 | 32 | 8 | Large-scale training |
ct6e-standard-4t | 8x8 | 64 | 16 | Large-scale training |
ct6e-standard-4t | 8x16 | 128 | 32 | Large-scale training |
ct6e-standard-4t | 16x16 | 256 | 64 | Large-scale training |
For information about the regional availability of TPUs,seeUsing accelerators.For more information about TPU v6e, seeCloud TPU v6e training.
Machine type comparison:
| Machine Type | ct6e-standard-1t | ct6e-standard-4t | ct6e-standard-8t |
|---|---|---|---|
| Number of v6e chips | 1 | 4 | 8 |
| Number of vCPUs | 44 | 180 | 180 |
| RAM (GB) | 48 | 720 | 1440 |
| Number of NUMA nodes | 2 | 1 | 2 |
| Likelihood of preemption | High | Medium | Low |
TPU 7x (Preview)
TPU7x requires Python 3.12+.
We recommend the following stable combination for functionality tests and workload migration:
- JAX + JAX Lib:
jax-0.8.1.dev20251104,jaxlib-0.8.1.dev2025104 - Stable libtpu:
libtpu-0.0.27
To configure a TPU VM with TPU 7x, specify the following fields in theWorkerPoolSpec:
- Set
machineSpec.machineTypetotpu7x-standard-4t. - Set
machineSpec.tpuTopologyto a supported topology for the machine type.For details, see the following table. - Set
replicaCountto1.
The following table shows the TPU 7x topologies that are supported for custom training. All topologies use thetpu7x-standard-4t machine type.
| Topology | Number of TPU chips | Number of VMs | Scope |
|---|---|---|---|
| 2x2x1 | 4 | 1 | Single-host |
| 2x2x2 | 8 | 2 | Multi-host |
| 2x2x4 | 16 | 4 | Multi-host |
| 2x4x4 | 32 | 8 | Multi-host |
| 4x4x4 | 64 | 16 | Multi-host |
| 4x4x8 | 128 | 32 | Multi-host |
| 4x8x8 | 256 | 64 | Multi-host |
| 8x8x8 | 512 | 128 | Multi-host |
| 8x8x16 | 1024 | 256 | Multi-host |
For information about the regional availability of TPUs,seeUsing accelerators. For moreinformation about TPU7x, seeCloud TPU7x training.
ExampleCustomJob specifying a TPU VM
The following example highlights how to specify a TPU VM when you createaCustomJob:
gcloud
To specify a TPU VM using the gcloud CLI tool, you must use aconfig.yaml file.Select one of the following tabs to see an example:
TPU v2/v3
workerPoolSpecs:machineSpec:machineType:cloud-tpuacceleratorType:TPU_V2acceleratorCount:8replicaCount:1containerSpec:imageUri:CUSTOM_CONTAINER_IMAGE_URITPU v5e
workerPoolSpecs:machineSpec:machineType:ct5lp-hightpu-4ttpuTopology:4x4replicaCount:1containerSpec:imageUri:CUSTOM_CONTAINER_IMAGE_URIThen run a command like the following:
gcloudaicustom-jobscreate\--region=LOCATION\--display-name=JOB_NAME\--config=config.yamlPython
Before trying this sample, follow thePython setup instructions in theVertex AI quickstart using client libraries. For more information, see theVertex AIPython API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.
To specify a TPU VM using the Vertex AI SDK for Python, see the following example:
fromgoogle.cloud.aiplatformimportaiplatformjob=aiplatform.CustomContainerTrainingJob(display_name='DISPLAY_NAME',location='us-west1',project='PROJECT_ID',staging_bucket="gs://CLOUD_STORAGE_URI",container_uri='CONTAINER_URI')job.run(machine_type='ct5lp-hightpu-4t',tpu_topology='2x2')
For more information about creating a custom training job, seeCreate custom training jobs.
Boot disk options
You can optionally customize the boot disks for your training VMs. All VMs in aworker pool use the same type and size of boot disk.
To customize the type of boot disk that each training VM uses, specify the
diskSpec.bootDiskTypefield in yourWorkerPoolSpec.You can set this field to one of the following:
pd-standardto use a standard persistent disk backed by astandard hard drivepd-ssdto use an SSD persistent disk backed by a solid-state drivehyperdisk-balancedfor higher IOPS andthroughput rates.
The default value is
pd-ssd(hyperdisk-balancedis thedefault fora3-ultragpu-8ganda4-highgpu-8g).Using
pd-ssdorhyperdisk-balancedmight improve performance if your training code reads and writes to disk.Learn aboutdisk types. Also seehyperdisk supported machines.To customize the size (in GB) of the boot disk that each training VMuses, specify the
diskSpec.bootDiskSizeGbfield in yourWorkerPoolSpec.You can set this field to an integer between 100 and 64,000, inclusive. Thedefault value is
100.You might want to increase the boot disk size if your training code writesa lot of temporary data to disk. Note that any data you write to the boot diskis temporary, and you can't retrieve it after training completes.
Changing the type and size of your boot disks affectscustom trainingpricing.
The following examples highlight where you can specify boot disk options whenyou create aCustomJob:
Console
In the Google Cloud console, you can't create aCustomJob directly.However, you cancreate aTrainingPipeline that creates aCustomJob. When you create aTrainingPipeline in the Google Cloud console, you can specify boot diskoptions for each worker pool on theCompute and pricing step, in theDisktype drop-down list and theDisk size (GB) field.
gcloud
To specify boot disk options using the Google Cloud CLI tool, you must use aconfig.yamlfile. For example:
config.yaml
workerPoolSpecs:machineSpec:machineType:MACHINE_TYPEdiskSpec:bootDiskType:DISK_TYPEbootDiskSizeGb:DISK_SIZEreplicaCount:REPLICA_COUNTcontainerSpec:imageUri:CUSTOM_CONTAINER_IMAGE_URIThen run a command like the following:
gcloudaicustom-jobscreate\--region=LOCATION\--display-name=JOB_NAME\--config=config.yamlFor more context, read theguide to creating aCustomJob.
What's next
- Learn how tocreate a persistent resourceto run custom training jobs.
- Learn how to perform custom training bycreating a
CustomJob.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-16 UTC.