Compute resources

If you're interested in Vertex AI training clusters, contact your salesrepresentative for access.

Vertex AI training clusters supports a variety ofmachine types to accommodate different workloads. You can choose from thefollowing options when configuring your cluster node pools:

  • a4-highgpu-8g
  • a3-ultragpu-8g
  • a3-megagpu-8g
  • n2 CPU family

Capacity provisioning

Choosing the right provisioning model is critical for balancing cost, speed,and resource availability. See the following provisioning options:

  • RESERVATION: Allocates nodes from a specific Compute Engine reservationthat you've created in advance. This model ensures capacity and is therecommended choice for high-demand resources.

  • FLEX_START: Utilizes the Dynamic Workload Scheduler to queue your job. The job beginsautomatically as soon as the requested compute resources become available,offering a flexible start time without requiring a reservation.

  • SPOT: Provisions the node pool using Spot VMs. This is the mostcost-effective option, but it should only be used for workloads that arefault-tolerant and can handle interruptions, as the VMs may be preempted atany time.

  • ON_DEMAND: This is the default option for CPU-only node pools and is bestsuited for machine types that are not scarce. It provides standard VMinstances with predictable, pay-as-you-go pricing.

Use the following guidance to make your selection:

  • For high-demand GPU resources (like A3 and A4): TheRESERVATION model isstrongly recommended. It ensures you have dedicated access to the capacityyou need for critical training jobs.

  • For bursty or flexible workloads: ConsiderFLEX_START orSPOT.FLEX_START queues your job until resources are available, whileSPOToffers significant cost savings for fault-tolerant jobs that can handlepreemption.

  • For abundant machine types: TheON_DEMAND model is the preferred choice.Use it for machine types that are not scarce and where immediate availabilityisn't a concern.

Using a shared reservation (optional)

If you'd like to use a shared reservation, rather than a local reservation,there are additional steps to take before you can create a cluster.

Before using a shared reservation with Vertex AI training clusters, make surethe sharedreservation works by manually creating a VM that uses the shared reservation.If this VM creation works, move on to the next step.In the cluster creation configuration, use the reservation name in the followingformat:projects/RESERVATION_HOST_PROJECT_ID/zones/RESERVATION_ZONE/reservations/RESERVATION_NAME.

What's next

After selecting the compute and provisioning options for your training cluster,you are ready to create the cluster and run a workload on it.

  • Create a Compute Engine reservation: TheRESERVATION model is used forallocating high-demand resources like GPUs. Learn how to create a newreservation in Compute Engine to get dedicated access to your requiredresources.
  • Create your training cluster: Apply the configurations you've learned aboutby following the step-by-step guide to create your first persistent trainingcluster using the Vertex AI API orgcloud.
  • Submit a training job to your cluster: Once your cluster is active, the nextstep is to run a workload. Submit aCustomJob that targets your persistentcluster for execution.
  • Adapt your code for distributed training: To take full advantage of amulti-node cluster, adapt your training code for a distributed environment.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.