Get started with training clusters

If you're interested in Vertex AI training clusters contact your salesrepresentative for access.

Before you can deploy your first cluster onVertex AI training clusters, you must configure yourGoogle Cloud project and environment. This guide covers all the necessaryprerequisites, which fall into three main categories:

  • Project Access: Gaining access to the service, which is by invitation only.

  • Resource Configuration: Enabling APIs and setting up the required VPCnetwork and storage services.

  • User Permissions: Granting the necessary IAM roles for clustermanagement and resource access.

Completing these steps prepares your project for a successful deployment.

Prerequisites

To use training clusters, you must:

  1. Allowlist your project by contacting your sales representative for access.
  2. Obtain capacity forGPU clusters in supported regions.
  3. Enable the necessary APIs, including the Compute Engine, Filestore,Cloud Storage, Managed Lustre (optional),Hypercomputer Configuration Service,and Vertex AI APIs.
  4. Configure networking by ensuring an existing network meets specificconditions (for example, Google Private Access, firewall rules) or bycreating a new VPC network and subnetwork.
  5. Configure storage by creating a zonal or regional Filestoreinstance to serve as the/home directory and optionally configuring aGoogle Cloud managed Lustre instance.
  6. Grant IAM permissions to users for cluster management,storage access and SSH access to cluster nodes, as described in theIAM permissions section.

Supported regions

Note: Issuing a request to any regions that aren't on this list causes an API error.
  • us-central1
  • us-east1
  • us-east4
  • us-east5
  • us-south1
  • us-west1
  • us-west4
  • asia-southeast1
  • europe-west1
  • europe-west4
  • europe-north1

IAM permissions

  1. Grant theroles/aiplatform.admin role to users who will manageyour training clusters.
  2. Grant theroles/aiplatform.viewer role to users who only need to viewclusters and their configurations.
  3. Grant the following IAM roles to the user or service account thatwill manage (create, delete, and update) Mananged Training clusters:

    Role NameRole ID
    Compute Instance Admin (v1)roles/compute.instanceAdmin.v1
    Logs Writerroles/logging.logWriter
    Monitoring Metric Writerroles/monitoring.metricWriter
    Service Account Userroles/iam.serviceAccountUser
    Service Networking Adminroles/servicenetworking.networksAdmin
  4. To allow the cluster's nodes to read from and write to Cloud Storage bucketsusing Google Cloud Storage FUSE, grant the Storage Object User role (roles/storage.objectUser)to the service account used by the VMs.

  5. For SSH access to the Slurm login nodes, grant the following permissions:

    PermissionsDescriptionsPurpose
    Compute OS LoginSign in to a VM as a standard (non-administrator) user. Ifsudo is needed then use Compute OS Admin Login instead.SSH to the deployed login node
    IAP-secured Tunnel UserAccess Tunnel resources which use Identity-Aware Proxy.SSH to the deployed login node

Enable APIs

  1. Enable the Google Compute Engine API:

       gcloud services enable compute.googleapis.com
  2. Enable the service networking since Filestore must be deployed beforecreating the cluster.

       gcloud services enable servicenetworking.googleapis.com
  3. Enable the Cloud Storage API:

        gcloud services enable storage.googleapis.com
  4. Enable the Lustre API (if using Lustre):

    gcloud services enable lustre.googleapis.com
  5. Enable the HCS API:

    gcloud services enable hypercomputecluster.googleapis.com
  6. Enable theVertex AI API:

    gcloud services enable aiplatform.googleapis.com
  7. Enable theCloud Resource Manager API:

    gcloud services enable cloudresourcemanager.googleapis.com

What's next

For a detailed guide on creating a training cluster and runningyour AI/ML workloads, contact your sales representative.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.