Get started with training clusters Stay organized with collections Save and categorize content based on your preferences.
Before you can deploy your first cluster onVertex AI training clusters, you must configure yourGoogle Cloud project and environment. This guide covers all the necessaryprerequisites, which fall into three main categories:
Project Access: Gaining access to the service, which is by invitation only.
Resource Configuration: Enabling APIs and setting up the required VPCnetwork and storage services.
User Permissions: Granting the necessary IAM roles for clustermanagement and resource access.
Completing these steps prepares your project for a successful deployment.
Prerequisites
To use training clusters, you must:
- Allowlist your project by contacting your sales representative for access.
- Obtain capacity forGPU clusters in supported regions.
- Enable the necessary APIs, including the Compute Engine, Filestore,Cloud Storage, Managed Lustre (optional),Hypercomputer Configuration Service,and Vertex AI APIs.
- Configure networking by ensuring an existing network meets specificconditions (for example, Google Private Access, firewall rules) or bycreating a new VPC network and subnetwork.
- Configure storage by creating a zonal or regional Filestoreinstance to serve as the
/homedirectory and optionally configuring aGoogle Cloud managed Lustre instance. - Grant IAM permissions to users for cluster management,storage access and SSH access to cluster nodes, as described in theIAM permissions section.
Supported regions
Note: Issuing a request to any regions that aren't on this list causes an API error.us-central1us-east1us-east4us-east5us-south1us-west1us-west4asia-southeast1europe-west1europe-west4europe-north1
IAM permissions
- Grant the
roles/aiplatform.adminrole to users who will manageyour training clusters. - Grant the
roles/aiplatform.viewerrole to users who only need to viewclusters and their configurations. Grant the following IAM roles to the user or service account thatwill manage (create, delete, and update) Mananged Training clusters:
Role Name Role ID Compute Instance Admin (v1) roles/compute.instanceAdmin.v1Logs Writer roles/logging.logWriterMonitoring Metric Writer roles/monitoring.metricWriterService Account User roles/iam.serviceAccountUserService Networking Admin roles/servicenetworking.networksAdminTo allow the cluster's nodes to read from and write to Cloud Storage bucketsusing Google Cloud Storage FUSE, grant the Storage Object User role (
roles/storage.objectUser)to the service account used by the VMs.For SSH access to the Slurm login nodes, grant the following permissions:
Permissions Descriptions Purpose Compute OS Login Sign in to a VM as a standard (non-administrator) user. If sudois needed then use Compute OS Admin Login instead.SSH to the deployed login node IAP-secured Tunnel User Access Tunnel resources which use Identity-Aware Proxy. SSH to the deployed login node
Enable APIs
Enable the Google Compute Engine API:
gcloud services enable compute.googleapis.comEnable the service networking since Filestore must be deployed beforecreating the cluster.
gcloud services enable servicenetworking.googleapis.comEnable the Cloud Storage API:
gcloud services enable storage.googleapis.comEnable the Lustre API (if using Lustre):
gcloud services enable lustre.googleapis.comEnable the HCS API:
gcloud services enable hypercomputecluster.googleapis.comEnable theVertex AI API:
gcloud services enable aiplatform.googleapis.comEnable theCloud Resource Manager API:
gcloud services enable cloudresourcemanager.googleapis.com
What's next
For a detailed guide on creating a training cluster and runningyour AI/ML workloads, contact your sales representative.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.