Vertex AI training clusters overview

If you're interested in Vertex AI training clusters, contact your salesrepresentative for access.

Vertex AI training clusters is a servicefrom Google Cloud designed to simplify and accelerate the largest and mostcomplex AI/ML workloads. It's specifically built to address challenges inlarge-scale training, such as complex cluster configuration, frameworkoptimization, handling hardware failures, and integrating disparate toolsets.

Key value proposition and features

Model Development Service offers several core benefits:

  • Open-source Slurm UX and cluster transparency:Vertex AI training clusters provides familiar, flexibletools to launch and manage jobs through an open-source Slurm user experience.Slurm is an industry standard known for optimized GPU scheduling, automatedfault tolerance, and simplified parallel job launch.

  • Automated cluster setup and configuration:Vertex AI training clusters automates the setup andconfiguration of clusters, aiming to transition from reservation to productiontraining in hours. Users can create clusters using the Google Cloud console(using reference architectures or step-by-step configuration) or through APIcalls with JSON files.

  • Preconfigured data science recipes and workflows:Vertex AI training clusters includes purpose-built toolingand optimized training recipes to jumpstart training for popular use cases likeLlama and Gemma models, covering pre-training, SFT(Supervised Fine-Tuning), and Reinforcement Learning (RL). These recipes arepreconfigured for state-of-the-art (SOTA) performance on Google CloudInfrastructure, demonstrating significant performance gains.

  • Hardware resiliency and high uptime:Vertex AI training clusters is designed with hardware resiliency to boostcluster uptime.It automatically resolves hardware issues, detects and triages various failuremodes (for example, correctness checks, speed checks, Error-Correcting Code(ECC) errors, NVIDIA Data Center GPU Manager (DCGM) checks, disk space capacity),and triggers remediation actions such as restarting, reimaging, or replacingfaulty nodes, and resuming from checkpoints. This helps mitigate the significantcost increase and delays caused by job interruptions and hardware failures inlarge-scale training.

  • Architecture and components:Vertex AI training clusters runs onCompute Engineinfrastructuresupporting GPUs and CPUs. It leverages a managedSlurm orchestrator for deploying and managing compute nodes, including loginand worker nodes. The service integrates with other Google Cloud services suchas networking and storage.

  • MLOps and Observability: Integrates with Vertex ML Ops tools likeVertex AI Model Registry for automatic registration, tracking, andversioning of trained workflows, and Vertex AI Inference fordeployment with autoscaling and automated metrics.Training clusters also features automaticobservability integration with Vertex AI TensorBoard to visualizetraining processes, track metrics, and identify issues early.

Training clusters can be created,retrieved, listed, updated and deleted using theVertex AI training clusters API. After cluster creation,users can validate its functionality by logging into nodes, running basicSlurm commands (for example,sinfo,sbatch), and executing GPU-relatedworkloads (for example,nvidia-smi). TheCluster Health Scanner (CHS) toolis pre-installed for running diagnostics like DCGM and NCCL tests to verifycluster readiness.

Vertex AI training clusters provides an API for launchingprebuilt LLM jobs using optimized recipes for models like Llamaand Gemma, supporting pre-training and continuous pre-training fromcheckpoints. Job monitoring is possible by logging into the loginnode and examining output files and Slurm commands likesqueue.

Terminology

This section provides definitions for key terms and concepts essential tounderstanding and effectively utilizing Vertex AI training clusters.These terms span core service components, architectural considerations,integrated storage technologies, and fundamental machine learning (ML) andMLOps concepts that underpin your training environment.

Core service concepts

node
  • A single virtual machine (Compute Engine instance) within a cluster. In the context of Managed Training on reserved clusters, a node refers to an individual virtual machine (VM) that serves as a single unit of computation within your cluster. Think of it as one of the dedicated worker machines that runs a portion of your overall training job. Each node is equipped with specific resources like CPU, memory, and accelerators (for example, A3 or A4 GPUs), and they all work together in a coordinated way to handle large-scale, distributed training tasks.
login node
partition
  • In Slurm, a logical grouping of nodes, often used to separate nodes with different hardware configurations.
recipe
  • In the context of Managed Training, a recipe is a comprehensive and reusable package that contains everything needed to execute a specific large-scale training workload.
Slurm cluster
  • A collection of Compute Engine instances, managed by Slurm, that includes a login node and multiple worker nodes configured for running training jobs. For more information, seeSlurm workload manager.
worker node
  • A worker node refers to an individual machine or computational instance within a cluster that's responsible for executing tasks or performing work. In systems like Kubernetes or Ray clusters, nodes are the fundamental units of compute. For more information, seeWhat is high performance computing (HPC)?.

Architecture and networking

consumer VPC network
  • A consumer VPC network is a Google Cloud Virtual Private Cloud (VPC) that privately accesses a service hosted in another VPC (known as the producer VPC). For more information, seePrivate Service Connect.
Maximum transmission unit (MTU)
  • The largest size of a data packet that a network-connected device can transmit. Larger MTU sizes (jumbo frames) can improve network performance for certain workloads. For more information, seeMaximum transmission unit.
private services access
  • Private services access is a private connection between your Virtual Private Cloud (VPC) network and networks owned by Google or third-party service providers. It allows virtual machine (VM) instances in your VPC network to communicate with these services using internal IP addresses, avoiding exposure to the public internet. For more information, seePrivate services access.
VPC Network Peering
  • A networking connection that allows two VPC networks to communicate privately. In the context of Managed Training on reserved clusters, VPC Network Peering is a critical component for integrating essential services. For instance, it is the required method for connecting your cluster's VPC to a Filestore instance, which provides the necessary shared `/home` directory for all the nodes in your cluster.
zone
  • A specific deployment area within a Google Cloud region. In the context of Managed Training on reserved clusters, for best performance, all components of the service (the cluster, Filestore, and Managed Lustre instances) should be created in the same zone.

Integrated storage technologies

Cloud Storage Fuse
  • An open-source FUSE adapter that lets you mount Cloud Storage buckets as a file system on Linux or macOS systems. For more information, seeCloud Storage Fuse.
Filestore
  • A fully managed, high-performance file storage service from Google Cloud, often used for applications that require a shared file system. For more information, seeFilestore overview.
Managed Lustre
  • A parallel, distributed file system designed for high-performance computing. Google Cloud's Managed Lustre provides a high-throughput file system for demanding workloads. For more information, seeManaged Lustre overview.
performance tier
  • A configuration setting for a Managed Lustre instance that defines its throughput speed (in MBps per TiB) and affects its minimum and maximum capacity.

Key ML and MLOps concepts

checkpoint
  • Data that captures the state of a model's parameters either during training or after training is completed. For example, during training, you can: 1. Stop training, perhaps intentionally or perhaps as the result of certain errors. 2. Capture the checkpoint. 3. Later, reload the checkpoint, possibly on different hardware. 4. Restart training. Within Gemini, a checkpoint refers to a specific version of a Gemini model trained on a specific dataset.
Supervised fine-tuning (SFT)
  • A machine learning technique where a pre-trained model is further trained on a smaller, labeled dataset to adapt it to a specific task.
Vertex AI Inference
  • A Vertex AI service that lets you use a trained machine learning (ML) model to make inferences from new, unseen data. Vertex AI provides services to deploy models for inference. For more information, seeGet inferences from a custom trained model.
Vertex AI Model Registry
  • The Vertex AI Model Registry is a central repository where you can manage the lifecycle of your ML models. From the Vertex AI Model Registry, you have an overview of your models so you can better organize, track, and train new versions. When you have a model version you would like to deploy, you can assign it to an endpoint directly from the registry, or using aliases, deploy models to an endpoint. For more information, seeIntroduction to the Vertex AI Model Registry.
Vertex AI TensorBoard
  • Vertex AI TensorBoard is a managed, scalable service on Google Cloud that enables data scientists and ML engineers to visualize their machine learning experiments, debug model training, and track performance metrics using the familiar open-source TensorBoard interface. It integrates seamlessly with Vertex AI Training and other services, providing persistent storage for experiment data and allowing collaborative analysis of model development. For more information, seeIntroduction to Vertex AI TensorBoard.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.