Cluster resiliency

If you're interested in Vertex AI training clusters, contact your salesrepresentative for access.

Vertex AI training clusters integrates acomprehensive health check system to ensure the reliability of compute nodesand the stability of your Slurm jobs. This system features both automatedand manual recovery options. An automated process runs during job executionto monitor critical components like GPU health and disk usage, automaticallyreplacing nodes that fail. For situations requiring user intervention,your training cluster provides areportFaultyNodesAPI, letting you manually delete a specific faulty node or to report a suspectedhardware failure on its underlying host.

Run a test workload to verify GPU functionality

Step 1: Connect to the cluster nodes using SSH

From Cloud Shell or the Google Cloud console, connect to the Login Node usingIAP. The following example shows the command for Cloud Shell:

gcloud compute ssh --zone $ZONE "Login Node Name" --tunnel-through-iap --project $PROJECT_ID

Step 2: Run a standard Slurm command

After connecting to a login node, run a few standard Slurm commands to verifythat the cluster is functioning correctly.

~$ sinfoPARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELISTpartition1*    up   infinite      2   idle hcsa3m1236-a3mnodeset-[0-1]

Next, submit a batch job.

~$ sbatch --qos normal --wrap "echo start! && sleep 10s && echo done!"

You should see that a slurm-job-id.out file is created in your home directory.

Step 3: Run a GPU workload

Save the following content as a script file namedtest.sh in yourhome directory.

#!/bin/bash#SBATCH --nodes=2#SBATCH --ntasks-per-node=1#SBATCH --cpus-per-task=4#SBATCH --gres=gpu:8#SBATCH --job-name=nvidia_smisrunnvidia-smi-L

Set the script's permissions to 755 to make it executable, then submit theSlurm job:

~$ sbatch ./test.sh

Slurm saves the script's output to a file named slurm-job-id.out.

Expected output:

GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-f75045e8-4d87-49d1-2eb9-39ec2baddf9b)GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-b91556d8-5215-d0ed-50b8-a88720e5b29c)GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-7600155a-0036-35f5-9489-a7b4ed0ce887)GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-a402e125-7841-033f-f08b-7921526c121f)GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-20eef8f8-b2c7-1716-5ce7-7f64475bd2c0)GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-65463286-e587-b52f-4d5b-8880eecbf0e7)GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-d5ff75e7-dd54-edf6-a684-33c26fc365e1)GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-26e81ae2-11fd-9d7e-95b6-c186e5173007)GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-e66a185a-b40c-81d9-d35d-19cab811df34)GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-d23e5cf7-afd8-bec2-1487-9e27eeb6aae0)GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-4dde1b05-ea5e-01e9-5c1e-e1c0d3b4b113)GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-3a0d734a-6fb8-d841-a97f-d6846553ea7f)GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-76fe0d37-08b2-a3a6-8ddf-55501426bc7c)GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-9e0a41e1-b399-8934-01af-6198b749c02a)GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-dddd09ee-c944-1098-9c4e-d96f8762ecb1)GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-df52c109-0ac1-30cc-226b-85b1a8a6bc16)

Cluster health verification

This section shows how to test your training cluster using theCluster Health Scanner (CHS) tool, which is pre-installed on the trainingcluster image. The CHS tool checks the health of the cluster, running testssuch as DCGM diagnostics and NCCL tests to verify that the cluster is readyto run your workloads.

Important: The compute nodes must have outbound internet access for the DCGMdiagnostic and cluster validation tests to succeed. Ensure that either publicIP addresses are enabled for the compute nodes ("enable_public_ips": "true")or a Cloud NAT gateway is deployed in the cluster's network.

From the login node of the cluster, you can run the following script to runtests using the CHS tool.

Note: This script is written for a cluster with two A3 Ultra nodes, but it canbe modified to fit your cluster setup.
exportCLUSTER_ID=<your_cluster_id>exportPARTITION=a3uexportMACHINE_TYPE=a3-ultragpu-8gcd~/opt/cluster-health-scanner/deploy/slurm/cluster-validation.sh\--nodelist=${CLUSTER_ID}-${PARTITION}-[0-1]\--nodes=2\--partition=${PARTITION}\--machine-type=${MACHINE_TYPE}\--relative-exec-path=../../opt/cluster-health-scanner/deploy/slurm\--results-dir=results

A successful test run provides two sets of results:

  • Summary Output: A brief summary is printed to the console, which shouldresemble the following example.
  • Detailed Logs: For a complete report, see the detailed logs saved inthe~/results directory.
Starting DCGM Diagnostics...DCGM diagnostics passing on all nodes!Starting NCCL all_reduce_perf...CURR_NODES:  cluster-id-0cluster-id-1NCCL test passing on all nodes!

Automated health checks and recovery

To ensure node reliability, training clusters continuouslymonitors node healthusing the following suite of automated checks.Training clusters runs healthchecks during the Slurm prolog (before a job starts) and epilog(after a job completes).

Health check suite

  • GPU Health: Performs detailed, individual GPU diagnostics includingnvidia-smi,dcgmi, andxid code monitoring.
  • Disk Usage: Checks for high disk usage on critical partitions(/,/mnt/localssd,/mnt/localdisk) to prevent jobs from failing dueto lack of space.
  • Network Health: Verifies that primary network interfaces have an IPv4 address.If an issue is found, it attempts to self-heal by resetting the interface.
  • CPU Load: Monitors the system's load average and logs a warning if itexceeds a predefined threshold.

Failure recovery process

If a check detects a severe, unrecoverable error,Vertex AI training clusters automaticallyinitiates a failure recovery process. The standard process involves drainingthe faulty nodes, requeuing the affected Slurm job, and then deleting andrecreating the drained nodes to restore them to a healthy state.

This automated recovery is subject to the following conditions:

  • Restart Limit: The recovery process is skipped if the affected Slurm job hasalready been restarted a set number of times.

  • GPU Utilization: Node deletion and recreation is also skipped if the jobrunning on the node doesn't use all of the available GPUs. In this case,the node is only drained to prevent new jobs from being scheduled on it.

Manually managing faulty compute nodes

Training clusters provides APIs for manually reporting andmanaging faultycompute nodes, which is particularly useful if automated health checks don'tresolve an issue. You can only run these operations on one node at a time.

ActionDescriptionBest For
Delete NodeRemoves a specified faulty node from the cluster. This is the default action.General errors or when a node is unresponsive and needs to be recycled.
Report Host as FaultyReports the underlying physical host as faulty, triggering a repair or migration process.Suspected hardware failures on the physical machine hosting the GPU node.

Action 1: Delete a faulty node

This action deletes the specified node. The outcome of this operation dependson whether the node is classified as "static" or "dynamic" by Slurm:

  • Static Nodes: If a deleted node's index is less than the minimum node count ofthe node pool, a new compute node is recreated with the same name andspecifications.

  • Dynamic Nodes: If a deleted node's index is greater than the minimum nodecount, it's only recreated if there is a pending workload scheduled for it.Otherwise, it is removed.

These examples use agcurl alias, which is a convenient, authenticated shortcutfor interacting with the API endpoints. The following command creates an alias forcurl that includes the required authorization headers.

alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"'

API request to delete a node

To delete a faulty node, execute the following POST request. TheNODE_IDshould be in the formatCLUSTER_ID-NODEPOOL_ID-INDEX.

  gcurl -X POST https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID:reportFaultyNodes -d '{"nodeActions": [{"nodeId": "NODE_ID"}]}'

Check operation status
You can monitor the result of thereportFaultyNodes action by checkingthe operation status.

  gcurl https://REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/operations/OPERATION_ID

Action 2: Report a host as faulty

Caution: This is a pre-GA feature. Before reporting a host, thoroughlyinvestigate the issue to identify the root cause. Only use this action ifyou have no other alternatives. Consult the reportfaulty host documentation reportand the Google Cloud support team before proceeding.

You can report the physical host of a GPU node as faulty if you suspect a hardware failure.

  • Supported VMs: A3 Ultra and A4 High-GPU

  • Node State: The target node must be in aRUNNING state before you callthe API. It will transition toREPAIRING upon a successful call and returntoRUNNING after the host is repaired or the node is recreated on a newhost. This is a best-effort operation.

Prerequisite: Grant IAM role

To use this feature, you must grant the Compute Instance Admin (v1)(roles/compute.instanceAdmin.v1) role to the Vertex AI Service Agent.

PROJECT_NUMBER=$(gcloudprojectsdescribePROJECT_ID--format="value(projectNumber)")gcloudprojectsadd-iam-policy-bindingPROJECT_ID\--member="serviceAccount:service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.iam.gserviceaccount.com" \--role="roles/compute.instanceAdmin.v1"

API request to report a host

Execute the following POST request to report the underlying host as faulty. When doing so,you must provide one or more observed behaviors and descriptions for thefaultReasons.
For thebehavior field, you should use one of the following values:

BehaviorDescription
PERFORMANCEThe GPUs attached to the VM have performance issues compared to other GPUs in the cluster, you see no XID errors in the logs, and Compute Engine detects no other usual failure patterns such as silent data corruption.
SILENT_DATA_CORRUPTIONYou see data corruption in your VM, but the VM keeps running. This can be due to issues like vCPU defects, software bugs, or kernel issues.
UNRECOVERABLE_GPU_ERRORYou have identified an unrecoverable GPU error with an XID.
BEHAVIOR_UNSPECIFIEDYou are not sure what the issue with your VM is.

Here is an example of the API request.

gcurl -X POST \  https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID:reportFaultyNodes \  -d '{"nodeActions": [{"nodeId": "NODE_ID", "reportFaultyHost": {"faultReasons": [{"behavior": "BEHAVIOR_1", "description": "DESCRIPTION_1"}, {"behavior": "BEHAVIOR_2", "description": "DESCRIPTION_2"}]}}]}'

Putting it all together

By leveraging both automated health checks and the manual controls detailedon this page, you're able to maintain a highly resilient training environment.Proactively managing the health of your cluster by deleting faulty nodes orreporting hardware issues ensures maximum uptime and the successful completionof your training jobs. For persistent or complex issues, always considerconsulting with the Google Cloud support team for in-depth diagnostics andassistance.

What's next

Configuring your training cluster for fault tolerance is a key step in buildinga complete, production-ready MLOps workflow.

  • Monitor and debug your training jobs: Track the progress, resourceutilization, and health of your training jobs, including how to identifywhen a node has been recovered or a job has been restarted due to a failure.
  • Orchestrate your resilient jobs with Vertex AI Pipelines: For productionenvironments, use Vertex AI Pipelines to create an automated, repeatableworkflow that submits your resilient training jobs to your cluster.
  • Manage and deploy your model: Once your resilient training job is complete,use Vertex AI Model Registry to version your model artifact beforedeploying the model to an endpoint to serve online inference requests.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-05 UTC.