Deploy an A3 Mega Slurm cluster for ML training Stay organized with collections Save and categorize content based on your preferences.
This document outlines the deployment steps for provisioning an A3 Mega(a3-megagpu-8g) Slurm cluster that is ideal for running large-scale artificialintelligence (AI) and machine learning (ML) training workloads.
Before you begin
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Identify the regions and zones where the
a3-megagpu-8gmachine type isavailable, run the following command:gcloud compute machine-types list --filter="name=a3-megagpu-8g"
Verify that you have enough GPU quotas. Each
a3-megagpu-8gmachine haseight H100 80GB GPUs attached, so you'll need at least eight NVIDIAH100 80GB GPUs in your selected region.- To view quotas, seeView the quotas for your project.In theFilter field,selectDimensions (e.g. location) and specify
gpu_family:NVIDIA_H100_MEGA. - If you don't have enough quota,request a higher quota.
- To view quotas, seeView the quotas for your project.In theFilter field,selectDimensions (e.g. location) and specify
Verify that you have enough Filestore quota. You need a minimum of10,240 GiB of zonal (also known as high scale SSD) capacity.If you don't have enough quota,request a quota increase.
Required roles
To ensure that the Compute Engine default service account has the necessary permissions to deploy a Slurm cluster, ask your administrator to grant the Compute Engine default service account the following IAM roles:
Important: You must grant these roles to the Compute Engine default service account,not to your user account. Failure to grant the roles to the correct principal might result in permission errors.- Storage Object Viewer (
roles/storage.objectViewer) on your project - Compute Instance Admin (v1) (
roles/compute.instanceAdmin.v1) on your project - Service Account User (
roles/iam.serviceAccountUser) on the service account itself
For more information about granting roles, seeManage access to projects, folders, and organizations.
Your administrator might also be able to give the Compute Engine default service account the required permissions throughcustom roles or otherpredefined roles.
Install Cluster Toolkit
From the CLI, complete the following steps:
Installdependencies.
To provision Slurm clusters, we recommend that you use Cluster Toolkitversion
v1.51.1or later. To install Cluster Toolkit, seeSet up Cluster Toolkit.
Switch to the Cluster Toolkit directory
After you have installed the Cluster Toolkit, check that you are in theCluster Toolkit directory.
To go to the main Cluster Toolkit working directory, run the following command.
cd cluster-toolkit
Set up Cloud Storage bucket
Cluster blueprints use Terraform modules to provision Cloud infrastructure. Abest practice when working with Terraform is tostore the state remotelyin a version enabled file. On Google Cloud, you can create a Cloud Storagebucket that has versioning enabled.
To create this bucket and enable versioning from the CLI, run the following commands:
gcloud storage buckets create gs://BUCKET_NAME \ --project=PROJECT_ID \ --default-storage-class=STANDARD --location=BUCKET_REGION \ --uniform-bucket-level-accessgcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following:
BUCKET_NAME: a name for your Cloud Storage bucketthat meets thebucket naming requirements.PROJECT_ID: your project ID.BUCKET_REGION: anyavailable location.
Reserve capacity
Reservations help ensure that you have the available resources to create A3 Mega VMs with the specified hardware (memory, vCPUs, and GPUs) and attached Local SSD disks whenever you need them. To review the different methods to reserve resources for creating VMs, seeChoose a reservation type.
For example, to create an on-demand, specifically targeted reservation for A3 Mega VMs, run thegcloud compute reservations create command with the--require-specific-reservation flag:
gcloud compute reservations createRESERVATION_NAME \ --require-specific-reservation \ --project=PROJECT_ID \ --machine-type=a3-megagpu-8g \ --vm-count=NUMBER_OF_VMS \ --zone=ZONE
Replace the following:
RESERVATION_NAME: the name of the single-project reservation that you want to use.PROJECT_ID: the ID of your project.NUMBER_OF_VMS: the number of VMs needed for the cluster.ZONE: a zone that hasa3-megagpu-8gmachine types. To review the zones where you can create A3 Mega VMs, seeAccelerator availability.
After you destroy your Slurm cluster, you can delete the reservation if you don't need it anymore. For information, seeDelete reservations.
Update the deployment file
Using a text editor, open the
examples/machine-learning/a3-megagpu-8g/a3mega-slurm-deployment.yamlfile.In the deployment file, specify the Cloud Storage bucket, set names foryour network and subnetwork, and set deployment variables such as project ID,region, and zone.
--- terraform_backend_defaults: type: gcs configuration: bucket:BUCKET_NAME vars: deployment_name: a3mega-base project_id:PROJECT_ID region:REGION zone:ZONE network_name_system:NETWORK_NAME subnetwork_name_system:SUBNETWORK_NAME enable_ops_agent: true enable_nvidia_dcgm: true enable_nvidia_persistenced: true disk_size_gb: 200 final_image_family: slurm-a3mega slurm_cluster_name: a3mega a3mega_reservation_name:RESERVATION_NAME a3mega_cluster_size:NUMBER_OF_VMS
Replace the following:
BUCKET_NAME: the name of your Cloud Storagebucket, created in the previous section.PROJECT_ID: your project ID.REGION: a region that hasa3-megagpu-8gmachinetypes.ZONE: a zone that hasa3-megagpu-8gmachinetypes.NETWORK_NAME: a name for your network. For example,a3mega-sys-net.SUBNETWORK_NAME: a name for your subnetwork. For example,a3mega-sys-subnet.RESERVATION_NAME: the name of the single-projectreservation that you want to use.NUMBER_OF_VMS: the number of VMs needed for thecluster.
Make additional updates
If you have multiple reservations, you can update the deployment file tospecify the additional reservations. To do this, seeScale A3 Mega clusters across multiple reservations.
Provision a Slurm cluster
Cluster Toolkit provisions the cluster based on the deployment file youcreated in the previous step and the default cluster blueprint.
To provision the cluster, run the command for your machine type from theCluster Toolkit directory. This step takes approximately 30-40 minutes.
./gcluster deploy -d examples/machine-learning/a3-megagpu-8g/a3mega-slurm-deployment.yaml \examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml \--auto-approve
Connect to the A3 Mega Slurm cluster
To enable optimized NCCL communication tuning on your cluster, you must loginto the Slurm login node. To login, you can use either Google Cloud console orGoogle Cloud CLI.
Console
Go to theCompute Engine >VM instances page.
Locate the login node. It should have a name similar to
a3mega-login-001.From theConnect column of the login node, clickSSH.
gcloud
To connect to the login node, use thegcloud compute ssh command.
gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \ --tunnel-through-iap \ --zoneZONE
Run a NCCL test
After you connect to the login node, you can thenEnable GPUDirect-TCPXO optimized NCCL communication.
Redeploy the Cluster
If you need to increase the number of compute nodes or add new partitions toyour cluster, you might need to update configurations for your Slurm cluster byredeploying. Redeployment can be sped up by using an existing image from aprevious deployment. To avoid creating new images during a redeploy, specify the--only flag. To redeploy the cluster using an existing image run the followingcommand from themain Cluster Toolkit directory:
./gcluster deploy -d \examples/machine-learning/a3-megagpu-8g/a3mega-slurm-deployment.yaml \examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml \--only primary,cluster --auto-approve -w
This command is only for redeployments where an image already exists as it onlyredeploys the cluster and its infrastructure.
Destroy the Slurm cluster
By default the A3 Mega blueprints enable deletion protection on theFilestore instance. For the Filestore instance to be deleted when destroying the Slurm cluster, learn how toset or remove deletion protection on an existing instanceto disable deletion protection before running the destroy command.
Disconnect from the cluster if you haven't already.
Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default,DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory.
To destroy the cluster, run:
./gcluster destroyDEPLOYMENT_FOLDER --auto-approve
Replace
DEPLOYMENT_FOLDERwith the name of thedeployment folder. It's typically the same asDEPLOYMENT_NAME.When destruction is complete you should see a message similar to the following:
Destroy complete! Resources: xx destroyed.
To learn how to cleanly destroy infrastructure and for advanced manualdeployment instructions, see the deployment folder located at the root ofthe Cluster Toolkit directory:DEPLOYMENT_FOLDER/instructions.txt
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-17 UTC.