Create an AI-optimized Slurm cluster with an A4 machine type

This page describes how to quickly create and deploy an AI-optimizedSlurm cluster using A4accelerator-optimized machine types with the gcloud CLI andCluster Toolkit.

A4 accelerator-optimized machine types come with NVIDIA B200 GPUs attached andare specifically engineered for intensive AI computation to help your Slurmcluster efficiently handle large-scale model training and inference. For moreinformation on A4 accelerator-optimized machine types on Google Cloud, seeGPU machine types.

Important: To complete this tutorial, you must first contact your GoogleTechnical Account Manager (TAM) to reserve a capacity block for the A4 machinetype. Once approved, this capacity isadded to your Google Cloud project. The capacity approval process can takeseveral days.

Additionally, running this tutorial can incurcosts to your Google Cloudproject.

To follow step-by-step guidance for this task directly in the Google Cloud console, clickGuide me:

Guide me

Tutorial overview

This tutorial describes the steps to set up an AI-optimized Slurm cluster usingA4 accelerator-optimized machine types. Specifically, you set up a cluster withCompute Engine virtual machines, create a Cloud Storage bucket to store thenecessary Terraform modules, and set up a Filestore instance toprovision your Slurm cluster. To complete the steps in this tutorial, you followthis process:

Set up your Google Cloud project with the required permissions andenvironmental variables.
Set up a Cloud Storage bucket.
Set up Cluster Toolkit.
Switch to the Cluster Toolkit directory.
Create a Slurm deployment YAML file.
Provision a Slurm cluster using a blueprint.
Connect to the Slurm cluster.

Before you begin

Reserve a capacity block for onea4-highgpu-8g machine. These machines are required for this tutorial.
Ensure that you have enough Filestore quota to provision theSlurm cluster. You need a minimum of 10,240 GiB of zonal capacity(also known as high scale SSD capacity).
To check your Filestore quota, viewQuotas & Systemlimitsin the Google Cloud console and filter the table to only showFilestoreresources.
- For detailed instructions on checking Filestore quotas,seeView API-specific quota.
- If you don't have enough quota,request a quotaincrease.
Make sure that billing is enabled for your Google Cloud project.
Enable the Compute Engine, Filestore,Cloud Storage, Service Usage, and Cloud Resource Manager API:
Enable the APIs

Required roles

To ensure that the Compute Engine default service account has the necessary permissions to deploy a Slurm cluster, ask your administrator to grant the Compute Engine default service account the following IAM roles:

Important: You must grant these roles to the Compute Engine default service account,not to your user account. Failure to grant the roles to the correct principal might result in permission errors.

Storage Object Viewer (roles/storage.objectViewer) on your project
Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1) on your project
Service Account User (roles/iam.serviceAccountUser) on the service account itself

For more information about granting roles, seeManage access to projects, folders, and organizations.

Your administrator might also be able to give the Compute Engine default service account the required permissions throughcustom roles or otherpredefined roles.

Costs

The cost of running this tutorial varies by each section you complete, such assetting up the tutorial or running jobs. You can calculate the cost by using thepricing calculator.

To estimate the cost for setting up this tutorial, use the followingspecifications:
- Filestore (standard) capacity per region: 10,240 GiB.
- Standard persistent disk: 50 GBpd-standard for the Slurmlogin node.
- Performance (SSD) persistent disks: 50 GBpd-ssd for the Slurmcontroller.
- VM instance: 1a4-highgpu-8g.

Launch Cloud Shell

In this tutorial, you useCloud Shell which is a shellenvironment for managing resources hosted on Google Cloud.

Cloud Shell comes preinstalled with theGoogle Cloud CLI. gcloud CLI provides the primary command-lineinterface for Google Cloud.Launch Cloud Shell:

Go to the Google Cloud console.
Google Cloud console
From the upper-right corner of the console, click theActivate Cloud Shell button:

A Cloud Shell session starts and displays a command-line prompt.You use this shell to rungcloud and Cluster Toolkit commands.

Note: You need enough Cloud Shell storage to run this tutorial successfully.We recommend checking if you have enough space available to run the tutorial andif not, you can also reset Cloud Shellto a clean slate. For more information, seeResetCloud Shell.

Set environment variables

In Cloud Shell, set the following environment variables to use for theremainder of the tutorial. These environment variables set placeholder valuesfor the following tasks:

Configures your project with the relevant values to access your reserveda4-highgpu-8g machine.
Sets up a Cloud Storage bucket to store Cluster Toolkitmodules.

Reservation capacity variables

Note: These values must match the reserved capacity block details provided byyour Technical Account Manager (TAM) when thecapacity was delivered.

export A4_RESERVATION_PROJECT_ID=A4_RESERVATION_PROJECT_IDexport A4_RESERVATION_NAME=A4_RESERVATION_NAMEexport A4_DEPLOYMENT_NAME=A4_DEPLOYMENT_NAMEexport A4_REGION=A4_REGIONexport A4_ZONE=A4_ZONEexport A4_DEPLOYMENT_FILE_NAME=A4_DEPLOYMENT_FILE_NAME

Replace the following:

A4_RESERVATION_PROJECT_ID: the Google Cloudproject ID that was granted the A4 machine type reservation block.
A4_RESERVATION_NAME: the name of the GPU reservationthat's used in your project. For example,a4high-exr.
A4_DEPLOYMENT_NAME: a unique name for your Slurmcluster deployment. For example,my-slurm-cluster-deployment.
A4_REGION: the region that is running the reserved A4machine reservation block. For example,us-central1.
A4_ZONE: the zone that contains the reserved machines.This string must contain both the region and zone. For example,us-central1-a.
A4_DEPLOYMENT_FILE_NAME: a unique name for your Slurmblueprint YAML file. If you run through this tutorial more than once, choose aunique deployment name each time.

Storage capacity variables

Create the environment variables for your Cloud Storage bucket.

Cluster Toolkit uses blueprints to define and deploy clusters ofVMs. A blueprint defines one or more Terraform modules to provision Cloudinfrastructure. This bucket is used to store these blueprints.

export GOOGLE_CLOUD_BUCKET_NAME=GOOGLE_CLOUD_BUCKET_NAMEexport GOOGLE_CLOUD_BUCKET_LOCATION=GOOGLE_CLOUD_BUCKET_LOCATION

Replace the following:

GOOGLE_CLOUD_BUCKET_NAME: the name that you want touse for your Cloud Storage bucket that meets thebucket namingrequirements.
GOOGLE_CLOUD_BUCKET_LOCATION: any Google Cloudregion of yourchoice, where the bucket will be hosted. For example,us-central1.

Switch to your A4-approved project

Run the following command to ensure that you are in the Google Cloudproject that has the approved reservation block for the A4 machine type.

gcloud config set project ${A4_RESERVATION_PROJECT_ID}

Create a Cloud Storage bucket

Create the bucket to store your Terraform modules. From Cloud Shell,using your environment variables, run the following command:

A best practice when working with Terraform is tostore thestate remotelyin a version-enabled file. On Google Cloud, you can create aCloud Storage bucket that has versioning enabled.

gcloud storage buckets create gs://${GOOGLE_CLOUD_BUCKET_NAME} \    --project=${A4_RESERVATION_PROJECT_ID} \    --default-storage-class=STANDARD \    --location=${GOOGLE_CLOUD_BUCKET_LOCATION} \    --uniform-bucket-level-accessgcloud storage buckets update gs://${GOOGLE_CLOUD_BUCKET_NAME} --versioning

Set up the Cluster Toolkit

To create a Slurm cluster in a Google Cloud project, you can useCluster Toolkit to handle deployingand provisioning the cluster. Cluster Toolkit is open-source softwareoffered by Google Cloudto simplify the process of deploying workloads on Google Cloud.

Use the following steps to set up Cluster Toolkit.

Clone the Cluster Toolkit GitHub repository

In Cloud Shell, clone the GitHub repository:

git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git

Go to the main working directory:
```
cd cluster-toolkit/
```

Build the Cluster Toolkit binary

In Cloud Shell, build the Cluster Toolkit binary from source byrunning the following command:
```
make
```
To verify the build, run the following command:
To deploy an A4 high accelerator-optimized machine Slurm cluster,you must use versionv1.51.1 or later of the Cluster Toolkit.
```
./gcluster --version
```
After building the binary, you are now ready to deploy clusters to run yourjobs or workloads.

Create a deployment file

In the Cluster Toolkit directory, create your Slurmdeployment YAML file.
```
nano ${A4_DEPLOYMENT_FILE_NAME}.yaml
```

Paste the following content into the YAML file.

---terraform_backend_defaults:  type: gcs  configuration:    bucket:GOOGLE_CLOUD_BUCKET_NAMEvars:  deployment_name:A4_DEPLOYMENT_FILE_NAME  project_id:A4_RESERVATION_PROJECT_ID  region:A4_REGION  zone:A4_ZONE  a4h_reservation_name:A4_RESERVATION_NAME  a4h_cluster_size: 1

To save and exit the file, pressCtrl+O >Enter >Ctrl+X.

Provision the Slurm cluster

To provision the Slurm cluster, run the following deployment command. Thiscommand provisions the Slurm cluster with theexamples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml blueprint file.

Note: Provisioning the cluster can take up to an hour.

Cloud Shell has an inactivity timeout that stops shells from runningany processes after 40 minutes. If a timeout occurs, a dialog appears that asksif you want to reauthorize your session. To continue the deployment after atimeout, clickReauthorize.

In Cloud Shell, start the cluster creation.

./gcluster deploy -d ${A4_DEPLOYMENT_FILE_NAME}.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --auto-approve

Connect to the cluster

After deploying, connect to the Google Cloud console to view your cluster.

Go to theCompute Engine >VM instances page in theGoogle Cloud console.
Go to VM instances
Locate the login node (a4high-login-001 or similar).
ClickSSH to connect.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Destroy the Slurm cluster

We recommend that you clean up your resources when they are no longer needed.

By default, the A4 High blueprints enable deletion protection on theFilestore instance. When destroying the Slurm cluster, you must disabledeletion protection before running the destroy command.

Disable deletion protection

To disable deletion protection when you update an instance, use a commandsimilar to the following:

gcloudfilestoreinstancesupdateINSTANCE_NAME\--no-deletion-protection

ReplaceINSTANCE_NAME with the name of the instance youwant to edit. For example,my-genomics-instance.

To find theINSTANCE_NAME, you can rungcloud filestore instanceslist. This command lists all the Filestore instances in yourcurrent Google Cloud project, including their names, locations (zones),tiers, capacity, and status.

After running the command, find the Filestore instance thatmatches thea4-highgpu-8g machine that's running in this tutorial.

Destroy the Slurm cluster

Before running the destroy command, navigate to the root of theCluster Toolkit directory. By default,DEPLOYMENT_FOLDERis located at the root of the Cluster Toolkit directory.
To destroy the cluster, run:
```
./gcluster destroyDEPLOYMENT_FOLDER --auto-approve
```
ReplaceDEPLOYMENT_FOLDER with the name of thedeployment folder. It's typically the same asDEPLOYMENT_NAME.

When destruction is complete, you see a message similar to the following:

Destroy complete! Resources: xx destroyed.

Delete the storage bucket

Delete the Cloud Storage bucket after you make sure that theprevious command ended without errors:

gcloud storage buckets delete gs://${GOOGLE_CLOUD_BUCKET_NAME}

Troubleshooting

Error: Cloud Shell can't provision the cluster because there isno storage left.
You might see this error if you are a frequent user of Cloud Shelland you have run out of storage space.
To resolve this issue, seeDisable or resetCloud Shell.
Error: Cluster or blueprint name already exists.
You might see this error if you are using a project that has already usedthe exact file names used in this tutorial. For example, if someone else inyour organization ran through this tutorial end-to-end.
To resolve this issue, run through the tutorial again and choose a uniquename for thedeployment file and rerun theprovision the Slurm cluster command with the newdeployment file.

What's next

Advanced Slurm tasks:
- Learn how toRedeploy the Slurm cluster
- Learn how toTest network performance on the Slurm cluster
Learn how to manage host events:
- Manage host events across VMs
- Manage host events across reservations
View VMs topology
Monitor VMs in your Slurm cluster
Report a faulty host

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.

Movatterモバイル変換

Create an AI-optimized Slurm cluster with an A4 machine type

Tutorial overview

Before you begin

Required roles

Costs

Launch Cloud Shell

Set environment variables

Reservation capacity variables

Storage capacity variables

Switch to your A4-approved project

Create a Cloud Storage bucket

Set up the Cluster Toolkit

Clone the Cluster Toolkit GitHub repository

Build the Cluster Toolkit binary

Create a deployment file

Provision the Slurm cluster

Connect to the cluster

Clean up

Destroy the Slurm cluster

Disable deletion protection

Destroy the Slurm cluster

Delete the storage bucket

Troubleshooting

What's next