Create an H4D Slurm cluster with enhanced management capabilities Stay organized with collections Save and categorize content based on your preferences.
This page describes how to create a High Performance Computing (HPC)Slurm cluster that usesremote direct memory access (RDMA) with H4D VMs with enhanced cluster managementcapabilities. You use the gcloud CLI andCluster Toolkitto configure the cluster.
The H4D machine series is specifically designed to meet the needs of demandingHPC workloads. H4D offers instances with improved workload scalability throughCloud RDMA networking with 200 Gbps throughput. For more information on H4Dcompute-optimized machine types on Google Cloud, seeH4D machine series.
Important: To complete this tutorial, you must first contact your GoogleTechnical Account Manager (TAM) torequest a reserved capacity block for the H4D machine type.Once approved, this capacity is added to your Google Cloudproject. The capacity approval process can take several days.Tip: To walk through a quick start tutorial that deploys an H4D machine typeon Slurm, see thequickstart forcreating an RDMA enabled Slurm cluster with H4D instances.
Before you begin
Before creating a Slurm cluster, if you haven't already done so, complete the followingsteps:
- Choose a consumption option: the option that you pick determines how you want to obtain and use vCPU resources.
- Obtain capacity: obtain capacity for the selected consumption option.
- Ensure that you have enough Filestore quota: you need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity.
- To check quota, seeView API-specific quota.
- If you don't have enough quota,request a quota increase.
- Install Cluster Toolkit: to provision Slurm clusters, you must useCluster Toolkit version
v1.62.0or later.To install Cluster Toolkit, seeSet up Cluster Toolkit.
To learn more, seeChoose a consumption option and obtain capacity.
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Set up a storage bucket
Cluster blueprints use Terraform modules to provision Cloud infrastructure. Abest practice when working with Terraform is tostore the state remotelyin a version enabled file. On Google Cloud, you can create a Cloud Storagebucket that has versioning enabled.
To create this bucket and enable versioning from the CLI, run the following commands:
gcloud storage buckets create gs://BUCKET_NAME \ --project=PROJECT_ID \ --default-storage-class=STANDARD --location=BUCKET_REGION \ --uniform-bucket-level-accessgcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following:
BUCKET_NAME: a name for your Cloud Storage bucketthat meets thebucket naming requirements.PROJECT_ID: your project ID.BUCKET_REGION: anyavailable location.
Open the Cluster Toolkit directory
Ensure that you are in the Cluster Toolkit directory by running thefollowing command:
cd cluster-toolkit
This cluster deployment requires Cluster Toolkitv1.70.0 orlater. To check your version, you can run the following command:
./gcluster --version
Create a deployment file
Create a deployment file to specify the Cloud Storage bucket, set names foryour network and subnetwork, and set deployment variables such as project ID,region, and zone.
To create a deployment file, follow the steps for the H4D machine type:
The parameters that you need to add to your deployment file depend on theconsumption option that you're using for your deployment. Select the tab that corresponds to your consumption option you want to use.
Reservation-bound
To create your deployment file, use a text editor to create a YAML file namedh4d-slurm-deployment.yaml and add the following content.
examples/hpc-slurm-h4d/hpc-slurm-h4d-deployment.yamlto your workspace and edit it.terraform_backend_defaults: type: gcs configuration: bucket:BUCKET_NAMEvars: deployment_name:DEPLOYMENT_NAME project_id:PROJECT_ID region:REGION zone:ZONE h4d_cluster_size:NUMBER_OF_VMS h4d_reservation_name:RESERVATION_NAME
Replace the following:
BUCKET_NAME: the name of your Cloud Storagebucket, which you created in the previous section.DEPLOYMENT_NAME: a name for your deployment. Ifcreating multiple clusters, ensure that you select a unique name for each one.PROJECT_ID: your project ID.REGION: the region that has the reserved machines.ZONE: the zone where you want to provision the cluster. If you're using areservation-based consumption option, the region and zone information was provided by youraccount team when thecapacity was delivered.NUMBER_OF_VMS: the number of VMs that you want for the cluster.RESERVATION_NAME: the name of yourreservation.
Flex-start
To create your deployment file, use a text editor to create a YAML file namedh4d-slurm-deployment.yaml and add the following content.
examples/hpc-slurm-h4d/hpc-slurm-h4d-deployment.yamlto your workspace and edit it.terraform_backend_defaults: type: gcs configuration: bucket:BUCKET_NAMEvars: deployment_name:DEPLOYMENT_NAME project_id:PROJECT_ID region:REGION zone:ZONE h4d_cluster_size:NUMBER_OF_VMS h4d_dws_flex_enabled: true
Replace the following:
BUCKET_NAME: the name of your Cloud Storagebucket, which you created in the previous section.DEPLOYMENT_NAME: a name for your deployment. Ifcreating multiple clusters, ensure that you select a unique name for each one.PROJECT_ID: your project ID.REGION: the region where you want to provision your cluster.ZONE: the zone where you want to provisionyour cluster.NUMBER_OF_VMS: the number of VMs that you want for the cluster.
This deployment provisionsstatic compute nodes, which means that the cluster has a set number of nodes at all times. If you want to enable your cluster to autoscale instead, useexamples/h4d/hpc-slurm-h4d.yaml file and edit the values ofnode_count_static andnode_count_dynamic_max to match the following:
node_count_static: 0 node_count_dynamic_max: $(vars.h4d_cluster_size)
Spot
To create your deployment file, use a text editor to create a YAML file namedh4d-slurm-deployment.yaml and add the following content.
examples/hpc-slurm-h4d/hpc-slurm-h4d-deployment.yamlto your workspace and edit it.terraform_backend_defaults: type: gcs configuration: bucket:BUCKET_NAMEvars: deployment_name:DEPLOYMENT_NAME project_id:PROJECT_ID region:REGION zone:ZONE h4d_cluster_size:NUMBER_OF_VMS h4d_enable_spot_vm: true
Replace the following:
BUCKET_NAME: the name of your Cloud Storagebucket, which you created in the previous section.DEPLOYMENT_NAME: a name for your deployment. Ifcreating multiple clusters, ensure that you select a unique name for each one.PROJECT_ID: your project ID.REGION: the region where you want to provision your cluster.ZONE: the zone where you want to provisionyour cluster.NUMBER_OF_VMS: the number of VMs that you want for the cluster.
Provision an H4D Slurm cluster
Cluster Toolkit provisions the cluster based on the deployment file youcreated in the previous step and the default cluster blueprint. For moreinformation about the software that is installed by the blueprint, seelearn more about Slurm custom images.
Using Cloud Shell, from the directory where you installed Cluster Toolkitand created the deployment file, you can provision the cluster with the following command,which uses theH4D Slurm blueprint file.This step takes approximately 20-30 minutes.
Note: Cloud Shell has an inactivity timeout that stops shells from runningany processes after 40 minutes. If a timeout occurs, a dialog appears that asks if you wantto reauthorize your session. To continue the deployment after a timeout, clickReauthorize../gcluster deploy -d h4d-slurm-deployment.yaml examples/hpc-slurm-h4d/hpc-slurm-h4d.yaml --auto-approve
Connect to the Slurm cluster
To access your cluster, you must sign in to the Slurm login node. To sign in, youcan use either Google Cloud console or Google Cloud CLI.
Console
Go to theCompute Engine >VM instances page.
Locate the login node. It should have a name with the pattern
DEPLOYMENT_NAME+login-001.From theConnect column of the login node, clickSSH.
gcloud
To connect to the login node, complete the following steps:
Identify the login node by using the
gcloud compute instances listcommand.gcloud compute instances list \ --zones=
ZONE\ --filter="name ~ login" --format "value(name)"If the output lists multiple Slurm clusters, you can identify your loginnode by the
DEPLOYMENT_NAMEthat you specified.Use the
gcloud compute sshcommandto connect to the login node.gcloud compute sshLOGIN_NODE \ --zone=
ZONE--tunnel-through-iapReplace the following:
ZONE: the zone where the VMs for your clusterare located.LOGIN_NODE: the name of the login node, whichyou identified in the previous step.
Redeploy the Slurm cluster
If you need to increase the number of compute nodes or add new partitions toyour cluster, you might need to update configurations for your Slurm cluster byredeploying.
To redeploy the cluster using an existing image do the following:
Run the following command:
./gcluster deploy -d h4d-slurm-deployment.yaml examples/h4d/h4d-slurm-deployment.yaml --only cluster-env,cluster --auto-approve -w
This command is only for redeployments where an image already exists; it onlyredeploys the cluster and its infrastructure.
Destroy the Slurm cluster
To remove the Slurm cluster and the instances within it, use complete thefollowing steps:
Disconnect from the cluster if you haven't already.
Before running the destroy command, navigate to the root of theCluster Toolkit directory. By default,DEPLOYMENT_FOLDERis located at the root of the Cluster Toolkit directory.
To destroy the cluster, run:
./gcluster destroyDEPLOYMENT_FOLDER --auto-approve
Replace the following:
DEPLOYMENT_FOLDER: the name of the deploymentfolder. It's typically the same asDEPLOYMENT_NAME.
When the cluster removal is complete you should see a message similar to thefollowing:
Destroy complete! Resources: xx destroyed.
To learn how to cleanly destroy infrastructure and for advanced manualdeployment instructions, see the deployment folder located at the root ofthe Cluster Toolkit directory:DEPLOYMENT_FOLDER/instructions.txt
What's next
- Verify reservation consumption
- View VMs topology
- Manage host events across VMs
- Manage host events across reservations
- Monitor VMs in your Slurm cluster
- Report faulty host
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.