Create an H4D Slurm cluster with enhanced management capabilities

This page describes how to create a High Performance Computing (HPC)Slurm cluster that usesremote direct memory access (RDMA) with H4D VMs with enhanced cluster managementcapabilities. You use the gcloud CLI andCluster Toolkitto configure the cluster.

The H4D machine series is specifically designed to meet the needs of demandingHPC workloads. H4D offers instances with improved workload scalability throughCloud RDMA networking with 200 Gbps throughput. For more information on H4Dcompute-optimized machine types on Google Cloud, seeH4D machine series.

Important: To complete this tutorial, you must first contact your GoogleTechnical Account Manager (TAM) torequest a reserved capacity block for the H4D machine type.Once approved, this capacity is added to your Google Cloudproject. The capacity approval process can take several days.

Tip: To walk through a quick start tutorial that deploys an H4D machine typeon Slurm, see thequickstart forcreating an RDMA enabled Slurm cluster with H4D instances.

Before you begin

Before creating a Slurm cluster, if you haven't already done so, complete the followingsteps:

  1. Choose a consumption option: the option that you pick determines how you want to obtain and use vCPU resources.
  2. Obtain capacity: obtain capacity for the selected consumption option.
  3. To learn more, seeChoose a consumption option and obtain capacity.

  4. Ensure that you have enough Filestore quota: you need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity.
  5. Install Cluster Toolkit: to provision Slurm clusters, you must useCluster Toolkit versionv1.62.0 or later.

    To install Cluster Toolkit, seeSet up Cluster Toolkit.

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Set up a storage bucket

Cluster blueprints use Terraform modules to provision Cloud infrastructure. Abest practice when working with Terraform is tostore the state remotelyin a version enabled file. On Google Cloud, you can create a Cloud Storagebucket that has versioning enabled.

To create this bucket and enable versioning from the CLI, run the following commands:

gcloud storage buckets create gs://BUCKET_NAME \    --project=PROJECT_ID \    --default-storage-class=STANDARD --location=BUCKET_REGION \    --uniform-bucket-level-accessgcloud storage buckets update gs://BUCKET_NAME --versioning

Replace the following:

Open the Cluster Toolkit directory

Ensure that you are in the Cluster Toolkit directory by running thefollowing command:

cd cluster-toolkit

This cluster deployment requires Cluster Toolkitv1.70.0 orlater. To check your version, you can run the following command:

./gcluster --version

Create a deployment file

Create a deployment file to specify the Cloud Storage bucket, set names foryour network and subnetwork, and set deployment variables such as project ID,region, and zone.

To create a deployment file, follow the steps for the H4D machine type:

The parameters that you need to add to your deployment file depend on theconsumption option that you're using for your deployment. Select the tab that corresponds to your consumption option you want to use.

Reservation-bound

To create your deployment file, use a text editor to create a YAML file namedh4d-slurm-deployment.yaml and add the following content.

Tip: Alternatively, for an example of a more detailed deployment file, you can copyexamples/hpc-slurm-h4d/hpc-slurm-h4d-deployment.yamlto your workspace and edit it.
terraform_backend_defaults:  type: gcs  configuration:    bucket:BUCKET_NAMEvars:  deployment_name:DEPLOYMENT_NAME  project_id:PROJECT_ID  region:REGION  zone:ZONE  h4d_cluster_size:NUMBER_OF_VMS  h4d_reservation_name:RESERVATION_NAME

Replace the following:

  • BUCKET_NAME: the name of your Cloud Storagebucket, which you created in the previous section.
  • DEPLOYMENT_NAME: a name for your deployment. Ifcreating multiple clusters, ensure that you select a unique name for each one.
  • PROJECT_ID: your project ID.
  • REGION: the region that has the reserved machines.
  • ZONE: the zone where you want to provision the cluster. If you're using areservation-based consumption option, the region and zone information was provided by youraccount team when thecapacity was delivered.
  • NUMBER_OF_VMS: the number of VMs that you want for the cluster.
  • RESERVATION_NAME: the name of yourreservation.

Flex-start

To create your deployment file, use a text editor to create a YAML file namedh4d-slurm-deployment.yaml and add the following content.

Tip: Alternatively, for an example of a more detailed deployment file, you can copyexamples/hpc-slurm-h4d/hpc-slurm-h4d-deployment.yamlto your workspace and edit it.
terraform_backend_defaults:  type: gcs  configuration:    bucket:BUCKET_NAMEvars:  deployment_name:DEPLOYMENT_NAME  project_id:PROJECT_ID  region:REGION  zone:ZONE  h4d_cluster_size:NUMBER_OF_VMS  h4d_dws_flex_enabled: true

Replace the following:

  • BUCKET_NAME: the name of your Cloud Storagebucket, which you created in the previous section.
  • DEPLOYMENT_NAME: a name for your deployment. Ifcreating multiple clusters, ensure that you select a unique name for each one.
  • PROJECT_ID: your project ID.
  • REGION: the region where you want to provision your cluster.
  • ZONE: the zone where you want to provisionyour cluster.
  • NUMBER_OF_VMS: the number of VMs that you want for the cluster.

This deployment provisionsstatic compute nodes, which means that the cluster has a set number of nodes at all times. If you want to enable your cluster to autoscale instead, useexamples/h4d/hpc-slurm-h4d.yaml file and edit the values ofnode_count_static andnode_count_dynamic_max to match the following:

      node_count_static: 0      node_count_dynamic_max: $(vars.h4d_cluster_size)

Spot

To create your deployment file, use a text editor to create a YAML file namedh4d-slurm-deployment.yaml and add the following content.

Tip: Alternatively, for an example of a more detailed deployment file, you can copyexamples/hpc-slurm-h4d/hpc-slurm-h4d-deployment.yamlto your workspace and edit it.
terraform_backend_defaults:  type: gcs  configuration:    bucket:BUCKET_NAMEvars:  deployment_name:DEPLOYMENT_NAME  project_id:PROJECT_ID  region:REGION  zone:ZONE  h4d_cluster_size:NUMBER_OF_VMS  h4d_enable_spot_vm: true

Replace the following:

  • BUCKET_NAME: the name of your Cloud Storagebucket, which you created in the previous section.
  • DEPLOYMENT_NAME: a name for your deployment. Ifcreating multiple clusters, ensure that you select a unique name for each one.
  • PROJECT_ID: your project ID.
  • REGION: the region where you want to provision your cluster.
  • ZONE: the zone where you want to provisionyour cluster.
  • NUMBER_OF_VMS: the number of VMs that you want for the cluster.

Provision an H4D Slurm cluster

Cluster Toolkit provisions the cluster based on the deployment file youcreated in the previous step and the default cluster blueprint. For moreinformation about the software that is installed by the blueprint, seelearn more about Slurm custom images.

Using Cloud Shell, from the directory where you installed Cluster Toolkitand created the deployment file, you can provision the cluster with the following command,which uses theH4D Slurm blueprint file.This step takes approximately 20-30 minutes.

Note: Cloud Shell has an inactivity timeout that stops shells from runningany processes after 40 minutes. If a timeout occurs, a dialog appears that asks if you wantto reauthorize your session. To continue the deployment after a timeout, clickReauthorize.
./gcluster deploy -d h4d-slurm-deployment.yaml examples/hpc-slurm-h4d/hpc-slurm-h4d.yaml --auto-approve

Connect to the Slurm cluster

To access your cluster, you must sign in to the Slurm login node. To sign in, youcan use either Google Cloud console or Google Cloud CLI.

Console

  1. Go to theCompute Engine >VM instances page.

    Go to the VM instances page

  2. Locate the login node. It should have a name with the patternDEPLOYMENT_NAME +login-001.

  3. From theConnect column of the login node, clickSSH.

gcloud

To connect to the login node, complete the following steps:

  1. Identify the login node by using thegcloud compute instances list command.

    gcloud compute instances list \  --zones=ZONE \  --filter="name ~ login" --format "value(name)"

    If the output lists multiple Slurm clusters, you can identify your loginnode by theDEPLOYMENT_NAME that you specified.

  2. Use thegcloud compute ssh commandto connect to the login node.

    gcloud compute sshLOGIN_NODE \  --zone=ZONE --tunnel-through-iap

    Replace the following:

    • ZONE: the zone where the VMs for your clusterare located.
    • LOGIN_NODE: the name of the login node, whichyou identified in the previous step.

Redeploy the Slurm cluster

If you need to increase the number of compute nodes or add new partitions toyour cluster, you might need to update configurations for your Slurm cluster byredeploying.

To redeploy the cluster using an existing image do the following:

  1. Connect to the cluster

  2. Run the following command:

    ./gcluster deploy -d h4d-slurm-deployment.yaml examples/h4d/h4d-slurm-deployment.yaml --only cluster-env,cluster --auto-approve -w

    This command is only for redeployments where an image already exists; it onlyredeploys the cluster and its infrastructure.

Destroy the Slurm cluster

To remove the Slurm cluster and the instances within it, use complete thefollowing steps:

  1. Disconnect from the cluster if you haven't already.

  2. Before running the destroy command, navigate to the root of theCluster Toolkit directory. By default,DEPLOYMENT_FOLDERis located at the root of the Cluster Toolkit directory.

  3. To destroy the cluster, run:

    ./gcluster destroyDEPLOYMENT_FOLDER --auto-approve

    Replace the following:

    • DEPLOYMENT_FOLDER: the name of the deploymentfolder. It's typically the same asDEPLOYMENT_NAME.

When the cluster removal is complete you should see a message similar to thefollowing:

  Destroy complete! Resources: xx destroyed.

To learn how to cleanly destroy infrastructure and for advanced manualdeployment instructions, see the deployment folder located at the root ofthe Cluster Toolkit directory:DEPLOYMENT_FOLDER/instructions.txt

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.