Overview of HPC clusters with enhanced cluster management capabilities Stay organized with collections Save and categorize content based on your preferences.
To create the infrastructure for tightly-coupled applications that scale acrossmultiple nodes, you can create a cluster of virtual machine (VM) instances. Thisguide provides a high-level overview of the key considerations and steps toconfigure a cluster of virtual machine (VM) instances for high performancecomputing (HPC) workloads using dense resource allocation.
WithH4D,Compute Engine adds support for running massive HPC workloads bytreating an entire cluster of VM instances as a single computer. Usingtopology-aware placement of VMs lets you access many instances within a singlenetworking superblock and minimizes network latency. You can also configureCloud RDMA on these instancesto maximize inter-node communication performance, which is crucial fortightly-coupled HPC workloads.
Note: This type of configuration relies on similar features and concepts asthose documented in theAI Hypercomputer documentation foraccelerator-optimized VMs with GPUs.You create these HPC VM clusters with H4D by reserving blocks of capacityinstead of individual resources. Using blocks of capacity for your clusterenablesenhanced cluster management capabilities.
HPC clusters with H4D instances can be created either with or withoutenhanced cluster management capabilities. If you don't require enhanced cluster management capabilities features with your H4D HPCcluster, or if you want to create HPC clusters using a machine series other thanH4D, then use the following instructions for creating HPC instances or clusters:
Cluster terminology
When working with blocks of capacity, the following terms are used:
Overview of cluster creation process with H4D VMs
To create HPC clusters on reserved blocks of capacity, you must complete thefollowing steps:
- Review available provisioning models
- Choose a consumption option and obtain capacity
- Choose a deployment option and orchestrator
- Choose the operating system or cluster image
- Create your cluster
Provisioning models for VM and cluster creation
When creating VM instances, you can use the provisioning models described inCompute Engine instances provisioning models.
To create a tightly-coupled H4D instances, you must use one of the followingprovisioning models to obtain the necessary resources for creating computeinstances:
Reservation-bound: you can reserve resources at a discounted price for afuture date and duration. At the start of your reservation period, you can usethe reserved resources to create VMs or clusters. You have exclusive access toyour reserved resources for the reservation period.
Flex-start: you can request discounted resources for up to seven days.Compute Engine makes best-effort attempts to schedule the provisioningof your requested resources as soon as they're available. You have exclusiveaccess to your obtained resources for your requested period.
Spot: based on availability, you can immediately obtain deeply discountedresources. However, Compute Engine might stop or delete the VMinstances at any time to reclaim capacity.
Reservation-bound provisioning model
The reservation-bound provisioning model links your created VM instances to thecapacity that you previously reserved. When you reserve capacity,Compute Engine creates an empty reservation. Then, at the reservationstart time, the following occurs:
Compute Engine adds your reserved resources to the reservation.You have exclusive access to the reserved capacity until the reservation endtime.
Google Cloud charges you for the reserved capacity until the end of yourreservation period, whether you use the capacity or not.
You can then use the reserved resources to create VMs without additionalcharges. You only pay for resources that aren't included in the reservation,such as disks or IP addresses.
You can reserve resources for as many VMs as you like for aslong as you like for a future date. Then, you can use the reserved resources tocreate and run VMs until the end of the reservation period. If you reserveresources for one year or longer, then you must purchase and attach aresource-based commitment.
To provision resources using the reservation-bound provisioning model, see:
For long-running, large-scale distributed workloads with densely allocatedresources:Reserve capacity through your account team
For short-running (up to 90 days) distributed workloads with densely allocatedresources:Future reservation requests in calendar mode
You can use reservation-bound provisioning with H4D instances by specifying thereservation-bound provisioning model when creating individual VMs, a HPCcluster, or a group of VMs.
Flex-start provisioning model
To run short-duration workloads that require densely allocated resources, youcan request compute resources for up to seven days by using Flex-start. Wheneverresources are available, Compute Engine creates your requested number ofVMs. You can stop standalone Flex-start VMs, but you can't stopFlex-start VMs that a managed instance group (MIG) creates throughresize requests. The Flex-start VMs exist until you delete them, oruntil Compute Engine deletes the VMs at the end of their run duration.
Flex-start is ideal for workloads that can start at any time. The flex-startprovisioning model provisions resources from a secure capacity pool, so theallocated resources are densely allocated to minimize network latency.
When you add Flex-start VMs to amanaged instance group (MIG) by using resize requests, the MIG creates the VMsall at once. This approach helps you avoid unnecessary charges forpartial capacity that Compute Engine might deliver while you wait forthe full capacity needed to start your workload.
You can use Flex-start provisioning with H4D instances, using any availabledeployment model.
Spot provisioning model
To run fault-tolerant workloads, you can obtain compute resources immediatelybased on availability. You get resources at the lowest price possible. However,Compute Engine might stop or delete the created Spot VMs at anytime to reclaim capacity. This process is calledpreemption.
Spot VMs are ideal for workloads where interruptions are acceptable,such as:
- Batch processing
- High performance computing (HPC)
- Data analytics
- Continuous integration and continuous deployment (CI/CD)
- Media encoding
You can use Spot VMs with any machine type, except A4X, X4, and baremetal machine types. Dense allocation depends on resource availability. To helpensure a closer allocation, you can apply a compact placement policy to theSpot VMs.
Note: Spot VMs are not covered by any Service Level Agreement and areexcluded from theCompute Engine SLA.You can use Spot VMs with the following dense deployment options:
- Create a HPC Slurm cluster with H4D
- Bulk create HPC-optimized instances with H4D
- Create a HPC MIG with H4D machine series
Choose a consumption option and obtain capacity
Consumption options determine how resources are obtained for your cluster. Tocreate a cluster that uses enhanced cluster management capabilities, you must request blocksof capacity for adense deployment.
The following table summarizes the key differences between the consumptionoptions for blocks of capacity:
Note: You can also request a future reservation for more than 90 days. If youneed to reserve this capacity, seeReserve capacity through your account team.| Consumption option | Future reservations for capacity blocks | Future reservations for up to 90 days (in calendar mode) | Flex-start | Spot |
|---|---|---|---|---|
| Workload characteristics | Long-running, large-scale distributed workloads that require densely allocated resources | Short-duration workloads that require densely allocated resources | Short-duration workloads that require densely allocated resources | Fault-tolerant workloads |
| Lifespan | Any time | Up to 90 days | Up to 7 days | Any time, but subject topreemption |
| Preemptible | No | No | No | Yes |
| Capacity assurance | Very high | Very high | Best effort | Best effort |
| Quota | Check that you have enough quota before creating instances. | No quota is charged | Preemptible quota is charged. | Preemptible quota is charged. |
| Pricing |
|
|
|
|
| Resource allocation | Dense | Dense | Dense | Standard (Compact placement policy optional) |
| Provisioning model | Reservation-bound | Reservation-bound | Flex-start | Spot |
| Creation method | To create HPC clusters and VMs, you must do the following:
| To create HPC clusters and VMs, you must do the following:
| To create VMs, select one of the following options:
When your requested capacity becomes available, Compute Engine provisions it. | You can immediately create VMs. SeeChoose a deployment option. |
Choose a deployment option
High performance computing (HPC) workloads aggregate computing resources to gainperformance greater than that of a single workstation, server, or computer. HPCis used to solve problems in academic research, science, design, simulation, andbusiness intelligence.
For HPC clusters with enhanced cluster management capabilities, choose the H4D machine series. If you planto use a different machine series, follow the documentation atCreate an HPC-ready VM instanceinstead of using the deployment methods listed on this page.
Some of the available deployment optionsinclude the installation and configuration of anorchestratorfor enhanced management of the HPC cluster.
For the most appropriate option to create your VMs or clusters for your usecase, choose one of the following:
| Option | Use case |
|---|---|
| Cluster Toolkit | You want to use open-source software that simplifies the process for you to deploy both Slurm and Google Kubernetes Engine (GKE) clusters.Cluster Toolkit is designed to be highly customizable and extensible. To learn more, see the following: |
| GKE | You want maximum flexibility in configuring your Google Kubernetes Engine cluster based on the needs of your workload. To learn more, seeRun HPC workloads with H4D. |
| Use Compute Engine | You want full control of the infrastructure layer so that you can set up your own orchestrator. To learn more, see the following:
|
Choose the operating system image
The operating system (OS) image you choose depends on the service you use todeploy your cluster.
For clusters on GKE: Use a GKE node image,such as Container-Optimized OS. If you use Cluster Toolkit todeploy your GKE cluster, a Container-Optimized OSimage is used by default. For more information about node images, seeNode images in theGKE documentation.
For clusters on Compute Engine: You can use one of the followingimages:
- HPC VM image:A Rocky Linux 8 image that is optimized for tightly-coupledHPC workloads.
- OS image provided by Google Cloud:OS images that support H4D. You will need to configure these for your HPCworkloads.
- Custom images: You can createand use your own custom images. To include HPC-specific optimizations, werecommend that youcreate a custom image using the HPC VM image.
For Slurm Clusters: Cluster Toolkit deploys the Slurm Clusterwith a HPC VM image based on Rocky Linux 8 that is optimized fortightly-coupled HPC workloads.
Create your HPC cluster
After you review the cluster creation process and make preliminary decisionsfor your workload, create your cluster by using any of thedeployment options.
Enhanced cluster management capabilities for your HPC cluster
When you create H4D instances with densely allocated resources using thedeployment methods mentioned inChoose a deployment option,you can use enhanced HPC cluster management capabilities with your instances.
For more information about these capabilities, seeEnhanced HPC cluster management with H4D instances.
What's next
- Learn more aboutCluster Toolkit.
- Try the Quickstart tutorialDeploy an HPC cluster with Slurm.
- Reviewbest practices for running HPC workloads
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.