Enhanced HPC cluster management with H4D instances Stay organized with collections Save and categorize content based on your preferences.
Enhanced HPC cluster management capabilities enable you to run large-scale,densely deployed HPC clusters and provides the following cluster managementcapabilities:
- HPC cluster resources colocation
- Cluster topology-aware placement
- Cluster operational mode
- Cluster maintenance scheduling and controls
- Cluster monitoring and diagnostic tooling
HPC infrastructure resources colocation
When you use the H4D instances with enhanced management capabilities, you canrequest Compute Engine to provision your instances as close together apossible. These machines offer the following features:
Compute Engine provisions the machines asblocks of resources.
Improved workload scalability through Cloud RDMA-enabled 200 Gbps networking.
This resource arrangement minimizes network hops and optimizes for lowestnetwork latency. To learn more about how to obtain capacity to deploy denselyallocated blocks of machines, seeCreate an HPC cluster with enhanced management capabilities.
Cluster topology-aware placement
After you create VMs or clusters of H4D VMs, you can get topology information atthe node and cluster levels. This information helps you do the following:
Adjust your application or workload design to further minimize networklatency.
Understand and troubleshoot network latency and performance issues for VMsthat communicate frequently with each other. These issues can occur if theVMs are unexpectedly located far apart.
For more information, seeView VMs topology.
Managed maintenance and recovery of your H4D VMs
When you reserve capacity to create H4D VMs or clusters, Google Cloudautomatically manages the maintenance and recovery process of your VMs afterhost errors orfaulty host reports. This approach, referred to as themanaged mode, is idealwhen your workload requires high stability, and needs an automated process tominimize downtimes.
The managed mode has the following features:
Only use reserved capacity for recovery: Compute Engine only usesyour reserved capacity to restart VMs. If there's no available capacity inyour reservations, then Compute Engine only restarts VMs after youobtain more capacity.
Automated VM restarts: Google Cloud handles the entire recovery processfor a VM. When host maintenance is required, Compute Engineautomatically migrates your VMs on other available machines within yourreservation and restarts the VMs.
Block management and visibility: you can view the topology, health, andmaintenance status of individual reservations and reservation blocks. Youcan also receive maintenance notifications, and optionally start maintenancebefore the scheduled maintenance time, for these resources.
Potential API rate limits: calls to the report faulty host API may berate-limited per reservation.
Cluster maintenance scheduling and controls
You control maintenance of H4D instances by using topology-aware scheduling in ablock of resources. This capability helps synchronize upgrades so that yourworkloads are more resilient to host events and minimizes disruptions.
To facilitate full control of maintenance events, you can use the followingfeatures:
Maintenance scheduling type
When you reserve capacity to create VMs or clusters of H4D VM instances, you candefine how Compute Engine maintains the infrastructure that your VMs runon. You can specify whether to group VMs and have synchronized maintenancescheduling (grouped), or the VMs can be loosely coupled and have independentmaintenance scheduling (independent).
Grouped maintenance scheduling
The grouped maintenance scheduling type helps ensure that, no matter whenCompute Engine provisions a VM, all VMs running the same workload havethe same planned maintenance frequency. This tightly-coupled maintenance letsyou optimize your job's performance by giving you complete control over yourused and unused capacity.
A group maintenance scheduling type is useful in the following cases:
- Your environment uses a job scheduler, such as Slurm or Google Kubernetes Engine.
- You want to run highly parallelized-computing workloads.
Independent maintenance scheduling
Independent maintenance scheduling type gives VMs different maintenanceschedules. This configuration is ideal if you have workloads that run moreefficiently when the VMs have separate maintenance schedules.
Manage host events
After you create H4D VMs and start your workload, you can setup alerts and receive notifications when maintenance for your VMs or reservedblocks is scheduled, starts, or is completed. You can also view and, if needed,manually start maintenance on a VM or reserved block before its scheduled time.These options help you proactively control and minimize downtimes to yourworkloads.
For more information, see the following:
Cluster monitoring and diagnostic tooling
For monitoring and troubleshooting, H4D instances include aFaulty host reportingservice, which you can use to flag issues with individual host machines.
What's next?
Create an HPC cluster with enhanced cluster management capabilities by using one of the following methods:
Observe and monitor VMs in your Slurm cluster
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.