Handle GPU host maintenance events Stay organized with collections Save and categorize content based on your preferences.
This document discusses how you can minimize disruptions to your GPU workloadsduring a maintenance event.
To learn how to monitor, plan for, and perform scheduled maintenance onvirtual machine (VM) instances with Cluster Director, see insteadManage host events across VMs.
When Compute Engine performsmaintenance on a virtual machine (VM) withattached graphics processing units (GPUs),the VM must be stopped. This is because VMs with attached GPUscan't belive migrated.
You must set these VMs tostop for host maintenance events.You can set your stopped VMs toautomatically restartafter the maintenance event completes.
Warning: For VMs with GPUs, data on any Local SSD disks attached to the VM isunrecoverable whenever Compute Engine stops the VM for host maintenanceevents. For more information, seeMigrate your temporary data off of Local SSD disksin this document.Host maintenance events typically occur once every two weeks, but might occasionally run more frequently.
Note: VMs with attached GPUs can take up to one hour to terminate after failuresorhost errors.Receive advance notice before maintenance events
You canmonitor the maintenance schedule for your virtual machine (VM) instance, andprepare your workloads to transition through the system restart.
To receive advance notice of host events, monitor the/computeMetadata/v1/instance/maintenance-event metadata value.If the request to the metadata server returnsNONE, then the VM isn'tscheduled to stop. For example, run the following command from within a VM:
curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"NONEIf the metadata server returnsTERMINATE_ON_HOST_MAINTENANCE, then yourVM is scheduled for stopping. Compute Engine gives GPUVMs a 1-hour stopping notice, while normal VMs receive onlya 60-second notice.
Use these notices to configure your application to transition throughhost maintenance events. For example, seeMigrate your temporary data off of Local SSD disksin this document.
Migrate your temporary data off of Local SSD disks
Due toLocal SSD data persistence,data on any Local SSD disks attached to a VM is unrecoverable wheneverCompute Engine stops the VM for host maintenance events. If you want tohelp prevent data loss, configure your workload to migrate data off of theLocal SSD disks before the VM is stopped. For example, you might use one of thefollowing techniques:
Configure your application to temporarily move work in progress to aCloud Storage bucket, then retrievethat data after the VM restarts.
Write data to asecondary Persistent Disk.When the VM automatically restarts, the Persistent Disk can bereattached and your application can resume work.
What's next?
- Learn more aboutGPU platforms.
- To learn more about managing and scaling groups of VMs, seeSet the group's target size.
- To monitor GPU performance, seeMonitor GPU performance.
- To improve network performance, seeUse higher network bandwidth.
- Learn how totroubleshoot VM shutdowns and reboots.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-16 UTC.