Handle GPU host maintenance events

This document discusses how you can minimize disruptions to your GPU workloadsduring a maintenance event.

When Compute Engine performsmaintenance on a Compute Engine instance withattached graphics processing units (GPUs),the compute instance must be stopped. This is because compute instances withattached GPUs can't belive migrated.

You must set these compute instances tostop for host maintenance events.You can set your stopped compute instances toautomatically restartafter the maintenance event completes.

Warning: For compute instances with GPUs, data on any Local SSD disks attachedto the compute instance is unrecoverable whenever Compute Engine stops theinstance for host maintenance events. For more information, see Migrate your temporary data off of Local SSD disksin this document.

Host maintenance events typically occur once every two weeks, but mightoccasionally run more frequently. Compute instances with attached GPUs can takeup to one hour to terminate after failures orhost errors.

Receive advance notice before maintenance events

Note: To learn how to monitor, plan for, and performscheduled maintenance on A4X, A4, or A3 Ultra instances, see Manage host events across compute instancesin the AI Hypercomputer documentation.

You canmonitor the maintenance schedule for your Compute Engine instance, andprepare your workloads to transition through the system restart.

To receive advance notice of host events, monitor the/computeMetadata/v1/instance/maintenance-event metadata value.If the request to the metadata server returnsNONE, then the compute instanceisn't scheduled to stop. For example, run the following command from within acompute instance:

curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"NONE

If the metadata server returnsTERMINATE_ON_HOST_MAINTENANCE, then yourcompute instance is scheduled for stopping. For compute instances that haveattached GPUs, Compute Engine provides this notice 1 hour before the computeinstance stops.

For some GPU machine series, such as A3, Compute Engineprovides notice of upcoming maintenance more than an hour in advance through theupcoming-maintenance metadata attribute. To learn more, seeMonitor and plan for a host maintenance event.

Configure your application to transition through themaintenance event. For example, you might use one of the following techniques:

Use these notices to configure your application to transition throughhost maintenance events. For example, seeMigrate your temporary data off of Local SSD disksin this document.

Migrate your temporary data off of Local SSD disks

Due toLocal SSD data persistence,data on any Local SSD disks attached to a compute instance is unrecoverablewhenever Compute Engine stops the compute instance for host maintenanceevents. If you want to help prevent data loss, configure your workload tomigrate data off of the Local SSD disks before the compute instance is stopped.For example, you might use one of the following techniques:

Configure your application to temporarily move work in progress to aCloud Storage bucket, then retrievethat data after the compute instance restarts.
Write data to asecondary Persistent Disk.When the compute instance automatically restarts, the Persistent Disk can bereattached and your application can resume work.

What's next?

Learn more aboutGPU platforms.
To learn more about managing and scaling groups of compute instances, seeSet the group's target size.
To monitor GPU performance, seeMonitor GPU performance.
To improve network performance, seeUse higher network bandwidth.
Learn how totroubleshoot VM shutdowns and reboots.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Handle GPU host maintenance events Stay organized with collections Save and categorize content based on your preferences.

Receive advance notice before maintenance events

Migrate your temporary data off of Local SSD disks

What's next?

Handle GPU host maintenance events