Report faulty host Stay organized with collections Save and categorize content based on your preferences.
If you notice an issue on an A4X, A4, or A3 Ultra VM that you can't resolveotherwise—such as slower performance within a cluster or consistently highGPU temperatures—then you can report its host as faulty. When you report ahost as faulty, Compute Engine automatically repairs the VM by running hostmaintenance. For A4 and A3 Ultra VMs, Compute Engine attempts to migratethe VM to a different host when maintenance starts, if you have unused reservedcapacity or capacity is available in the VM's zone. Reporting a host as faultyhelps you minimize downtime for your workload.
This document explains how to report and repair faulty hosts for virtualmachine (VM) instances that are part of a Slurm cluster or other VM-basedclusters. To report faulty hosts in a Google Kubernetes Engine (GKE) cluster, seeReport faulty hosts through GKE.
Caution: Reporting a faulty host is a disruptive action that stops your VM. Before you report a host, thoroughly investigate your environment by using tools like thecluster health scanner (CHS) to identify the root cause of the issue. Only report the host as faulty if you have no alternatives to resolve your issue.Limitations
When you report a faulty host, the following limitations apply:
You can only report a faulty host if the VM that runs on the host meets allof the following conditions:
The VM is running.
The VM uses an A4X, A4, or A3 Ultra machine type.
The VM uses thereservation-bound provisioning model.
Note: If a running A4X, A4, or A3 Ultra VM uses a different provisioningmodel, but you still want to report its host as faulty, then contactyour account team.
Google Cloud makes best-effort attempts to fulfill all your report faultyhost requests. However, due to capacity constraints or rate limits, arequest might not always be fulfilled.
Before you begin
Select the tab for how you plan to use the samples on this page:
Console
When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.
gcloud
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
REST
To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.
Install the Google Cloud CLI. After installation,initialize the Google Cloud CLI by running the following command:
gcloudinit
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
For more information, seeAuthenticate for using REST in the Google Cloud authentication documentation.
Required roles
To get the permissions that you need to report a faulty host, ask your administrator to grant you the following IAM roles:
- Compute Instance Admin (v1) (
roles/compute.instanceAdmin.v1) on the VM or the project - To view the state of a faulty host report operation by using Cloud Logging:Logs Viewer (
roles/logging.viewer) on the project
For more information about granting roles, seeManage access to projects, folders, and organizations.
These predefined roles contain the permissions required to report a faulty host. To see the exact permissions that are required, expand theRequired permissions section:
Required permissions
The following permissions are required to report a faulty host:
- To create a faulty host report:
compute.instances.updateon the VM - To view a list of operations by using Logging:
logging.operations.liston the project - To view the details of an operation by using Logging:
logging.operations.geton the project - To view a list of operations in Compute Engine:
compute.zoneOperations.liston the project - To view the details of an operation in Compute Engine:
compute.zoneOperations.describeon the project
You might also be able to get these permissions withcustom roles or otherpredefined roles.
Understand the faulty host report process
After you report a faulty host for a VM, the time when the VM restartsvaries based on thereservation operational mode that is specified in the reservation that the VM uses.To verify the reservation operational mode for a reservation,view thereservationOperationalMode field in the reservation.The following table summarizes the faulty host process for the two available reservation operationalmodes:all capacity mode andmanaged mode.All capacity mode (ALL_CAPACITY) | Managed mode (HIGHLY_AVAILABLE_CAPACITY) | |
|---|---|---|
| Supported machine types | A4X | A4 and A3 Ultra |
| Faulty host report API rate limiting | No rate limits apply. | Calls to the API may be rate-limited. |
| Faulty host report process | When you report a faulty host for a VM that runs in the all capacity mode, the following occurs:
| When you report a faulty host for a VM that runs in the managed mode, the following occurs:
|
Report a faulty host
To report a faulty host, complete the following steps:
Review the host on which your VM runs.
For instructions, seeView VMs topology.
Optional:Back up Local SSD data. When the VM stops,Compute Engine automatically discards the data of any Local SSDdisks that are attached to the VM. You can't recover Local SSD data afterCompute Engine discards it.
For instructions on how to preserve Local SSD data, seeLocal SSD data backup.
Report the faulty host. To report a faulty host, select one of thefollowing options. The host repair operation starts immediately, within aminute after the report faulty host operation completes. If the VM becomesunresponsive after you start the faulty host report operation, then, afteryou wait for at least 15 minutes, we recommend that you restart the VM.
gcloud
To report a faulty host, use the following
gcloud compute instances report-host-as-faultycommand:gcloud compute instances report-host-as-faultyVM_NAME \ --async \ --disruption-schedule=IMMEDIATE \ --fault-reasons=behavior=FAULT_REASON,description=DESCRIPTION \ --zone=ZONEReplace the following:
VM_NAME: the name of the VM.FAULT_REASON: a list of host issues that your VMencountered, separated by commas—for example,ISSUE_1,ISSUE_2. You can specify the following values:PERFORMANCE: that GPUs that are attached to the VM haveperformance issues compared to other GPUs in the cluster, yousee no XID errors in the logs, and the Compute Enginedetects no other usual failure patterns such as silent datacorruption.SILENT_DATA_CORRUPTION: you see data corruption in yourVM, but the VM keeps running. Silent data corruption can be dueto issues like vCPUs defects, software bugs, or kernel issues.UNRECOVERABLE_GPU_ERROR: you identified an unrecoverableGPU error with an XID.BEHAVIOR_UNSPECIFIED: you aren't sure about what the issueto your VM is.
DESCRIPTION: a description of the issue that isaffecting your VM, such as XID information or suspected performanceproblems.ZONE: the zone where the VM exists.
REST
To report a faulty host, make the following
POSTrequest to theinstances.reportHostAsFaultymethod.When you report a faulty host, you can specify multiple fault reasonsat once. For example, to specify two fault reasons, make a request asfollows:
POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME/reportHostAsFaulty{ "disruptionSchedule": "IMMEDIATE", "faultReasons": [ { "behavior": "FAULT_REASON_1", "description": "DESCRIPTION_1" }, { "behavior": "FAULT_REASON_2", "description": "DESCRIPTION_2" } ]}Replace the following:
PROJECT_ID: the ID of the project where the VMexists.ZONE: the zone where the VM exists.VM_NAME: the name of the VM.FAULT_REASON_1andFAULT_REASON_2: each host issue that your VMencountered. You can specify the following values:PERFORMANCE: that GPUs that are attached to the VM haveperformance issues compared to other GPUs in the cluster, yousee no XID errors in the logs, and the Compute Enginedetects no other usual failure patterns such as silent datacorruption.SILENT_DATA_CORRUPTION: you see data corruption in yourVM, but the VM keeps running. Silent data corruption can be dueto issues like vCPUs defects, software bugs, or kernel issues.UNRECOVERABLE_GPU_ERROR: you identified an unrecoverableGPU error with an XID.BEHAVIOR_UNSPECIFIED: you aren't sure about what the issueto your VM is.
DESCRIPTION_1andDESCRIPTION_2: a description for each host issuethat you specified, such as XID information or suspected performanceproblems.
Review report faulty host operations
After you report a faulty host, Compute Engine starts a series ofoperations to mark the host as faulty and prepares the host for repair.Specifically, during a report faulty host operation, the following processhappens:
Mark the host as faulty. Compute Engine creates the reportfaulty host operation. The report faulty host operation thencreates a sequence of sub-operations. These sub-operations mark theunderlying host as faulty.
Prepare the host for repairs. After all sub-operations complete, thereport faulty host operation starts. Compute Engine stops the VM andstarts the repair faulty host operation. Based on thereservation operational mode that is specified in thereservation that the VM uses, and if healthy hosts are available,Compute Engine either keeps the VM stopped or attempts toautomatically migrate and restart the VM.
Report completion and repair the host. Compute Engine completesthe report faulty host operation, and the host repair operation runs.
To track the status of the report faulty host(compute.instances.reportHostAsFaulty) operations in your project, select oneof the following options. For more information about other operations that youcan use to track repairs, migration, and automatic restart, seeMaintenance and restart behaviorsandMonitor and plan for a host maintenance eventin the Compute Engine documentation.
Console (VM operations)
In the Google Cloud console, go to theOperations page.
In the table that appears, locate the VM that you reported.
In the row that contains the VM, in theStatus column, you can seethe status of the report faulty host operation. When the operationcompletes, the value isDone.
Optional: To verify if Compute Engine has restarted the VM,view the details of the VM.
Console (VM logs)
In the Google Cloud console, go to theLogs Explorer page.
Verify that theShow query toggle is set to the on position.
In the query editor, enter the following query:
resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.reportHostAsFaulty"ClickRun query. TheQuery results pane displays the queryresults.
gcloud
To view the status of the report faulty host operations in your project,use the
gcloud compute operations listcommandwith the--filterflag set tooperationType:reportHostAsFaulty:gcloud compute operations list --filter="operationType:reportHostAsFaulty"If you want to view the details of a specific faulty host operation,then use the
gcloud compute operations describecommand:gcloud compute operations describeOPERATION_NAME \ --zone="ZONE"Replace the following:
OPERATION_NAME: the name of the operation.ZONE: the zone where the operation exists.
REST
To view the status of the report faulty host operations in your project,make aGET request to thezoneOperations.list method.In the request URL, include thefilter query parameter set toitems.operationType:reportHostAsFaulty.
GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/operations&filter=items.operationType:reportHostAsFaultyReplace the following:
PROJECT_ID: the name of the operation.ZONE: the zone where the operations exist.
What's next?
- If you encounter issues when reporting a faulty host, then seeTroubleshoot faulty host API.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.