Report faulty host

If you notice an issue on an A4X, A4, or A3 Ultra VM that you can't resolveotherwise—such as slower performance within a cluster or consistently highGPU temperatures—then you can report its host as faulty. When you report ahost as faulty, Compute Engine automatically repairs the VM by running hostmaintenance. For A4 and A3 Ultra VMs, Compute Engine attempts to migratethe VM to a different host when maintenance starts, if you have unused reservedcapacity or capacity is available in the VM's zone. Reporting a host as faultyhelps you minimize downtime for your workload.

This document explains how to report and repair faulty hosts for virtualmachine (VM) instances that are part of a Slurm cluster or other VM-basedclusters. To report faulty hosts in a Google Kubernetes Engine (GKE) cluster, seeReport faulty hosts through GKE.

Caution: Reporting a faulty host is a disruptive action that stops your VM. Before you report a host, thoroughly investigate your environment by using tools like thecluster health scanner (CHS) to identify the root cause of the issue. Only report the host as faulty if you have no alternatives to resolve your issue.

Limitations

When you report a faulty host, the following limitations apply:

Before you begin

Select the tab for how you plan to use the samples on this page:

Console

When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

gcloud

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, aCloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

REST

To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.

    Install the Google Cloud CLI. After installation,initialize the Google Cloud CLI by running the following command:

    gcloudinit

    If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

For more information, seeAuthenticate for using REST in the Google Cloud authentication documentation.

Required roles

To get the permissions that you need to report a faulty host, ask your administrator to grant you the following IAM roles:

  • Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1) on the VM or the project
  • To view the state of a faulty host report operation by using Cloud Logging:Logs Viewer (roles/logging.viewer) on the project

For more information about granting roles, seeManage access to projects, folders, and organizations.

These predefined roles contain the permissions required to report a faulty host. To see the exact permissions that are required, expand theRequired permissions section:

Required permissions

The following permissions are required to report a faulty host:

  • To create a faulty host report: compute.instances.update on the VM
  • To view a list of operations by using Logging: logging.operations.list on the project
  • To view the details of an operation by using Logging: logging.operations.get on the project
  • To view a list of operations in Compute Engine: compute.zoneOperations.list on the project
  • To view the details of an operation in Compute Engine: compute.zoneOperations.describe on the project

You might also be able to get these permissions withcustom roles or otherpredefined roles.

Understand the faulty host report process

After you report a faulty host for a VM, the time when the VM restartsvaries based on thereservation operational mode that is specified in the reservation that the VM uses.To verify the reservation operational mode for a reservation,view thereservationOperationalMode field in the reservation.The following table summarizes the faulty host process for the two available reservation operationalmodes:all capacity mode andmanaged mode.
All capacity mode (ALL_CAPACITY)Managed mode (HIGHLY_AVAILABLE_CAPACITY)
Supported machine typesA4XA4 and A3 Ultra
Faulty host report API rate limitingNo rate limits apply.Calls to the API may be rate-limited.
Faulty host report process

When you report a faulty host for a VM that runs in the all capacity mode, the following occurs:

  1. Report the faulty host: The VM remains in theRUNNING state throughout the report faulty host operation, which usually takes 10-12 minutes to complete. To review the operation state, seeReview report faulty host operations in this document.
  2. Repair the host: After the report faulty host operation completes, the host repair operation starts within a minute.

    When the repair host operation starts, the VM stops and its state changes depending on theautomatic restart (automaticRestart) setting that is specified for the VM:

    • If automatic restart is enabled for the VM, the VM state changes toREPAIRING. The VM automatically restarts when its host is healthy unless you stop the VM before then.
    • If automatic restart is disabled for the VM, the VM state changes toTERMINATED. You need to manually restart the VM after its host is healthy.

    Repairing the faulty host can take 3-14 days, or even longer at times.

  3. Restart the VM: After the host repair operationcompletes (usually 3-14 days), one of the following occurs:

    • If the VM is in theREPAIRING state and the resources are available when the repair completes, then Compute Engine automatically restarts the VM on the repaired host.
    • Otherwise, if the VM is in theTERMINATED state or if resources aren't available when the repair completes, then the VM state stays in or changes toTERMINATED. You mustmanually restart the VM when you want it to run. However, restarting the VM might fail if resources aren't available when you restart the VM; for example, this can happen if other VMs are already using the repaired host.

When you report a faulty host for a VM that runs in the managed mode, the following occurs:

  1. Report the faulty host: The VM remains in theRUNNING state throughout the report faulty host operation, which usually takes 10-12 minutes to complete. To review the operation state, seeReview report faulty host operations in this document.
  2. Start repairing the host: After the report faulty host operation completes, the host repair operation starts within a minute.

    When the repair host operation starts, the VM stops and its state changes depending on theautomatic restart (automaticRestart) setting that is specified for the VM:

    • If automatic restart is enabled for the VM, the VM state changes toREPAIRING. The VM automatically restarts when its host is healthy unless you stop the VM before then.
    • If automatic restart is disabled for the VM, the VM state changes toTERMINATED. You need to manually restart the VM after its host is healthy.

    Repairing the faulty host can take 3-14 days, or even longer at times.

  3. Migrate and restart the VM: After the host repair operationstarts (usually 10-12 minutes), Compute Engine attempts to reserve one more host to replace your reported faulty host in your reserved capacity. If Compute Engine finds a healthy host—if it successfully replaces the faulty host or otherwise finds a matching healthy host in your reserved capacity—then Compute Engine migrates the VM to that host. Then, restarting the VM happens through one of the following:

    • If the VM is in theREPAIRING state and resources are available before or when the repair completes, then Compute Engine automatically restarts the VM on a healthy host.
    • Otherwise, if the VM is in theTERMINATED state or if resources aren't available before or when the repair completes, then the VM state stays in or changes toTERMINATED. You mustmanually restart the VM when you want it to run. However, restarting the VM might fail if resources aren't available when you restart the VM; for example, this can happen if other VMs are already using the repaired host.

Report a faulty host

To report a faulty host, complete the following steps:

  1. Review the host on which your VM runs.

    For instructions, seeView VMs topology.

  2. Optional:Back up Local SSD data. When the VM stops,Compute Engine automatically discards the data of any Local SSDdisks that are attached to the VM. You can't recover Local SSD data afterCompute Engine discards it.

    For instructions on how to preserve Local SSD data, seeLocal SSD data backup.

  3. Report the faulty host. To report a faulty host, select one of thefollowing options. The host repair operation starts immediately, within aminute after the report faulty host operation completes. If the VM becomesunresponsive after you start the faulty host report operation, then, afteryou wait for at least 15 minutes, we recommend that you restart the VM.

    gcloud

    To report a faulty host, use the followinggcloud compute instances report-host-as-faulty command:

    gcloud compute instances report-host-as-faultyVM_NAME \    --async \    --disruption-schedule=IMMEDIATE \    --fault-reasons=behavior=FAULT_REASON,description=DESCRIPTION \    --zone=ZONE

    Replace the following:

    • VM_NAME: the name of the VM.

    • FAULT_REASON: a list of host issues that your VMencountered, separated by commas—for example,ISSUE_1,ISSUE_2. You can specify the following values:

      • PERFORMANCE: that GPUs that are attached to the VM haveperformance issues compared to other GPUs in the cluster, yousee no XID errors in the logs, and the Compute Enginedetects no other usual failure patterns such as silent datacorruption.

      • SILENT_DATA_CORRUPTION: you see data corruption in yourVM, but the VM keeps running. Silent data corruption can be dueto issues like vCPUs defects, software bugs, or kernel issues.

      • UNRECOVERABLE_GPU_ERROR: you identified an unrecoverableGPU error with an XID.

      • BEHAVIOR_UNSPECIFIED: you aren't sure about what the issueto your VM is.

    • DESCRIPTION: a description of the issue that isaffecting your VM, such as XID information or suspected performanceproblems.

    • ZONE: the zone where the VM exists.

    REST

    To report a faulty host, make the followingPOST request to theinstances.reportHostAsFaulty method.

    When you report a faulty host, you can specify multiple fault reasonsat once. For example, to specify two fault reasons, make a request asfollows:

    POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME/reportHostAsFaulty{  "disruptionSchedule": "IMMEDIATE",  "faultReasons": [    {      "behavior": "FAULT_REASON_1",      "description": "DESCRIPTION_1"    },    {      "behavior": "FAULT_REASON_2",      "description": "DESCRIPTION_2"    }  ]}

    Replace the following:

    • PROJECT_ID: the ID of the project where the VMexists.

    • ZONE: the zone where the VM exists.

    • VM_NAME: the name of the VM.

    • FAULT_REASON_1 andFAULT_REASON_2: each host issue that your VMencountered. You can specify the following values:

      • PERFORMANCE: that GPUs that are attached to the VM haveperformance issues compared to other GPUs in the cluster, yousee no XID errors in the logs, and the Compute Enginedetects no other usual failure patterns such as silent datacorruption.

      • SILENT_DATA_CORRUPTION: you see data corruption in yourVM, but the VM keeps running. Silent data corruption can be dueto issues like vCPUs defects, software bugs, or kernel issues.

      • UNRECOVERABLE_GPU_ERROR: you identified an unrecoverableGPU error with an XID.

      • BEHAVIOR_UNSPECIFIED: you aren't sure about what the issueto your VM is.

    • DESCRIPTION_1 andDESCRIPTION_2: a description for each host issuethat you specified, such as XID information or suspected performanceproblems.

Review report faulty host operations

After you report a faulty host, Compute Engine starts a series ofoperations to mark the host as faulty and prepares the host for repair.Specifically, during a report faulty host operation, the following processhappens:

  1. Mark the host as faulty. Compute Engine creates the reportfaulty host operation. The report faulty host operation thencreates a sequence of sub-operations. These sub-operations mark theunderlying host as faulty.

  2. Prepare the host for repairs. After all sub-operations complete, thereport faulty host operation starts. Compute Engine stops the VM andstarts the repair faulty host operation. Based on thereservation operational mode that is specified in thereservation that the VM uses, and if healthy hosts are available,Compute Engine either keeps the VM stopped or attempts toautomatically migrate and restart the VM.

  3. Report completion and repair the host. Compute Engine completesthe report faulty host operation, and the host repair operation runs.

To track the status of the report faulty host(compute.instances.reportHostAsFaulty) operations in your project, select oneof the following options. For more information about other operations that youcan use to track repairs, migration, and automatic restart, seeMaintenance and restart behaviorsandMonitor and plan for a host maintenance eventin the Compute Engine documentation.

Console (VM operations)

  1. In the Google Cloud console, go to theOperations page.

    Go to Operations

  2. In the table that appears, locate the VM that you reported.

  3. In the row that contains the VM, in theStatus column, you can seethe status of the report faulty host operation. When the operationcompletes, the value isDone.

  4. Optional: To verify if Compute Engine has restarted the VM,view the details of the VM.

Console (VM logs)

  1. In the Google Cloud console, go to theLogs Explorer page.

    Go to Logs Explorer

  2. Verify that theShow query toggle is set to the on position.

  3. In the query editor, enter the following query:

    resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.reportHostAsFaulty"
  4. ClickRun query. TheQuery results pane displays the queryresults.

gcloud

  1. To view the status of the report faulty host operations in your project,use thegcloud compute operations list commandwith the--filter flag set tooperationType:reportHostAsFaulty:

    gcloud compute operations list --filter="operationType:reportHostAsFaulty"
  2. If you want to view the details of a specific faulty host operation,then use thegcloud compute operations describe command:

    gcloud compute operations describeOPERATION_NAME \    --zone="ZONE"

    Replace the following:

    • OPERATION_NAME: the name of the operation.

    • ZONE: the zone where the operation exists.

REST

To view the status of the report faulty host operations in your project,make aGET request to thezoneOperations.list method.In the request URL, include thefilter query parameter set toitems.operationType:reportHostAsFaulty.

GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/operations&filter=items.operationType:reportHostAsFaulty

Replace the following:

  • PROJECT_ID: the name of the operation.

  • ZONE: the zone where the operations exist.

What's next?

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.