Troubleshoot GPU VMs

This page shows you how to resolve issues for VMs running on Compute Enginethat have attached GPUs.

If you are trying to create a VM with attached GPUs and are getting errors,reviewTroubleshooting resource availability errors andTroubleshooting creating and updating VMs.

Troubleshoot GPU VMs by using NVIDIA DCGM

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing andmonitoring NVIDIA data center GPUs in cluster environments.

If you want to use DCGM to troubleshoot issues in your GPU environment, completethe following:

  • Ensure that you are using the latest recommended NVIDIA driver for the GPUmodel that is attached to your VM.To review driver versions, seeRecommended NVIDIA driver versions.
  • Ensure that you installed the latest version of DCGM. To install the latestversion,seeDCGM installation.

Diagnose issues

When you run adcgmi diagnostic command, the issues reported by the diagnostictool include next steps for taking action on the issue. The following exampleshows the actionable output from thedcgmi diag -r memory -j command.

{  ........   "category":"Hardware",   "tests":[      {         "name":"GPU Memory",         "results":[            {               "gpu_id":"0",               "info":"GPU 0 Allocated 23376170169bytes (98.3%)",               "status":"Fail",               ""warnings":[                  {"warning":"Pending pageretirements together with a DBE were detected on GPU 0. Drain the GPU and reset it or reboot the node to resolve this issue.",                     "error_id":83,                     "error_category":10,"error_severity":6                  }               ]            }  .........

From the preceding output snippet, you can see thatGPU 0 has pending pageretirements that are caused by a non-recoverable error.The output provided the uniqueerror_id and advice on debugging the issue.For this example output, it is recommended that you drain the GPU and rebootthe VM. In most cases, following the instructions in this section of the outputcan help to resolve the issue.

Pro Tip: Take note of the error severity.In the example output a value of"error_severity":6corresponds to aDCGM_ERROR_RESET which means that areset resolves issues with this severity value.

For a full list oferror_severity values, review thedcgmErrorSeverity_enumsection on thedcgm_errors GitHub file.

Open a support case

If you are unable to resolve the issues by using the guidance provided by theoutput of yourdcgmi diagnostic run, you can open a support case. When youopen a support case, you need to provide the following information:

  1. The command that was run and the output returned.
  2. Relevant log files such as host engine and diagnostic logs. To gather therequired log files, you can run thegather-dcgm-logs.sh script.

    For a default installation on Debian and RPM-based systems, this script islocated in/usr/local/dcgm/scripts.

  3. Fordcgmi diag failures, provide the stats files for the plugins that failed.The stats file uses the following naming convention:stats_PLUGIN_NAME.json.

    For example, if thepcie plugin failed, include the file namedstats_pcie.json.

  4. NVIDIA system information and driver state. To gather this information, youcan run thenvidia-bug-report.sh script. If you are using an instance withBlackwell GPUs, followGenerate NVIDIA Bug Report for Blackwell GPUs to obtain a comprehensive bug report.

    Running this script also helps with additional debugging if the problem iscaused by other NVIDIA dependencies and not a bug in DCGM itself.

  5. Details about any recent changes that were made to your environmentpreceding the failure.

Xid messages

After you create a VM that has attached GPUs, you must install NVIDIA devicedriverson your GPU VMsso that your applications can access the GPUs. However, sometimes these driversreturn error messages.

An Xid message is an error report from the NVIDIA driver that is printed to theoperating system's kernel log or event log for your Linux VM. These messages areplaced in the/var/log/messages file.

For more information about Xid messages including potential causes,seeNVIDIA documentation.

The following section provides guidance on handling some Xid messages groupedby the most common types: GPU memory errors, GPU System Processor (GSP) errors,and illegal memory access errors.

GPU memory errors

GPU memory is the memory that is available on a GPU that can be used fortemporary storage of data. GPU memory is protected with Error Correction Code,ECC, which detects and corrects single bit errors (SBE) and detects and reportsDouble Bit Errors (DBE).

Prior to the release of the NVIDIA A100 GPUs,dynamic page retirementwas supported. For NVIDIA A100 and later GPU releases (such as NVIDIA H100),row remap errorrecovery is introduced. ECC is enabled by default. Google highly recommendskeeping ECC enabled.

The following are common GPU memory errors and their suggested resolutions.

Xid error messageResolution
Xid 48:Double Bit ECC
  1. Stop your workloads.
  2. Delete andrecreate the VM. If the error persists, file a case withCloud Customer Care.
Xid 63:ECC page retirement or row remapping recording event
  1. Stop your workloads.
  2. Reset the GPUs.
Xid 64:ECC page retirement or row remapper recording failure

And the message contains the following information:

Xid 64: All reserved rows for bank are remapped
  1. Stop your workloads.
  2. Delete andrecreate the VM. If the error persists, file a case withCloud Customer Care.

If you get at least two of the following Xid messages together:

  • Xid 48
  • Xid 63
  • Xid 64

And the message contains the following information:

Xid XX: row remap pending
  1. Stop your workloads.
  2. Reset the GPUs. Resetting the GPU allows the row remap and page retirement process to complete and heal the GPU.
Xid 92:High single-bit ECC error rateThis Xid message is returned after the GPU driver corrects a correctable error, and it shouldn't affect your workloads. This Xid message is informational only.No action is needed.
Xid 94:Contained ECC error
  1. Stop your workloads.
  2. Reset the GPUs.
Xid 95:Uncontained ECC error
  1. Stop your workloads.
  2. Reset the GPUs.

GSP errors

AGPU System Processor(GSP) is a microcontroller that runs on GPUs and handles some of the low levelhardware management functions.

Xid error messageResolution
Xid 119:GSP RPC timeout
  1. Stop your workloads.
  2. Delete andrecreate the VM. If the error persists, collect the NVIDIA bug report and file a case withCloud Customer Care.
Xid 120:GSP error

Illegal memory access errors

The following Xids are returned when applications have illegal memory accessissues:

  • Xid 13:Graphics Engine Exception
  • Xid 31:GPU memory page fault

Illegal memory access errors are typically caused by your workloads tryingto access memory that is already freed or is out of bounds. This can be causedby issues such as the dereferencing of an invalid pointer, or an out bounds array.

To resolve this issue, you need to debug your application. To debug yourapplication, you can usecuda-memcheck andCUDA-GDB.

In some very rare cases, hardware degradation might cause illegal memory accesserrors to be returned. To identify if the issue is with your hardware, useNVIDIA Data Center GPU Manager (DCGM).You can rundcgmi diag -r 3 ordcgmi diag -r 4 to run different levels oftest coverage and duration. If you identify that the issue is with the hardware,file a case withCloud Customer Care.

Other common Xid error messages

Xid error messageResolution
Xid 74:NVLINK error
  1. Stop your workloads.
  2. Reset the GPUs.
Xid 79:GPU has fallen off the bus

This means the driver is not able to communicate with the GPU.

Reboot the VM.
Xid 149 that mentions0x02a, such as the following example:
Xid (PCI:0000:c0:00):149,NETIR_LINK_EVT Fatal XC0 i0 Link 04 (0x02a485c6 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000)

This indicates aknown issue affecting firmware for NVIDIA B200 GPUs.

  1. Stop your workloads.
  2. Reset the GPUs.

Reset GPUs

Some issues might require you to reset your GPUs. To reset GPUs,complete the following steps:

  • For N1, G2, and A2 VMs,reboot the VM.
  • For A3 and A4 VMs, runsudo nvidia-smi --gpu-reset.
    • For most Linux VMs, thenvidia-smi executable is located in the/var/lib/nvidia/bin directory.
    • For GKE nodes, thenvidia-smi executable is located in the/home/kubernetes/bin/nvidia directory.

Alternatively, GPUs are also reset whenever youreset a VMorrestart a VM.

If errors persist after resetting the GPU, you need todelete andrecreate the VM.

If the error persists after a delete and recreate, file a case withCloud Customer Care to move the VM into therepair stage.

What's next

ReviewGPU machine types.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-12 UTC.