Troubleshoot GPU VMs Stay organized with collections Save and categorize content based on your preferences.
This page shows you how to resolve issues for VMs running on Compute Enginethat have attached GPUs.
If you are trying to create a VM with attached GPUs and are getting errors,reviewTroubleshooting resource availability errors andTroubleshooting creating and updating VMs.
Troubleshoot GPU VMs by using NVIDIA DCGM
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing andmonitoring NVIDIA data center GPUs in cluster environments.
If you want to use DCGM to troubleshoot issues in your GPU environment, completethe following:
- Ensure that you are using the latest recommended NVIDIA driver for the GPUmodel that is attached to your VM.To review driver versions, seeRecommended NVIDIA driver versions.
- Ensure that you installed the latest version of DCGM. To install the latestversion,seeDCGM installation.
Diagnose issues
When you run adcgmi diagnostic command, the issues reported by the diagnostictool include next steps for taking action on the issue. The following exampleshows the actionable output from thedcgmi diag -r memory -j command.
{ ........ "category":"Hardware", "tests":[ { "name":"GPU Memory", "results":[ { "gpu_id":"0", "info":"GPU 0 Allocated 23376170169bytes (98.3%)", "status":"Fail", ""warnings":[ {"warning":"Pending pageretirements together with a DBE were detected on GPU 0. Drain the GPU and reset it or reboot the node to resolve this issue.", "error_id":83, "error_category":10,"error_severity":6 } ] } .........From the preceding output snippet, you can see thatGPU 0 has pending pageretirements that are caused by a non-recoverable error.The output provided the uniqueerror_id and advice on debugging the issue.For this example output, it is recommended that you drain the GPU and rebootthe VM. In most cases, following the instructions in this section of the outputcan help to resolve the issue.
Pro Tip: Take note of the error severity.In the example output a value of"error_severity":6corresponds to aDCGM_ERROR_RESET which means that areset resolves issues with this severity value.
For a full list oferror_severity values, review thedcgmErrorSeverity_enumsection on thedcgm_errors GitHub file.
Open a support case
If you are unable to resolve the issues by using the guidance provided by theoutput of yourdcgmi diagnostic run, you can open a support case. When youopen a support case, you need to provide the following information:
- The command that was run and the output returned.
Relevant log files such as host engine and diagnostic logs. To gather therequired log files, you can run the
gather-dcgm-logs.shscript.For a default installation on Debian and RPM-based systems, this script islocated in
/usr/local/dcgm/scripts.For
dcgmi diagfailures, provide the stats files for the plugins that failed.The stats file uses the following naming convention:stats_PLUGIN_NAME.json.For example, if the
pcieplugin failed, include the file namedstats_pcie.json.NVIDIA system information and driver state. To gather this information, youcan run the
nvidia-bug-report.shscript. If you are using an instance withBlackwell GPUs, followGenerate NVIDIA Bug Report for Blackwell GPUs to obtain a comprehensive bug report.Running this script also helps with additional debugging if the problem iscaused by other NVIDIA dependencies and not a bug in DCGM itself.
Details about any recent changes that were made to your environmentpreceding the failure.
Xid messages
After you create a VM that has attached GPUs, you must install NVIDIA devicedriverson your GPU VMsso that your applications can access the GPUs. However, sometimes these driversreturn error messages.
An Xid message is an error report from the NVIDIA driver that is printed to theoperating system's kernel log or event log for your Linux VM. These messages areplaced in the/var/log/messages file.
For more information about Xid messages including potential causes,seeNVIDIA documentation.
The following section provides guidance on handling some Xid messages groupedby the most common types: GPU memory errors, GPU System Processor (GSP) errors,and illegal memory access errors.
GPU memory errors
GPU memory is the memory that is available on a GPU that can be used fortemporary storage of data. GPU memory is protected with Error Correction Code,ECC, which detects and corrects single bit errors (SBE) and detects and reportsDouble Bit Errors (DBE).
Prior to the release of the NVIDIA A100 GPUs,dynamic page retirementwas supported. For NVIDIA A100 and later GPU releases (such as NVIDIA H100),row remap errorrecovery is introduced. ECC is enabled by default. Google highly recommendskeeping ECC enabled.
The following are common GPU memory errors and their suggested resolutions.
| Xid error message | Resolution |
|---|---|
Xid 48:Double Bit ECC |
|
Xid 63:ECC page retirement or row remapping recording event |
|
Xid 64:ECC page retirement or row remapper recording failureAnd the message contains the following information: Xid 64: All reserved rows for bank are remapped |
|
If you get at least two of the following Xid messages together:
And the message contains the following information: Xid XX: row remap pending |
|
Xid 92:High single-bit ECC error rate | This Xid message is returned after the GPU driver corrects a correctable error, and it shouldn't affect your workloads. This Xid message is informational only.No action is needed. |
Xid 94:Contained ECC error |
|
Xid 95:Uncontained ECC error |
|
GSP errors
AGPU System Processor(GSP) is a microcontroller that runs on GPUs and handles some of the low levelhardware management functions.
| Xid error message | Resolution |
|---|---|
Xid 119:GSP RPC timeout |
|
Xid 120:GSP error |
Illegal memory access errors
The following Xids are returned when applications have illegal memory accessissues:
Xid 13:Graphics Engine ExceptionXid 31:GPU memory page fault
Illegal memory access errors are typically caused by your workloads tryingto access memory that is already freed or is out of bounds. This can be causedby issues such as the dereferencing of an invalid pointer, or an out bounds array.
To resolve this issue, you need to debug your application. To debug yourapplication, you can usecuda-memcheck andCUDA-GDB.
In some very rare cases, hardware degradation might cause illegal memory accesserrors to be returned. To identify if the issue is with your hardware, useNVIDIA Data Center GPU Manager (DCGM).You can rundcgmi diag -r 3 ordcgmi diag -r 4 to run different levels oftest coverage and duration. If you identify that the issue is with the hardware,file a case withCloud Customer Care.
Other common Xid error messages
| Xid error message | Resolution |
|---|---|
Xid 74:NVLINK error |
|
Xid 79:GPU has fallen off the busThis means the driver is not able to communicate with the GPU. | Reboot the VM. |
Xid 149 that mentions0x02a, such as the following example:This indicates aknown issue affecting firmware for NVIDIA B200 GPUs. |
|
Reset GPUs
Some issues might require you to reset your GPUs. To reset GPUs,complete the following steps:
- For N1, G2, and A2 VMs,reboot the VM.
- For A3 and A4 VMs, run
sudo nvidia-smi --gpu-reset.- For most Linux VMs, the
nvidia-smiexecutable is located in the/var/lib/nvidia/bindirectory. - For GKE nodes, the
nvidia-smiexecutable is located in the/home/kubernetes/bin/nvidiadirectory.
- For most Linux VMs, the
Alternatively, GPUs are also reset whenever youreset a VMorrestart a VM.
If errors persist after resetting the GPU, you need todelete andrecreate the VM.
If the error persists after a delete and recreate, file a case withCloud Customer Care to move the VM into therepair stage.
What's next
ReviewGPU machine types.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-12 UTC.