Troubleshoot GPUs in GKE

Autopilot Standard

If your Google Kubernetes Engine (GKE) Pods are stuck in aPending state whilerequestingnvidia.com/gpu resources, or if your nodes fail to register theiravailable GPUs, you might have an issue with the NVIDIA driver installation oryour node pool configuration. These problems prevent your workloads fromaccessing the GPU hardware that they need.

This document shows you how to diagnose and resolve common problems that preventGKE from scheduling or running GPU-accelerated workloads. Learnhow to verify the GPU driver installation, inspect Pod and node logs for errors,and confirm that your configurations are correct.

This information is for Platform admins and operators who manage GPU-enabled nodepools and need to resolve NVIDIA driver issues, and forApplication developers who need to debug GPU workloads that are stuck orfailing to start. For more information about the common roles and example tasksthat we reference in Google Cloudcontent, seeCommon GKE user roles andtasks.

GPU driver installation

This section provides troubleshooting information for automatic NVIDIA devicedriver installation in GKE.

Driver installation fails in Ubuntu nodes

If you use Ubuntu nodes that have attached L4, RTX PRO 6000, H100, or H200 GPUs,the default GPU driver that GKE installs might not be at or laterthan the required version for those GPUs. As a result, the GPU device plugin Podremains stuck in the Pending state and your GPU workloads on those nodes mightexperience issues.

To resolve this issue, see the instructions for the respective GPU:

L4 and H100

To resolve this issue for L4 and H100 GPUs, we recommend upgrading to thefollowing GKE versions which install GPU driver version 535 asthe default driver:

1.26.15-gke.1483000 and later
1.27.15-gke.1039000 and later
1.28.11-gke.1044000 and later
1.29.6-gke.1073000 and later
1.30.2-gke.1124000 and later

Alternatively, you can manually install driver version 535 or later by runningthe following command:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R535.yaml

RTX PRO 6000

To resolve this issue for RTX PRO 6000 GPUs, upgrade to one of the followingGKE versions. These versions install GPU driver version 580 asthe default driver:

1.32.8-gke.1170000 and later
1.33.4-gke.1245000 and later
1.34.0-gke.1662000 and later

H200

To resolve this issue for H200 GPUs, you must manually install driver version550 or later by running the following command:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R550.yaml

GPU device plugins fail with CrashLoopBackOff errors

The following issue occurs if you used themanual driver installation methodin your node pool prior toJanuary 25, 2023 and later upgraded your nodepool to a GKE version that supports automatic driverinstallation. Both installation workloads exist at the same time and try toinstall conflicting driver versions on your nodes.

The GPU device plugin init container fails with theInit:CrashLoopBackOffstatus. The logs for the container are similar to the following:

failed to verify installation: failed to verify GPU driver installation: exit status 18

To resolve this issue, try the following methods:

Remove the manual driver installation DaemonSet from your cluster. Thisdeletes the conflicting installation workload and lets GKEautomatically install a driver to your nodes.
Note: Ensure that all of your node pools use automatic installation before youdelete the DaemonSet.
```
kubectldelete-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
```
Re-apply the manual driver installation DaemonSet manifest to your cluster.On January 25, 2023, we updated the manifest to ignore nodes that useautomatic driver installation.
```
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
```
Disable automatic driver installation for your node pool. The existing driverinstallation DaemonSet should work as expected after the update operationcompletes.
```
gcloudcontainernode-poolsupdatePOOL_NAME\--accelerator=type=GPU_TYPE,count=GPU_COUNT,gpu-driver-version=disabled\--cluster=CLUSTER_NAME\--location=LOCATION
```
Replace the following:
- POOL_NAME: the name of the node pool.
- GPU_TYPE: the GPU type that the node pool alreadyuses.
- GPU_COUNT: the number of GPUs that are alreadyattached to the node pool.
- CLUSTER_NAME: the name of the GKEcluster that contains the node pool.
- LOCATION: the Compute Engine location of thecluster.

For more information about mapping the GPU driver version to GKEversion, see Map the GKE version andContainer-Optimized OS node image version to the GPU driverversion.

Error: "Container image cos-nvidia-installer:fixed is not present with pull policy of Never." or "Container image ubuntu-nvidia-installer:fixed is not present with pull policy of Never."

This issue occurs when thenvidia-driver-installer Pods are in thePodInitializing state and the GPU plugin device or the GPU driver installer Podsreport the following error. The specific error message depends on the operatingsystem running on your node:

COS

Container image "cos-nvidia-installer:fixed" is not present with pull policy of Never.

Ubuntu

Container image "gke-nvidia-installer:fixed" is not present with pull policy of Never.

This issue can occur when the garbage collector removes thepreloaded NVIDIA driver image to free space on a node. When the driver Pod is recreated or its container is restarted, GKE won't be able to locate the preloaded image.

To mitigate the garbage collection issue when you are running COS, upgrade your GKE nodes to one of these versions that contain thefix:

1.25.15-gke.1040000 and later
1.26.10-gke.1030000 and later
1.27.6-gke.1513000 and later
1.28.3-gke.1061000 and later

For more information about mapping the GPU driver version to GKEversion, seeMap the GKE version andContainer-Optimized OS node image version to the GPU driverversion.

If your nodes are running Ubuntu, no fix is available yet for this garbagecollection issue. To mitigate this issue on Ubuntu, you can run a privilegedcontainer that interacts with the host to ensure the correct setup ofNVIDIA GPU drivers. To do so, runsudo /usr/local/bin/nvidia-container-first-bootfrom your node or apply the following manifest:

apiVersion:v1kind:Podmetadata:name:gke-nvidia-installer-fixupspec:nodeSelector:cloud.google.com/gke-os-distribution:ubuntuhostPID:truecontainers:-name:installerimage:ubuntusecurityContext:privileged:truecommand:-nsenter--at-'1'----sh--c-"/usr/local/bin/nvidia-container-first-boot"restartPolicy:Never

Another potential cause of the issue is when the NVIDIA driver images are lostafter node reboot or host maintenance. This may occur on confidential nodes, ornodes with GPUs, that use ephemeral local SSD storage. In this situation, GKE preloads thenvidia-installer-driver container images on nodes and moves them from the boot disk to the local SSD on first boot.

To confirm there was a host maintenance event, use the following logfilter:

resource.type="gce_instance"protoPayload.serviceName="compute.googleapis.com"log_id("cloudaudit.googleapis.com/system_event")

To mitigate the host maintenance issue, upgrade yourGKE version to one of these versions:

1.27.13-gke.1166000 and later
1.29.3-gke.1227000 and later
1.28.8-gke.1171000 and later

Error: failed to configure GPU driver installation dirs: failed to create lib64 overlay: failed to create dir /usr/local/nvidia/lib64: mkdir /usr/local/nvidia/lib64: not a directory.

You encounter this error from the GPU driver installer container inside the GPUdevice plugin when NCCL fastsocket is enabled:

failed to configure GPU driver installation dirs: failed to create lib64 overlay: failed to create dir /usr/local/nvidia/lib64: mkdir /usr/local/nvidia/lib64: not a directory.

This issue only happens on clusters and nodes running GKE 1.28and 1.29.

The issue is caused by a NCCL fastsocket race condition with the GPU driverinstaller.

To mitigate this issue, upgrade your GKE versionto one of these versions:

1.28.8-gke.1206000 and later
1.29.3-gke.1344000 and later

For more information, read theGPUDirect-TCPXO Release Notes.

Error: Failed to get device for nvidia0: device nvidia0 not found.

The following error indicates that XID 62 and RmInitAdapter failed for GPU with minor 0:

Failed to get device for nvidia0: device nvidia0 not found.

NVIDIA driver version 525.105.17 has a bug that can cause communication errors(XID) and prevent the GPU from initializing properly, leading to a failure toinitialize the GPU.

To fix this issue, upgrade the NVIDIA driver to driver version 525.110.11 orlater.

Map the GKE version and Container-Optimized OS node image version to the GPU driver version

To find the GPU driver versions that are mapped withGKEversions and Container-Optimized OS node image versions, do the following steps:

Map Container-Optimized OS node image versions to GKE patch versions for the specific GKE version where you want to find the GPU driver version. For example, 1.33.0-gke.1552000 uses cos-121-18867-90-4.
Choose the milestone of the Container-Optimized OS node image version in theContainer-Optimized OS release notes. For example, choose Milestone 121 for cos-121-18867-90-4.
In the release notes page for the specific milestone, find the release note corresponding with the specific Container-Optimized OS node image version. For example, inContainer-Optimized OS Release Notes: Milestone 121, seecos-121-18867-90-4. In the table in theGPU Drivers column, clickSee List to see the GPU driver version information.

The`nvidia-smi` command fails

When you use GPU VMs to run workloads in GKE, commands likenvidia-smimight fail within the container with any of the following errors:

bash: nvidia-smi: command not found
Errors indicating thatlibnvidia-ml.so or other NVIDIA libraries cannot befound.

GKE mounts the necessary NVIDIA drivers and tools from the hostnode in your containers, typically under the/usr/local/nvidia/ path. However, thecontainer's default environment variables (PATH andLD_LIBRARY_PATH) mightnot include the paths to these NVIDIA binaries and libraries.

To solve these errors, update your Pod or Deployment manifest to include thenecessary NVIDIA paths in thePATH andLD_LIBRARY_PATH environment variablesfor the container.

For example, add the followingenv block to yourspec.template.spec.containers spec:

spec:containers:-name:gpu-containerimage:gpu-imageenv:-name:LD_LIBRARY_PATH# Prepend NVIDIA lib64 directory to existing LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64-name:PATH# Prepend NVIDIA bin directory to existing PATHvalue:/usr/local/nvidia/bin:$PATH# ... other container settings

Reset GPUs on A3 VMs

Some issues might require you to reset the GPU on an A3 VM.

To reset the GPU, follow these steps:

Remove Pods that request GPU resources from the node where you need to resetthe GPU.

Disable the GPU device plugin on the node:

kubectlgetnodes\--selector=kubernetes.io/hostname=NODE_NAME\--no-headers|awk'{print $1}'\|xargs-I{}kubectllabelnode{}gke-no-default-nvidia-gpu-device-plugin=true

ReplaceNODE_NAME with the name of the node.

Connect to the VM backing the node.

In the SSH session, reset the GPU:

/home/kubernetes/bin/nvidia/bin/nvidia-smi--gpu-reset

Re-enable the GPU device plugin:

kubectlgetnodes--selector=kubernetes.io/hostname=NODE_NAME\--no-headers\|awk'{print $1}'\|xargs-I{}kubectllabelnode{}gke-no-default-nvidia-gpu-device-plugin=false\--overwrite

GPUs on Confidential GKE Nodes

The following sections show you how to identify and fix issues with GPUs thatrun on Confidential GKE Nodes.

GPU workloads not scheduling on Confidential GKE Nodes

Confidential GKE Nodes requires that you manually install a GPU driverthat corresponds to your selected GPU type and your GKE version.If your GPU Pods aren't scheduling on Confidential GKE Nodes and remainin thePending state, describe the driver installation DaemonSet:

kubectl--namespace=kube-systemgetdaemonsetnvidia-driver-installer-oyaml

If the output returns aNotFound error,install the driver.

If the DaemonSet is running, the output is similar to the following:

apiVersion: apps/v1kind: DaemonSet# lines omitted for clarityspec:  revisionHistoryLimit: 10  selector:    matchLabels:      k8s-app: nvidia-driver-installer  template:    metadata:      creationTimestamp: null      labels:        k8s-app: nvidia-driver-installer        name: nvidia-driver-installerspec:      affinity:        nodeAffinity:          requiredDuringSchedulingIgnoredDuringExecution:            nodeSelectorTerms:            - matchExpressions:              - key: cloud.google.com/gke-accelerator                operator: Exists              - key: cloud.google.com/gke-gpu-driver-version                operator: DoesNotExist              - key: cloud.google.com/gke-confidential-nodes-instance-type                operator: In                values:                - TDX

In this output, verify that thenodeAffinity field contains thecloud.google.com/gke-confidential-nodes-instance-type key. If the outputdoesn't contain this key, the driver installation DaemonSet doesn't supportConfidential GKE Nodes.

Deploy the DaemonSet that supports GPUs on Confidential GKE Nodes:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/cos/daemonset-confidential.yaml

After you install the drivers, check whether your GPU workloads startsuccessfully.

Error: Failed to allocate device vector

The following error message in your GPU container logs indicates that the GPUwas detached from the node VM instance:

Failed to allocate device vector A (error code unknown error)!

This detachment might happen because of a hardware error or because of an issuewith the encryption keys.

To resolve this issue, reboot the node instance. This operation is disruptive,and affects all of the workloads on that node. To reboot the instance, do thefollowing steps:

Get the name of the node that runs the GPU Pod:
```
kubectlgetpodPOD_NAME-oyaml|grep"nodeName"
```
ReplacePOD_NAME with the name of the failing Pod.
The output is similar to the following:
```
nodeName: gke-cluster-1-default-pool-b7asdfbt-fd3e
```
Reset the Compute Engine instance:
```
gcloudcomputeinstancesresetNODE_NAME
```
ReplaceNODE_NAME with the node name from theoutput of the previous step.
The gcloud CLI looks for VMs with that name in your activeproject. If you see a prompt to select a zone, specifyY.
Check whether your GPU workloads run without errors.

Error: Decryption failed with error -74

The following error message in your node logs indicates that the encryption keysfor the GPU were lost:

Decryption failed with error -74

This error happens when the NVIDIA persistence daemon, which runs on the nodeVM instance, fails. If you see this error message, reset the instance:

gcloudcomputeinstancesresetNODE_NAME

ReplaceNODE_NAME with the name of the failing node.

The gcloud CLI looks for VMs with that name in your activeproject. If you see a prompt to select a zone, specifyY.

If resetting the instance doesn't fix this issue, contact Cloud Customer Care orsubmit a product bug. For more information, seeGet support.

Finding XID errors

Thegpu-device-plugin daemonset runs within thekube-system namespace and is responsible for the following:

GPU workload scheduling: allocating GPU resources to Pods.
GPU health checking: monitoring the health of your GPUs.
GPU metrics gathering: collecting GPU-related metrics, such as duty cycle and memory usage.

Thegpu-device-plugin uses NVIDIA Management Library (NVML) to detect XID errors. When an XID error occurs, thegpu-device-plugin Pod running on the affected node logs the error. You will find two types of XID error logs:

Non-critical XID errors:
- Log format:Skip error Xid=%d as it is not Xid Critical
- Meaning: These errors are considered non-critical. They can be caused by various software or hardware issues.
- Action: GKE takes no automated action for non-critical XID errors.
Critical XID errors:
- Log format:XidCriticalError: Xid=%d, All devices will go unhealthy
- Meaning: These errors indicate a GPU hardware issue.
- Action:
  - GKE marks the node's GPU resources as unhealthy.
  - GKE prevents GPU workloads from being scheduled on the node.
  - If node auto-repair is enabled, GKE will recreate the node.

GPUDirect-TCPX(O) issues

This section provides troubleshooting information for GPUDirect-TCPX(O) issues.If you are using GKE version 1.34 or later, also see theGKE Known issues page.

Release note and upgrade instructions

For new users,Maximize GPU network bandwidth in Standard modeclusters provides guidance onusing GPUDirect-TCPX(O). For existing users, read theGPUDirect-TCPXO ReleaseNotesfor release information and upgrade instructions, because new versions arecontinuously released.

`tcpx-daemon` and`tcpxo-daemon` failure after GKE version upgrade

You might encounter the following error if yourtcpx-daemon ortcpxo-daemon Pods fail:

cuda error detected! name: CUDA_ERROR_NO_DEVICE; string: no CUDA-capable device is detected

Also, check if thedevice-injector Pods in thekube-system namespace are in theCrashLoopBackOff orError state.

This issue occurs if you upgraded your GKE node pools to one of the following versions or later:

For GKE 1.31: 1.31.14-gke.1033000
For GKE 1.32: 1.32.9-gke.1575000
For GKE 1.33: 1.33.5-gke.1862000
For GKE 1.34: 1.34.1-gke.3225000
For GKE 1.35 and later

The issue is caused by using an older version of theNRI device injector manifest that includes aninitContainer namedenable-nri. In these affected GKE versions, NRI is enabled by default in the node configuration. The obsoleteenable-nri container attempts to modify the configuration and restart the container runtime, which conflicts with the system defaults and causes thenri-device-injector Pods to crash. This failure prevents GPU devices from being exposed to thetcpx-daemon ortcpxo-daemon container.

To resolve this issue, update thenri-device-injector DaemonSet to the latest version, which removes the conflictinginitContainer.

Delete the existing DaemonSet:

kubectldeletedaemonsetdevice-injector-nkube-system

Apply the latest manifest:

kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nri_device_injector/nri-device-injector.yaml

Restart your workload Pods.

Debug with NCCL logs

If you can't resolve an issue with NCCL, collect NCCL logs with debugginginformation. These logs contain valuable information about NCCL operations andcan help you find the source of your problem. If you can't resolve the issue,collect these logsbefore you open a case with Cloud Customer Care. These logscan help Cloud Customer Care resolve your issue more quickly.

To generate and collect the logs, complete the following steps:

Set the following environment variables inside your Pod or application manifest:
```
NCCL_DEBUG=INFONCCL_DEBUG_SUBSYS=INIT,NET,ENV,COLL,GRAPHNCCL_DEBUG_FILE=/DIRECTORY/FILE_NAME.%h.%p
```
For more information about these environment variables, readcollect NCCL debugging logs.
To generate data for your logs, run anNCCL test. The way to run this testdepends on the type of cluster that you use. For GKE clusters,you candeploy and run NCCL test with Topology Aware Scheduling (TAS).After you run the NCCL test, NCCL automatically generates the logs on allparticipating nodes.
Collect the logs from all nodes. Verify that you've collected NCCL logs fromall nodes by verifying that the logs contain the following information:
- The hostnames of all VMs that are involved in a workload.
- The PIDs of all relevant processes on the VM.
- The ranks of all GPUs that are used by the workload on each VM.
If you're not sure where the log files are located, the following exampleshows you where NCCL creates the log files when theNCCL_DEBUG_FILEvariable is set to/tmp/nccl_log.%h.%p. You have two VMs nameda3plus-vm-1 anda3plus-vm-2, and each VM runs eight processes withinthe workload container. In this scenario, NCCL creates the following logfiles under the/tmp directory within the workload container on each VM:
- Ona3plus-vm-1: eight log files namednccl_log.a3plus-vm-1.PID,wherePID is the process ID.
- Ona3plus-vm-2: eight log files namednccl_log.a3plus-vm-2.PID.
Review the logs. NCCL log entries have the following format:
```
HOSTNAME:PID:TID [RANK]NCCL_MESSAGE
```
These log entries contain the following values:
- HOSTNAME: the hostname of the VM. This valueidentifies which VM was being used when NCCL generated the logentry.
- PID: the PID. This value identifies whichprocess generated the log entry.
- TID: the thread ID. This value identifieswhich thread within the process was being used when NCCL generated thelog entry.
- RANK: the local rank ID. This value identifieswhich GPU was being used when NCCL generated the log entry.Ranks are numbered from 0-N, where N is the total number of GPUsthat are involved in the process. For example, if your workload runs witheight GPUs per VM, then each VM should have eight different rankvalues (0-7).
- NCCL_MESSAGE: a descriptive message thatprovides more information about the event and includes the timestamp ofwhen NCCL created the log.
For example:
```
gke-a3plus-mega-np-2-aa33fe53-7wvq:1581:1634 [1] NCCL INFO 00:09:24.631392: NET/FasTrak plugin initialized.
```
This example has the following values:
- gke-a3plus-mega-np-2-aa33fe53-7wvq: the hostname.
- 1581: the process ID.
- 1634: the thread ID.
- 1: the local rank ID.
- NCCL INFO 00:09:24.631392: NET/FasTrak plugin initialized.: themessage explaining what happened.
If you're opening a support case, package the logs that you collected,along with the output of the NCCL test, into a zip file. Include the zipfile when yousubmit a support case to Cloud Customer Care.
To stop collecting NCCL debugging logs, remove the variables that you addedin step 1.

What's next

If you can't find a solution to your problem in the documentation, seeGet support for further help,including advice on the following topics:
- Opening a support case by contactingCloud Customer Care.
- Getting support from the community byasking questions on StackOverflow and using thegoogle-kubernetes-engine tag to search for similarissues. You can also join the#kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using thepublic issue tracker.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Troubleshoot GPUs in GKE Stay organized with collections Save and categorize content based on your preferences.