Troubleshoot GPUs in GKE Stay organized with collections Save and categorize content based on your preferences.
If your Google Kubernetes Engine (GKE) Pods are stuck in aPending state whilerequestingnvidia.com/gpu resources, or if your nodes fail to register theiravailable GPUs, you might have an issue with the NVIDIA driver installation oryour node pool configuration. These problems prevent your workloads fromaccessing the GPU hardware that they need.
This document shows you how to diagnose and resolve common problems that preventGKE from scheduling or running GPU-accelerated workloads. Learnhow to verify the GPU driver installation, inspect Pod and node logs for errors,and confirm that your configurations are correct.
This information is for Platform admins and operators who manage GPU-enabled nodepools and need to resolve NVIDIA driver issues, and forApplication developers who need to debug GPU workloads that are stuck orfailing to start. For more information about the common roles and example tasksthat we reference in Google Cloudcontent, seeCommon GKE user roles andtasks.
GPU driver installation
This section provides troubleshooting information for automatic NVIDIA devicedriver installation in GKE.
Driver installation fails in Ubuntu nodes
If you use Ubuntu nodes that have attached L4, RTX PRO 6000, H100, or H200 GPUs,the default GPU driver that GKE installs might not be at or laterthan the required version for those GPUs. As a result, the GPU device plugin Podremains stuck in the Pending state and your GPU workloads on those nodes mightexperience issues.
To resolve this issue, see the instructions for the respective GPU:
L4 and H100
To resolve this issue for L4 and H100 GPUs, we recommend upgrading to thefollowing GKE versions which install GPU driver version 535 asthe default driver:
- 1.26.15-gke.1483000 and later
- 1.27.15-gke.1039000 and later
- 1.28.11-gke.1044000 and later
- 1.29.6-gke.1073000 and later
- 1.30.2-gke.1124000 and later
Alternatively, you can manually install driver version 535 or later by runningthe following command:
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R535.yamlRTX PRO 6000
To resolve this issue for RTX PRO 6000 GPUs, upgrade to one of the followingGKE versions. These versions install GPU driver version 580 asthe default driver:
- 1.32.8-gke.1170000 and later
- 1.33.4-gke.1245000 and later
- 1.34.0-gke.1662000 and later
H200
To resolve this issue for H200 GPUs, you must manually install driver version550 or later by running the following command:
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R550.yamlGPU device plugins fail with CrashLoopBackOff errors
The following issue occurs if you used themanual driver installation methodin your node pool prior toJanuary 25, 2023 and later upgraded your nodepool to a GKE version that supports automatic driverinstallation. Both installation workloads exist at the same time and try toinstall conflicting driver versions on your nodes.
The GPU device plugin init container fails with theInit:CrashLoopBackOffstatus. The logs for the container are similar to the following:
failed to verify installation: failed to verify GPU driver installation: exit status 18To resolve this issue, try the following methods:
Remove the manual driver installation DaemonSet from your cluster. Thisdeletes the conflicting installation workload and lets GKEautomatically install a driver to your nodes.
Note: Ensure that all of your node pools use automatic installation before youdelete the DaemonSet.kubectldelete-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yamlRe-apply the manual driver installation DaemonSet manifest to your cluster.On January 25, 2023, we updated the manifest to ignore nodes that useautomatic driver installation.
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yamlDisable automatic driver installation for your node pool. The existing driverinstallation DaemonSet should work as expected after the update operationcompletes.
gcloudcontainernode-poolsupdatePOOL_NAME\--accelerator=type=GPU_TYPE,count=GPU_COUNT,gpu-driver-version=disabled\--cluster=CLUSTER_NAME\--location=LOCATIONReplace the following:
POOL_NAME: the name of the node pool.GPU_TYPE: the GPU type that the node pool alreadyuses.GPU_COUNT: the number of GPUs that are alreadyattached to the node pool.CLUSTER_NAME: the name of the GKEcluster that contains the node pool.LOCATION: the Compute Engine location of thecluster.
For more information about mapping the GPU driver version to GKEversion, seeMap the GKE version andContainer-Optimized OS node image version to the GPU driverversion.
Error: "Container image cos-nvidia-installer:fixed is not present with pull policy of Never." or "Container image ubuntu-nvidia-installer:fixed is not present with pull policy of Never."
This issue occurs when thenvidia-driver-installer Pods are in thePodInitializing state and the GPU plugin device or the GPU driver installer Podsreport the following error. The specific error message depends on the operatingsystem running on your node:
COS
Container image "cos-nvidia-installer:fixed" is not present with pull policy of Never.Ubuntu
Container image "gke-nvidia-installer:fixed" is not present with pull policy of Never.This issue can occur when the garbage collector removes thepreloaded NVIDIA driver image to free space on a node. When the driver Pod is recreated or its container is restarted, GKE won't be able to locate the preloaded image.
To mitigate the garbage collection issue when you are running COS, upgrade your GKE nodes to one of these versions that contain thefix:
- 1.25.15-gke.1040000 and later
- 1.26.10-gke.1030000 and later
- 1.27.6-gke.1513000 and later
- 1.28.3-gke.1061000 and later
For more information about mapping the GPU driver version to GKEversion, seeMap the GKE version andContainer-Optimized OS node image version to the GPU driverversion.
If your nodes are running Ubuntu, no fix is available yet for this garbagecollection issue. To mitigate this issue on Ubuntu, you can run a privilegedcontainer that interacts with the host to ensure the correct setup ofNVIDIA GPU drivers. To do so, runsudo /usr/local/bin/nvidia-container-first-bootfrom your node or apply the following manifest:
apiVersion:v1kind:Podmetadata:name:gke-nvidia-installer-fixupspec:nodeSelector:cloud.google.com/gke-os-distribution:ubuntuhostPID:truecontainers:-name:installerimage:ubuntusecurityContext:privileged:truecommand:-nsenter--at-'1'----sh--c-"/usr/local/bin/nvidia-container-first-boot"restartPolicy:NeverAnother potential cause of the issue is when the NVIDIA driver images are lostafter node reboot or host maintenance. This may occur on confidential nodes, ornodes with GPUs, that use ephemeral local SSD storage. In this situation, GKE preloads thenvidia-installer-driver container images on nodes and moves them from the boot disk to the local SSD on first boot.
To confirm there was a host maintenance event, use the following logfilter:
resource.type="gce_instance"protoPayload.serviceName="compute.googleapis.com"log_id("cloudaudit.googleapis.com/system_event")To mitigate the host maintenance issue, upgrade yourGKE version to one of these versions:
- 1.27.13-gke.1166000 and later
- 1.29.3-gke.1227000 and later
- 1.28.8-gke.1171000 and later
Error: failed to configure GPU driver installation dirs: failed to create lib64 overlay: failed to create dir /usr/local/nvidia/lib64: mkdir /usr/local/nvidia/lib64: not a directory.
You encounter this error from the GPU driver installer container inside the GPUdevice plugin when NCCL fastsocket is enabled:
failed to configure GPU driver installation dirs: failed to create lib64 overlay: failed to create dir /usr/local/nvidia/lib64: mkdir /usr/local/nvidia/lib64: not a directory.This issue only happens on clusters and nodes running GKE 1.28and 1.29.
The issue is caused by a NCCL fastsocket race condition with the GPU driverinstaller.
To mitigate this issue, upgrade your GKE versionto one of these versions:
- 1.28.8-gke.1206000 and later
- 1.29.3-gke.1344000 and later
For more information, read theGPUDirect-TCPXO Release Notes.
Error: Failed to get device for nvidia0: device nvidia0 not found.
The following error indicates that XID 62 and RmInitAdapter failed for GPU with minor 0:
Failed to get device for nvidia0: device nvidia0 not found.NVIDIA driver version 525.105.17 has a bug that can cause communication errors(XID) and prevent the GPU from initializing properly, leading to a failure toinitialize the GPU.
To fix this issue, upgrade the NVIDIA driver to driver version 525.110.11 orlater.
Map the GKE version and Container-Optimized OS node image version to the GPU driver version
To find the GPU driver versions that are mapped withGKEversions and Container-Optimized OS node image versions, do the following steps:- Map Container-Optimized OS node image versions to GKE patch versions for the specific GKE version where you want to find the GPU driver version. For example, 1.33.0-gke.1552000 uses cos-121-18867-90-4.
- Choose the milestone of the Container-Optimized OS node image version in theContainer-Optimized OS release notes. For example, choose Milestone 121 for cos-121-18867-90-4.
- In the release notes page for the specific milestone, find the release note corresponding with the specific Container-Optimized OS node image version. For example, inContainer-Optimized OS Release Notes: Milestone 121, seecos-121-18867-90-4. In the table in theGPU Drivers column, clickSee List to see the GPU driver version information.
Thenvidia-smi command fails
When you use GPU VMs to run workloads in GKE, commands likenvidia-smimight fail within the container with any of the following errors:
bash: nvidia-smi: command not found- Errors indicating that
libnvidia-ml.soor other NVIDIA libraries cannot befound.
GKE mounts the necessary NVIDIA drivers and tools from the hostnode in your containers, typically under the/usr/local/nvidia/ path. However, thecontainer's default environment variables (PATH andLD_LIBRARY_PATH) mightnot include the paths to these NVIDIA binaries and libraries.
To solve these errors, update your Pod or Deployment manifest to include thenecessary NVIDIA paths in thePATH andLD_LIBRARY_PATH environment variablesfor the container.
For example, add the followingenv block to yourspec.template.spec.containers spec:
spec:containers:-name:gpu-containerimage:gpu-imageenv:-name:LD_LIBRARY_PATH# Prepend NVIDIA lib64 directory to existing LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib64-name:PATH# Prepend NVIDIA bin directory to existing PATHvalue:/usr/local/nvidia/bin:$PATH# ... other container settingsReset GPUs on A3 VMs
Some issues might require you to reset the GPU on an A3 VM.
To reset the GPU, follow these steps:
Remove Pods that request GPU resources from the node where you need to resetthe GPU.
Disable the GPU device plugin on the node:
kubectlgetnodes\--selector=kubernetes.io/hostname=NODE_NAME\--no-headers|awk'{print $1}'\|xargs-I{}kubectllabelnode{}gke-no-default-nvidia-gpu-device-plugin=trueReplace
NODE_NAMEwith the name of the node.Connect to the VM backing the node.
In the SSH session, reset the GPU:
/home/kubernetes/bin/nvidia/bin/nvidia-smi--gpu-resetRe-enable the GPU device plugin:
kubectlgetnodes--selector=kubernetes.io/hostname=NODE_NAME\--no-headers\|awk'{print $1}'\|xargs-I{}kubectllabelnode{}gke-no-default-nvidia-gpu-device-plugin=false\--overwrite
GPUs on Confidential GKE Nodes
The following sections show you how to identify and fix issues with GPUs thatrun on Confidential GKE Nodes.
GPU workloads not scheduling on Confidential GKE Nodes
Confidential GKE Nodes requires that you manually install a GPU driverthat corresponds to your selected GPU type and your GKE version.If your GPU Pods aren't scheduling on Confidential GKE Nodes and remainin thePending state, describe the driver installation DaemonSet:
kubectl--namespace=kube-systemgetdaemonsetnvidia-driver-installer-oyamlIf the output returns aNotFound error,install the driver.
If the DaemonSet is running, the output is similar to the following:
apiVersion: apps/v1kind: DaemonSet# lines omitted for clarityspec: revisionHistoryLimit: 10 selector: matchLabels: k8s-app: nvidia-driver-installer template: metadata: creationTimestamp: null labels: k8s-app: nvidia-driver-installer name: nvidia-driver-installerspec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cloud.google.com/gke-accelerator operator: Exists - key: cloud.google.com/gke-gpu-driver-version operator: DoesNotExist - key: cloud.google.com/gke-confidential-nodes-instance-type operator: In values: - TDXIn this output, verify that thenodeAffinity field contains thecloud.google.com/gke-confidential-nodes-instance-type key. If the outputdoesn't contain this key, the driver installation DaemonSet doesn't supportConfidential GKE Nodes.
Deploy the DaemonSet that supports GPUs on Confidential GKE Nodes:
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/cos/daemonset-confidential.yamlAfter you install the drivers, check whether your GPU workloads startsuccessfully.
Error: Failed to allocate device vector
The following error message in your GPU container logs indicates that the GPUwas detached from the node VM instance:
Failed to allocate device vector A (error code unknown error)!This detachment might happen because of a hardware error or because of an issuewith the encryption keys.
To resolve this issue, reboot the node instance. This operation is disruptive,and affects all of the workloads on that node. To reboot the instance, do thefollowing steps:
Get the name of the node that runs the GPU Pod:
kubectlgetpodPOD_NAME-oyaml|grep"nodeName"Replace
POD_NAMEwith the name of the failing Pod.The output is similar to the following:
nodeName: gke-cluster-1-default-pool-b7asdfbt-fd3eReset the Compute Engine instance:
gcloudcomputeinstancesresetNODE_NAMEReplace
NODE_NAMEwith the node name from theoutput of the previous step.The gcloud CLI looks for VMs with that name in your activeproject. If you see a prompt to select a zone, specify
Y.Check whether your GPU workloads run without errors.
Error: Decryption failed with error -74
The following error message in your node logs indicates that the encryption keysfor the GPU were lost:
Decryption failed with error -74This error happens when the NVIDIA persistence daemon, which runs on the nodeVM instance, fails. If you see this error message, reset the instance:
gcloudcomputeinstancesresetNODE_NAMEReplaceNODE_NAME with the name of the failing node.
The gcloud CLI looks for VMs with that name in your activeproject. If you see a prompt to select a zone, specifyY.
If resetting the instance doesn't fix this issue, contact Cloud Customer Care orsubmit a product bug. For more information, seeGet support.
Finding XID errors
Thegpu-device-plugin daemonset runs within thekube-system namespace and is responsible for the following:
- GPU workload scheduling: allocating GPU resources to Pods.
- GPU health checking: monitoring the health of your GPUs.
- GPU metrics gathering: collecting GPU-related metrics, such as duty cycle and memory usage.
Thegpu-device-plugin uses NVIDIA Management Library (NVML) to detect XID errors. When an XID error occurs, thegpu-device-plugin Pod running on the affected node logs the error. You will find two types of XID error logs:
- Non-critical XID errors:
- Log format:
Skip error Xid=%d as it is not Xid Critical - Meaning: These errors are considered non-critical. They can be caused by various software or hardware issues.
- Action: GKE takes no automated action for non-critical XID errors.
- Log format:
- Critical XID errors:
- Log format:
XidCriticalError: Xid=%d, All devices will go unhealthy - Meaning: These errors indicate a GPU hardware issue.
- Action:
- GKE marks the node's GPU resources as unhealthy.
- GKE prevents GPU workloads from being scheduled on the node.
- If node auto-repair is enabled, GKE will recreate the node.
- Log format:
GPUDirect-TCPX(O) issues
This section provides troubleshooting information for GPUDirect-TCPX(O) issues.If you are using GKE version 1.34 or later, also see theGKE Known issues page.
Release note and upgrade instructions
For new users,Maximize GPU network bandwidth in Standard modeclusters provides guidance onusing GPUDirect-TCPX(O). For existing users, read theGPUDirect-TCPXO ReleaseNotesfor release information and upgrade instructions, because new versions arecontinuously released.
tcpx-daemon andtcpxo-daemon failure after GKE version upgrade
You might encounter the following error if yourtcpx-daemon ortcpxo-daemon Pods fail:
cuda error detected! name: CUDA_ERROR_NO_DEVICE; string: no CUDA-capable device is detectedAlso, check if thedevice-injector Pods in thekube-system namespace are in theCrashLoopBackOff orError state.
This issue occurs if you upgraded your GKE node pools to one of the following versions or later:
- For GKE 1.31: 1.31.14-gke.1033000
- For GKE 1.32: 1.32.9-gke.1575000
- For GKE 1.33: 1.33.5-gke.1862000
- For GKE 1.34: 1.34.1-gke.3225000
- For GKE 1.35 and later
The issue is caused by using an older version of theNRI device injector manifest that includes aninitContainer namedenable-nri. In these affected GKE versions, NRI is enabled by default in the node configuration. The obsoleteenable-nri container attempts to modify the configuration and restart the container runtime, which conflicts with the system defaults and causes thenri-device-injector Pods to crash. This failure prevents GPU devices from being exposed to thetcpx-daemon ortcpxo-daemon container.
To resolve this issue, update thenri-device-injector DaemonSet to the latest version, which removes the conflictinginitContainer.
Delete the existing DaemonSet:
kubectldeletedaemonsetdevice-injector-nkube-systemApply the latest manifest:
kubectlapply-fhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nri_device_injector/nri-device-injector.yamlRestart your workload Pods.
Debug with NCCL logs
If you can't resolve an issue with NCCL, collect NCCL logs with debugginginformation. These logs contain valuable information about NCCL operations andcan help you find the source of your problem. If you can't resolve the issue,collect these logsbefore you open a case with Cloud Customer Care. These logscan help Cloud Customer Care resolve your issue more quickly.
To generate and collect the logs, complete the following steps:
Set the following environment variables inside your Pod or application manifest:
NCCL_DEBUG=INFONCCL_DEBUG_SUBSYS=INIT,NET,ENV,COLL,GRAPHNCCL_DEBUG_FILE=/DIRECTORY/FILE_NAME.%h.%pFor more information about these environment variables, readcollect NCCL debugging logs.
To generate data for your logs, run anNCCL test. The way to run this testdepends on the type of cluster that you use. For GKE clusters,you candeploy and run NCCL test with Topology Aware Scheduling (TAS).After you run the NCCL test, NCCL automatically generates the logs on allparticipating nodes.
Collect the logs from all nodes. Verify that you've collected NCCL logs fromall nodes by verifying that the logs contain the following information:
- The hostnames of all VMs that are involved in a workload.
- The PIDs of all relevant processes on the VM.
- The ranks of all GPUs that are used by the workload on each VM.
If you're not sure where the log files are located, the following exampleshows you where NCCL creates the log files when the
NCCL_DEBUG_FILEvariable is set to/tmp/nccl_log.%h.%p. You have two VMs nameda3plus-vm-1anda3plus-vm-2, and each VM runs eight processes withinthe workload container. In this scenario, NCCL creates the following logfiles under the/tmpdirectory within the workload container on each VM:- On
a3plus-vm-1: eight log files namednccl_log.a3plus-vm-1.PID,wherePIDis the process ID. - On
a3plus-vm-2: eight log files namednccl_log.a3plus-vm-2.PID.
Review the logs. NCCL log entries have the following format:
HOSTNAME:PID:TID [RANK]NCCL_MESSAGEThese log entries contain the following values:
HOSTNAME: the hostname of the VM. This valueidentifies which VM was being used when NCCL generated the logentry.PID: the PID. This value identifies whichprocess generated the log entry.TID: the thread ID. This value identifieswhich thread within the process was being used when NCCL generated thelog entry.RANK: the local rank ID. This value identifieswhich GPU was being used when NCCL generated the log entry.Ranks are numbered from 0-N, where N is the total number of GPUsthat are involved in the process. For example, if your workload runs witheight GPUs per VM, then each VM should have eight different rankvalues (0-7).NCCL_MESSAGE: a descriptive message thatprovides more information about the event and includes the timestamp ofwhen NCCL created the log.
For example:
gke-a3plus-mega-np-2-aa33fe53-7wvq:1581:1634 [1] NCCL INFO 00:09:24.631392: NET/FasTrak plugin initialized.This example has the following values:
gke-a3plus-mega-np-2-aa33fe53-7wvq: the hostname.1581: the process ID.1634: the thread ID.1: the local rank ID.NCCL INFO 00:09:24.631392: NET/FasTrak plugin initialized.: themessage explaining what happened.
If you're opening a support case, package the logs that you collected,along with the output of the NCCL test, into a zip file. Include the zipfile when yousubmit a support case to Cloud Customer Care.
To stop collecting NCCL debugging logs, remove the variables that you addedin step 1.
What's next
If you can't find a solution to your problem in the documentation, seeGet support for further help,including advice on the following topics:
- Opening a support case by contactingCloud Customer Care.
- Getting support from the community byasking questions on StackOverflow and using the
google-kubernetes-enginetag to search for similarissues. You can also join the#kubernetes-engineSlack channel for more community support. - Opening bugs or feature requests by using thepublic issue tracker.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-18 UTC.