Troubleshoot GKE

This document lists troubleshooting documents for common issues that you mightencounter when using Google Kubernetes Engine (GKE). Whether you're diagnosingworkload errors likeImagePullBackOff andCrashLoopBackOff, debuggingcluster autoscaling behavior, resolving PersistentVolume issues, ortroubleshooting node registration problems, the documents listed here can help.

If you're new to troubleshooting in GKE, start withIntroduction to troubleshooting.

To diagnose and resolve issues you encounter, see the documents in thefollowing sections:

To troubleshoot GKE networking, seeTroubleshoot GKE networkingin the GKE networking documentation.

This document is for Admins and architects, Security specialists,Networking specialists, or Storage specialists who troubleshootGKE configurations. To learn more about GKE roles,seeCommon GKE user roles and tasks.

Introduction to troubleshooting

TopicDescription
Introduction to GKE troubleshootingGet started troubleshooting GKE by learning about the overall process and fundamental concepts.
Review service health and incidentsLearn how to check the health of GKE and related Google Cloud services to exclude platform issues.
Assess cluster and workload health in the Google Cloud consoleLearn how to use the Google Cloud console to investigate and resolve GKE issues.
Investigate a cluster's state withkubectlExplore commonkubectl commands and techniques for diagnosing problems in your clusters and workloads.
Conduct historical analysis with Cloud LoggingUnderstand how to effectively use Cloud Logging to find root causes of issues in GKE.
Perform proactive monitoring with Cloud MonitoringUtilize Cloud Monitoring dashboards and metrics to identify, diagnose, and resolve GKE issues.
Accelerate diagnosis with Gemini Cloud AssistDiscover how Gemini can assist in diagnosing and resolving GKE problems.
Put it all together: Example troubleshooting scenarioFollow a step-by-step example of troubleshooting a common scenario in GKE.

Cluster setup

TopicDescription
Cluster creationResolve issues with creating clusters.
Autopilot clustersDiagnose and troubleshoot GKE Autopilot clusters, including cluster creation, namespace deletion, scaling, and workload issues.
Kubectl command-line toolTroubleshoot thekubectl command-line tool in GKE, including issues with authentication, authorization. This page also includes advice on how totroubleshoot the Konnectivity proxy to check if it's causing thekubectl logs,attach,exec, orport-forward commands to stop responding.
Standard node poolsTroubleshoot GKE Standard node pools, including issues with node pool creation, best-effort provisioning, corrupted instance metadata, and migrating workloads to new node pools.
NodeNotReady statusLearn how to diagnose and resolve the nodeNotReady status in GKE by troubleshooting common causes such as resource shortages, network issues, and component failures.
Node registrationTroubleshoot issues that occur when adding nodes to your GKE Standard cluster, such as node registration failures and missing prerequisites for successful node registration.
Container runtimeTroubleshoot container runtimes in GKE, including issues withcontainerd anddockershim, and private registries.

Autoscaling

TopicDescription
Cluster autoscaler not scaling downDiagnose and resolve common reasons your cluster isn't removing underutilized nodes. Learn how to check for issues like restrictivePodDisruptionBudgets, Pods with local storage, or specific annotations (for example,"cluster-autoscaler.kubernetes.io/safe-to-evict": "false") that prevent node eviction.
Cluster autoscaler not scaling upLearn why the cluster autoscaler isn't adding new nodes to meet demand. Check for unschedulable Pods, verify that you haven't hit cluster or node pool size limits, and identify potential resource quota or regional VM availability issues.
Horizontal Pod autoscalingTroubleshoot problems with the Horizontal Pod Autoscaler not scaling your application's Pod replicas. Resolve common issues, such as misconfigured HorizontalPodAutoscaler objects or problems with the metrics pipeline.

Storage

TopicDescription
StorageTroubleshoot storage, including issues with regional persistent disks, disk performance, and volume expansion.

Cluster security

TopicDescription
AuthenticationTroubleshoot authentication in GKE, including issues with RBAC, Workload Identity Federation for GKE, and the GKE metadata server.
Service accountsTroubleshoot service accounts, including restoring the default service account and enabling the Compute Engine default service account.
Application-layer secretsTroubleshoot issues that can occur when configuring application-layer secrets encryption, including failed updates and errors where you're unable to use a Cloud KMS key or where the Cloud KMS key version was destroyed.

Cluster's root Certificate Authority expiring soon

TopicDescription
Root Certificate Authority (CA) expiringIf your cluster's root Certificate Authority (CA) is expiring soon, learn how to perform acredential rotation to prevent normal cluster operations from being interrupted.

Workloads

TopicDescription
Deployed workloadsTroubleshoot errors for workloads running in a GKE cluster, includingPodUnschedulable. Read the PodUnschedulable section for advice on errors likeMatchNodeSelector andDoes not have minimum availability.
Image pullsTroubleshoot image pulls. Learn what causes statuses likeImagePullBackOff andErrImagePull and how to resolve these statuses by fixing common issues like authentication and network connectivity.
CrashLoopBackOff eventsTroubleshootCrashLoopBackOff events in GKE. Diagnose issues like resource exhaustion, app misconfigurations, and liveness probe failures.
OOM eventsTroubleshoot Kubernetes Out of Memory (OOM) events. Identify causes, distinguish event types, and apply effective solutions for both container- and node-level OOM kills.
Arm workloadsTroubleshoot issues with Arm workloads, including Pods on Arm nodes crashing.
TPUsTroubleshoot TPUs, including issues with quota, node auto-provisioning, workload configuration, and scheduling.
GPUsTroubleshoot GPUs, including issues with GPU driver installation, device plugin errors, and container images.

Cluster management

TopicDescription
Cluster upgradesTroubleshoot and resolve GKE cluster and node upgrade issues, including long or incomplete upgrades, unexpected auto-upgrades, failures, and post-upgrade problems.
WebhooksUnderstand how to troubleshoot and ensure the stability of your cluster control plane when using admission webhooks.
Namespace stuck in theTerminating stateTroubleshoot issues with namespaces stuck in theTerminating state by identifying and removing the unhealthy components that are blocking deletion.
Concurrent operationsTroubleshoot concurrent operations by learning how to identify these errors and resolve them by waiting for operations to complete.

Monitoring

TopicDescription
System metricsTroubleshoot system metrics not appearing in Cloud Monitoring.
Monitoring dashboardsTroubleshoot monitoring dashboards, including issues with enabling monitoring, missing Kubernetes resources, and permissions.
Troubleshoot missing logsTroubleshoot missing GKE logs. Learn how to check API status, cluster settings, permissions, quotas, filters, and application behavior.

4xx errors

TopicDescription
4xx errorsTroubleshoot some of the 400, 401, 403, and 404 errors that you might encounter when using GKE. This page also includes information on how to troubleshootmissing edit permissions on account errors.

Known issues

TopicDescription
Known issuesIdentify and resolve known issues that mightaffect your use of GKE.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.