Troubleshoot CrashLoopBackOff events
If your Pods in Google Kubernetes Engine (GKE) are stuck in aCrashLoopBackOffstate, it means that one or more containers are repeatedly starting and thenexiting. This behavior is likely making your apps unstable or completelyunavailable.
Use this page to diagnose and resolve the underlying causes,which often fall into categories such as resource limitations, issues withliveness probes, app errors, or configuration mistakes. Troubleshootingthese issues helps ensure that your apps run reliably and remain available toyour users.
This information is important for Application developers who want toidentify and fix app-level problems, such as coding errors, incorrect entrypoints, configuration file issues, or problems connecting to dependencies.Platform admins and operators can identify and address platform-related issueslike resource exhaustion (OOMKilled), node disruptions, or misconfiguredliveness probes. For more information about the common roles and example tasksthat we reference in Google Cloud content, seeCommon GKE user roles and tasks.
Understand aCrashLoopBackOff event
When your Pod is stuck in aCrashLoopBackOff state, a container within it isrepeatedly starting and crashing or exiting. ThisCrashLoop triggersKubernetes to attempt restarting the container by adhering to itsrestartPolicy. With each failed restart, theBackOff delay before the nextattempt increases exponentially (for example, 10s, 20s, 40s), up to a maximum offive minutes.
Although this event indicates a problem within your container, it's also avaluable diagnostic signal. ACrashLoopBackOff event confirms that manyfoundational steps of Pod creation, such as assignment to a node and pulling thecontainer image, have already completed. This knowledge lets you focus yourinvestigation on the container's app or configuration, rather than the clusterinfrastructure.
TheCrashLoopBackOff state occurs because of how Kubernetes, specifically thekubelet, handles container termination based on the Pod'srestart policy.The cycle typically follows this pattern:
- The container starts.
- The container exits.
- The kubelet observes the stopped container and restarts it according to thePod's
restartPolicy. - This cycle repeats, with the container restarted after an increasingexponential back-off delay.
The Pod'srestartPolicy is the key to this behavior. The default policy,Always, is the most common cause of this loop because it restarts a containerif it exits forany reason, even after a successful exit. TheOnFailurepolicy is less likely to cause a loop because it only restarts on non-zero exitcodes, and theNever policy avoids a restart entirely.
Identify symptoms of aCrashLoopBackOff event
A Pod with theCrashLoopBackOff status is the primary indication of aCrashLoopBackOff event.
However, you might experience some less obvious symptoms of aCrashLoopBackOffevent:
- Zero healthy replicas for a workload.
- A sharp decrease in healthy replicas.
- Workloads with horizontal Pod autoscaling enabled are scaling slowly orfailing to scale.
If asystem workload (for example, a logging or metrics agent) has theCrashLoopBackOff status, you might also notice the following symptoms:
- Some GKE metrics aren't reported.
- Some GKE dashboards and graphs have gaps.
- Connectivity issues on Pod-level networking.
If you observe any of these less obvious symptoms, your next step should be toconfirm if aCrashLoopBackOff event occurred.
Confirm aCrashLoopBackOff event
To confirm and investigate aCrashLoopBackOff event, gather evidence fromKubernetes events and the container's app logs. These two sources providedifferent, but complementary views of the problem:
- Kubernetes events confirm that a Pod is crashing.
- The container's app logs can show you why the process inside the containeris failing.
To view this information, select one of the following options:
Console
To view Kubernetes events and app logs, do the following:
In the Google Cloud console, go to theWorkloads page.
Select the workload that you want to investigate. TheOverview orDetails tab displays more information about the status of theworkload.
From theManaged Pods section, click the name of the problematicPod.
On the Pod details page, investigate the following:
- To see details about Kubernetes events, go to theEvents tab.
- To view the container's app logs, go to theLogs tab. This pageis where you find app-specific error messages or stack traces.
kubectl
To view Kubernetes events and app logs, do the following:
View the status of all Pods running in your cluster:
kubectlgetpodsThe output is similar to the following:
NAME READY STATUS RESTARTS AGEPOD_NAME 0/1 CrashLoopBackOff 23 8dIn the output, review the following columns:
Ready: review how many containers are ready. In this example,0/1indicates that zero out of one expected container is in aready state. This value is a clear sign of a problem.Status: look for Pods with a status ofCrashLoopBackOff.Restarts: a high value indicates that Kubernetes is repeatedlytrying and failing to start the container.
After you identify a failing Pod, describe it to see cluster-levelevents that are related to the Pod's state:
kubectldescribepodPOD_NAME-nNAMESPACE_NAMEReplace the following:
POD_NAME: the name of the Pod thatyou identified in the output of thekubectl getcommand.NAMESPACE_NAME: the namespace of the Pod.
The output is similar to the following:
Containers:container-name:... State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: context deadline exceeded: unknown Exit Code: 128 Started: Thu, 01 Jan 1970 00:00:00 +0000 Finished: Fri, 27 Jun 2025 16:20:03 +0000 Ready: False Restart Count: 3459...Conditions:Type StatusPodReadyToStartContainers TrueInitialized TrueReady FalseContainersReady FalsePodScheduled True...Events:Type Reason Age From Message---- ------ ---- ---- -------Warning Failed 12m (x216 over 25h) kubelet Error: context deadline exceededWarning Failed 8m34s (x216 over 25h) kubelet Error: context deadline exceededWarning BackOff 4m24s (x3134 over 25h) kubelet Back-off restarting failed container container-name in pod failing-pod(11111111-2222-3333-4444-555555555555)In the output, review the following fields for signs of a
CrashLoopBackOffevent:State: the state of the container likely showsWaitingwith thereasonCrashLoopBackOff.Last State: the state of the previously terminated container. Lookfor aTerminatedstatus and review the exit code to see if there wasa crash (non-zero exit code) or an unexpected successful exit (zeroexit code).Events: actions taken by the cluster itself. Look for messages aboutthe container being started, followed by liveness probe failures orback-off warnings likeBack-off restarting failed container.
To learn more about why the Pod failed, view its app logs:
kubectllogsPOD_NAME--previousThe
--previousflag retrieves logs from the prior, terminatedcontainer, which is where you can find the specific stack trace or errormessage that reveals the cause of the crash. The current container mightbe too new to have recorded any logs.In the output, look for app-specific errors that would cause the processto exit. If you use a custom-made app, the developers who wrote it arebest equipped to interpret these error messages. If you use a prebuiltapp, these apps often provide their own debugging instructions.
Use the Crashlooping Pods interactive playbook
After you confirm aCrashLoopBackOff event, begin troubleshooting withthe interactive playbook:
In the Google Cloud console, go to theGKE Interactive Playbook -Crashlooping Pods page.
In theCluster list, select the cluster that you want totroubleshoot. If you can't find your cluster, enter the name of the clusterin theFilter field.
In theNamespace list, select the namespace that you want totroubleshoot. If you can't find your namespace, enter the namespace in theFilter field.
Work through each section to help you answer the followingquestions:
- Identify App Errors: which containers are restarting?
- Investigate Out Of Memory Issues: is there amisconfiguration or an error related to the app?
- Investigate Node Disruptions: are disruptions on the noderesource causing container restarts?
- Investigate Liveness Probe Failures: are liveness probesstopping your containers?
- Correlate Change Events: what happened around the time thecontainers started crashing?
Optional: To get notifications about future
CrashLoopBackOffevents,in theFuture Mitigation Tips section, selectCreate an Alert.
If your problem persists after using the playbook, read the rest of the guidefor more information about resolvingCrashLoopBackOff events.
Resolve aCrashLoopBackOff event
The following sections help you resolve the most common causes ofCrashLoopBackOff events:
Resolve resource exhaustion
ACrashLoopBackOff event is often caused by an Out of Memory (OOM) issue. Youcan confirm if this is the cause if thekubectl describe output shows thefollowing:
Last State: Terminated Reason: OOMKilledFor information about how to diagnose and resolve OOM events, seeTroubleshoot OOM events.
Resolve liveness probe failures
Aliveness probeis a periodic health check performed by thekubelet. If the probe fails aspecified number of times (the default number is three), thekubelet restartsthe container, potentially causing aCrashLoopBackOff event if the probefailures continue.
Confirm if a liveness probe is the cause
To confirm if liveness probe failures are triggering theCrashLoopBackOffevent, query yourkubelet logs. These logs often contain explicit messagesindicating probe failures and subsequent restarts.
In the Google Cloud console, go to theLogs Explorer page.
In the query pane, filter for any liveness-probe-related restarts byentering the following query:
resource.type="k8s_node"log_id("kubelet")jsonPayload.MESSAGE:"failed liveness probe, will be restarted"resource.labels.cluster_name="CLUSTER_NAME"Replace
CLUSTER_NAMEwith the name of your cluster.Review the output. If a liveness probe failure is the cause of your
CrashLoopBackOffevents, the query returns log messages similar to thefollowing:Container probe failed liveness probe, will be restarted
After you confirm that liveness probes are the cause of theCrashLoopBackOffevent, proceed to troubleshoot common causes:
- Review liveness probe configuration.
- Inspect CPU and disk I/0 utilization.
- Address large deployments.
- Address transient errors.
- Address probe resource consumption.
Review liveness probe configuration
Misconfigured probes are a frequent cause ofCrashLoopBackOff events. Checkthe following settings in the manifest of your probe:
- Verify probe type: your probe's configuration must match how your appreports its health. For example, if your app has a health check URL (like
/healthz), use thehttpGetprobe type. If its health is determined byrunning a command, use theexecprobe type. For example, to check if anetwork port is open and listening, use thetcpSocketprobe type. - Check probe parameters:
- Path (for
httpGetprobe type): make sure the HTTP path iscorrect and that your app serves health checks on it. - Port: verify that the port configured in the probe isactually used and exposed by the app.
- Command (for
execprobe type): make sure the commandexists within the container, returns an exit code of0for success,and completes within the configuredtimeoutSecondsperiod. - Timeout: make sure that the
timeoutSecondsvalue is sufficientfor the app to respond, especially during startup or under load. - Initial delay (
initialDelaySeconds): check if the initialdelay is sufficient for the app to start before probes begin.
- Path (for
For more information, seeConfigure Liveness, Readiness and Startup Probesin the Kubernetes documentation.
Inspect CPU and disk I/O utilization
Resource contention results in probe timeouts, which is a major cause ofliveness probe failures. To see if resource usage is the cause of the livenessprobe failure, try the following solutions:
- Analyze CPU usage: monitor the CPU utilization of the affectedcontainer and the node it's running on during the probe intervals. A keymetric to track is
kubernetes.io/container/cpu/core_usage_time. High CPUusage on the container or the node can prevent the app from responding tothe probe in time. - Monitor disk I/O: check disk I/O metrics for the node. You can use the
compute.googleapis.com/guest/disk/operation_timemetric to assess theamount of time spent on the disk operations, which are categorized by readsand writes. High disk I/O can significantly slow down container startup, appinitialization, or overall app performance, leading to probe timeouts.
Address large deployments
In scenarios where a large number of Pods are deployed simultaneously (forexample, by a CI/CD tool like ArgoCD), a sudden surge of new Pods can overwhelmcluster resources, leading to control plane resource exhaustion. This lack ofresources delays app startup and can cause liveness probes to fail repeatedlybefore the apps are ready.
To resolve this issue, try the following solutions:
- Implement staggered deployments: implement strategies to deploy Pods inbatches or over a longer period to avoid overwhelming node resources.
- Reconfigure or scale nodes: if staggered deployments aren't feasible,consider upgrading nodes with faster or larger disks, or Persistent VolumeClaims, to better handle increased I/O demand. Ensure your clusterautoscaling is configured appropriately.
- Wait and observe: In some cases, if the cluster is not severelyunder-resourced, workloads might eventually deploy after a significant delay(sometimes 30 minutes or more).
Address transient errors
The app might experience temporary errors or slowdowns during startup orinitialization that cause the probe to fail initially. If the app eventuallyrecovers, consider increasing the values defined in theinitialDelaySeconds orfailureThreshold fields in the manifest of your liveness probe.
Address probe resource consumption
In rare cases, the liveness probe's execution itself might consume significantresources, which could trigger resource constraints that potentially lead to thecontainer being terminated due to an OOM kill. Ensure your probe commands arelightweight. A lightweight probe is more likely to execute quickly and reliably,giving it higher fidelity in accurately reporting your app's true health.
Resolve app misconfigurations
App misconfigurations cause manyCrashLoopBackOff events. Tounderstand why your app is stopping, the first step is to examine its exit code.This code determines your troubleshooting path:
- Exit code
0indicates a successful exit, which is unexpected for along-running service and points to issues with the container's entry pointor app design. - A non-zero exit code signals an app crash, directing your focus towardconfiguration errors, dependency issues, or bugs in the code.
Find the exit code
To find the exit code of your app, do the following:
Describe the Pod:
kubectldescribepodPOD_NAME-nNAMESPACE_NAMEReplace the following:
POD_NAME: the name of the problematic Pod.NAMESPACE_NAME: the namespace of the Pod.
In the output, review the
Exit Codefield located under theLast Statesection for the relevant container. If the exit code is0, seeTroubleshoot successful exits (exit code0). If the exit code is a number other than0, seeTroubleshoot app crashes (non-zero exitcode).
Troubleshoot successful exits (exit code0)
An exit code of0 typically means the container's process finished successfully.Although this is the outcome that you want for a task-based Job, it can signal aproblem for a long-running controller like a Deployment, StatefulSet, orReplicaSet.
These controllers work to ensure a Pod is always running, so they treat any exitas a failure to be corrected. Thekubelet enforces this behavior by adheringto the Pod'srestartPolicy (which defaults toAlways), restarting thecontainer even after a successful exit. This action creates a loop, whichultimately triggers theCrashLoopBackOff status.
The most common reasons for unexpected successful exits are the following:
Container command doesn't start a persistent process: a containerremains running only as long as its initial process (
commandorentrypoint) does. If this process isn't a long-running service, thecontainer exits as soon as the command completes. For example, a commandlike["/bin/bash"]exits immediately because it has no script to run.To resolve this issue, ensure your container's initial process starts aprocess that runs continuously.Worker app exits when a work queue is empty: many worker apps aredesigned to check a queue for a task and exit cleanly if the queue isempty. To resolve this, you can either use a Job controller (whichis designed for tasks that run to completion) or modify the app's logicto run as a persistent service.
App exits due to missing or invalid configuration: Your app might exitimmediately if it's missing required startup instructions, such ascommand-line arguments, environment variables, or a critical configurationfile.
To resolve this issue, first inspect your app's logs for specific errormessages related to configuration loading or missing parameters. Then,verify the following:
- App arguments or environment: ensure that all necessarycommand-line arguments and environment variables are correctly passed tothe container as expected by your app.
- Configuration file presence: confirm that any required configurationfiles are present at the expected paths within the container.
- Configuration file content: validate the content and format of yourconfiguration files for syntax errors, missing mandatory fields, orincorrect values.
A common example of this issue is when an app is configured to read from afile mounted with a
Note: Although a well-designed app should exit with a non-zero error code whena mandatory configuration is missing, some apps are written to exitsuccessfully (with codeConfigMapvolume. If theConfigMapisn't attached,is empty, or has misnamed keys, an app designed to exit when itsconfiguration is missing might stop with an exit code of0. In such cases,verify the following settings:- TheConfigMapname in your Pod's volume definition matches its actual name.- The keys within theConfigMapmatch what your app expects to find as filenames in the mounted volume.0) in this situation.
Troubleshoot app crashes (non-zero exit code)
When a container exits with a non-zero code, Kubernetes restarts it. If theunderlying issue that caused the error is persistent, the app crashes again andthe cycle repeats, culminating in aCrashLoopBackOff state.
The non-zero exit code is a clear signal that an error occurred within the appitself, which directs your debugging efforts toward its internal workings andenvironment. The following issues often cause this termination:
Configuration errors: a non-zero exit code often points to problems withthe app's configuration or the environment it's running in. Check your appfor these common issues:
- Missing configuration file: the app might not be able to locate oraccess a required configuration file.
- Invalid configuration: the configuration file might contain syntaxerrors, incorrect values, or incompatible settings, causing the app tocrash.
- Permissions issues: the app could lack the necessary permissions toread or write the configuration file.
- Environment variables: incorrect or missing environment variablescan cause the app to malfunction or fail to start.
- Invalid
entrypointorcommand: the command specified in thecontainer'sentrypointorcommandfield might be incorrect. This issue canhappen with newly deployed images where the path to the executable iswrong or the file itself is not present in the container image. Thismisconfiguration often results in the128exit code. Uncontrolled image updates (
:latesttag): if your workload imagesuse the:latesttag, new Pods might pull an updated image version thatintroduces breaking changes.To help ensure consistency and reproducibility, always use specific,immutable image tags (for example,
v1.2.3) or SHA digests (for example,sha256:45b23dee08...) in production environments. This practice helpsensure that the exact same image content is pulled every time.
Dependency issues: your app might crash if it can't connect to the otherservices it depends on, or if it fails to authenticate or has insufficientpermissions to access them.
External service unavailable: the app might depend on externalservices (for example, databases or APIs) that are unreachable due tonetwork connectivity problems or service outages. To troubleshoot thisissue, connect to the Pod. For more information, seeDebug Running Podsin the Kubernetes documentation.
After you connect to the Pod, you can run commands to check foraccess to files, databases, or to test the network. For example, you canuse a tool like
curlto try and reach a service's URL. This actionhelps you determine if a problem is caused by network policies, DNS, orthe service itself.Authentication failures: the app might be unable to authenticatewith external services due to incorrect credentials. Inspect thecontainer's logs for messages like
401 Unauthorized(bad credentials) or403 Forbidden(insufficient permissions), whichoften indicate that the service account for the Pod lacks thenecessary IAM roles to make external Google Cloudservice calls.If you use GKE Workload Identity Federation, verify that theprincipal identifier has the permissions required for the task. For moreinformation about granting IAM roles to principals byusing GKE Workload Identity Federation, seeConfigure authorization and principals.You should alsoverify that the resource usage of GKE Metadata Server hasn't exceeded its limits.
Timeouts: the app might experience timeouts when waiting forresponses from external services, leading to crashes.
App-specific errors: if configuration and external dependencies seemcorrect, the error might be within the app's code. Inspect the app logs forthese common internal errors:
- Unhandled exceptions: the app logs might contain stack traces orerror messages indicating unhandled exceptions or other code-relatedbugs.
- Deadlocks or livelocks: the app might be stuck in a deadlock, wheremultiple processes are waiting for each other to complete. In thisscenario, the app might not exit, but it stops responding indefinitely.
- Port conflicts: the app might fail to start if it attempts to bindto a port that is already in use by another process.
- Incompatible libraries: the app might depend on libraries ordependencies that are missing or incompatible with the runtimeenvironment.
To find the root cause, inspect the container's logs for a specific errormessage or stack trace. This information helps you decide whether to fix theapp code, adjust resource limits, or correct the environment'sconfiguration. For more information about logs, seeAbout GKE logs.
What's next
If you can't find a solution to your problem in the documentation, seeGet support for further help,including advice on the following topics:
- Opening a support case by contactingCloud Customer Care.
- Getting support from the community byasking questions on StackOverflow and using the
google-kubernetes-enginetag to search for similarissues. You can also join the#kubernetes-engineSlack channel for more community support. - Opening bugs or feature requests by using thepublic issue tracker.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-06 UTC.