Troubleshoot CrashLoopBackOff events

Autopilot Standard

If your Pods in Google Kubernetes Engine (GKE) are stuck in aCrashLoopBackOffstate, it means that one or more containers are repeatedly starting and thenexiting. This behavior is likely making your apps unstable or completelyunavailable.

Use this page to diagnose and resolve the underlying causes,which often fall into categories such as resource limitations, issues withliveness probes, app errors, or configuration mistakes. Troubleshootingthese issues helps ensure that your apps run reliably and remain available toyour users.

This information is important for Application developers who want toidentify and fix app-level problems, such as coding errors, incorrect entrypoints, configuration file issues, or problems connecting to dependencies.Platform admins and operators can identify and address platform-related issueslike resource exhaustion (OOMKilled), node disruptions, or misconfiguredliveness probes. For more information about the common roles and example tasksthat we reference in Google Cloud content, seeCommon GKE user roles and tasks.

Understand a`CrashLoopBackOff` event

When your Pod is stuck in aCrashLoopBackOff state, a container within it isrepeatedly starting and crashing or exiting. ThisCrashLoop triggersKubernetes to attempt restarting the container by adhering to itsrestartPolicy. With each failed restart, theBackOff delay before the nextattempt increases exponentially (for example, 10s, 20s, 40s), up to a maximum offive minutes.

Although this event indicates a problem within your container, it's also avaluable diagnostic signal. ACrashLoopBackOff event confirms that manyfoundational steps of Pod creation, such as assignment to a node and pulling thecontainer image, have already completed. This knowledge lets you focus yourinvestigation on the container's app or configuration, rather than the clusterinfrastructure.

TheCrashLoopBackOff state occurs because of how Kubernetes, specifically thekubelet, handles container termination based on the Pod'srestart policy.The cycle typically follows this pattern:

The container starts.
The container exits.
The kubelet observes the stopped container and restarts it according to thePod'srestartPolicy.
This cycle repeats, with the container restarted after an increasingexponential back-off delay.

The Pod'srestartPolicy is the key to this behavior. The default policy,Always, is the most common cause of this loop because it restarts a containerif it exits forany reason, even after a successful exit. TheOnFailurepolicy is less likely to cause a loop because it only restarts on non-zero exitcodes, and theNever policy avoids a restart entirely.

Identify symptoms of a`CrashLoopBackOff` event

A Pod with theCrashLoopBackOff status is the primary indication of aCrashLoopBackOff event.

However, you might experience some less obvious symptoms of aCrashLoopBackOffevent:

Zero healthy replicas for a workload.
A sharp decrease in healthy replicas.
Workloads with horizontal Pod autoscaling enabled are scaling slowly orfailing to scale.

If asystem workload (for example, a logging or metrics agent) has theCrashLoopBackOff status, you might also notice the following symptoms:

Some GKE metrics aren't reported.
Some GKE dashboards and graphs have gaps.
Connectivity issues on Pod-level networking.

If you observe any of these less obvious symptoms, your next step should be toconfirm if aCrashLoopBackOff event occurred.

Confirm a`CrashLoopBackOff` event

To confirm and investigate aCrashLoopBackOff event, gather evidence fromKubernetes events and the container's app logs. These two sources providedifferent, but complementary views of the problem:

Kubernetes events confirm that a Pod is crashing.
The container's app logs can show you why the process inside the containeris failing.

To view this information, select one of the following options:

Console

To view Kubernetes events and app logs, do the following:

In the Google Cloud console, go to theWorkloads page.
Go to Workloads
Select the workload that you want to investigate. TheOverview orDetails tab displays more information about the status of theworkload.
From theManaged Pods section, click the name of the problematicPod.
On the Pod details page, investigate the following:
- To see details about Kubernetes events, go to theEvents tab.
- To view the container's app logs, go to theLogs tab. This pageis where you find app-specific error messages or stack traces.

kubectl

To view Kubernetes events and app logs, do the following:

View the status of all Pods running in your cluster:
```
kubectlgetpods
```
The output is similar to the following:
```
NAME       READY  STATUS             RESTARTS  AGEPOD_NAME   0/1    CrashLoopBackOff   23        8d
```
In the output, review the following columns:
- Ready: review how many containers are ready. In this example,0/1 indicates that zero out of one expected container is in aready state. This value is a clear sign of a problem.
- Status: look for Pods with a status ofCrashLoopBackOff.
- Restarts: a high value indicates that Kubernetes is repeatedlytrying and failing to start the container.

After you identify a failing Pod, describe it to see cluster-levelevents that are related to the Pod's state:

kubectldescribepodPOD_NAME-nNAMESPACE_NAME

Replace the following:

POD_NAME: the name of the Pod thatyou identified in the output of thekubectl get command.
NAMESPACE_NAME: the namespace of the Pod.

The output is similar to the following:

Containers:container-name:...  State:          Waiting    Reason:       CrashLoopBackOff  Last State:     Terminated    Reason:       StartError    Message:      failed to create containerd task: failed to create shim task: context deadline exceeded: unknown    Exit Code:    128    Started:      Thu, 01 Jan 1970 00:00:00 +0000    Finished:     Fri, 27 Jun 2025 16:20:03 +0000  Ready:          False  Restart Count:  3459...Conditions:Type                        StatusPodReadyToStartContainers   TrueInitialized                 TrueReady                       FalseContainersReady             FalsePodScheduled                True...Events:Type     Reason   Age                     From     Message----     ------   ----                    ----     -------Warning  Failed   12m (x216 over 25h)     kubelet  Error: context deadline exceededWarning  Failed   8m34s (x216 over 25h)   kubelet  Error: context deadline exceededWarning  BackOff  4m24s (x3134 over 25h)  kubelet  Back-off restarting failed container container-name in pod failing-pod(11111111-2222-3333-4444-555555555555)

In the output, review the following fields for signs of aCrashLoopBackOff event:

State: the state of the container likely showsWaiting with thereasonCrashLoopBackOff.
Last State: the state of the previously terminated container. Lookfor aTerminated status and review the exit code to see if there wasa crash (non-zero exit code) or an unexpected successful exit (zeroexit code).
Events: actions taken by the cluster itself. Look for messages aboutthe container being started, followed by liveness probe failures orback-off warnings likeBack-off restarting failed container.

To learn more about why the Pod failed, view its app logs:
```
kubectllogsPOD_NAME--previous
```
The--previous flag retrieves logs from the prior, terminatedcontainer, which is where you can find the specific stack trace or errormessage that reveals the cause of the crash. The current container mightbe too new to have recorded any logs.
In the output, look for app-specific errors that would cause the processto exit. If you use a custom-made app, the developers who wrote it arebest equipped to interpret these error messages. If you use a prebuiltapp, these apps often provide their own debugging instructions.

Use the Crashlooping Pods interactive playbook

After you confirm aCrashLoopBackOff event, begin troubleshooting withthe interactive playbook:

In the Google Cloud console, go to theGKE Interactive Playbook -Crashlooping Pods page.
Go to Crashlooping Pods
In theCluster list, select the cluster that you want totroubleshoot. If you can't find your cluster, enter the name of the clusterin theFilter field.
In theNamespace list, select the namespace that you want totroubleshoot. If you can't find your namespace, enter the namespace in theFilter field.
Work through each section to help you answer the followingquestions:
1. Identify App Errors: which containers are restarting?
2. Investigate Out Of Memory Issues: is there amisconfiguration or an error related to the app?
3. Investigate Node Disruptions: are disruptions on the noderesource causing container restarts?
4. Investigate Liveness Probe Failures: are liveness probesstopping your containers?
5. Correlate Change Events: what happened around the time thecontainers started crashing?
Optional: To get notifications about futureCrashLoopBackOff events,in theFuture Mitigation Tips section, selectCreate an Alert.

If your problem persists after using the playbook, read the rest of the guidefor more information about resolvingCrashLoopBackOff events.

Resolve a`CrashLoopBackOff` event

The following sections help you resolve the most common causes ofCrashLoopBackOff events:

Resolve resource exhaustion

ACrashLoopBackOff event is often caused by an Out of Memory (OOM) issue. Youcan confirm if this is the cause if thekubectl describe output shows thefollowing:

Last State: Terminated  Reason: OOMKilled

For information about how to diagnose and resolve OOM events, seeTroubleshoot OOM events.

Resolve liveness probe failures

Aliveness probeis a periodic health check performed by thekubelet. If the probe fails aspecified number of times (the default number is three), thekubelet restartsthe container, potentially causing aCrashLoopBackOff event if the probefailures continue.

Confirm if a liveness probe is the cause

To confirm if liveness probe failures are triggering theCrashLoopBackOffevent, query yourkubelet logs. These logs often contain explicit messagesindicating probe failures and subsequent restarts.

In the Google Cloud console, go to theLogs Explorer page.
Go to Logs Explorer

In the query pane, filter for any liveness-probe-related restarts byentering the following query:

resource.type="k8s_node"log_id("kubelet")jsonPayload.MESSAGE:"failed liveness probe, will be restarted"resource.labels.cluster_name="CLUSTER_NAME"

ReplaceCLUSTER_NAME with the name of your cluster.

Review the output. If a liveness probe failure is the cause of yourCrashLoopBackOff events, the query returns log messages similar to thefollowing:
```
Container probe failed liveness probe, will be restarted
```

After you confirm that liveness probes are the cause of theCrashLoopBackOffevent, proceed to troubleshoot common causes:

Review liveness probe configuration

Misconfigured probes are a frequent cause ofCrashLoopBackOff events. Checkthe following settings in the manifest of your probe:

Verify probe type: your probe's configuration must match how your appreports its health. For example, if your app has a health check URL (like/healthz), use thehttpGet probe type. If its health is determined byrunning a command, use theexec probe type. For example, to check if anetwork port is open and listening, use thetcpSocket probe type.
Check probe parameters:
- Path (forhttpGet probe type): make sure the HTTP path iscorrect and that your app serves health checks on it.
- Port: verify that the port configured in the probe isactually used and exposed by the app.
- Command (forexec probe type): make sure the commandexists within the container, returns an exit code of0 for success,and completes within the configuredtimeoutSeconds period.
- Timeout: make sure that thetimeoutSeconds value is sufficientfor the app to respond, especially during startup or under load.
- Initial delay (initialDelaySeconds): check if the initialdelay is sufficient for the app to start before probes begin.

For more information, seeConfigure Liveness, Readiness and Startup Probesin the Kubernetes documentation.

Inspect CPU and disk I/O utilization

Resource contention results in probe timeouts, which is a major cause ofliveness probe failures. To see if resource usage is the cause of the livenessprobe failure, try the following solutions:

Analyze CPU usage: monitor the CPU utilization of the affectedcontainer and the node it's running on during the probe intervals. A keymetric to track iskubernetes.io/container/cpu/core_usage_time. High CPUusage on the container or the node can prevent the app from responding tothe probe in time.
Monitor disk I/O: check disk I/O metrics for the node. You can use thecompute.googleapis.com/guest/disk/operation_time metric to assess theamount of time spent on the disk operations, which are categorized by readsand writes. High disk I/O can significantly slow down container startup, appinitialization, or overall app performance, leading to probe timeouts.

Address large deployments

In scenarios where a large number of Pods are deployed simultaneously (forexample, by a CI/CD tool like ArgoCD), a sudden surge of new Pods can overwhelmcluster resources, leading to control plane resource exhaustion. This lack ofresources delays app startup and can cause liveness probes to fail repeatedlybefore the apps are ready.

To resolve this issue, try the following solutions:

Implement staggered deployments: implement strategies to deploy Pods inbatches or over a longer period to avoid overwhelming node resources.
Reconfigure or scale nodes: if staggered deployments aren't feasible,consider upgrading nodes with faster or larger disks, or Persistent VolumeClaims, to better handle increased I/O demand. Ensure your clusterautoscaling is configured appropriately.
Wait and observe: In some cases, if the cluster is not severelyunder-resourced, workloads might eventually deploy after a significant delay(sometimes 30 minutes or more).

Address transient errors

The app might experience temporary errors or slowdowns during startup orinitialization that cause the probe to fail initially. If the app eventuallyrecovers, consider increasing the values defined in theinitialDelaySeconds orfailureThreshold fields in the manifest of your liveness probe.

Address probe resource consumption

In rare cases, the liveness probe's execution itself might consume significantresources, which could trigger resource constraints that potentially lead to thecontainer being terminated due to an OOM kill. Ensure your probe commands arelightweight. A lightweight probe is more likely to execute quickly and reliably,giving it higher fidelity in accurately reporting your app's true health.

Resolve app misconfigurations

App misconfigurations cause manyCrashLoopBackOff events. Tounderstand why your app is stopping, the first step is to examine its exit code.This code determines your troubleshooting path:

Exit code0 indicates a successful exit, which is unexpected for along-running service and points to issues with the container's entry pointor app design.
A non-zero exit code signals an app crash, directing your focus towardconfiguration errors, dependency issues, or bugs in the code.

Find the exit code

To find the exit code of your app, do the following:

Describe the Pod:
```
kubectldescribepodPOD_NAME-nNAMESPACE_NAME
```
Replace the following:
- POD_NAME: the name of the problematic Pod.
- NAMESPACE_NAME: the namespace of the Pod.
In the output, review theExit Code field located under theLast Statesection for the relevant container. If the exit code is0, seeTroubleshoot successful exits (exit code0). If the exit code is a number other than0, seeTroubleshoot app crashes (non-zero exitcode).

Troubleshoot successful exits (exit code`0`)

An exit code of0 typically means the container's process finished successfully.Although this is the outcome that you want for a task-based Job, it can signal aproblem for a long-running controller like a Deployment, StatefulSet, orReplicaSet.

These controllers work to ensure a Pod is always running, so they treat any exitas a failure to be corrected. Thekubelet enforces this behavior by adheringto the Pod'srestartPolicy (which defaults toAlways), restarting thecontainer even after a successful exit. This action creates a loop, whichultimately triggers theCrashLoopBackOff status.

The most common reasons for unexpected successful exits are the following:

Container command doesn't start a persistent process: a containerremains running only as long as its initial process (command orentrypoint) does. If this process isn't a long-running service, thecontainer exits as soon as the command completes. For example, a commandlike["/bin/bash"] exits immediately because it has no script to run.To resolve this issue, ensure your container's initial process starts aprocess that runs continuously.
Worker app exits when a work queue is empty: many worker apps aredesigned to check a queue for a task and exit cleanly if the queue isempty. To resolve this, you can either use a Job controller (whichis designed for tasks that run to completion) or modify the app's logicto run as a persistent service.
App exits due to missing or invalid configuration: Your app might exitimmediately if it's missing required startup instructions, such ascommand-line arguments, environment variables, or a critical configurationfile.
To resolve this issue, first inspect your app's logs for specific errormessages related to configuration loading or missing parameters. Then,verify the following:
- App arguments or environment: ensure that all necessarycommand-line arguments and environment variables are correctly passed tothe container as expected by your app.
- Configuration file presence: confirm that any required configurationfiles are present at the expected paths within the container.
- Configuration file content: validate the content and format of yourconfiguration files for syntax errors, missing mandatory fields, orincorrect values.
A common example of this issue is when an app is configured to read from afile mounted with aConfigMap volume. If theConfigMap isn't attached,is empty, or has misnamed keys, an app designed to exit when itsconfiguration is missing might stop with an exit code of0. In such cases,verify the following settings:- TheConfigMap name in your Pod's volume definition matches its actual name.- The keys within theConfigMap match what your app expects to find as filenames in the mounted volume.
Note: Although a well-designed app should exit with a non-zero error code whena mandatory configuration is missing, some apps are written to exitsuccessfully (with code0) in this situation.

Troubleshoot app crashes (non-zero exit code)

When a container exits with a non-zero code, Kubernetes restarts it. If theunderlying issue that caused the error is persistent, the app crashes again andthe cycle repeats, culminating in aCrashLoopBackOff state.

The non-zero exit code is a clear signal that an error occurred within the appitself, which directs your debugging efforts toward its internal workings andenvironment. The following issues often cause this termination:

Configuration errors: a non-zero exit code often points to problems withthe app's configuration or the environment it's running in. Check your appfor these common issues:
Dependency issues: your app might crash if it can't connect to the otherservices it depends on, or if it fails to authenticate or has insufficientpermissions to access them.
- External service unavailable: the app might depend on externalservices (for example, databases or APIs) that are unreachable due tonetwork connectivity problems or service outages. To troubleshoot thisissue, connect to the Pod. For more information, see Debug Running Podsin the Kubernetes documentation.
  After you connect to the Pod, you can run commands to check foraccess to files, databases, or to test the network. For example, you canuse a tool likecurl to try and reach a service's URL. This actionhelps you determine if a problem is caused by network policies, DNS, orthe service itself.
- Authentication failures: the app might be unable to authenticatewith external services due to incorrect credentials. Inspect thecontainer's logs for messages like401 Unauthorized(bad credentials) or403 Forbidden (insufficient permissions), whichoften indicate that the service account for the Pod lacks thenecessary IAM roles to make external Google Cloudservice calls.
  If you use GKE Workload Identity Federation, verify that theprincipal identifier has the permissions required for the task. For moreinformation about granting IAM roles to principals byusing GKE Workload Identity Federation, seeConfigure authorization and principals.You should alsoverify that the resource usage of GKE Metadata Server hasn't exceeded its limits.
- Timeouts: the app might experience timeouts when waiting forresponses from external services, leading to crashes.
App-specific errors: if configuration and external dependencies seemcorrect, the error might be within the app's code. Inspect the app logs forthese common internal errors:
- Unhandled exceptions: the app logs might contain stack traces orerror messages indicating unhandled exceptions or other code-relatedbugs.
- Deadlocks or livelocks: the app might be stuck in a deadlock, wheremultiple processes are waiting for each other to complete. In thisscenario, the app might not exit, but it stops responding indefinitely.
- Port conflicts: the app might fail to start if it attempts to bindto a port that is already in use by another process.
- Incompatible libraries: the app might depend on libraries ordependencies that are missing or incompatible with the runtimeenvironment.
To find the root cause, inspect the container's logs for a specific errormessage or stack trace. This information helps you decide whether to fix theapp code, adjust resource limits, or correct the environment'sconfiguration. For more information about logs, seeAbout GKE logs.

What's next

If you can't find a solution to your problem in the documentation, seeGet support for further help,including advice on the following topics:
- Opening a support case by contactingCloud Customer Care.
- Getting support from the community byasking questions on StackOverflow and using thegoogle-kubernetes-engine tag to search for similarissues. You can also join the#kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using thepublic issue tracker.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-06 UTC.

Movatterモバイル変換