Auto-repair nodes

This page explains how node auto-repair works andhow to use the feature for Standard Google Kubernetes Engine (GKE) clusters.

Node auto-repair helps keep the nodes in your GKE cluster in ahealthy, running state. When enabled, GKE makes periodic checkson the health state of each node in your cluster. If a node fails consecutivehealth checks over an extended time period, GKE initiates arepair process for that node.

Settings for Autopilot and Standard

Autopilot clusters always automatically repair nodes. You can't disablethis setting.

In Standard clusters, node auto-repair is enabled by default for newnode pools. You candisable auto repair for an existing node pool,however we recommend keeping the default configuration.

Repair criteria

GKE uses the node's health status to determine if a nodeneeds to be repaired. A node reporting aReady status is considered healthy.GKE triggers a repair action if a node reports consecutiveunhealthy status reports for a given time threshold.An unhealthy status can mean:

  • A node reports aNotReady status on consecutive checks over the given timethreshold (approximately 10 minutes).
  • A node does not report any status at all over the given time threshold(approximately 10 minutes).
  • A node's boot disk is out of disk space for an extended time period(approximately 30 minutes).
  • A node in an Autopilot cluster is cordoned for longer than the given time threshold(approximately 10 minutes).

You can manually check your node's health signals at any time by using thekubectl get nodes command.

Node repair process

If GKE detects that a node requires repair, the node is drainedand re-created. This process preserves the original name of the node.GKE waits one hour for the drain to complete. If the draindoesn't complete, the node is shut down and a new node is created.

If multiple nodes require repair, GKE might repair nodes inparallel. GKE balances the number of repairs depending on thesize of the cluster and the number of broken nodes. GKE willrepair more nodes in parallel on a larger cluster, but fewer nodes as the numberof unhealthy nodes grows.

If you disable node auto-repair at any time during the repair process, in-progress repairs arenot canceled and continue for any node under repair.

Note: Modifications on the boot disk of a node VM don't persist across nodere-creations. To preserve modifications across node re-creation, use aDaemonSet.Note: Node auto-repair uses a set of signals, including signalsfrom theNode Problem Detector.The Node Problem Detector is enabled by default on nodes that useContainer-Optimized OSand Ubuntu images.

Node auto repair in TPU slice nodes

If a TPU slice node in amulti-host TPU slice nodepool is unhealthy and requiresauto repair, theentire node pool is recreated. To learn more about the TPUslice node conditions, seeTPU slice node autorepair.

Enable auto-repair for an existing Standard node pool

You enable node auto-repair on aper-node pool basis.

If auto-repair is disabled on an existing node pool in a Standardcluster, use the following instructions to enable it:

Console

  1. Go to theGoogle Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click theNodes tab.

  4. UnderNode Pools, click the name of the node pool you want to modify.

  5. On theNode pool details page, clickEdit.

  6. UnderManagement, select theEnable auto-repair checkbox.

  7. ClickSave.

gcloud

gcloudcontainernode-poolsupdatePOOL_NAME\--clusterCLUSTER_NAME\--location=CONTROL_PLANE_LOCATION\--enable-autorepair

Replace the following:

  • POOL_NAME: the name of your node pool.
  • CLUSTER_NAME: the name of your Standard cluster.
  • CONTROL_PLANE_LOCATION: the Compute Enginelocation of the control plane of yourcluster. Provide a region for regional clusters, or a zone for zonal clusters.

Verify node auto-repair is enabled for a Standard node pool

Node auto-repair is enabled on aper-node pool basis. You can verify that anode pool in your cluster has node auto-repair enabled with the Google Cloud CLIor the Google Cloud console.

Console

  1. Go to theGoogle Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. On theGoogle Kubernetes Engine page, click the name of the cluster ofthe node pool you want to inspect.

  3. Click theNodes tab.

  4. UnderNode Pools, click the name of the node pool you want to inspect.

  5. UnderManagement, in theAuto-repair field, verify thatauto-repair is enabled.

gcloud

Describe the node pool:

gcloudcontainernode-poolsdescribeNODE_POOL_NAME\--cluster=CLUSTER_NAME

If node auto-repair is enabled, the output of the command includes theselines:

management:  ...  autoRepair: true

Disable node auto-repair

You can disable node auto-repair for an existing node pool in a Standardcluster by using the gcloud CLI or the Google Cloud console.

Note: You can only disable auto-repair with the gcloud CLI for a nodepool in a Standard cluster enrolled in a release channel.

Console

  1. Go to theGoogle Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click theNodes tab.

  4. UnderNode Pools, click the name of the node pool you want to modify.

  5. On theNode pool details page, clickEdit.

  6. UnderManagement, clear theEnable auto-repair checkbox.

  7. ClickSave.

gcloud

gcloudcontainernode-poolsupdatePOOL_NAME\--clusterCLUSTER_NAME\--location=CONTROL_PLANE_LOCATION\--no-enable-autorepair

Replace the following:

  • POOL_NAME: the name of your node pool.
  • CLUSTER_NAME: the name of your Standard cluster.
  • CONTROL_PLANE_LOCATION: the Compute Enginelocation of the control plane of yourcluster. Provide a region for regional clusters, or a zone for zonal clusters.

Get information about recent automated repair events

GKE generates a log entry for automated repair events. You cancheck the logs by running the following commands:

  1. List the operations:

    gcloudcontaineroperationslist--location=CONTROL_PLANE_LOCATION

    ReplaceCONTROL_PLANE_LOCATION with the Compute Enginelocation of the control plane of yourcluster. Provide a region for regional clusters, or a zone for zonal clusters.

  2. Find the reason why the node auto-repair operation was triggered by runningthe following command:

    gcloudcontaineroperationsdescribeOPERATION_NAME--location=CONTROL_PLANE_LOCATION

    ReplaceOPERATION_NAME with the name of an operationlisted in the output from the previous command.

In the output from the command, check theoperationReason for the reason whythe repair operation was triggered. For example,AUTO_REPAIR_LONG_UNHEALTHYmeans that the node auto-repair was triggered because the node was unhealthy for10 minutes.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.