StatefulSets

A StatefulSet runs a group of Pods, and maintains a sticky identity for each of those Pods. This is useful for managing applications that need persistent storage or a stable, unique network identity.

StatefulSet is the workload API object used to manage stateful applications.

Manages the deployment and scaling of a set ofPods,and provides guarantees about the ordering and uniqueness of these Pods.

Like aDeployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.

If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods that replace any that have failed.

Using StatefulSets

StatefulSets are valuable for applications that require one or more of thefollowing:

  • Stable, unique network identifiers.
  • Stable, persistent storage.
  • Ordered, graceful deployment and scaling.
  • Ordered, automated rolling updates.

In the above, stable is synonymous with persistence across Pod (re)scheduling.If an application doesn't require any stable identifiers or ordered deployment,deletion, or scaling, you should deploy your application using a workload objectthat provides a set of stateless replicas.Deployment orReplicaSet may be better suited to your stateless needs.

Limitations

  • The storage for a given Pod must either be provisioned by aPersistentVolume Provisionerbased on the requestedstorage class, or pre-provisioned by an admin.
  • Deleting and/or scaling a StatefulSet down willnot delete the volumes associated with theStatefulSet. This is done to ensure data safety, which is generally more valuable than anautomatic purge of all related StatefulSet resources.
  • StatefulSets currently require aHeadless Serviceto be responsible for the network identity of the Pods. You are responsible for creating thisService.
  • StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet isdeleted. To achieve ordered and graceful termination of the pods in the StatefulSet, it ispossible to scale the StatefulSet down to 0 prior to deletion.
  • When usingRolling Updates with the defaultPod Management Policy (OrderedReady),it's possible to get into a broken state that requiresmanual intervention to repair.

Components

The example below demonstrates the components of a StatefulSet.

apiVersion:v1kind:Servicemetadata:name:nginxlabels:app:nginxspec:ports:-port:80name:webclusterIP:Noneselector:app:nginx---apiVersion:apps/v1kind:StatefulSetmetadata:name:webspec:selector:matchLabels:app:nginx# has to match .spec.template.metadata.labelsserviceName:"nginx"replicas:3# by default is 1minReadySeconds:10# by default is 0template:metadata:labels:app:nginx# has to match .spec.selector.matchLabelsspec:terminationGracePeriodSeconds:10containers:-name:nginximage:registry.k8s.io/nginx-slim:0.24ports:-containerPort:80name:webvolumeMounts:-name:wwwmountPath:/usr/share/nginx/htmlvolumeClaimTemplates:-metadata:name:wwwspec:accessModes:["ReadWriteOnce"]storageClassName:"my-storage-class"resources:requests:storage:1Gi

Note:

This example uses theReadWriteOnce access mode, for simplicity. Forproduction use, the Kubernetes project recommends using theReadWriteOncePodaccess mode instead.

In the above example:

  • A Headless Service, namednginx, is used to control the network domain.
  • The StatefulSet, namedweb, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods.
  • ThevolumeClaimTemplates will provide stable storage usingPersistentVolumes provisioned by aPersistentVolume Provisioner.

The name of a StatefulSet object must be a validDNS label.

Pod Selector

You must set the.spec.selector field of a StatefulSet to match the labels of its.spec.template.metadata.labels. Failing to specify a matching Pod Selector will result in avalidation error during StatefulSet creation.

Volume Claim Templates

You can set the.spec.volumeClaimTemplates field to create aPersistentVolumeClaim.This will provide stable storage to the StatefulSet if either:

  • The StorageClass specified for the volume claim is set up to usedynamicprovisioning.
  • The cluster already contains a PersistentVolume with the correct StorageClassand sufficient available storage space.

Minimum ready seconds

FEATURE STATE:Kubernetes v1.25 [stable]

.spec.minReadySeconds is an optional field that specifies the minimum number of seconds for which a newlycreated Pod should be running and ready without any of its containers crashing, for it to be considered available.This is used to check progression of a rollout when using aRolling Update strategy.This field defaults to 0 (the Pod will be considered available as soon as it is ready). To learn more about whena Pod is considered ready, seeContainer Probes.

Pod Identity

StatefulSet Pods have a unique identity that consists of an ordinal, astable network identity, and stable storage. The identity sticks to the Pod,regardless of which node it's (re)scheduled on.

Ordinal Index

For a StatefulSet with Nreplicas, each Pod in the StatefulSetwill be assigned an integer ordinal, that is unique over the Set. By default,pods will be assigned ordinals from 0 up through N-1. The StatefulSet controllerwill also add a pod label with this index:apps.kubernetes.io/pod-index.

Start ordinal

FEATURE STATE:Kubernetes v1.31 [stable](enabled by default)

.spec.ordinals is an optional field that allows you to configure the integerordinals assigned to each Pod. It defaults to nil. Within the field, you canconfigure the following options:

  • .spec.ordinals.start: If the.spec.ordinals.start field is set, Pods willbe assigned ordinals from.spec.ordinals.start up through.spec.ordinals.start + .spec.replicas - 1.

Stable Network ID

Each Pod in a StatefulSet derives its hostname from the name of the StatefulSetand the ordinal of the Pod. The pattern for the constructed hostnameis$(statefulset name)-$(ordinal). The example above will create three Podsnamedweb-0,web-1,web-2.A StatefulSet can use aHeadless Serviceto control the domain of its Pods. The domain managed by this Service takes the form:$(service name).$(namespace).svc.cluster.local, where "cluster.local" is thecluster domain.As each Pod is created, it gets a matching DNS subdomain, taking the form:$(podname).$(governing service domain), where the governing service is definedby theserviceName field on the StatefulSet.

Depending on how DNS is configured in your cluster, you may not be able to look up the DNSname for a newly-run Pod immediately. This behavior can occur when other clients in thecluster have already sent queries for the hostname of the Pod before it was created.Negative caching (normal in DNS) means that the results of previous failed lookups areremembered and reused, even after the Pod is running, for at least a few seconds.

If you need to discover Pods promptly after they are created, you have a few options:

  • Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.
  • Decrease the time of caching in your Kubernetes DNS provider (typically this means editing theconfig map for CoreDNS, which currently caches for 30 seconds).

As mentioned in thelimitations section, you are responsible forcreating theHeadless Serviceresponsible for the network identity of the pods.

Here are some examples of choices for Cluster Domain, Service name,StatefulSet name, and how that affects the DNS names for the StatefulSet's Pods.

Cluster DomainService (ns/name)StatefulSet (ns/name)StatefulSet DomainPod DNSPod Hostname
cluster.localdefault/nginxdefault/webnginx.default.svc.cluster.localweb-{0..N-1}.nginx.default.svc.cluster.localweb-{0..N-1}
cluster.localfoo/nginxfoo/webnginx.foo.svc.cluster.localweb-{0..N-1}.nginx.foo.svc.cluster.localweb-{0..N-1}
kube.localfoo/nginxfoo/webnginx.foo.svc.kube.localweb-{0..N-1}.nginx.foo.svc.kube.localweb-{0..N-1}

Note:

Cluster Domain will be set tocluster.local unlessotherwise configured.

Stable Storage

For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives onePersistentVolumeClaim. In the nginx example above, each Pod receives a single PersistentVolumewith a StorageClass ofmy-storage-class and 1 GiB of provisioned storage. If no StorageClassis specified, then the default StorageClass will be used. When a Pod is (re)scheduledonto a node, itsvolumeMounts mount the PersistentVolumes associated with itsPersistentVolume Claims. Note that, the PersistentVolumes associated with thePods' PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted.This must be done manually.

Pod Name Label

When the StatefulSetcontroller creates a Pod,it adds a label,statefulset.kubernetes.io/pod-name, that is set to the name ofthe Pod. This label allows you to attach a Service to a specific Pod inthe StatefulSet.

Pod index label

FEATURE STATE:Kubernetes v1.32 [stable](enabled by default)

When the StatefulSetcontroller creates a Pod,the new Pod is labelled withapps.kubernetes.io/pod-index. The value of this label is the ordinal index ofthe Pod. This label allows you to route traffic to a particular pod index, filter logs/metricsusing the pod index label, and more. Note the feature gatePodIndexLabel is enabled and locked by default for thisfeature, in order to disable it, users will have to use server emulated version v1.31.

Deployment and Scaling Guarantees

  • For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
  • When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
  • Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready. If.spec.minReadySeconds is set, predecessors must be available (Ready for at leastminReadySeconds).
  • Before a Pod is terminated, all of its successors must be completely shutdown.

The StatefulSet should not specify apod.Spec.TerminationGracePeriodSeconds of 0. This practiceis unsafe and strongly discouraged. For further explanation, please refer toforce deleting StatefulSet Pods.

When the nginx example above is created, three Pods will be deployed in the orderweb-0, web-1, web-2. web-1 will not be deployed before web-0 isRunning and Ready, and web-2 will not be deployed untilweb-1 is Running and Ready. If web-0 should fail, after web-1 is Running and Ready, but beforeweb-2 is launched, web-2 will not be launched until web-0 is successfully relaunched andbecomes Running and Ready.

If a user were to scale the deployed example by patching the StatefulSet such thatreplicas=1, web-2 would be terminated first. web-1 would not be terminated until web-2is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated andis completely shutdown, but prior to web-1's termination, web-1 would not be terminateduntil web-0 is Running and Ready.

Pod Management Policies

StatefulSet allows you to relax its ordering guarantees whilepreserving its uniqueness and identity guarantees via its.spec.podManagementPolicy field.

OrderedReady Pod Management

OrderedReady pod management is the default for StatefulSets. It implements the behaviordescribed inDeployment and Scaling Guarantees.

Parallel Pod Management

Parallel pod management tells the StatefulSet controller to launch orterminate all Pods in parallel, and to not wait for Pods to become Runningand Ready or completely terminated prior to launching or terminating anotherPod.

For scaling operations, this means all Pods are created or terminated simultaneously.

For rolling updates when.spec.updateStrategy.rollingUpdate.maxUnavailableis greater than 1, the StatefulSet controller terminates and creates up tomaxUnavailable Podssimultaneously (also known as "bursting"). This can speed up updates but may result in Pods becoming ready out of order, which might not be suitable for applications requiring strict ordering.

Update strategies

A StatefulSet's.spec.updateStrategy field allows you to configureand disable automated rolling updates for containers, labels, resource request/limits, andannotations for the Pods in a StatefulSet. There are two possible values:

OnDelete
When a StatefulSet's.spec.updateStrategy.type is set toOnDelete,the StatefulSet controller will not automatically update the Pods in aStatefulSet. Users must manually delete Pods to cause the controller tocreate new Pods that reflect modifications made to a StatefulSet's.spec.template.
RollingUpdate
TheRollingUpdate update strategy implements automated, rolling updates for the Pods in aStatefulSet. This is the default update strategy.

Rolling Updates

When a StatefulSet's.spec.updateStrategy.type is set toRollingUpdate, theStatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceedin the same order as Pod termination (from the largest ordinal to the smallest), updatingeach Pod one at a time.

The Kubernetes control plane waits until an updated Pod is Running and Ready priorto updating its predecessor. If you have set.spec.minReadySeconds (seeMinimum Ready Seconds), the control plane additionally waits thatamount of time after the Pod turns ready, before moving on.

Partitioned rolling updates

TheRollingUpdate update strategy can be partitioned, by specifying a.spec.updateStrategy.rollingUpdate.partition. If a partition is specified, all Pods with anordinal that is greater than or equal to the partition will be updated when the StatefulSet's.spec.template is updated. All Pods with an ordinal that is less than the partition will notbe updated, and, even if they are deleted, they will be recreated at the previous version. If aStatefulSet's.spec.updateStrategy.rollingUpdate.partition is greater than its.spec.replicas,updates to its.spec.template will not be propagated to its Pods.In most cases you will not need to use a partition, but they are useful if you want to stage anupdate, roll out a canary, or perform a phased roll out.

Maximum unavailable Pods

FEATURE STATE:Kubernetes v1.35 [beta]

You can control the maximum number of Pods that can be unavailable during an updateby specifying the.spec.updateStrategy.rollingUpdate.maxUnavailable field.The value can be an absolute number (for example,5) or a percentage of desiredPods (for example,10%). Absolute number is calculated from the percentage valueby rounding it up. This field cannot be 0. The default setting is 1.

This field applies to all Pods in the range0 toreplicas - 1. If there is anyunavailable Pod in the range0 toreplicas - 1, it will be counted towardsmaxUnavailable.

Note:

ThemaxUnavailable field is in Beta stage and it is enabled by default.

Forced rollback

When usingRolling Updates with the defaultPod Management Policy (OrderedReady),it's possible to get into a broken state that requires manual intervention to repair.

If you update the Pod template to a configuration that never becomes Running andReady (for example, due to a bad binary or application-level configuration error),StatefulSet will stop the rollout and wait.

In this state, it's not enough to revert the Pod template to a good configuration.Due to aknown issue,StatefulSet will continue to wait for the broken Pod to become Ready(which never happens) before it will attempt to revert it back to the workingconfiguration.

After reverting the template, you must also delete any Pods that StatefulSet hadalready attempted to run with the bad configuration.StatefulSet will then begin to recreate the Pods using the reverted template.

Revision history

ControllerRevision is a Kubernetes API resource used by controllers, such as the StatefulSet controller, to track historical configuration changes.

StatefulSets use ControllerRevisions to maintain a revision history, enabling rollbacks and version tracking.

How StatefulSets track changes using ControllerRevisions

When you update a StatefulSet's Pod template (spec.template), the StatefulSet controller:

  1. Prepares a new ControllerRevision object
  2. Stores a snapshot of the Pod template and metadata
  3. Assigns an incremental revision number

Key Properties

SeeControllerRevision to learn more about key properties and other details.


Managing Revision History

Control retained revisions with.spec.revisionHistoryLimit:

apiVersion:apps/v1kind:StatefulSetmetadata:name:webappspec:revisionHistoryLimit:5# Keep last 5 revisions# ... other spec fields ...
  • Default: 10 revisions retained if unspecified
  • Cleanup: Oldest revisions are garbage-collected when exceeding the limit

Performing Rollbacks

You can revert to a previous configuration using:

# View revision historykubectl rollouthistory statefulset/webapp# Rollback to a specific revisionkubectl rollout undo statefulset/webapp --to-revision=3

This will:

  • Apply the Pod template from revision 3
  • Create a new ControllerRevision with an updated revision number

Inspecting ControllerRevisions

To view associated ControllerRevisions:

# List all revisions for the StatefulSetkubectl get controllerrevisions -l app.kubernetes.io/name=webapp# View detailed configuration of a specific revisionkubectl get controllerrevision/webapp-3 -o yaml

Best Practices

Retention Policy
  • SetrevisionHistoryLimit between5–10 for most workloads.
  • Increase only ifdeep rollback history is required.
Monitoring
  • Regularly check revisions with:

    kubectl get controllerrevisions
  • Alert onrapid revision count growth.
Avoid
  • Manual edits to ControllerRevision objects.
  • Using revisions as a backup mechanism (use actual backup tools).
  • SettingrevisionHistoryLimit: 0 (disables rollback capability).

PersistentVolumeClaim retention

FEATURE STATE:Kubernetes v1.32 [stable](enabled by default)

The optional.spec.persistentVolumeClaimRetentionPolicy field controls ifand how PVCs are deleted during the lifecycle of a StatefulSet. You must enable theStatefulSetAutoDeletePVCfeature gateon the API server and the controller manager to use this field.Once enabled, there are two policies you can configure for each StatefulSet:

whenDeleted
Configures the volume retention behavior that applies when the StatefulSet is deleted.
whenScaled
Configures the volume retention behavior that applies when the replica count ofthe StatefulSet is reduced; for example, when scaling down the set.

For each policy that you can configure, you can set the value to eitherDelete orRetain.

Delete
The PVCs created from the StatefulSetvolumeClaimTemplate are deleted for each Podaffected by the policy. With thewhenDeleted policy all PVCs from thevolumeClaimTemplate are deleted after their Pods have been deleted. With thewhenScaled policy, only PVCs corresponding to Pod replicas being scaled down aredeleted, after their Pods have been deleted.
Retain (default)
PVCs from thevolumeClaimTemplate are not affected when their Pod isdeleted. This is the behavior before this new feature.

Bear in mind that these policiesonly apply when Pods are being removed due to theStatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSetfails due to node failure, and the control plane creates a replacement Pod, the StatefulSetretains the existing PVC. The existing volume is unaffected, and the cluster will attach it tothe node where the new Pod is about to launch.

The default for policies isRetain, matching the StatefulSet behavior before this new feature.

Here is an example policy:

apiVersion:apps/v1kind:StatefulSet...spec:persistentVolumeClaimRetentionPolicy:whenDeleted:RetainwhenScaled:Delete...

The StatefulSetcontroller addsowner referencesto its PVCs, which are then deleted by thegarbage collector after the Pod is terminated. This enables the Pod tocleanly unmount all volumes before the PVCs are deleted (and before the backing PV andvolume are deleted, depending on the retain policy). When you set thewhenDeletedpolicy toDelete, an owner reference to the StatefulSet instance is placed on all PVCsassociated with that StatefulSet.

ThewhenScaled policy must delete PVCs only when a Pod is scaled down, and not when aPod is deleted for another reason. When reconciling, the StatefulSet controller comparesits desired replica count to the actual Pods present on the cluster. Any StatefulSet Podwhose id greater than the replica count is condemned and marked for deletion. If thewhenScaled policy isDelete, the condemned Pods are first set as owners to theassociated StatefulSet template PVCs, before the Pod is deleted. This causes the PVCsto be garbage collected after only the condemned Pods have terminated.

This means that if the controller crashes and restarts, no Pod will be deleted before itsowner reference has been updated appropriate to the policy. If a condemned Pod isforce-deleted while the controller is down, the owner reference may or may not have beenset up, depending on when the controller crashed. It may take several reconcile loops toupdate the owner references, so some condemned Pods may have set up owner references andothers may not. For this reason we recommend waiting for the controller to come back up,which will verify owner references before terminating Pods. If that is not possible, theoperator should verify the owner references on PVCs to ensure the expected objects aredeleted when Pods are force-deleted.

Replicas

.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to 1.

Should you manually scale a StatefulSet, viakubectl scale statefulset statefulset --replicas=X, and then you update that StatefulSetbased on a manifest (for example: by runningkubectl apply -f statefulset.yaml), then applying that manifest overwrites the manual scalingthat you previously did.

If aHorizontalPodAutoscaler(or any similar API for horizontal scaling) is managing scaling for aStatefulset, don't set.spec.replicas. Instead, allow the Kubernetescontrol plane to managethe.spec.replicas field automatically.

What's next

Last modified January 12, 2026 at 3:38 PM PST:Correct controller name to match the context of the text (cca3c89831)