Pod Priority and Preemption
Kubernetes v1.14 [stable]Pods can havepriority. Priority indicates theimportance of a Pod relative to other Pods. If a Pod cannot be scheduled, thescheduler tries to preempt (evict) lower priority Pods to make scheduling of thepending Pod possible.
Warning:
In a cluster where not all users are trusted, a malicious user could create Podsat the highest possible priorities, causing other Pods to be evicted/not getscheduled.An administrator can use ResourceQuota to prevent users from creating pods athigh priorities.
Seelimit Priority Class consumption by defaultfor details.
How to use priority and preemption
To use priority and preemption:
Add one or morePriorityClasses.
Create Pods with
priorityClassNameset to one of the addedPriorityClasses. Of course you do not need to create the Pods directly;normally you would addpriorityClassNameto the Pod template of acollection object like a Deployment.
Keep reading for more information about these steps.
Note:
Kubernetes already ships with two PriorityClasses:system-cluster-critical andsystem-node-critical.These are common classes and are used toensure that critical components are always scheduled first.PriorityClass
A PriorityClass is a non-namespaced object that defines a mapping from apriority class name to the integer value of the priority. The name is specifiedin thename field of the PriorityClass object's metadata. The value isspecified in the requiredvalue field. The higher the value, the higher thepriority.The name of a PriorityClass object must be a validDNS subdomain name,and it cannot be prefixed withsystem-.
A PriorityClass object can have any 32-bit integer value smaller than or equalto 1 billion. This means that the range of values for a PriorityClass object isfrom -2147483648 to 1000000000 inclusive. Larger numbers are reserved forbuilt-in PriorityClasses that represent critical system Pods. A clusteradmin should create one PriorityClass object for each such mapping that they want.
PriorityClass also has two optional fields:globalDefault anddescription.TheglobalDefault field indicates that the value of this PriorityClass shouldbe used for Pods without apriorityClassName. Only one PriorityClass withglobalDefault set to true can exist in the system. If there is noPriorityClass withglobalDefault set, the priority of Pods with nopriorityClassName is zero.
Thedescription field is an arbitrary string. It is meant to tell users of thecluster when they should use this PriorityClass.
Notes about PodPriority and existing clusters
If you upgrade an existing cluster without this feature, the priorityof your existing Pods is effectively zero.
Addition of a PriorityClass with
globalDefaultset totruedoes notchange the priorities of existing Pods. The value of such a PriorityClass isused only for Pods created after the PriorityClass is added.If you delete a PriorityClass, existing Pods that use the name of thedeleted PriorityClass remain unchanged, but you cannot create more Pods thatuse the name of the deleted PriorityClass.
Example PriorityClass
apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priorityvalue:1000000globalDefault:falsedescription:"This priority class should be used for XYZ service pods only."Non-preempting PriorityClass
Kubernetes v1.24 [stable]Pods withpreemptionPolicy: Never will be placed in the scheduling queueahead of lower-priority pods,but they cannot preempt other pods.A non-preempting pod waiting to be scheduled will stay in the scheduling queue,until sufficient resources are free,and it can be scheduled.Non-preempting pods,like other pods,are subject to scheduler back-off.This means that if the scheduler tries these pods and they cannot be scheduled,they will be retried with lower frequency,allowing other pods with lower priority to be scheduled before them.
Non-preempting pods may still be preempted by other,high-priority pods.
preemptionPolicy defaults toPreemptLowerPriority,which will allow pods of that PriorityClass to preempt lower-priority pods(as is existing default behavior).IfpreemptionPolicy is set toNever,pods in that PriorityClass will be non-preempting.
An example use case is for data science workloads.A user may submit a job that they want to be prioritized above other workloads,but do not wish to discard existing work by preempting running pods.The high priority job withpreemptionPolicy: Never will be scheduledahead of other queued pods,as soon as sufficient cluster resources "naturally" become free.
Example Non-preempting PriorityClass
apiVersion:scheduling.k8s.io/v1kind:PriorityClassmetadata:name:high-priority-nonpreemptingvalue:1000000preemptionPolicy:NeverglobalDefault:falsedescription:"This priority class will not cause other pods to be preempted."Pod priority
After you have one or more PriorityClasses, you can create Pods that specify oneof those PriorityClass names in their specifications. The priority admissioncontroller uses thepriorityClassName field and populates the integer value ofthe priority. If the priority class is not found, the Pod is rejected.
The following YAML is an example of a Pod configuration that uses thePriorityClass created in the preceding example. The priority admissioncontroller checks the specification and resolves the priority of the Pod to1000000.
apiVersion:v1kind:Podmetadata:name:nginxlabels:env:testspec:containers:-name:nginximage:nginximagePullPolicy:IfNotPresentpriorityClassName:high-priorityEffect of Pod priority on scheduling order
When Pod priority is enabled, the scheduler orders pending Pods bytheir priority and a pending Pod is placed ahead of other pending Podswith lower priority in the scheduling queue. As a result, the higherpriority Pod may be scheduled sooner than Pods with lower priority ifits scheduling requirements are met. If such Pod cannot be scheduled, thescheduler will continue and try to schedule other lower priority Pods.
Preemption
When Pods are created, they go to a queue and wait to be scheduled. Thescheduler picks a Pod from the queue and tries to schedule it on a Node. If noNode is found that satisfies all the specified requirements of the Pod,preemption logic is triggered for the pending Pod. Let's call the pending Pod P.Preemption logic tries to find a Node where removal of one or more Pods withlower priority than P would enable P to be scheduled on that Node. If such aNode is found, one or more lower priority Pods get evicted from the Node. Afterthe Pods are gone, P can be scheduled on the Node.
User exposed information
When Pod P preempts one or more Pods on Node N,nominatedNodeName field of PodP's status is set to the name of Node N. This field helps the scheduler trackresources reserved for Pod P and also gives users information about preemptionsin their clusters.
Please note that Pod P is not necessarily scheduled to the "nominated Node".The scheduler always tries the "nominated Node" before iterating over any other nodes.After victim Pods are preempted, they get their graceful termination period. Ifanother node becomes available while scheduler is waiting for the victim Pods toterminate, scheduler may use the other node to schedule Pod P. As a resultnominatedNodeName andnodeName of Pod spec are not always the same. Also, ifthe scheduler preempts Pods on Node N, but then a higher priority Pod than Pod Parrives, the scheduler may give Node N to the new higher priority Pod. In such acase, scheduler clearsnominatedNodeName of Pod P. By doing this, schedulermakes Pod P eligible to preempt Pods on another Node.
Limitations of preemption
Graceful termination of preemption victims
When Pods are preempted, the victims get theirgraceful termination period.They have that much time to finish their work and exit. If they don't, they arekilled. This graceful termination period creates a time gap between the pointthat the scheduler preempts Pods and the time when the pending Pod (P) can bescheduled on the Node (N). In the meantime, the scheduler keeps scheduling otherpending Pods. As victims exit or get terminated, the scheduler tries to schedulePods in the pending queue. Therefore, there is usually a time gap between thepoint that scheduler preempts victims and the time that Pod P is scheduled. Inorder to minimize this gap, one can set graceful termination period of lowerpriority Pods to zero or a small number.
PodDisruptionBudget is supported, but not guaranteed
APodDisruptionBudget (PDB)allows application owners to limit the number of Pods of a replicated applicationthat are down simultaneously from voluntary disruptions. Kubernetes supportsPDB when preempting Pods, but respecting PDB is best effort. The scheduler triesto find victims whose PDB are not violated by preemption, but if no such victimsare found, preemption will still happen, and lower priority Pods will be removeddespite their PDBs being violated.
Inter-Pod affinity on lower-priority Pods
A Node is considered for preemption only when the answer to this question isyes: "If all the Pods with lower priority than the pending Pod are removed fromthe Node, can the pending Pod be scheduled on the Node?"
Note:
Preemption does not necessarily remove all lower-priorityPods. If the pending Pod can be scheduled by removing fewer than alllower-priority Pods, then only a portion of the lower-priority Pods are removed.Even so, the answer to the preceding question must be yes. If the answer is no,the Node is not considered for preemption.If a pending Pod has inter-podaffinityto one or more of the lower-priority Pods on the Node, the inter-Pod affinityrule cannot be satisfied in the absence of those lower-priority Pods. In this case,the scheduler does not preempt any Pods on the Node. Instead, it looks for anotherNode. The scheduler might find a suitable Node or it might not. There is noguarantee that the pending Pod can be scheduled.
Our recommended solution for this problem is to create inter-Pod affinity onlytowards equal or higher priority Pods.
Cross node preemption
Suppose a Node N is being considered for preemption so that a pending Pod P canbe scheduled on N. P might become feasible on N only if a Pod on another Node ispreempted. Here's an example:
- Pod P is being considered for Node N.
- Pod Q is running on another Node in the same Zone as Node N.
- Pod P has Zone-wide anti-affinity with Pod Q (
topologyKey: topology.kubernetes.io/zone). - There are no other cases of anti-affinity between Pod P and other Pods inthe Zone.
- In order to schedule Pod P on Node N, Pod Q can be preempted, but schedulerdoes not perform cross-node preemption. So, Pod P will be deemedunschedulable on Node N.
If Pod Q were removed from its Node, the Pod anti-affinity violation would begone, and Pod P could possibly be scheduled on Node N.
We may consider adding cross Node preemption in future versions if there isenough demand and if we find an algorithm with reasonable performance.
Troubleshooting
Pod priority and preemption can have unwanted side effects. Here are someexamples of potential problems and ways to deal with them.
Pods are preempted unnecessarily
Preemption removes existing Pods from a cluster under resource pressure to makeroom for higher priority pending Pods. If you give high priorities tocertain Pods by mistake, these unintentionally high priority Pods may causepreemption in your cluster. Pod priority is specified by setting thepriorityClassName field in the Pod's specification. The integer value forpriority is then resolved and populated to thepriority field ofpodSpec.
To address the problem, you can change thepriorityClassName for those Podsto use lower priority classes, or leave that field empty. An emptypriorityClassName is resolved to zero by default.
When a Pod is preempted, there will be events recorded for the preempted Pod.Preemption should happen only when a cluster does not have enough resources fora Pod. In such cases, preemption happens only when the priority of the pendingPod (preemptor) is higher than the victim Pods. Preemption must not happen whenthere is no pending Pod, or when the pending Pods have equal or lower prioritythan the victims. If preemption happens in such scenarios, please file an issue.
Pods are preempted, but the preemptor is not scheduled
When pods are preempted, they receive their requested graceful terminationperiod, which is by default 30 seconds. If the victim Pods do not terminate withinthis period, they are forcibly terminated. Once all the victims go away, thepreemptor Pod can be scheduled.
While the preemptor Pod is waiting for the victims to go away, a higher priorityPod may be created that fits on the same Node. In this case, the scheduler willschedule the higher priority Pod instead of the preemptor.
This is expected behavior: the Pod with the higher priority should take the placeof a Pod with a lower priority.
Higher priority Pods are preempted before lower priority pods
The scheduler tries to find nodes that can run a pending Pod. If no node isfound, the scheduler tries to remove Pods with lower priority from an arbitrarynode in order to make room for the pending pod.If a node with low priority Pods is not feasible to run the pending Pod, the schedulermay choose another node with higher priority Pods (compared to the Pods on theother node) for preemption. The victims must still have lower priority than thepreemptor Pod.
When there are multiple nodes available for preemption, the scheduler tries tochoose the node with a set of Pods with lowest priority. However, if such Podshave PodDisruptionBudget that would be violated if they are preempted then thescheduler may choose another node with higher priority Pods.
When multiple nodes exist for preemption and none of the above scenarios apply,the scheduler chooses a node with the lowest priority.
Interactions between Pod priority and quality of service
Pod priority andQoS classare two orthogonal features with few interactions and no default restrictions onsetting the priority of a Pod based on its QoS classes. The scheduler'spreemption logic does not consider QoS when choosing preemption targets.Preemption considers Pod priority and attempts to choose a set of targets withthe lowest priority. Higher-priority Pods are considered for preemption only ifthe removal of the lowest priority Pods is not sufficient to allow the schedulerto schedule the preemptor Pod, or if the lowest priority Pods are protected byPodDisruptionBudget.
The kubelet uses Priority to determine pod order fornode-pressure eviction.You can use the QoS class to estimate the order in which pods are most likelyto get evicted. The kubelet ranks pods for eviction based on the following factors:
- Whether the starved resource usage exceeds requests
- Pod Priority
- Amount of resource usage relative to requests
SeePod selection for kubelet evictionfor more details.
kubelet node-pressure eviction does not evict Pods when theirusage does not exceed their requests. If a Pod with lower priority is notexceeding its requests, it won't be evicted. Another Pod with higher prioritythat exceeds its requests may be evicted.
What's next
- Read about using ResourceQuotas in connection with PriorityClasses:limit Priority Class consumption by default
- Learn aboutPod Disruption
- Learn aboutAPI-initiated Eviction
- Learn aboutNode-pressure Eviction