Scheduling Framework
Kubernetes v1.19 [stable]Thescheduling framework is a pluggable architecture for the Kubernetes scheduler.It consists of a set of "plugin" APIs that are compiled directly into the scheduler.These APIs allow most scheduling features to be implemented as plugins,while keeping the scheduling "core" lightweight and maintainable. Refer to thedesign proposal of the scheduling framework for more technical information onthe design of the framework.
Framework workflow
The Scheduling Framework defines a few extension points. Scheduler pluginsregister to be invoked at one or more extension points. Some of these pluginscan change the scheduling decisions and some are informational only.
Each attempt to schedule one Pod is split into two phases, thescheduling cycle and thebinding cycle.
Scheduling cycle & binding cycle
The scheduling cycle selects a node for the Pod, and the binding cycle appliesthat decision to the cluster. Together, a scheduling cycle and binding cycle arereferred to as a "scheduling context".
Scheduling cycles are run serially, while binding cycles may run concurrently.
A scheduling or binding cycle can be aborted if the Pod is determined tobe unschedulable or if there is an internal error. The Pod will be returned tothe queue and retried.
Interfaces
The following picture shows the scheduling context of a Pod and the interfacesthat the scheduling framework exposes.
One plugin may implement multiple interfaces to perform more complex orstateful tasks.
Some interfaces match the scheduler extension points which can be configured throughScheduler Configuration.

Scheduling framework extension points
PreEnqueue
These plugins are called prior to adding Pods to the internal active queue, where Pods are marked asready for scheduling.
Only when all PreEnqueue plugins returnSuccess, the Pod is allowed to enter the active queue.Otherwise, it's placed in the internal unschedulable Pods list, and doesn't get anUnschedulable condition.
For more details about how internal scheduler queues work, readScheduling queue in kube-scheduler.
EnqueueExtension
EnqueueExtension is the interface where the plugin can controlwhether to retry scheduling of Pods rejected by this plugin, based on changes in the cluster.Plugins that implement PreEnqueue, PreFilter, Filter, Reserve or Permit should implement this interface.
QueueingHint
Kubernetes v1.34 [stable](enabled by default)QueueingHint is a callback function for deciding whether a Pod can be requeued to the active queue or backoff queue.It's executed every time a certain kind of event or change happens in the cluster.When the QueueingHint finds that the event might make the Pod schedulable,the Pod is put into the active queue or the backoff queueso that the scheduler will retry the scheduling of the Pod.
QueueSort
These plugins are used to sort Pods in the scheduling queue. A queue sort pluginessentially provides aLess(Pod1, Pod2) function. Only one queue sortplugin may be enabled at a time.
PreFilter
These plugins are used to pre-process info about the Pod, or to check certainconditions that the cluster or the Pod must meet. If a PreFilter plugin returnsan error, the scheduling cycle is aborted.
Filter
These plugins are used to filter out nodes that cannot run the Pod. For eachnode, the scheduler will call filter plugins in their configured order. If anyfilter plugin marks the node as infeasible, the remaining plugins will not becalled for that node. Nodes may be evaluated concurrently.
PostFilter
These plugins are called after the Filter phase, but only when no feasible nodeswere found for the pod. Plugins are called in their configured order. Ifany postFilter plugin marks the node asSchedulable, the remaining pluginswill not be called. A typical PostFilter implementation is preemption, whichtries to make the pod schedulable by preempting other Pods.
PreScore
These plugins are used to perform "pre-scoring" work, which generates a sharablestate for Score plugins to use. If a PreScore plugin returns an error, thescheduling cycle is aborted.
Score
These plugins are used to rank nodes that have passed the filtering phase. Thescheduler will call each scoring plugin for each node. There will be a welldefined range of integers representing the minimum and maximum scores. After theNormalizeScore phase, the scheduler will combine nodescores from all plugins according to the configured plugin weights.
Capacity scoring
Kubernetes v1.33 [alpha](disabled by default)The feature gateVolumeCapacityPriority was used in v1.32 to support storage that arestatically provisioned. Starting from v1.33, the new feature gateStorageCapacityScoringreplaces the oldVolumeCapacityPriority gate with added support to dynamically provisioned storage.WhenStorageCapacityScoring is enabled, the VolumeBinding plugin in the kube-scheduler is extendedto score Nodes based on the storage capacity on each of them.This feature is applicable to CSI volumes that supportedStorage Capacity,including local storage backed by a CSI driver.
NormalizeScore
These plugins are used to modify scores before the scheduler computes a finalranking of Nodes. A plugin that registers for this extension point will becalled with theScore results from the same plugin. This is calledonce per plugin per scheduling cycle.
For example, suppose a pluginBlinkingLightScorer ranks Nodes based on howmany blinking lights they have.
funcScoreNode(_*v1.pod, n*v1.Node) (int,error) {returngetBlinkingLightCount(n)}However, the maximum count of blinking lights may be small compared toNodeScoreMax. To fix this,BlinkingLightScorer should also register for thisextension point.
funcNormalizeScores(scoresmap[string]int) { highest:=0for _, score:=range scores { highest =max(highest, score) }for node, score:=range scores { scores[node] = score*NodeScoreMax/highest }}If any NormalizeScore plugin returns an error, the scheduling cycle isaborted.
Note:
Plugins wishing to perform "pre-reserve" work should use theNormalizeScore extension point.Reserve
A plugin that implements the Reserve interface has two methods, namelyReserveandUnreserve, that back two informational scheduling phases called Reserveand Unreserve, respectively. Plugins which maintain runtime state (aka "statefulplugins") should use these phases to be notified by the scheduler when resourceson a node are being reserved and unreserved for a given Pod.
The Reserve phase happens before the scheduler actually binds a Pod to itsdesignated node. It exists to prevent race conditions while the scheduler waitsfor the bind to succeed. TheReserve method of each Reserve plugin may succeedor fail; if oneReserve method call fails, subsequent plugins are not executedand the Reserve phase is considered to have failed. If theReserve method ofall plugins succeed, the Reserve phase is considered to be successful and therest of the scheduling cycle and the binding cycle are executed.
The Unreserve phase is triggered if the Reserve phase or a later phase fails.When this happens, theUnreserve method ofall Reserve plugins will beexecuted in the reverse order ofReserve method calls. This phase exists toclean up the state associated with the reserved Pod.
Caution:
The implementation of theUnreserve method in Reserve plugins must beidempotent and may not fail.Permit
Permit plugins are invoked at the end of the scheduling cycle for each Pod, toprevent or delay the binding to the candidate node. A permit plugin can do one ofthe three things:
approve
Once all Permit plugins approve a Pod, it is sent for binding.deny
If any Permit plugin denies a Pod, it is returned to the scheduling queue.This will trigger the Unreserve phase inReserve plugins.wait (with a timeout)
If a Permit plugin returns "wait", then the Pod is kept in an internal "waiting"Pods list, and the binding cycle of this Pod starts but directly blocks until itgets approved. If a timeout occurs,wait becomesdenyand the Pod is returned to the scheduling queue, triggering theUnreserve phase inReserve plugins.
Note:
While any plugin can access the list of "waiting" Pods and approve them(seeFrameworkHandle),we expect only the permit plugins to approve binding of reserved Pods that are in "waiting" state.Once a Pod is approved, it is sent to thePreBind phase.PreBind
These plugins are used to perform any work required before a Pod is bound. Forexample, a pre-bind plugin may provision a network volume and mount it on thetarget node before allowing the Pod to run there.
If any PreBind plugin returns an error, the Pod isrejected andreturned to the scheduling queue.
Bind
These plugins are used to bind a Pod to a Node. Bind plugins will not be calleduntil all PreBind plugins have completed. Each bind plugin is called in theconfigured order. A bind plugin may choose whether or not to handle the givenPod. If a bind plugin chooses to handle a Pod,the remaining bind plugins areskipped.
PostBind
This is an informational interface. Post-bind plugins are called after aPod is successfully bound. This is the end of a binding cycle, and can be usedto clean up associated resources.
Plugin API
There are two steps to the plugin API. First, plugins must register and getconfigured, then they use the extension point interfaces. Extension pointinterfaces have the following form.
type Plugininterface {Name()string}type QueueSortPlugininterface { PluginLess(*v1.pod,*v1.pod)bool}type PreFilterPlugininterface { PluginPreFilter(context.Context,*framework.CycleState,*v1.pod)error}// ...Plugin configuration
You can enable or disable plugins in the scheduler configuration. If you are usingKubernetes v1.18 or later, most schedulingplugins are in use andenabled by default.
In addition to default plugins, you can also implement your own schedulingplugins and get them configured along with default plugins. You can visitscheduler-plugins for more details.
If you are using Kubernetes v1.18 or later, you can configure a set of plugins asa scheduler profile and then define multiple profiles to fit various kinds of workload.Learn more atmultiple profiles.