Extensible Scheduler Class¶
sched_ext is a scheduler class whose behavior can be defined by a set of BPFprograms - the BPF scheduler.
sched_ext exports a full scheduling interface so that any schedulingalgorithm can be implemented on top.
The BPF scheduler can group CPUs however it sees fit and schedule themtogether, as tasks aren’t tied to specific CPUs at the time of wakeup.
The BPF scheduler can be turned on and off dynamically anytime.
The system integrity is maintained no matter what the BPF scheduler does.The default scheduling behavior is restored anytime an error is detected,a runnable task stalls, or on invoking the SysRq key sequenceSysRq-S.
When the BPF scheduler triggers an error, debug information is dumped toaid debugging. The debug dump is passed to and printed out by thescheduler binary. The debug dump can also be accessed through thesched_ext_dump tracepoint. The SysRq key sequenceSysRq-Dtriggers a debug dump. This doesn’t terminate the BPF scheduler and canonly be read through the tracepoint.
Switching to and from sched_ext¶
CONFIG_SCHED_CLASS_EXT is the config option to enable sched_ext andtools/sched_ext contains the example schedulers. The following configoptions should be enabled to use sched_ext:
CONFIG_BPF=yCONFIG_SCHED_CLASS_EXT=yCONFIG_BPF_SYSCALL=yCONFIG_BPF_JIT=yCONFIG_DEBUG_INFO_BTF=yCONFIG_BPF_JIT_ALWAYS_ON=yCONFIG_BPF_JIT_DEFAULT_ON=yCONFIG_PAHOLE_HAS_SPLIT_BTF=yCONFIG_PAHOLE_HAS_BTF_TAG=y
sched_ext is used only when the BPF scheduler is loaded and running.
If a task explicitly sets its scheduling policy toSCHED_EXT, it will betreated asSCHED_NORMAL and scheduled by the fair-class scheduler until theBPF scheduler is loaded.
When the BPF scheduler is loaded andSCX_OPS_SWITCH_PARTIAL is not setinops->flags, allSCHED_NORMAL,SCHED_BATCH,SCHED_IDLE, andSCHED_EXT tasks are scheduled by sched_ext.
However, when the BPF scheduler is loaded andSCX_OPS_SWITCH_PARTIAL isset inops->flags, only tasks with theSCHED_EXT policy are scheduledby sched_ext, while tasks withSCHED_NORMAL,SCHED_BATCH andSCHED_IDLE policies are scheduled by the fair-class scheduler.
Terminating the sched_ext scheduler program, triggeringSysRq-S, ordetection of any internal error including stalled runnable tasks aborts theBPF scheduler and reverts all tasks back to the fair-class scheduler.
# make -j16 -C tools/sched_ext# tools/sched_ext/build/bin/scx_simplelocal=0 global=3local=5 global=24local=9 global=44local=13 global=56local=17 global=72^CEXIT: BPF scheduler unregistered
The current status of the BPF scheduler can be determined as follows:
# cat /sys/kernel/sched_ext/stateenabled# cat /sys/kernel/sched_ext/root/opssimple
You can check if any BPF scheduler has ever been loaded since boot by examiningthis monotonically incrementing counter (a value of zero indicates that no BPFscheduler has been loaded):
# cat /sys/kernel/sched_ext/enable_seq1
tools/sched_ext/scx_show_state.py is a drgn script which shows moredetailed information:
# tools/sched_ext/scx_show_state.pyops : simpleenabled : 1switching_all : 1switched_all : 1enable_state : enabled (2)bypass_depth : 0nr_rejected : 0enable_seq : 1
Whether a given task is on sched_ext can be determined as follows:
# grep ext /proc/self/schedext.enabled : 1
The Basics¶
Userspace can implement an arbitrary BPF scheduler by loading a set of BPFprograms that implementstructsched_ext_ops. The only mandatory fieldisops.name which must be a valid BPF object name. All operations areoptional. The following modified excerpt is fromtools/sched_ext/scx_simple.bpf.c showing a minimal global FIFO scheduler.
/* * Decide which CPU a task should be migrated to before being * enqueued (either at wakeup, fork time, or exec time). If an * idle core is found by the default ops.select_cpu() implementation, * then insert the task directly into SCX_DSQ_LOCAL and skip the * ops.enqueue() callback. * * Note that this implementation has exactly the same behavior as the * default ops.select_cpu implementation. The behavior of the scheduler * would be exactly same if the implementation just didn't define the * simple_select_cpu() struct_ops prog. */s32BPF_STRUCT_OPS(simple_select_cpu,structtask_struct*p,s32prev_cpu,u64wake_flags){s32cpu;/* Need to initialize or the BPF verifier will reject the program */booldirect=false;cpu=scx_bpf_select_cpu_dfl(p,prev_cpu,wake_flags,&direct);if(direct)scx_bpf_dsq_insert(p,SCX_DSQ_LOCAL,SCX_SLICE_DFL,0);returncpu;}/* * Do a direct insertion of a task to the global DSQ. This ops.enqueue() * callback will only be invoked if we failed to find a core to insert * into in ops.select_cpu() above. * * Note that this implementation has exactly the same behavior as the * default ops.enqueue implementation, which just dispatches the task * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same * if the implementation just didn't define the simple_enqueue struct_ops * prog. */voidBPF_STRUCT_OPS(simple_enqueue,structtask_struct*p,u64enq_flags){scx_bpf_dsq_insert(p,SCX_DSQ_GLOBAL,SCX_SLICE_DFL,enq_flags);}s32BPF_STRUCT_OPS_SLEEPABLE(simple_init){/* * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and * SCHED_BATCH tasks should use sched_ext. */return0;}voidBPF_STRUCT_OPS(simple_exit,structscx_exit_info*ei){exit_type=ei->type;}SEC(".struct_ops")structsched_ext_opssimple_ops={.select_cpu=(void*)simple_select_cpu,.enqueue=(void*)simple_enqueue,.init=(void*)simple_init,.exit=(void*)simple_exit,.name="simple",};
Dispatch Queues¶
To match the impedance between the scheduler core and the BPF scheduler,sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and apriority queue. By default, there is one global FIFO (SCX_DSQ_GLOBAL),and one local DSQ per CPU (SCX_DSQ_LOCAL). The BPF scheduler can managean arbitrary number of DSQs usingscx_bpf_create_dsq() andscx_bpf_destroy_dsq().
A CPU always executes a task from its local DSQ. A task is “inserted” into aDSQ. A task in a non-local DSQ is “move”d into the target CPU’s local DSQ.
When a CPU is looking for the next task to run, if the local DSQ is notempty, the first task is picked. Otherwise, the CPU tries to move a taskfrom the global DSQ. If that doesn’t yield a runnable task either,ops.dispatch() is invoked.
Scheduling Cycle¶
The following briefly shows how a waking task is scheduled and executed.
When a task is waking up,
ops.select_cpu()is the first operationinvoked. This serves two purposes. First, CPU selection optimizationhint. Second, waking up the selected CPU if idle.The CPU selected by
ops.select_cpu()is an optimization hint and notbinding. The actual decision is made at the last step of scheduling.However, there is a small performance gain if the CPUops.select_cpu()returns matches the CPU the task eventually runs on.A side-effect of selecting a CPU is waking it up from idle. While a BPFscheduler can wake up any cpu using the
scx_bpf_kick_cpu()helper,usingops.select_cpu()judiciously can be simpler and more efficient.A task can be immediately inserted into a DSQ from
ops.select_cpu()by callingscx_bpf_dsq_insert(). If the task is inserted intoSCX_DSQ_LOCALfromops.select_cpu(), it will be inserted into thelocal DSQ of whichever CPU is returned fromops.select_cpu().Additionally, inserting directly fromops.select_cpu()will cause theops.enqueue()callback to be skipped.Note that the scheduler core will ignore an invalid CPU selection, forexample, if it’s outside the allowed cpumask of the task.
Once the target CPU is selected,
ops.enqueue()is invoked (unless thetask was inserted directly fromops.select_cpu()).ops.enqueue()can make one of the following decisions:Immediately insert the task into either the global or a local DSQ bycalling
scx_bpf_dsq_insert()with one of the following options:SCX_DSQ_GLOBAL,SCX_DSQ_LOCAL, orSCX_DSQ_LOCAL_ON|cpu.Immediately insert the task into a custom DSQ by calling
scx_bpf_dsq_insert()with a DSQ ID which is smaller than 2^63.Queue the task on the BPF side.
When a CPU is ready to schedule, it first looks at its local DSQ. Ifempty, it then looks at the global DSQ. If there still isn’t a task torun,
ops.dispatch()is invoked which can use the following twofunctions to populate the local DSQ.scx_bpf_dsq_insert()inserts a task to a DSQ. Any target DSQ can beused -SCX_DSQ_LOCAL,SCX_DSQ_LOCAL_ON|cpu,SCX_DSQ_GLOBALor a custom DSQ. Whilescx_bpf_dsq_insert()currently can’t be called with BPF locks held, this is being worked onand will be supported.scx_bpf_dsq_insert()schedules insertionrather than performing them immediately. There can be up toops.dispatch_max_batchpending tasks.scx_bpf_move_to_local()moves a task from the specified non-localDSQ to the dispatching DSQ. This function cannot be called with any BPFlocks held.scx_bpf_move_to_local()flushes the pending insertionstasks before trying to move from the specified DSQ.
After
ops.dispatch()returns, if there are tasks in the local DSQ,the CPU runs the first one. If empty, the following steps are taken:Try to move from the global DSQ. If successful, run the task.
If
ops.dispatch()has dispatched any tasks, retry #3.If the previous task is an SCX task and still runnable, keep executingit (see
SCX_OPS_ENQ_LAST).Go idle.
Note that the BPF scheduler can always choose to dispatch tasks immediatelyinops.enqueue() as illustrated in the above simple example. If only thebuilt-in DSQs are used, there is no need to implementops.dispatch() asa task is never queued on the BPF scheduler and both the local and globalDSQs are executed automatically.
scx_bpf_dsq_insert() inserts the task on the FIFO of the target DSQ. Usescx_bpf_dsq_insert_vtime() for the priority queue. Internal DSQs such asSCX_DSQ_LOCAL andSCX_DSQ_GLOBAL do not support priority-queuedispatching, and must be dispatched to withscx_bpf_dsq_insert(). Seethe function documentation and usage intools/sched_ext/scx_simple.bpf.cfor more information.
Task Lifecycle¶
The following pseudo-code summarizes the entire lifecycle of a task managedby a sched_ext scheduler:
ops.init_task();/* A new task is created */ops.enable();/* Enable BPF scheduling for the task */while(taskinSCHED_EXT){if(taskcanmigrate)ops.select_cpu();/* Called on wakeup (optimization) */ops.runnable();/* Task becomes ready to run */while(taskisrunnable){if(taskisnotinaDSQ&&task->scx.slice==0){ops.enqueue();/* Task can be added to a DSQ *//* Any usable CPU becomes available */ops.dispatch();/* Task is moved to a local DSQ */}ops.running();/* Task starts running on its assigned CPU */while(task->scx.slice>0&&taskisrunnable)ops.tick();/* Called every 1/HZ seconds */ops.stopping();/* Task stops running (time slice expires or wait) *//* Task's CPU becomes available */ops.dispatch();/* task->scx.slice can be refilled */}ops.quiescent();/* Task releases its assigned CPU (wait) */}ops.disable();/* Disable BPF scheduling for the task */ops.exit_task();/* Task is destroyed */
Where to Look¶
include/linux/sched/ext.hdefines the core data structures, ops tableand constants.kernel/sched/ext.ccontains sched_ext core implementation and helpers.The functions prefixed withscx_bpf_can be called from the BPFscheduler.tools/sched_ext/hosts example BPF scheduler implementations.scx_simple[.bpf].c: Minimal global FIFO scheduler example using acustom DSQ.scx_qmap[.bpf].c: A multi-level FIFO scheduler supporting fivelevels of priority implemented withBPF_MAP_TYPE_QUEUE.
ABI Instability¶
The APIs provided by sched_ext to BPF schedulers programs have no stabilityguarantees. This includes the ops table callbacks and constants defined ininclude/linux/sched/ext.h, as well as thescx_bpf_ kfuncs defined inkernel/sched/ext.c.
While we will attempt to provide a relatively stable API surface whenpossible, they are subject to change without warning between kernelversions.