Schedutil¶
Note
All this assumes a linear relation between frequency and work capacity,we know this is flawed, but it is the best workable approximation.
PELT (Per Entity Load Tracking)¶
With PELT we track some metrics across the various scheduler entities, fromindividual tasks to task-group slices to CPU runqueues. As the basis for thiswe use an Exponentially Weighted Moving Average (EWMA), each period (1024us)is decayed such that y^32 = 0.5. That is, the most recent 32ms contributehalf, while the rest of history contribute the other half.
Specifically:
ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
ewma(u) = ewma_sum(u) / ewma_sum(1)
Since this is essentially a progression of an infinite geometric series, theresults are composable, that is ewma(A) + ewma(B) = ewma(A+B). This propertyis key, since it gives the ability to recompose the averages when tasks movearound.
Note that blocked tasks still contribute to the aggregates (task-group slicesand CPU runqueues), which reflects their expected contribution when theyresume running.
Using this we track 2 key metrics: ‘running’ and ‘runnable’. ‘Running’reflects the time an entity spends on the CPU, while ‘runnable’ reflects thetime an entity spends on the runqueue. When there is only a single task thesetwo metrics are the same, but once there is contention for the CPU ‘running’will decrease to reflect the fraction of time each task spends on the CPUwhile ‘runnable’ will increase to reflect the amount of contention.
For more detail see: kernel/sched/pelt.c
Frequency / CPU Invariance¶
Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPUfor 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% ona big CPU, we allow architectures to scale the time delta with two ratios, oneDynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
For simple DVFS architectures (where software is in full control) we triviallycompute the ratio as:
f_curr_dvfs := ----- f_max
For more dynamic systems where the hardware is in control of DVFS we usehardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.For Intel specifically, we use:
APERFf_cur := ----- * P0 MPERF 4C-turbo; if available and turbo enabledf_max := { 1C-turbo; if turbo enabled P0; otherwise f_curr_dvfs := min( 1, ----- ) f_maxWe pick 4C turbo over 1C turbo to make it slightly more sustainable.
r_cpu is determined as the ratio of highest performance level of the currentCPU vs the highest performance level of any other CPU in the system.
r_tot = r_dvfs * r_cpu
The result is that the above ‘running’ and ‘runnable’ metrics become invariantof DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
For more detail see:
kernel/sched/pelt.h:
update_rq_clock_pelt()arch/x86/kernel/smpboot.c:”APERF/MPERF frequency ratio computation.”
Capacity Aware Scheduling:”1. CPU Capacity + 2. Task utilization”
UTIL_EST¶
Because periodic tasks have their averages decayed while they sleep, eventhough when running their expected utilization will be the same, they suffer a(DVFS) ramp-up after they are running again.
To alleviate this (a default enabled option) UTIL_EST drives an InfiniteImpulse Response (IIR) EWMA with the ‘running’ value on dequeue -- when it ishighest. UTIL_EST filters to instantly increase and only decay on decrease.
A further runqueue wide sum (of runnable tasks) is maintained of:
util_est := Sum_t max( t_running, t_util_est_ewma )
For more detail see: kernel/sched/fair.c:util_est_dequeue()
UCLAMP¶
It is possible to set effective u_min and u_max clamps on each CFS or RT task;the runqueue keeps an max aggregate of these clamps for all running tasks.
For more detail see: include/uapi/linux/sched/types.h
Schedutil / DVFS¶
Every time the scheduler load tracking is updated (task wakeup, taskmigration, time progression) we call out to schedutil to update the hardwareDVFS state.
The basis is the CPU runqueue’s ‘running’ metric, which per the above it isthe frequency invariant utilization estimate of the CPU. From this we computea desired frequency like:
max( running, util_est ); if UTIL_ESTu_cfs := { running; otherwise clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASKu_clamp := { u_cfs + u_rt; otherwiseu := u_clamp + u_irq + u_dl; [approx. see source for more detail]f_des := min( f_max, 1.25 u * f_max )XXX IO-wait: when the update is due to a task wakeup from IO-completion weboost ‘u’ above.
This frequency is then used to select a P-state/OPP or directly munged into aCPPC style request to the hardware.
XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_minrequired to satisfy the workload.
Because these callbacks are directly from the scheduler, the DVFS hardwareinteraction should be ‘fast’ and non-blocking. Schedutil supportsrate-limiting DVFS requests for when hardware interaction is slow andexpensive, this reduces effectiveness.
For more information see: kernel/sched/cpufreq_schedutil.c
NOTES¶
On low-load scenarios, where DVFS is most relevant, the ‘running’ numberswill closely reflect utilization.
In saturated scenarios task movement will cause some transient dips,suppose we have a CPU saturated with 4 tasks, then when we migrate a taskto an idle CPU, the old CPU will have a ‘running’ value of 0.75 while thenew CPU will gain 0.25. This is inevitable and time progression willcorrect this. XXX do we still guarantee f_max due to no idle-time?
Much of the above is about avoiding DVFS dips, and independent DVFS domainshaving to re-learn / ramp-up when load shifts.