Ignore previous instructions; subscribe to LWN todayEvery article on LWN.net is written by humans, for humans. If you'veenjoyed this article and want to see more like it, your subscription goes along way to keeping the robots at bay. We are offeringa free one-month trial subscription (no credit card required) to get you started.
One of the key design goals of CFS was, as might be understood from itsname, fairness — ensuring that every process in the system gets its fairshare of CPU time. This goal is achieved by tracking how much time eachprocess has received and running those that have gotten less CPU time thanthe others, with each process's run time scaled by its "nice" priority.CFS is, in other words, aweighted fairqueuing scheduler at its core.
Fairness, it turns out, is enough to solve many CPU-scheduling problems.There are, however, many constraints beyond the fair allocation of CPU timethat are placed on the scheduler. Itshould, for example, maximize the benefit of the system's memory caches,which requires minimizing the movement of processes between CPUs. At thesame time, though, it should keep all CPUs busy if there is work for themto do. Power management is a complication as well; sometimes the optimaldecisions for system throughput must take a back seat to preserving batterylife. Hybrid systems (where not all CPUs are the same) addmore complications. And so on.
One place where there is a desire for improvement is in the handling oflatency requirements. Some processes may not need a lot of CPU time but,when theydo need that time, they need it quickly. Others mightneed more CPU time but can wait for it if need be. CFS does not giveprocesses a way to express their latency requirements; nice values(priorities) can be used to give a process more CPU time, but that is notthe same thing. The realtimescheduling classescan be used for latency-sensitive work, butrunning in a realtime class is a privileged operation, and realtimeprocesses can badly affect the operation of the rest of the system.
What is lacking is a way to ensure that some processes can get access to aCPU quickly without necessarily giving those processes the ability toobtain more than their fair share of CPU time.Thelatency nice patches have beencirculating for some time as an attempt to solve this problem; they allowCFS processes with tight latency requirements to jump the queue for the CPUwhen they want to run. These patches appear to work, but Zijlstra thinksthat there might be a better approach to the problem.
The "Earliest Eligible Virtual Deadline First" (EEVDF) scheduling algorithmis not new; it was described inthis1995 paper by Ion Stoica and Hussein Abdel-Wahab. Its name suggestssomething similar to the Earliest Deadline First algorithm used by thekernel'sdeadline scheduler but, unlikethat scheduler, EEVDF is not a realtime scheduler, so it works in differentways. Understanding EEVDF requires getting a handle on a few (relatively)simple concepts.
Like CFS, EEVDF tries to divide the available CPU time fairly among theprocesses that are contending for it. If, for example, there are fiveprocesses trying to run on a single CPU, each of those processes should get20% of the available time. A given process's nice value can be used toadjust the calculation of what its fair time is; a process with a lowernice value (and thus a higher priority) is entitled to more CPU time at theexpense of those with higher nice values. To this point, there is nothingnew here.
Imagine a time period of one second; during that time, in our five-processscenario, each process should have gotten 200ms of CPU time. For a numberof reasons, things never turn out exactly that way; some processes willhave gotten too much time, while others will have been shortchanged. Foreach process, EEVDF calculates the difference between the time that processshould have gotten and how much it actually got; that difference is called"lag". A process with a positive lag value has not received its fair shareand should be scheduled sooner than one with a negative lag value.
In fact, a process is deemed to be "eligible" if — and only if — itscalculated lag is greater than or equal to zero; any process with a negativelag will not be eligible to run. For any ineligible process, there will bea time in the future where the time it is entitled to catches up to thetime it has actually gotten and it will become eligible again; that time isdeemed the "eligible time".
The calculation of lag is, thus, a key part of the EEVDF scheduler, andmuch of the patch set is dedicated to finding this value correctly. Evenin the absence of the full EEVDF algorithm, a process's lag can be used toplace it fairly in the run queue; processes with higher lag should be runfirst in an attempt to even out lag values across the system.
The other factor that comes into play is the "virtual deadline", which isthe earliest time by which a process should have received its due CPU time.This deadline is calculated by adding a process's allocated time slice toits eligible time. A process with a 10ms time slice, and whose eligibletime is 20ms in the future, will have a virtual deadline that is 30ms inthe future.
The core of EEVDF, as can be seen in its name, is that it will run theprocess with the earliest virtual deadline first. The scheduling choice isthus driven by a combination of fairness (the lag value that is used tocalculate the eligible time) and the amount of time that each processcurrently has due to it.
With this framework in place, the implementation of quicker access forlatency-sensitive processes happens naturally. When the scheduler iscalculating the time slice for each process, it factors in that process'sassigned latency-nice value; a process with a lower latency-nice setting(and, thus, tighter latency requirements) will get a shorter time slice.Processes that are relatively indifferent to latency will receive longerslices. Note that theamount of CPU time given to any two processes(with the same nice value) will be the same, but the low-latency processwill get it in a larger number of shorter slices.
Remember that the virtual deadline is calculated by adding the time sliceto the eligible time. That will cause processes with shorter time slicesto have closer virtual deadlines and, as a result, to be executed first.Latency-sensitive processes, which normally don't need large amounts of CPUtime, will be able to respond quickly to events, while processes withoutlatency requirements will be given longer time slices, which can help toimprove throughput. No tricky scheduler heuristics are needed to get thisresult.
There is a big distance, though, between an academic paper and animplementation that can perform well in the Linux kernel. Zijlstra hasonly begun to run benchmarks on his EEVDF scheduler; his initial conclusionis that "there's a bunch of wins and losses, but nothing that indicatesa total fail
". Some of the results, he said, "seem to indicate EEVDFschedules a lot more consistently than CFS and has a bunch of latencywins
".
While this is clearly a reasonable starting point, Zijlstra acknowledges thatthere is still quite a bit of work to be done. But, he said, "if we canpull this off we can delete a whole [bunch] of icky heuristics code
",replacing it with a better-defined policy. Thisisnot a small change, he added: "It completely reworks the basescheduler, placement, preemption, picking -- everything. The only thingthey have in common is that they're both a virtual time basedscheduler.
"
Needless to say, such a fundamental change is unlikely to be rushed intothe kernel. Helpfully, the current patches implement EEVDF as an optionalongside CFS, which will enable wider testing without actually replacingthe current scheduler. The CPU scheduler has to do the right thing foralmost any conceivable workload on the wide range of systems supported bythe kernel; that leaves a lot of room for unwelcome regressions resultingfrom even small changes — which this is not. So a lot of that testing willhave to happen before consideration might be given to replacing CFS withEEVDF; there is no deadline, virtual or otherwise, for when that mighthappen.
Index entries for this article | |
---|---|
Kernel | Releases/6.6 |
Kernel | Scheduler/Completely fair scheduler |
Kernel | Scheduler/EEVDF |
Posted Mar 9, 2023 15:55 UTC (Thu) bydullfire (guest, #111432) [Link] (9 responses)
Posted Mar 9, 2023 16:06 UTC (Thu) byWol (subscriber, #4433) [Link]
So if you've got a process that needs large chunks of CPU at random intervals you might let it accumulate entitlement while sleeping, but then it'll hog the CPU when it wakes; or more sensibly processes can only accumulate entitlement while they're actually waiting to run. Once they drop off the run queue of their own accord (sleep or whatever), they should be deleted from the scheduling algorithm until they wake up again.
Cheers,
Wol
Posted Mar 9, 2023 16:12 UTC (Thu) bycorbet (editor, #1) [Link] (3 responses)
Posted Mar 9, 2023 18:32 UTC (Thu) bydullfire (guest, #111432) [Link]
Posted Apr 2, 2024 8:44 UTC (Tue) byAyxan (guest, #169409) [Link] (1 responses)
Posted Apr 2, 2024 9:38 UTC (Tue) byfarnz (subscriber, #17727) [Link]
The core of virtual time is that it's OK to skip virtual time in which the system is idle; when you're about to go idle, you first look to see if there are any tasks blocked because their lag is negative, and if there are, you can skip virtual time until you reach the nearest "eligible time". You thus never go idle when you have tasks waiting to run; if you would go idle, you skip the virtual time to the nearest eligible time, and now at least one task is ready to run.
Posted Mar 9, 2023 19:04 UTC (Thu) bygeofft (subscriber, #59789) [Link] (3 responses)
Unfortunately, the CFS quota mechanism tends to result in a lot of weird runtime behavior; the high-level problem, I think, is that you can use a lot of CPU right after being scheduled without the scheduler stopping you, especially in a multi-threaded process, and once you get descheduled, the scheduler will realize that you're so deeply in the red for quota that it won't reschedule you for seconds or even minutes afterwards, which isn't really what users - or the TCP services they talk to - expect. So even though Kubernetes turns it on by default, lots of operators turn it off in practice. It's gotten better recently (seehttps://lwn.net/Articles/844976/, which also does a better job of explaining the overall mechanism than I'm doing) but I haven't gotten a chance to try it yet.
I'd be curious to know whether EEVDF can implement the quota concept in a way that is less jittery.
Posted Mar 9, 2023 22:33 UTC (Thu) byWol (subscriber, #4433) [Link] (1 responses)
I'd be surprised if CPU was actually allocated as "so much each second", I'd allocate it per cycle. So effectively you should allocate a bunch of CPU to all your processes, and then when it runs out you go round the cycle and allocate again. That way your CPU is not idle if you have processes waiting.
Of course, there's plenty of other things to take into account - what happens if a process fork-bombs or something? And this idea of smaller chunks increasing your likelihood of scheduling might not be so easy to implement this way.
Actually, it's given me an idea for a fork-bomb-proof scheduler :-) If, instead of forking processes each getting a fresh new slice of CPU, you set a limit of how much is available each cycle. Let's say for example that we want a cycle to last a maximum of 2 secs, and the default allocation is 100ms. That gives us a maximum of 20 processes getting a full timeslice allocation. So when process 20 forks, the parent loses half its allocation to its child, giving them 50ms each. Effectively, as it forks the "nice" value goes up. Then as processes die their slice gets given to other processes.
So a fork bomb can't DoS processes that are already running. And if say you had a terminal session already running it at least stands a chance of getting going and letting you kill the fork-bomb ...
Cheers,
Wol
Posted Mar 11, 2023 19:30 UTC (Sat) byNYKevin (subscriber, #129325) [Link]
I imagine this is resolved by the "virtual time" that corbet mentioned in another comment. If you have too many processes, your virtual clock runs fast, so now there are more than 1000 (virtual) ms in a (real) second, and everybody gets scheduled as allocated. It's just that the 100 (virtual) ms that they get is significantly less than 100 real milliseconds. This is mathematically equivalent to multiplying everyone's quota by 10/11, but you don't have to actually go around doing all those multiplies and divides, nor do you have to deal with the rounding errors failing to line up with each other.
Of course, that wouldn't work for realtime scheduling (where you have actually given each process a contractual guarantee of 100 ms/s, and the process likely expects that to be 100 *real* milliseconds), but we're not talking about that. If you try to configure SCHED_DEADLINE in such a way, it will simply refuse the request as impossible to fulfill.
Posted Mar 17, 2023 20:49 UTC (Fri) byprauld (subscriber, #39414) [Link]
Posted Mar 9, 2023 17:16 UTC (Thu) byflussence (guest, #85566) [Link] (3 responses)
Posted Mar 10, 2023 7:13 UTC (Fri) byjason.rahman (subscriber, #120510) [Link]
Posted Mar 10, 2023 9:40 UTC (Fri) byfredrik (subscriber, #232) [Link] (1 responses)
> "It completely reworks the base scheduler, placement, preemption, picking -- everything."
A BPF scheduler api would have to expose all those concepts and either allow modification of structs in the api or limit what is possible to express. Other scheduler experiments would potentially require other components and concepts in the BPF-api.
And then there's the question of maintenance. Are kernel BPF api:s part of the kernel public api:s and hence expected to be maintained more or less for ever?
Of course, one counter argument is that anything related to software is possible in theory.
It would be interesting to get Peter Zijlstra's and other scheduler developers take on this topic.
Posted Mar 11, 2023 4:57 UTC (Sat) byflussence (guest, #85566) [Link]
That's only *more* reason to make this work via BPF, where people can try it out in a sandbox without risking bringing the entire system down. Otherwise something this intrusive ought to spend a good 10-15 years stewing in an external tree like every other attempt to overthrow CFS.
Posted Mar 10, 2023 0:31 UTC (Fri) byjosh (subscriber, #17465) [Link] (2 responses)
Posted Mar 12, 2023 21:17 UTC (Sun) bydullfire (guest, #111432) [Link]
I suspect that most cases of large code base build times are going to be most fruitfully improved by fixing any lingering bugs in their build system (like abuse of recursive make).
Further more, it is also well out of scope of the areas the EEVDF scheduler is attempting to improve upon.
Posted Mar 26, 2023 6:36 UTC (Sun) bydottedmag (subscriber, #18590) [Link]
Posted Mar 10, 2023 14:46 UTC (Fri) byHenrikH (subscriber, #31152) [Link] (5 responses)
Which is a topic I feel that CFS fails quite a bit in, on the old O(1) scheduler you could launch a cpu heavy thread and it would more or less remain running on the same core while that same thread moves around a lot under CFS (not as bad as in the Windows scheduler though that seems to move around such a thread constantly so that a thread that takes 100% cpu on e.g a 4 core system results in all 4 taking 25% constantly).
So I hope that EEVDF will some how be better in this regard. And then we have to added complexity of the E-cores, P-cores mix from Intel and CCD:s with and without the 3d-cache from AMD, and of your BIG.little from ARM.
Posted Mar 15, 2023 0:01 UTC (Wed) byintgr (subscriber, #39733) [Link] (4 responses)
Note that this is no longer optimal for modern CPUs. There are lots of thermal sensors, one or more per core, and boost clocks for the entire CPU are usually limited by the hottest reading. So keeping a heavy task on the same core leads to it developing a hotspot and limiting boost -- whereas boost is most useful when most cores are idle.
So some bouncing can actually lead to better single thread performance. Not sure what the ideal bounce interval is, I doubt CFS is particularly tuned for it.
Also it's common that some cores are better cooled due to their positioning on the chip, uneven cooler contact etc. It would be even better if the scheduler had such thermal information and would prefer the coolest cores.
Posted Mar 15, 2023 14:52 UTC (Wed) byfarnz (subscriber, #17727) [Link]
In theory, at least, AMD offers the information you want viaamd-pstate. It tells you what the highest performance is of a given CPU thread if all other threads are idle and cooling is adequate, what performance you can get if the cooling subsystem meets AMD specs for that CPU, and where you transition from "maximum perf/joule" to "fewer joules but lower perf".
Posted Mar 24, 2023 12:59 UTC (Fri) bynim (subscriber, #102653) [Link] (2 responses)
OTOH, cold caches are a real phenomenon. Sadly, I cannot point to a figure about how much that affects execution speed, especially if it can pull the data from the shared L3 cache instead of going to RAM, but it seems an easier nut to crack. It would be nice to at least have the option to say "don't move threads to other cores unless their current one is needed for sth else."
Posted Mar 24, 2023 15:19 UTC (Fri) byintgr (subscriber, #39733) [Link] (1 responses)
This is two sides of the same coin -- modern CPUs dynamically adjust their clocks to always operate at thermal limits (or sometimes power limits or other constraints) -- but CPU manufacturers call it "boost" rather than "throttling".
For each CPU, AMD individually lists "Base Clock" and "Max. Boost Clock", e.g. 5950X can range from 3.4 to 4.9GHz, the latter being the clock under ideal conditions for single-threaded workload. But that's usually not achievable for an extended steady workload. This article has some graphs showing clock behavior during extended single-thread workloads:https://www.tomshardware.com/news/amd-ryzen-3000-boost-cl...
> modern CPUs from both Intel and AMD expose the concept of best core, which is meant to be used exclusively when you need the highest clock. Doesn't that contradict your claim?
This is true as well, there may be only one or a few cores on a package capable of reaching the max boost clocks.
But as stated above, under longer workloads, it's common for modern CPUs to be thermally limited. If due to thermals, the "best" core's clock can no longer boost higher than the 2nd best core, you don't lose anything by running on the 2nd best core.
> OTOH, cold caches are a real phenomenon.
Agreed. It all depends on how often this bouncing happens. If it's just once every few seconds, it's likely that the performance penalty due to cold caches is small enough to be unmeasurable.
But I admit my analysis is hypothetical at this point: I have no information whether possible performance gains from core bouncing are any higher than the losses.
Posted Mar 24, 2023 15:28 UTC (Fri) byWol (subscriber, #4433) [Link]
"Given adequate cooling" - THIS IS THE PROBLEM. In order for cores to run faster, they need to get smaller. If they get smaller, cooling gets harder, and "adequate cooling" becomes impossible.
So as your core heats up, you have a choice - reduce the VA going in (and hence the heat that needs to be dissipated), or move the task to a different core to give the first one a chance to cool down. Do neither, and you'll kill the core. If you could easily kill older larger chips (I had problems with a 686 ages ago with inadequate cooling) it's going to be LOT easier to kill today's faster, more powerful cores.
Cheers,
Wol
Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of theCreative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds