BFQ (Budget Fair Queueing)¶
BFQ is a proportional-share I/O scheduler, with some extralow-latency capabilities. In addition to cgroups support (blkio or iocontrollers), BFQ’s main features are:
BFQ guarantees a high system and application responsiveness, and alow latency for time-sensitive applications, such as audio or videoplayers;
BFQ distributes bandwidth, not just time, among processes orgroups (switching back to time distribution when needed to keepthroughput high).
In its default configuration, BFQ privileges latency overthroughput. So, when needed for achieving a lower latency, BFQ buildsschedules that may lead to a lower throughput. If your main or onlygoal, for a given device, is to achieve the maximum-possiblethroughput at all times, then do switch off all low-latency heuristicsfor that device, by setting low_latency to 0. See Section 3 fordetails on how to configure BFQ for the desired tradeoff betweenlatency and throughput, or on how to maximize throughput.
As every I/O scheduler, BFQ adds some overhead to per-I/O-requestprocessing. To give an idea of this overhead, the total,single-lock-protected, per-request processing time of BFQ---i.e., thesum of the execution times of the request insertion, dispatch andcompletion hooks---is, e.g., 1.9 us on an Intel Corei7-2760QM@2.40GHz(dated CPU for notebooks; time measured with simple codeinstrumentation, and using the throughput-sync.sh script of the Ssuite [1], in performance-profiling mode). To put this result intocontext, the total, single-lock-protected, per-request execution timeof the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ).
Scheduling overhead further limits the maximum IOPS that a CPU canprocess (already limited by the execution of the rest of the I/Ostack). To give an idea of the limits with BFQ, on slow or averageCPUs, here are, first, the limits of BFQ for three different CPUs, on,respectively, an average laptop, an old desktop, and a cheap embeddedsystem, in case full hierarchical support is enabled (i.e.,CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is notset (Section 4-2):- Intel i7-4850HQ: 400 KIOPS- AMD A8-3850: 250 KIOPS- ARM CortexTM-A53 Octa-core: 80 KIOPS
If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchicalsupport is enabled), then the sustainable throughput with BFQdecreases, because all blkio.bfq* statistics are created and updated(Section 4-2). For BFQ, this leads to the following maximumsustainable throughputs, on the same systems as above:- Intel i7-4850HQ: 310 KIOPS- AMD A8-3850: 200 KIOPS- ARM CortexTM-A53 Octa-core: 56 KIOPS
BFQ works for multi-queue devices too.
1. When may BFQ be useful?¶
BFQ provides the following benefits on personal and server systems.
1-1 Personal systems¶
Low latency for interactive applications¶
Regardless of the actual background workload, BFQ guarantees that, forinteractive tasks, the storage device is virtually as responsive as ifit was idle. For example, even if one or more of the followingbackground workloads are being executed:
one or more large files are being read, written or copied,
a tree of source files is being compiled,
one or more virtual machines are performing I/O,
a software update is in progress,
indexing daemons are scanning filesystems and updating theirdatabases,
starting an application or loading a file from within an applicationtakes about the same time as if the storage device was idle. As acomparison, with CFQ, NOOP or DEADLINE, and in the same conditions,applications experience high latencies, or even become unresponsiveuntil the background workload terminates (also on SSDs).
Low latency for soft real-time applications¶
Also soft real-time applications, such as audio and videoplayers/streamers, enjoy a low latency and a low drop rate, regardlessof the background I/O workload. As a consequence, these applicationsdo not suffer from almost any glitch due to the background workload.
Higher speed for code-development tasks¶
If some additional workload happens to be executed in parallel, thenBFQ executes the I/O-related components of typical code-developmenttasks (compilation, checkout, merge, etc.) much more quickly than CFQ,NOOP or DEADLINE.
High throughput¶
On hard disks, BFQ achieves up to 30% higher throughput than CFQ, andup to 150% higher throughput than DEADLINE and NOOP, with all thesequential workloads considered in our tests. With random workloads,and with all the workloads on flash-based devices, BFQ achieves,instead, about the same throughput as the other schedulers.
Strong fairness, bandwidth and delay guarantees¶
BFQ distributes the device throughput, and not just the device time,among I/O-bound applications in proportion to their weights, with anyworkload and regardless of the device parameters. From these bandwidthguarantees, it is possible to compute a tight per-I/O-request delayguarantees by a simple formula. If not configured for strict serviceguarantees, BFQ switches to time-based resource sharing (only) forapplications that would otherwise cause a throughput loss.
1-2 Server systems¶
Most benefits for server systems follow from the same serviceproperties as above. In particular, regardless of whether additional,possibly heavy workloads are being served, BFQ guarantees:
audio and video-streaming with zero or very low jitter and droprate;
fast retrieval of WEB pages and embedded objects;
real-time recording of data in live-dumping applications (e.g.,packet logging);
responsiveness in local and remote access to a server.
2. How does BFQ work?¶
BFQ is a proportional-share I/O scheduler, whose general structure,plus a lot of code, are borrowed from CFQ.
Each process doing I/O on a device is associated with a weight and a(bfq_)queue.
BFQ grants exclusive access to the device, for a while, to one queue(process) at a time, and implements this service model byassociating every queue with a budget, measured in number ofsectors.
After a queue is granted access to the device, the budget of thequeue is decremented, on each request dispatch, by the size of therequest.
The in-service queue is expired, i.e., its service is suspended,only if one of the following events occurs: 1) the queue finishesits budget, 2) the queue empties, 3) a “budget timeout” fires.
The budget timeout prevents processes doing random I/O fromholding the device for too long and dramatically reducingthroughput.
Actually, as in CFQ, a queue associated with a process issuingsync requests may not be expired immediately when it empties. Incontrast, BFQ may idle the device for a short time interval,giving the process the chance to go on being served if it issuesa new request in time. Device idling typically boosts thethroughput on rotational devices and on non-queueing flash-baseddevices, if processes do synchronous and sequential I/O. Inaddition, under BFQ, device idling is also instrumental inguaranteeing the desired throughput fraction to processesissuing sync requests (see the description of the slice_idletunable in this document, or [1, 2], for more details).
With respect to idling for service guarantees, if severalprocesses are competing for the device at the same time, butall processes and groups have the same weight, then BFQguarantees the expected throughput distribution without everidling the device. Throughput is thus as high as possible inthis common scenario.
On flash-based storage with internal queueing of commands(typically NCQ), device idling happens to be always detrimentalto throughput. So, with these devices, BFQ performs idlingonly when strictly needed for service guarantees, i.e., forguaranteeing low latency or fairness. In these cases, overallthroughput may be sub-optimal. No solution currently exists toprovide both strong service guarantees and optimal throughputon devices with internal queueing.
If low-latency mode is enabled (default configuration), BFQexecutes some special heuristics to detect interactive and softreal-time applications (e.g., video or audio players/streamers),and to reduce their latency. The most important action taken toachieve this goal is to give to the queues associated with theseapplications more than their fair share of the devicethroughput. For brevity, we call it just “weight-raising” the wholesets of actions taken by BFQ to privilege these queues. Inparticular, BFQ provides a milder form of weight-raising forinteractive applications, and a stronger form for soft real-timeapplications.
BFQ automatically deactivates idling for queues born in a burst ofqueue creations. In fact, these queues are usually associated withthe processes of applications and services that benefit mostlyfrom a high throughput. Examples are systemd during boot, or gitgrep.
As CFQ, BFQ merges queues performing interleaved I/O, i.e.,performing random I/O that becomes mostly sequential ifmerged. Differently from CFQ, BFQ achieves this goal with a morereactive mechanism, called Early Queue Merge (EQM). EQM is soresponsive in detecting interleaved I/O (cooperating processes),that it enables BFQ to achieve a high throughput, by queuemerging, even for queues for which CFQ needs a differentmechanism, preemption, to get a high throughput. As such, EQM is aunified mechanism to achieve a high throughput with interleavedI/O.
Queues are scheduled according to a variant of WF2Q+, namedB-WF2Q+, and implemented using an augmented rb-tree to preserve anO(log N) overall complexity. See [2] for more details. B-WF2Q+ isalso ready for hierarchical scheduling, details in Section 4.
B-WF2Q+ guarantees a tight deviation with respect to an ideal,perfectly fair, and smooth service. In particular, B-WF2Q+guarantees that each queue receives a fraction of the devicethroughput proportional to its weight, even if the throughputfluctuates, and regardless of: the device parameters, the currentworkload and the budgets assigned to the queue.
The last, budget-independence, property (although probablycounterintuitive in the first place) is definitely beneficial, forthe following reasons:
First, with any proportional-share scheduler, the maximumdeviation with respect to an ideal service is proportional tothe maximum budget (slice) assigned to queues. As a consequence,BFQ can keep this deviation tight, not only because of theaccurate service of B-WF2Q+, but also because BFQdoes notneed to assign a larger budget to a queue to let the queuereceive a higher fraction of the device throughput.
Second, BFQ is free to choose, for every process (queue), thebudget that best fits the needs of the process, or bestleverages the I/O pattern of the process. In particular, BFQupdates queue budgets with a simple feedback-loop algorithm thatallows a high throughput to be achieved, while still providingtight latency guarantees to time-sensitive applications. Whenthe in-service queue expires, this algorithm computes the nextbudget of the queue so as to:
Let large budgets be eventually assigned to the queuesassociated with I/O-bound applications performing sequentialI/O: in fact, the longer these applications are served oncegot access to the device, the higher the throughput is.
Let small budgets be eventually assigned to the queuesassociated with time-sensitive applications (which typicallyperform sporadic and short I/O), because, the smaller thebudget assigned to a queue waiting for service is, the soonerB-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
If several processes are competing for the device at the same time,but all processes and groups have the same weight, then BFQguarantees the expected throughput distribution without ever idlingthe device. It uses preemption instead. Throughput is then muchhigher in this common scenario.
ioprio classes are served in strict priority order, i.e.,lower-priority queues are not served as long as there arehigher-priority queues. Among queues in the same class, thebandwidth is distributed in proportion to the weight of eachqueue. A very thin extra bandwidth is however guaranteed tothe Idle class, to prevent it from starving.
3. What are BFQ’s tunables and how to properly configure BFQ?¶
Most BFQ tunables affect service guarantees (basically latency andfairness) and throughput. For full details on how to choose thedesired tradeoff between service guarantees and throughput, see theparameters slice_idle, strict_guarantees and low_latency. For detailson how to maximise throughput, see slice_idle, timeout_sync andmax_budget. The other performance-related parameters have beeninherited from, and have been preserved mostly for compatibility withCFQ. So far, no performance improvement has been reported afterchanging the latter parameters in BFQ.
In particular, the tunables back_seek-max, back_seek_penalty,fifo_expire_async and fifo_expire_sync below are the same as inCFQ. Their description is just copied from that for CFQ. Someconsiderations in the description of slice_idle are copied from CFQtoo.
per-process ioprio and weight¶
Unless the cgroups interface is used (see “4. BFQ group scheduling”),weights can be assigned to processes only indirectly, through I/Opriorities, and according to the relation:weight = (IOPRIO_BE_NR - ioprio) * 10.
Beware that, if low-latency is set, then BFQ automatically raises theweight of the queues associated with interactive and soft real-timeapplications. Unset this tunable if you need/want to control weights.
slice_idle¶
This parameter specifies how long BFQ should idle for the next I/Orequest, when certain sync BFQ queues become empty. By defaultslice_idle is a non-zero value. Idling has a double purpose: boostingthroughput and making sure that the desired throughput distribution isrespected (see the description of how BFQ works, and, if needed, thepapers referred there).
As for throughput, idling can be very helpful on highly seeky medialike single spindle SATA/SAS disks where we can cut down on overallnumber of seeks and see improved throughput.
Setting slice_idle to 0 will remove all the idling on queues and oneshould see an overall improved throughput on faster storage deviceslike multiple SATA/SAS disks in hardware RAID configuration, as wellas flash-based storage with internal command queueing (andparallelism).
So depending on storage and workload, it might be useful to setslice_idle=0. In general for SATA/SAS disks and software RAID ofSATA/SAS disks keeping slice_idle enabled should be useful. For anyconfigurations where there are multiple spindles behind single LUN(Host based hardware RAID controller or for storage arrays), or withflash-based fast storage, setting slice_idle=0 might end up in betterthroughput and acceptable latencies.
Idling is however necessary to have service guarantees enforced incase of differentiated weights or differentiated I/O-request lengths.To see why, suppose that a given BFQ queue A must get several I/Orequests served for each request served for another queue B. Idlingensures that, if A makes a new I/O request slightly after becomingempty, then no request of B is dispatched in the middle, and thus Adoes not lose the possibility to get more than one request dispatchedbefore the next request of B is dispatched. Note that idlingguarantees the desired differentiated treatment of queues only interms of I/O-request dispatches. To guarantee that the actual serviceorder then corresponds to the dispatch order, the strict_guaranteestunable must be set too.
There is an important flip side to idling: apart from the above caseswhere it is beneficial also for throughput, idling can severely impactthroughput. One important case is random workload. Because of thisissue, BFQ tends to avoid idling as much as possible, when it is notbeneficial also for throughput (as detailed in Section 2). As aconsequence of this behavior, and of further issues described for thestrict_guarantees tunable, short-term service guarantees may beoccasionally violated. And, in some cases, these guarantees may bemore important than guaranteeing maximum throughput. For example, invideo playing/streaming, a very low drop rate may be more importantthan maximum throughput. In these cases, consider setting thestrict_guarantees parameter.
slice_idle_us¶
Controls the same tuning parameter as slice_idle, but in microseconds.Either tunable can be used to set idling behavior. Afterwards, theother tunable will reflect the newly set value in sysfs.
strict_guarantees¶
If this parameter is set (default: unset), then BFQ
always performs idling when the in-service queue becomes empty;
forces the device to serve one I/O request at a time, by dispatching anew request only if there is no outstanding request.
In the presence of differentiated weights or I/O-request sizes, boththe above conditions are needed to guarantee that every BFQ queuereceives its allotted share of the bandwidth. The first condition isneeded for the reasons explained in the description of the slice_idletunable. The second condition is needed because all modern storagedevices reorder internally-queued requests, which may trivially breakthe service guarantees enforced by the I/O scheduler.
Setting strict_guarantees may evidently affect throughput.
back_seek_max¶
This specifies, given in Kbytes, the maximum “distance” for backward seeking.The distance is the amount of space from the current head location to thesectors that are backward in terms of distance.
This parameter allows the scheduler to anticipate requests in the “backward”direction and consider them as being the “next” if they are within thisdistance from the current head location.
back_seek_penalty¶
This parameter is used to compute the cost of backward seeking. If thebackward distance of request is just 1/back_seek_penalty from a “front”request, then the seeking cost of two requests is considered equivalent.
So scheduler will not bias toward one or the other request (otherwise schedulerwill bias toward front request). Default value of back_seek_penalty is 2.
fifo_expire_async¶
This parameter is used to set the timeout of asynchronous requests. Defaultvalue of this is 250ms.
fifo_expire_sync¶
This parameter is used to set the timeout of synchronous requests. Defaultvalue of this is 125ms. In case to favor synchronous requests over asynchronousone, this value should be decreased relative to fifo_expire_async.
low_latency¶
This parameter is used to enable/disable BFQ’s low latency mode. Bydefault, low latency mode is enabled. If enabled, interactive and softreal-time applications are privileged and experience a lower latency,as explained in more detail in the description of how BFQ works.
DISABLE this mode if you need full control on bandwidthdistribution. In fact, if it is enabled, then BFQ automaticallyincreases the bandwidth share of privileged applications, as the mainmeans to guarantee a lower latency to them.
In addition, as already highlighted at the beginning of this document,DISABLE this mode if your only goal is to achieve a high throughput.In fact, privileging the I/O of some application over the rest mayentail a lower throughput. To achieve the highest-possible throughputon a non-rotational device, setting slice_idle to 0 may be needed too(at the cost of giving up any strong guarantee on fairness and lowlatency).
timeout_sync¶
Maximum amount of device time that can be given to a task (queue) onceit has been selected for service. On devices with costly seeks,increasing this time usually increases maximum throughput. On theopposite end, increasing this time coarsens the granularity of theshort-term bandwidth and latency guarantees, especially if thefollowing parameter is set to zero.
max_budget¶
Maximum amount of service, measured in sectors, that can be providedto a BFQ queue once it is set in service (of course within the limitsof the above timeout). According to what was said in the description ofthe algorithm, larger values increase the throughput in proportion tothe percentage of sequential I/O requests issued. The price of largervalues is that they coarsen the granularity of short-term bandwidthand latency guarantees.
The default value is 0, which enables auto-tuning: BFQ sets max_budgetto the maximum number of sectors that can be served duringtimeout_sync, according to the estimated peak rate.
For specific devices, some users have occasionally reported to havereached a higher throughput by setting max_budget explicitly, i.e., bysetting max_budget to a higher value than 0. In particular, they haveset max_budget to higher values than those to which BFQ would have setit with auto-tuning. An alternative way to achieve this goal is tojust increase the value of timeout_sync, leaving max_budget equal to 0.
4. Group scheduling with BFQ¶
BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namelyblkio and io. In particular, BFQ supports weight-based proportionalshare. To activate cgroups support, set BFQ_GROUP_IOSCHED.
4-1 Service guarantees provided¶
With BFQ, proportional share means true proportional share of thedevice bandwidth, according to group weights. For example, a groupwith weight 200 gets twice the bandwidth, and not just twice the time,of a group with weight 100.
BFQ supports hierarchies (group trees) of any depth. Bandwidth isdistributed among groups and processes in the expected way: for eachgroup, the children of the group share the whole bandwidth of thegroup in proportion to their weights. In particular, this impliesthat, for each leaf group, every process of the group receives thesame share of the whole group bandwidth, unless the ioprio of theprocess is modified.
The resource-sharing guarantee for a group may partially or totallyswitch from bandwidth to time, if providing bandwidth guarantees tothe group lowers the throughput too much. This switch occurs on aper-process basis: if a process of a leaf group causes throughput lossif served in such a way to receive its share of the bandwidth, thenBFQ switches back to just time-based proportional share for thatprocess.
4-2 Interface¶
To get proportional sharing of bandwidth with BFQ for a given device,BFQ must of course be the active scheduler for that device.
Within each group directory, the names of the files associated withBFQ-specific cgroup parameters and stats begin with the “bfq.”prefix. So, with cgroups-v1 or cgroups-v2, the full prefix forBFQ-specific files is “blkio.bfq.” or “io.bfq.” For example, the groupparameter to set the weight of a group with BFQ is blkio.bfq.weightor io.bfq.weight.
As for cgroups-v1 (blkio controller), the exact set of stat filescreated, and kept up-to-date by bfq, depends on whetherCONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates allthe stat files documented inBlock IO Controller. If, instead,CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files:
blkio.bfq.io_service_bytesblkio.bfq.io_service_bytes_recursiveblkio.bfq.io_servicedblkio.bfq.io_serviced_recursive
The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximumthroughput sustainable with bfq, because updating the blkio.bfq.*stats is rather costly, especially for some of the stats enabled byCONFIG_BFQ_CGROUP_DEBUG.
Parameters¶
For each group, the following parameters can be set:
- weight
This specifies the default weight for the cgroup inside its parent.Available values: 1..1000 (default: 100).
For cgroup v1, it is set by writing the value toblkio.bfq.weight.
For cgroup v2, it is set by writing the value toio.bfq.weight.(with an optional prefix ofdefault and a space).
The linear mapping between ioprio and weights, described at the beginningof the tunable section, is still valid, but all weights higher thanIOPRIO_BE_NR*10 are mapped to ioprio 0.
Recall that, if low-latency is set, then BFQ automatically raises theweight of the queues associated with interactive and soft real-timeapplications. Unset this tunable if you need/want to control weights.
- weight_device
This specifies a per-device weight for the cgroup. The syntax isminor:major weight. A weight of0 may be used to reset to the defaultweight.
For cgroup v1, it is set by writing the value toblkio.bfq.weight_device.
For cgroup v2, the file name isio.bfq.weight.
- [1]
P. Valente, A. Avanzini, “Evolution of the BFQ Storage I/OScheduler”, Proceedings of the First Workshop on Mobile SystemTechnologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
- [2]
P. Valente and M. Andreolini, “Improving ApplicationResponsiveness with the BFQ Disk I/O Scheduler”, Proceedings ofthe 5th Annual International Systems and Storage Conference(SYSTOR ‘12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-results.pdf
- [3]