Multi-Queue Block IO Queueing Mechanism (blk-mq)

The Multi-Queue Block IO Queueing Mechanism is an API to enable fast storagedevices to achieve a huge number of input/output operations per second (IOPS)through queueing and submitting IO requests to block devices simultaneously,benefiting from the parallelism offered by modern storage devices.

Introduction

Background

Magnetic hard disks have been the de facto standard from the beginning of thedevelopment of the kernel. The Block IO subsystem aimed to achieve the bestperformance possible for those devices with a high penalty when doing randomaccess, and the bottleneck was the mechanical moving parts, a lot slower thanany layer on the storage stack. One example of such optimization techniqueinvolves ordering read/write requests according to the current position of thehard disk head.

However, with the development of Solid State Drives and Non-Volatile Memorieswithout mechanical parts nor random access penalty and capable of performinghigh parallel access, the bottleneck of the stack had moved from the storagedevice to the operating system. In order to take advantage of the parallelismin those devices’ design, the multi-queue mechanism was introduced.

The former design had a single queue to store block IO requests with a singlelock. That did not scale well in SMP systems due to dirty data in cache and thebottleneck of having a single lock for multiple processors. This setup alsosuffered with congestion when different processes (or the same process, movingto different CPUs) wanted to perform block IO. Instead of this, the blk-mq APIspawns multiple queues with individual entry points local to the CPU, removingthe need for a lock. A deeper explanation on how this works is covered in thefollowing section (Operation).

Operation

When the userspace performs IO to a block device (reading or writing a file,for instance), blk-mq takes action: it will store and manage IO requests tothe block device, acting as middleware between the userspace (and a filesystem, if present) and the block device driver.

blk-mq has two group of queues: software staging queues and hardware dispatchqueues. When the request arrives at the block layer, it will try the shortestpath possible: send it directly to the hardware queue. However, there are twocases that it might not do that: if there’s an IO scheduler attached at thelayer or if we want to try to merge requests. In both cases, requests will besent to the software queue.

Then, after the requests are processed by software queues, they will be placedat the hardware queue, a second stage queue were the hardware has direct accessto process those requests. However, if the hardware does not have enoughresources to accept more requests, blk-mq will places requests on a temporaryqueue, to be sent in the future, when the hardware is able.

Software staging queues

The block IO subsystem adds requests in the software staging queues(represented by structblk_mq_ctx) in case that they weren’t sentdirectly to the driver. A request is one or more BIOs. They arrived at theblock layer through the data structure structbio. The block layerwill then build a new structure from it, the structrequest that willbe used to communicate with the device driver. Each queue has its own lock andthe number of queues is defined by a per-CPU or per-node basis.

The staging queue can be used to merge requests for adjacent sectors. Forinstance, requests for sector 3-6, 6-7, 7-9 can become one request for 3-9.Even if random access to SSDs and NVMs have the same time of response comparedto sequential access, grouped requests for sequential access decreases thenumber of individual requests. This technique of merging requests is calledplugging.

Along with that, the requests can be reordered to ensure fairness of systemresources (e.g. to ensure that no application suffers from starvation) and/or toimprove IO performance, by an IO scheduler.

IO Schedulers

There are several schedulers implemented by the block layer, each one followinga heuristic to improve the IO performance. They are “pluggable” (as in plugand play), in the sense of they can be selected at run time using sysfs. Youcan read more about Linux’s IO schedulershere. The schedulinghappens only between requests in the same queue, so it is not possible to mergerequests from different queues, otherwise there would be cache trashing and aneed to have a lock for each queue. After the scheduling, the requests areeligible to be sent to the hardware. One of the possible schedulers to beselected is the NONE scheduler, the most straightforward one. It will justplace requests on whatever software queue the process is running on, withoutany reordering. When the device starts processing requests in the hardwarequeue (a.k.a. run the hardware queue), the software queues mapped to thathardware queue will be drained in sequence according to their mapping.

Hardware dispatch queues

The hardware queue (represented by structblk_mq_hw_ctx) is a structused by device drivers to map the device submission queues (or device DMA ringbuffer), and are the last step of the block layer submission code before thelow level device driver taking ownership of the request. To run this queue, theblock layer removes requests from the associated software queues and tries todispatch to the hardware.

If it’s not possible to send the requests directly to hardware, they will beadded to a linked list (hctx->dispatch) of requests. Then,next time the block layer runs a queue, it will send the requests laying at thedispatch list first, to ensure a fairness dispatch with thoserequests that were ready to be sent first. The number of hardware queuesdepends on the number of hardware contexts supported by the hardware and itsdevice driver, but it will not be more than the number of cores of the system.There is no reordering at this stage, and each software queue has a set ofhardware queues to send requests for.

Note

Neither the block layer nor the device protocols guaranteethe order of completion of requests. This must be handled byhigher layers, like the filesystem.

Tag-based completion

In order to indicate which request has been completed, every request isidentified by an integer, ranging from 0 to the dispatch queue size. This tagis generated by the block layer and later reused by the device driver, removingthe need to create a redundant identifier. When a request is completed in thedrive, the tag is sent back to the block layer to notify it of the finalization.This removes the need to do a linear search to find out which IO has beencompleted.

Source code documentation

structblk_mq_hw_ctx

State for a hardware queue facing the hardware block device

Definition

struct blk_mq_hw_ctx {  struct {    spinlock_t lock;    struct list_head        dispatch;    unsigned long           state;  };  struct delayed_work     run_work;  cpumask_var_t cpumask;  int next_cpu;  int next_cpu_batch;  unsigned long           flags;  void *sched_data;  struct request_queue    *queue;  struct blk_flush_queue  *fq;  void *driver_data;  struct sbitmap          ctx_map;  struct blk_mq_ctx       *dispatch_from;  unsigned int            dispatch_busy;  unsigned short          type;  unsigned short          nr_ctx;  struct blk_mq_ctx       **ctxs;  spinlock_t dispatch_wait_lock;  wait_queue_entry_t dispatch_wait;  atomic_t wait_index;  struct blk_mq_tags      *tags;  struct blk_mq_tags      *sched_tags;  unsigned long           queued;  unsigned long           run;#define BLK_MQ_MAX_DISPATCH_ORDER       7;  unsigned long           dispatched[BLK_MQ_MAX_DISPATCH_ORDER];  unsigned int            numa_node;  unsigned int            queue_num;  atomic_t nr_active;  struct hlist_node       cpuhp_online;  struct hlist_node       cpuhp_dead;  struct kobject          kobj;  unsigned long           poll_considered;  unsigned long           poll_invoked;  unsigned long           poll_success;#ifdef CONFIG_BLK_DEBUG_FS;  struct dentry           *debugfs_dir;  struct dentry           *sched_debugfs_dir;#endif;  struct list_head        hctx_list;  struct srcu_struct      srcu[];};

Members

{unnamed_struct}
anonymous
lock
Protects the dispatch list.
dispatch
Used for requests that are ready to bedispatched to the hardware but for some reason (e.g. lack ofresources) could not be sent to the hardware. As soon as thedriver can send new requests, requests at this list willbe sent first for a fairer dispatch.
state
BLK_MQ_S_* flags. Defines the state of the hwqueue (active, scheduled to restart, stopped).
run_work
Used for scheduling a hardware queue run at a later time.
cpumask
Map of available CPUs where this hctx can run.
next_cpu
Used by blk_mq_hctx_next_cpu() for round-robin CPUselection fromcpumask.
next_cpu_batch
Counter of how many works left in the batch beforechanging to the next CPU.
flags
BLK_MQ_F_* flags. Defines the behaviour of the queue.
sched_data
Pointer owned by the IO scheduler attached to a requestqueue. It’s up to the IO scheduler how to use this pointer.
queue
Pointer to the request queue that owns this hardware context.
fq
Queue of requests that need to perform a flush operation.
driver_data
Pointer to data owned by the block driver that createdthis hctx
ctx_map
Bitmap for each software queue. If bit is on, there is apending request in that software queue.
dispatch_from
Software queue to be used when no scheduler wasselected.
dispatch_busy
Number used by blk_mq_update_dispatch_busy() todecide if the hw_queue is busy using Exponential Weighted MovingAverage algorithm.
type
HCTX_TYPE_* flags. Type of hardware queue.
nr_ctx
Number of software queues.
ctxs
Array of software queues.
dispatch_wait_lock
Lock for dispatch_wait queue.
dispatch_wait
Waitqueue to put requests when there is no tagavailable at the moment, to wait for another try in the future.
wait_index
Index of next available dispatch_wait queue to insertrequests.
tags
Tags owned by the block driver. A tag at this set is onlyassigned when a request is dispatched from a hardware queue.
sched_tags
Tags owned by I/O scheduler. If there is an I/Oscheduler associated with a request queue, a tag is assigned whenthat request is allocated. Else, this member is not used.
queued
Number of queued requests.
run
Number of dispatched requests.
dispatched
Number of dispatch requests by queue.
numa_node
NUMA node the storage adapter has been connected to.
queue_num
Index of this hardware queue.
nr_active
Number of active requests. Only used when a tag set isshared across request queues.
cpuhp_online
List to store request if CPU is going to die
cpuhp_dead
List to store request if some CPU die.
kobj
Kernel object for sysfs.
poll_considered
Count timesblk_poll() was called.
poll_invoked
Count how many requestsblk_poll() polled.
poll_success
Count how many polled requests were completed.
debugfs_dir
debugfs directory for this hardware queue. Namedas cpu<cpu_number>.
sched_debugfs_dir
debugfs directory for the scheduler.
hctx_list
if this hctx is not in use, this is an entry inq->unused_hctx_list.
srcu
Sleepable RCU. Use as lock when type of the hardware queue isblocking (BLK_MQ_F_BLOCKING). Must be the last member - see alsoblk_mq_hw_ctx_size().
structblk_mq_queue_map

Map software queues to hardware queues

Definition

struct blk_mq_queue_map {  unsigned int *mq_map;  unsigned int nr_queues;  unsigned int queue_offset;};

Members

mq_map
CPU ID to hardware queue index map. This is an arraywith nr_cpu_ids elements. Each element has a value in the range[queue_offset,queue_offset +nr_queues).
nr_queues
Number of hardware queues to map CPU IDs onto.
queue_offset
First hardware queue to map onto. Used by the PCIe NVMedriver to map each hardware queue type (enum hctx_type) onto a distinctset of hardware queues.
enumhctx_type

Type of hardware queue

Constants

HCTX_TYPE_DEFAULT
All I/O not otherwise accounted for.
HCTX_TYPE_READ
Just for READ I/O.
HCTX_TYPE_POLL
Polled I/O of any kind.
HCTX_MAX_TYPES
Number of types of hctx.
structblk_mq_tag_set

tag set that can be shared between request queues

Definition

struct blk_mq_tag_set {  struct blk_mq_queue_map map[HCTX_MAX_TYPES];  unsigned int            nr_maps;  const struct blk_mq_ops *ops;  unsigned int            nr_hw_queues;  unsigned int            queue_depth;  unsigned int            reserved_tags;  unsigned int            cmd_size;  int numa_node;  unsigned int            timeout;  unsigned int            flags;  void *driver_data;  struct blk_mq_tags      **tags;  struct mutex            tag_list_lock;  struct list_head        tag_list;};

Members

map
One or more ctx -> hctx mappings. One map exists for eachhardware queue type (enum hctx_type) that the driver wishesto support. There are no restrictions on maps being of thesame size, and it’s perfectly legal to share maps betweentypes.
nr_maps
Number of elements in themap array. A number in the range[1, HCTX_MAX_TYPES].
ops
Pointers to functions that implement block driver behavior.
nr_hw_queues
Number of hardware queues supported by the block driver thatowns this data structure.
queue_depth
Number of tags per hardware queue, reserved tags included.
reserved_tags
Number of tags to set aside for BLK_MQ_REQ_RESERVED tagallocations.
cmd_size
Number of additional bytes to allocate per request. The blockdriver owns these additional bytes.
numa_node
NUMA node the storage adapter has been connected to.
timeout
Request processing timeout in jiffies.
flags
Zero or more BLK_MQ_F_* flags.
driver_data
Pointer to data owned by the block driver that created thistag set.
tags
Tag sets. One tag set per hardware queue. Hasnr_hw_queueselements.
tag_list_lock
Serializes tag_list accesses.
tag_list
List of the request queues that use this tag set. See alsorequest_queue.tag_set_list.
structblk_mq_queue_data

Data about a request inserted in a queue

Definition

struct blk_mq_queue_data {  struct request *rq;  bool last;};

Members

rq
Request pointer.
last
If it is the last request in the queue.
structblk_mq_ops

Callback functions that implements block driver behaviour.

Definition

struct blk_mq_ops {  blk_status_t (*queue_rq)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *);  void (*commit_rqs)(struct blk_mq_hw_ctx *);  bool (*get_budget)(struct request_queue *);  void (*put_budget)(struct request_queue *);  enum blk_eh_timer_return (*timeout)(struct request *, bool);  int (*poll)(struct blk_mq_hw_ctx *);  void (*complete)(struct request *);  int (*init_hctx)(struct blk_mq_hw_ctx *, void *, unsigned int);  void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);  int (*init_request)(struct blk_mq_tag_set *set, struct request *, unsigned int, unsigned int);  void (*exit_request)(struct blk_mq_tag_set *set, struct request *, unsigned int);  void (*initialize_rq_fn)(struct request *rq);  void (*cleanup_rq)(struct request *);  bool (*busy)(struct request_queue *);  int (*map_queues)(struct blk_mq_tag_set *set);#ifdef CONFIG_BLK_DEBUG_FS;  void (*show_rq)(struct seq_file *m, struct request *rq);#endif;};

Members

queue_rq
Queue a new request from block IO.
commit_rqs
If a driver uses bd->last to judge when to submitrequests to hardware, it must define this function. In case of errorsthat make us stop issuing further requests, this hook serves thepurpose of kicking the hardware (which the last request otherwisewould have done).
get_budget
Reserve budget before queue request, once .queue_rq isrun, it is driver’s responsibility to release thereserved budget. Also we have to handle failure caseof .get_budget for avoiding I/O deadlock.
put_budget
Release the reserved budget.
timeout
Called on request timeout.
poll
Called to poll for completion of a specific tag.
complete
Mark the request as complete.
init_hctx
Called when the block layer side of a hardware queue hasbeen set up, allowing the driver to allocate/init matchingstructures.
exit_hctx
Ditto for exit/teardown.
init_request

Called for every command allocated by the block layerto allow the driver to set up driver specific data.

Tag greater than or equal to queue_depth is for setting upflush request.

exit_request
Ditto for exit/teardown.
initialize_rq_fn
Called from insideblk_get_request().
cleanup_rq
Called before freeing one request which isn’t completedyet, and usually for freeing the driver private data.
busy
If set, returns whether or not this queue currently is busy.
map_queues
This allows drivers specify their own queue mapping byoverriding the setup-time function that builds the mq_map.
show_rq
Used by the debugfs implementation to show driver-specificinformation about a request.
enum mq_rq_stateblk_mq_rq_state(struct request * rq)

read the current MQ_RQ_* state of a request

Parameters

structrequest*rq
target request.
struct request *blk_mq_rq_from_pdu(void * pdu)

cast a PDU to a request

Parameters

void*pdu
the PDU (Protocol Data Unit) to be casted

Return

request

Description

Driver command data is immediately after the request. So subtract requestsize to get back to the original request.

void *blk_mq_rq_to_pdu(struct request * rq)

cast a request to a PDU

Parameters

structrequest*rq
the request to be casted

Return

pointer to the PDU

Description

Driver command data is immediately after the request. So add request to getthe PDU.

voidblk_mq_quiesce_queue(struct request_queue * q)

wait until all ongoing dispatches have finished

Parameters

structrequest_queue*q
request queue.

Note

this function does not prevent that the struct request end_io()callback function is invoked. Once this function is returned, we makesure no dispatch can happen until the queue is unquiesced viablk_mq_unquiesce_queue().

voidblk_mq_complete_request(struct request * rq)

end I/O on a request

Parameters

structrequest*rq
the request being processed

Description

Complete a request by scheduling the ->complete_rq operation.
voidblk_mq_start_request(struct request * rq)

Start processing a request

Parameters

structrequest*rq
Pointer to request to be started

Description

Function used by device drivers to notify the block layer that a requestis going to be processed now, so blk layer can do proper initializationssuch as starting the timeout timer.

void__blk_mq_run_hw_queue(structblk_mq_hw_ctx * hctx)

Run a hardware queue.

Parameters

structblk_mq_hw_ctx*hctx
Pointer to the hardware queue to run.

Description

Send pending requests to the hardware.

void__blk_mq_delay_run_hw_queue(structblk_mq_hw_ctx * hctx, bool async, unsigned long msecs)

Run (or schedule to run) a hardware queue.

Parameters

structblk_mq_hw_ctx*hctx
Pointer to the hardware queue to run.
boolasync
If we want to run the queue asynchronously.
unsignedlongmsecs
Microseconds of delay to wait before running the queue.

Description

If!async, try to run the queue now. Else, run the queue asynchronously andwith a delay ofmsecs.

voidblk_mq_delay_run_hw_queue(structblk_mq_hw_ctx * hctx, unsigned long msecs)

Run a hardware queue asynchronously.

Parameters

structblk_mq_hw_ctx*hctx
Pointer to the hardware queue to run.
unsignedlongmsecs
Microseconds of delay to wait before running the queue.

Description

Run a hardware queue asynchronously with a delay ofmsecs.

voidblk_mq_run_hw_queue(structblk_mq_hw_ctx * hctx, bool async)

Start to run a hardware queue.

Parameters

structblk_mq_hw_ctx*hctx
Pointer to the hardware queue to run.
boolasync
If we want to run the queue asynchronously.

Description

Check if the request queue is not in a quiesced state and if there arepending requests to be sent. If this is true, run the queue to send requeststo hardware.

voidblk_mq_run_hw_queues(struct request_queue * q, bool async)

Run all hardware queues in a request queue.

Parameters

structrequest_queue*q
Pointer to the request queue to run.
boolasync
If we want to run the queue asynchronously.
voidblk_mq_delay_run_hw_queues(struct request_queue * q, unsigned long msecs)

Run all hardware queues asynchronously.

Parameters

structrequest_queue*q
Pointer to the request queue to run.
unsignedlongmsecs
Microseconds of delay to wait before running the queues.
boolblk_mq_queue_stopped(struct request_queue * q)

check whether one or more hctxs have been stopped

Parameters

structrequest_queue*q
request queue.

Description

The caller is responsible for serializing this function againstblk_mq_{start,stop}_hw_queue().

voidblk_mq_request_bypass_insert(struct request * rq, bool at_head, bool run_queue)

Insert a request at dispatch list.

Parameters

structrequest*rq
Pointer to request to be inserted.
boolat_head
true if the request should be inserted at the head of the list.
boolrun_queue
If we should run the hardware queue after inserting the request.

Description

Should only be used carefully, when the caller knows we want tobypass a potential IO scheduler on the target device.

voidblk_mq_try_issue_directly(structblk_mq_hw_ctx * hctx, struct request * rq, blk_qc_t * cookie)

Try to send a request directly to device driver.

Parameters

structblk_mq_hw_ctx*hctx
Pointer of the associated hardware queue.
structrequest*rq
Pointer to request to be sent.
blk_qc_t*cookie
Request queue cookie.

Description

If the device has enough resources to accept a new request now, send therequest directly to device driver. Else, insert at hctx->dispatch queue, sowe can try send it another time in the future. Requests inserted at thisqueue have higher priority.

blk_qc_tblk_mq_submit_bio(struct bio * bio)

Create and send a request to block device.

Parameters

structbio*bio
Bio pointer.

Description

Builds up a request structure fromq andbio and send to the device. Therequest may not be queued directly to hardware if:* This request can be merged with another one* We want to place request at plug queue for possible future merging* There is an IO scheduler active at this queue

It will not queue the request if there is an error with the bio, or at therequest creation.

Return

Request queue cookie.

intblk_poll(struct request_queue * q, blk_qc_t cookie, bool spin)

poll for IO completions

Parameters

structrequest_queue*q
the queue
blk_qc_tcookie
cookie passed back at IO submission time
boolspin
whether to spin for completions

Description

Poll for completions on the passed in queue. Returns number ofcompleted entries found. Ifspin is true, then blk_poll will continuelooping until at least one completion is found, unless the task isotherwise marked running (or we need to reschedule).