Multi-Queue Block IO Queueing Mechanism (blk-mq)¶
The Multi-Queue Block IO Queueing Mechanism is an API to enable fast storagedevices to achieve a huge number of input/output operations per second (IOPS)through queueing and submitting IO requests to block devices simultaneously,benefiting from the parallelism offered by modern storage devices.
Introduction¶
Background¶
Magnetic hard disks have been the de facto standard from the beginning of thedevelopment of the kernel. The Block IO subsystem aimed to achieve the bestperformance possible for those devices with a high penalty when doing randomaccess, and the bottleneck was the mechanical moving parts, a lot slower thanany layer on the storage stack. One example of such optimization techniqueinvolves ordering read/write requests according to the current position of thehard disk head.
However, with the development of Solid State Drives and Non-Volatile Memorieswithout mechanical parts nor random access penalty and capable of performinghigh parallel access, the bottleneck of the stack had moved from the storagedevice to the operating system. In order to take advantage of the parallelismin those devices’ design, the multi-queue mechanism was introduced.
The former design had a single queue to store block IO requests with a singlelock. That did not scale well in SMP systems due to dirty data in cache and thebottleneck of having a single lock for multiple processors. This setup alsosuffered with congestion when different processes (or the same process, movingto different CPUs) wanted to perform block IO. Instead of this, the blk-mq APIspawns multiple queues with individual entry points local to the CPU, removingthe need for a lock. A deeper explanation on how this works is covered in thefollowing section (Operation).
Operation¶
When the userspace performs IO to a block device (reading or writing a file,for instance), blk-mq takes action: it will store and manage IO requests tothe block device, acting as middleware between the userspace (and a filesystem, if present) and the block device driver.
blk-mq has two group of queues: software staging queues and hardware dispatchqueues. When the request arrives at the block layer, it will try the shortestpath possible: send it directly to the hardware queue. However, there are twocases that it might not do that: if there’s an IO scheduler attached at thelayer or if we want to try to merge requests. In both cases, requests will besent to the software queue.
Then, after the requests are processed by software queues, they will be placedat the hardware queue, a second stage queue where the hardware has direct accessto process those requests. However, if the hardware does not have enoughresources to accept more requests, blk-mq will place requests on a temporaryqueue, to be sent in the future, when the hardware is able.
Software staging queues¶
The block IO subsystem adds requests in the software staging queues(represented bystructblk_mq_ctx) in case that they weren’t sentdirectly to the driver. A request is one or more BIOs. They arrived at theblock layer through the data structurestructbio. The block layerwill then build a new structure from it, thestructrequest that willbe used to communicate with the device driver. Each queue has its own lock andthe number of queues is defined by a per-CPU or per-node basis.
The staging queue can be used to merge requests for adjacent sectors. Forinstance, requests for sector 3-6, 6-7, 7-9 can become one request for 3-9.Even if random access to SSDs and NVMs have the same time of response comparedto sequential access, grouped requests for sequential access decreases thenumber of individual requests. This technique of merging requests is calledplugging.
Along with that, the requests can be reordered to ensure fairness of systemresources (e.g. to ensure that no application suffers from starvation) and/or toimprove IO performance, by an IO scheduler.
IO Schedulers¶
There are several schedulers implemented by the block layer, each one followinga heuristic to improve the IO performance. They are “pluggable” (as in plugand play), in the sense of they can be selected at run time using sysfs. Youcan read more about Linux’s IO schedulershere. The schedulinghappens only between requests in the same queue, so it is not possible to mergerequests from different queues, otherwise there would be cache trashing and aneed to have a lock for each queue. After the scheduling, the requests areeligible to be sent to the hardware. One of the possible schedulers to beselected is the NONE scheduler, the most straightforward one. It will justplace requests on whatever software queue the process is running on, withoutany reordering. When the device starts processing requests in the hardwarequeue (a.k.a. run the hardware queue), the software queues mapped to thathardware queue will be drained in sequence according to their mapping.
Hardware dispatch queues¶
The hardware queue (represented bystructblk_mq_hw_ctx) is astructused by device drivers to map the device submission queues (or device DMA ringbuffer), and are the last step of the block layer submission code before thelow level device driver taking ownership of the request. To run this queue, theblock layer removes requests from the associated software queues and tries todispatch to the hardware.
If it’s not possible to send the requests directly to hardware, they will beadded to a linked list (hctx->dispatch) of requests. Then,next time the block layer runs a queue, it will send the requests laying at thedispatch list first, to ensure a fairness dispatch with thoserequests that were ready to be sent first. The number of hardware queuesdepends on the number of hardware contexts supported by the hardware and itsdevice driver, but it will not be more than the number of cores of the system.There is no reordering at this stage, and each software queue has a set ofhardware queues to send requests for.
Note
Neither the block layer nor the device protocols guaranteethe order of completion of requests. This must be handled byhigher layers, like the filesystem.
Tag-based completion¶
In order to indicate which request has been completed, every request isidentified by an integer, ranging from 0 to the dispatch queue size. This tagis generated by the block layer and later reused by the device driver, removingthe need to create a redundant identifier. When a request is completed in thedriver, the tag is sent back to the block layer to notify it of the finalization.This removes the need to do a linear search to find out which IO has beencompleted.
Further reading¶
Source code documentation¶
- enumblk_eh_timer_return¶
How the timeout handler should proceed
Constants
BLK_EH_DONEThe block driver completed the command or will complete it ata later time.
BLK_EH_RESET_TIMERReset the request timer and continue waiting for therequest to complete.
- structblk_mq_hw_ctx¶
State for a hardware queue facing the hardware block device
Definition:
struct blk_mq_hw_ctx { struct { spinlock_t lock; struct list_head dispatch; unsigned long state; }; struct delayed_work run_work; cpumask_var_t cpumask; int next_cpu; int next_cpu_batch; unsigned long flags; void *sched_data; struct request_queue *queue; struct blk_flush_queue *fq; void *driver_data; struct sbitmap ctx_map; struct blk_mq_ctx *dispatch_from; unsigned int dispatch_busy; unsigned short type; unsigned short nr_ctx; struct blk_mq_ctx **ctxs; spinlock_t dispatch_wait_lock; wait_queue_entry_t dispatch_wait; atomic_t wait_index; struct blk_mq_tags *tags; struct blk_mq_tags *sched_tags; unsigned int numa_node; unsigned int queue_num; atomic_t nr_active; struct hlist_node cpuhp_online; struct hlist_node cpuhp_dead; struct kobject kobj;#ifdef CONFIG_BLK_DEBUG_FS; struct dentry *debugfs_dir; struct dentry *sched_debugfs_dir;#endif; struct list_head hctx_list;};Members
{unnamed_struct}anonymous
lockProtects the dispatch list.
dispatchUsed for requests that are ready to bedispatched to the hardware but for some reason (e.g. lack ofresources) could not be sent to the hardware. As soon as thedriver can send new requests, requests at this list willbe sent first for a fairer dispatch.
stateBLK_MQ_S_* flags. Defines the state of the hwqueue (active, scheduled to restart, stopped).
run_workUsed for scheduling a hardware queue run at a later time.
cpumaskMap of available CPUs where this hctx can run.
next_cpuUsed by
blk_mq_hctx_next_cpu()for round-robin CPUselection fromcpumask.next_cpu_batchCounter of how many works left in the batch beforechanging to the next CPU.
flagsBLK_MQ_F_* flags. Defines the behaviour of the queue.
sched_dataPointer owned by the IO scheduler attached to a requestqueue. It’s up to the IO scheduler how to use this pointer.
queuePointer to the request queue that owns this hardware context.
fqQueue of requests that need to perform a flush operation.
driver_dataPointer to data owned by the block driver that createdthis hctx
ctx_mapBitmap for each software queue. If bit is on, there is apending request in that software queue.
dispatch_fromSoftware queue to be used when no scheduler wasselected.
dispatch_busyNumber used by
blk_mq_update_dispatch_busy()todecide if the hw_queue is busy using Exponential Weighted MovingAverage algorithm.typeHCTX_TYPE_* flags. Type of hardware queue.
nr_ctxNumber of software queues.
ctxsArray of software queues.
dispatch_wait_lockLock for dispatch_wait queue.
dispatch_waitWaitqueue to put requests when there is no tagavailable at the moment, to wait for another try in the future.
wait_indexIndex of next available dispatch_wait queue to insertrequests.
tagsTags owned by the block driver. A tag at this set is onlyassigned when a request is dispatched from a hardware queue.
sched_tagsTags owned by I/O scheduler. If there is an I/Oscheduler associated with a request queue, a tag is assigned whenthat request is allocated. Else, this member is not used.
numa_nodeNUMA node the storage adapter has been connected to.
queue_numIndex of this hardware queue.
nr_activeNumber of active requests. Only used when a tag set isshared across request queues.
cpuhp_onlineList to store request if CPU is going to die
cpuhp_deadList to store request if some CPU die.
kobjKernel object for sysfs.
debugfs_dirdebugfs directory for this hardware queue. Namedas cpu<cpu_number>.
sched_debugfs_dirdebugfs directory for the scheduler.
hctx_listif this hctx is not in use, this is an entry inq->unused_hctx_list.
- structblk_mq_queue_map¶
Map software queues to hardware queues
Definition:
struct blk_mq_queue_map { unsigned int *mq_map; unsigned int nr_queues; unsigned int queue_offset;};Members
mq_mapCPU ID to hardware queue index map. This is an arraywith nr_cpu_ids elements. Each element has a value in the range[queue_offset,queue_offset +nr_queues).
nr_queuesNumber of hardware queues to map CPU IDs onto.
queue_offsetFirst hardware queue to map onto. Used by the PCIe NVMedriver to map each hardware queue type (
enumhctx_type) onto a distinctset of hardware queues.
- enumhctx_type¶
Type of hardware queue
Constants
HCTX_TYPE_DEFAULTAll I/O not otherwise accounted for.
HCTX_TYPE_READJust for READ I/O.
HCTX_TYPE_POLLPolled I/O of any kind.
HCTX_MAX_TYPESNumber of types of hctx.
- structblk_mq_tag_set¶
tag set that can be shared between request queues
Definition:
struct blk_mq_tag_set { const struct blk_mq_ops *ops; struct blk_mq_queue_map map[HCTX_MAX_TYPES]; unsigned int nr_maps; unsigned int nr_hw_queues; unsigned int queue_depth; unsigned int reserved_tags; unsigned int cmd_size; int numa_node; unsigned int timeout; unsigned int flags; void *driver_data; struct blk_mq_tags **tags; struct blk_mq_tags *shared_tags; struct mutex tag_list_lock; struct list_head tag_list; struct srcu_struct *srcu; struct srcu_struct tags_srcu; struct rw_semaphore update_nr_hwq_lock;};Members
opsPointers to functions that implement block driver behavior.
mapOne or more ctx -> hctx mappings. One map exists for eachhardware queue type (
enumhctx_type) that the driver wishesto support. There are no restrictions on maps being of thesame size, and it’s perfectly legal to share maps betweentypes.nr_mapsNumber of elements in themap array. A number in the range[1, HCTX_MAX_TYPES].
nr_hw_queuesNumber of hardware queues supported by the block driver thatowns this data structure.
queue_depthNumber of tags per hardware queue, reserved tags included.
reserved_tagsNumber of tags to set aside for BLK_MQ_REQ_RESERVED tagallocations.
cmd_sizeNumber of additional bytes to allocate per request. The blockdriver owns these additional bytes.
numa_nodeNUMA node the storage adapter has been connected to.
timeoutRequest processing timeout in jiffies.
flagsZero or more BLK_MQ_F_* flags.
driver_dataPointer to data owned by the block driver that created thistag set.
tagsTag sets. One tag set per hardware queue. Hasnr_hw_queueselements.
shared_tagsShared set of tags. Hasnr_hw_queues elements. If set,shared by alltags.
tag_list_lockSerializes tag_list accesses.
tag_listList of the request queues that use this tag set. See alsorequest_queue.tag_set_list.
srcuUse as lock when type of the request queue is blocking(BLK_MQ_F_BLOCKING).
tags_srcuSRCU used to defer freeing of tags page_list to preventuse-after-free when iterating tags.
update_nr_hwq_lockSynchronize updating nr_hw_queues with add/del disk &switching elevator.
- structblk_mq_queue_data¶
Data about a request inserted in a queue
Definition:
struct blk_mq_queue_data { struct request *rq; bool last;};Members
rqRequest pointer.
lastIf it is the last request in the queue.
- structblk_mq_ops¶
Callback functions that implements block driver behaviour.
Definition:
struct blk_mq_ops { blk_status_t (*queue_rq)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *); void (*commit_rqs)(struct blk_mq_hw_ctx *); void (*queue_rqs)(struct rq_list *rqlist); int (*get_budget)(struct request_queue *); void (*put_budget)(struct request_queue *, int); void (*set_rq_budget_token)(struct request *, int); int (*get_rq_budget_token)(struct request *); enum blk_eh_timer_return (*timeout)(struct request *); int (*poll)(struct blk_mq_hw_ctx *, struct io_comp_batch *); void (*complete)(struct request *); int (*init_hctx)(struct blk_mq_hw_ctx *, void *, unsigned int); void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int); int (*init_request)(struct blk_mq_tag_set *set, struct request *, unsigned int, unsigned int); void (*exit_request)(struct blk_mq_tag_set *set, struct request *, unsigned int); void (*cleanup_rq)(struct request *); bool (*busy)(struct request_queue *); void (*map_queues)(struct blk_mq_tag_set *set);#ifdef CONFIG_BLK_DEBUG_FS; void (*show_rq)(struct seq_file *m, struct request *rq);#endif;};Members
queue_rqQueue a new request from block IO.
commit_rqsIf a driver uses bd->last to judge when to submitrequests to hardware, it must define this function. In case of errorsthat make us stop issuing further requests, this hook serves thepurpose of kicking the hardware (which the last request otherwisewould have done).
queue_rqsQueue a list of new requests. Driver is guaranteedthat each request belongs to the same queue. If the driver doesn’tempty therqlist completely, then the rest will be queuedindividually by the block layer upon return.
get_budgetReserve budget before queue request, once .queue_rq isrun, it is driver’s responsibility to release thereserved budget. Also we have to handle failure caseof .get_budget for avoiding I/O deadlock.
put_budgetRelease the reserved budget.
set_rq_budget_tokenstore rq’s budget token
get_rq_budget_tokenretrieve rq’s budget token
timeoutCalled on request timeout.
pollCalled to poll for completion of a specific tag.
completeMark the request as complete.
init_hctxCalled when the block layer side of a hardware queue hasbeen set up, allowing the driver to allocate/init matchingstructures.
exit_hctxDitto for exit/teardown.
init_requestCalled for every command allocated by the block layerto allow the driver to set up driver specific data.
Tag greater than or equal to queue_depth is for setting upflush request.
exit_requestDitto for exit/teardown.
cleanup_rqCalled before freeing one request which isn’t completedyet, and usually for freeing the driver private data.
busyIf set, returns whether or not this queue currently is busy.
map_queuesThis allows drivers specify their own queue mapping byoverriding the setup-time function that builds the mq_map.
show_rqUsed by the debugfs implementation to show driver-specificinformation about a request.
- enummq_rq_stateblk_mq_rq_state(structrequest*rq)¶
read the current MQ_RQ_* state of a request
Parameters
structrequest*rqtarget request.
- boolblk_mq_add_to_batch(structrequest*req,structio_comp_batch*iob,boolis_error,void(*complete)(structio_comp_batch*))¶
add a request to the completion batch
Parameters
structrequest*reqThe request to add to batch
structio_comp_batch*iobThe batch to add the request
boolis_errorSpecify true if the request failed with an error
void(*complete)(structio_comp_batch*)The completaion handler for the request
Description
Batched completions only work when there is no I/O error and no special->end_io handler.
Return
true when the request was added to the batch, otherwise false
- structrequest*blk_mq_rq_from_pdu(void*pdu)¶
cast a PDU to a request
Parameters
void*pduthe PDU (Protocol Data Unit) to be casted
Return
request
Description
Driver command data is immediately after the request. So subtract requestsize to get back to the original request.
- void*blk_mq_rq_to_pdu(structrequest*rq)¶
cast a request to a PDU
Parameters
structrequest*rqthe request to be casted
Return
pointer to the PDU
Description
Driver command data is immediately after the request. So add request to getthe PDU.
- unsignedintblk_rq_nr_bvec(structrequest*rq)¶
return number of bvecs in a request
Parameters
structrequest*rqrequest to calculate bvecs for
Description
Returns the number of bvecs.
- voidblk_mq_wait_quiesce_done(structblk_mq_tag_set*set)¶
wait until in-progress quiesce is done
Parameters
structblk_mq_tag_set*settag_set to wait on
Note
it is driver’s responsibility for making sure that quiesce hasbeen started on or more of the request_queues of the tag_set. Thisfunction only waits for the quiesce on those request_queues that hadthe quiesce flag set using blk_mq_quiesce_queue_nowait.
- voidblk_mq_quiesce_queue(structrequest_queue*q)¶
wait until all ongoing dispatches have finished
Parameters
structrequest_queue*qrequest queue.
Note
this function does not prevent that thestructrequestend_io()callback function is invoked. Once this function is returned, we makesure no dispatch can happen until the queue is unquiesced viablk_mq_unquiesce_queue().
- boolblk_update_request(structrequest*req,blk_status_terror,unsignedintnr_bytes)¶
Complete multiple bytes without completing the request
Parameters
structrequest*reqthe request being processed
blk_status_terrorblock status code
unsignedintnr_bytesnumber of bytes to complete forreq
Description
Ends I/O on a number of bytes attached toreq, but doesn’t completethe request structure even ifreq doesn’t have leftover.Ifreq has leftover, sets it up for the next range of segments.
Passing the result of
blk_rq_bytes()asnr_bytes guaranteesfalsereturn from this function.
Note
The RQF_SPECIAL_PAYLOAD flag is ignored on purpose in this functionexcept in the consistency check at the end of this function.
Return
false - this request doesn’t have any more datatrue - this request has more data
- voidblk_mq_complete_request(structrequest*rq)¶
end I/O on a request
Parameters
structrequest*rqthe request being processed
Description
Complete a request by scheduling the ->complete_rq operation.
- voidblk_mq_start_request(structrequest*rq)¶
Start processing a request
Parameters
structrequest*rqPointer to request to be started
Description
Function used by device drivers to notify the block layer that a requestis going to be processed now, so blk layer can do proper initializationssuch as starting the timeout timer.
- voidblk_execute_rq_nowait(structrequest*rq,boolat_head)¶
insert a request to I/O scheduler for execution
Parameters
structrequest*rqrequest to insert
boolat_headinsert request at head or tail of queue
Description
Insert a fully prepared request at the back of the I/O scheduler queuefor execution. Don’t wait for completion.
Note
This function will invokedone directly if the queue is dead.
- blk_status_tblk_execute_rq(structrequest*rq,boolat_head)¶
insert a request into queue for execution
Parameters
structrequest*rqrequest to insert
boolat_headinsert request at head or tail of queue
Description
Insert a fully prepared request at the back of the I/O scheduler queuefor execution and wait for completion.
Return
The blk_status_t result provided toblk_mq_end_request().
- voidblk_mq_delay_run_hw_queue(structblk_mq_hw_ctx*hctx,unsignedlongmsecs)¶
Run a hardware queue asynchronously.
Parameters
structblk_mq_hw_ctx*hctxPointer to the hardware queue to run.
unsignedlongmsecsMilliseconds of delay to wait before running the queue.
Description
Run a hardware queue asynchronously with a delay ofmsecs.
- voidblk_mq_run_hw_queue(structblk_mq_hw_ctx*hctx,boolasync)¶
Start to run a hardware queue.
Parameters
structblk_mq_hw_ctx*hctxPointer to the hardware queue to run.
boolasyncIf we want to run the queue asynchronously.
Description
Check if the request queue is not in a quiesced state and if there arepending requests to be sent. If this is true, run the queue to send requeststo hardware.
- voidblk_mq_run_hw_queues(structrequest_queue*q,boolasync)¶
Run all hardware queues in a request queue.
Parameters
structrequest_queue*qPointer to the request queue to run.
boolasyncIf we want to run the queue asynchronously.
- voidblk_mq_delay_run_hw_queues(structrequest_queue*q,unsignedlongmsecs)¶
Run all hardware queues asynchronously.
Parameters
structrequest_queue*qPointer to the request queue to run.
unsignedlongmsecsMilliseconds of delay to wait before running the queues.
- voidblk_mq_request_bypass_insert(structrequest*rq,blk_insert_tflags)¶
Insert a request at dispatch list.
Parameters
structrequest*rqPointer to request to be inserted.
blk_insert_tflagsBLK_MQ_INSERT_*
Description
Should only be used carefully, when the caller knows we want tobypass a potential IO scheduler on the target device.
- voidblk_mq_try_issue_directly(structblk_mq_hw_ctx*hctx,structrequest*rq)¶
Try to send a request directly to device driver.
Parameters
structblk_mq_hw_ctx*hctxPointer of the associated hardware queue.
structrequest*rqPointer to request to be sent.
Description
If the device has enough resources to accept a new request now, send therequest directly to device driver. Else, insert at hctx->dispatch queue, sowe can try send it another time in the future. Requests inserted at thisqueue have higher priority.
Parameters
structbio*bioBio pointer.
Description
Builds up a request structure fromq andbio and send to the device. Therequest may not be queued directly to hardware if:* This request can be merged with another one* We want to place request at plug queue for possible future merging* There is an IO scheduler active at this queue
It will not queue the request if there is an error with the bio, or at therequest creation.
- blk_status_tblk_insert_cloned_request(structrequest*rq)¶
Helper for stacking drivers to submit a request
Parameters
structrequest*rqthe request being queued
- voidblk_rq_unprep_clone(structrequest*rq)¶
Helper function to free all bios in a cloned request
Parameters
structrequest*rqthe clone request to be cleaned up
Description
Free all bios inrq for a cloned request.
- intblk_rq_prep_clone(structrequest*rq,structrequest*rq_src,structbio_set*bs,gfp_tgfp_mask,int(*bio_ctr)(structbio*,structbio*,void*),void*data)¶
Helper function to setup clone request
Parameters
structrequest*rqthe request to be setup
structrequest*rq_srcoriginal request to be cloned
structbio_set*bsbio_set that bios for clone are allocated from
gfp_tgfp_maskmemory allocation mask for bio
int(*bio_ctr)(structbio*,structbio*,void*)setup function to be called for each clone bio.Returns
0for success, non0for failure.void*dataprivate data to be passed tobio_ctr
Description
Clones bios inrq_src torq, and copies attributes ofrq_src torq.Also, pages which the original bios are pointing to are not copiedand the cloned bios just point same pages.So cloned bios must be completed before original bios, which meansthe caller must completerq beforerq_src.
- voidblk_mq_destroy_queue(structrequest_queue*q)¶
shutdown a request queue
Parameters
structrequest_queue*qrequest queue to shutdown
Description
This shuts down a request queue allocated byblk_mq_alloc_queue(). All futurerequests will be failed with -ENODEV. The caller is responsible for droppingthe reference fromblk_mq_alloc_queue() by callingblk_put_queue().
Context
can sleep