DMA and swiotlb¶
swiotlb is a memory buffer allocator used by the Linux kernel DMA layer. It istypically used when a device doing DMA can’t directly access the target memorybuffer because of hardware limitations or other requirements. In such a case,the DMA layer calls swiotlb to allocate a temporary memory buffer that conformsto the limitations. The DMA is done to/from this temporary memory buffer, andthe CPU copies the data between the temporary buffer and the original targetmemory buffer. This approach is generically called “bounce buffering”, and thetemporary memory buffer is called a “bounce buffer”.
Device drivers don’t interact directly with swiotlb. Instead, drivers informthe DMA layer of the DMA attributes of the devices they are managing, and usethe normal DMA map, unmap, and sync APIs when programming a device to do DMA.These APIs use the device DMA attributes and kernel-wide settings to determineif bounce buffering is necessary. If so, the DMA layer manages the allocation,freeing, and sync’ing of bounce buffers. Since the DMA attributes are perdevice, some devices in a system may use bounce buffering while others do not.
Because the CPU copies data between the bounce buffer and the original targetmemory buffer, doing bounce buffering is slower than doing DMA directly to theoriginal memory buffer, and it consumes more CPU resources. So it is used onlywhen necessary for providing DMA functionality.
Usage Scenarios¶
swiotlb was originally created to handle DMA for devices with addressinglimitations. As physical memory sizes grew beyond 4 GiB, some devices couldonly provide 32-bit DMA addresses. By allocating bounce buffer memory belowthe 4 GiB line, these devices with addressing limitations could still work anddo DMA.
More recently, Confidential Computing (CoCo) VMs have the guest VM’s memoryencrypted by default, and the memory is not accessible by the host hypervisorand VMM. For the host to do I/O on behalf of the guest, the I/O must bedirected to guest memory that is unencrypted. CoCo VMs set a kernel-wide optionto force all DMA I/O to use bounce buffers, and the bounce buffer memory is setup as unencrypted. The host does DMA I/O to/from the bounce buffer memory, andthe Linux kernel DMA layer does “sync” operations to cause the CPU to copy thedata to/from the original target memory buffer. The CPU copying bridges betweenthe unencrypted and the encrypted memory. This use of bounce buffers allowsdevice drivers to “just work” in a CoCo VM, with no modificationsneeded to handle the memory encryption complexity.
Other edge case scenarios arise for bounce buffers. For example, when IOMMUmappings are set up for a DMA operation to/from a device that is considered“untrusted”, the device should be given access only to the memory containingthe data being transferred. But if that memory occupies only part of an IOMMUgranule, other parts of the granule may contain unrelated kernel data. SinceIOMMU access control is per-granule, the untrusted device can gain access tothe unrelated kernel data. This problem is solved by bounce buffering the DMAoperation and ensuring that unused portions of the bounce buffers do notcontain any unrelated kernel data.
Core Functionality¶
The primary swiotlb APIs areswiotlb_tbl_map_single() andswiotlb_tbl_unmap_single(). The “map” API allocates a bounce buffer of aspecified size in bytes and returns the physical address of the buffer. Thebuffer memory is physically contiguous. The expectation is that the DMA layermaps the physical memory address to a DMA address, and returns the DMA addressto the driver for programming into the device. If a DMA operation specifiesmultiple memory buffer segments, a separate bounce buffer must be allocated foreach segment.swiotlb_tbl_map_single() always does a “sync” operation (i.e., aCPU copy) to initialize the bounce buffer to match the contents of the originalbuffer.
swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might haveupdated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, theunmap does a “sync” operation to cause a CPU copy of the data from the bouncebuffer back to the original buffer. Then the bounce buffer memory is freed.
swiotlb also provides “sync” APIs that correspond to the dma_sync_*() APIs thata driver may use when control of a buffer transitions between the CPU and thedevice. The swiotlb “sync” APIs cause a CPU copy of the data between theoriginal buffer and the bounce buffer. Like the dma_sync_*() APIs, the swiotlb“sync” APIs support doing a partial sync, where only a subset of the bouncebuffer is copied to/from the original buffer.
Core Functionality Constraints¶
The swiotlb map/unmap/sync APIs must operate without blocking, as they arecalled by the corresponding DMA APIs which may run in contexts that cannotblock. Hence the default memory pool for swiotlb allocations must bepre-allocated at boot time (but see Dynamic swiotlb below). Because swiotlballocations must be physically contiguous, the entire default memory pool isallocated as a single contiguous block.
The need to pre-allocate the default swiotlb pool creates a boot-time tradeoff.The pool should be large enough to ensure that bounce buffer requests canalways be satisfied, as the non-blocking requirement means requests can’t waitfor space to become available. But a large pool potentially wastes memory, asthis pre-allocated memory is not available for other uses in the system. Thetradeoff is particularly acute in CoCo VMs that use bounce buffers for all DMAI/O. These VMs use a heuristic to set the default pool size to ~6% of memory,with a max of 1 GiB, which has the potential to be very wasteful of memory.Conversely, the heuristic might produce a size that is insufficient, dependingon the I/O patterns of the workload in the VM. The dynamic swiotlb featuredescribed below can help, but has limitations. Better management of the swiotlbdefault memory pool size remains an open issue.
A single allocation from swiotlb is limited to IO_TLB_SIZE * IO_TLB_SEGSIZEbytes, which is 256 KiB with current definitions. When a device’s DMA settingsare such that the device might use swiotlb, the maximum size of a DMA segmentmust be limited to that 256 KiB. This value is communicated to higher-levelkernel code viadma_map_mapping_size() andswiotlb_max_mapping_size(). If thehigher-level code fails to account for this limit, it may make requests thatare too large for swiotlb, and get a “swiotlb full” error.
A key device DMA setting is “min_align_mask”, which is a power of 2 minus 1so that some number of low order bits are set, or it may be zero. swiotlballocations ensure these min_align_mask bits of the physical address of thebounce buffer match the same bits in the address of the original buffer. Whenmin_align_mask is non-zero, it may produce an “alignment offset” in the addressof the bounce buffer that slightly reduces the maximum size of an allocation.This potential alignment offset is reflected in the value returned byswiotlb_max_mapping_size(), which can show up in places like/sys/block/<device>/queue/max_sectors_kb. For example, if a device does not useswiotlb, max_sectors_kb might be 512 KiB or larger. If a device might useswiotlb, max_sectors_kb will be 256 KiB. When min_align_mask is non-zero,max_sectors_kb might be even smaller, such as 252 KiB.
swiotlb_tbl_map_single() also takes an “alloc_align_mask” parameter. Thisparameter specifies the allocation of bounce buffer space must start at aphysical address with the alloc_align_mask bits set to zero. But the actualbounce buffer might start at a larger address if min_align_mask is non-zero.Hence there may be pre-padding space that is allocated prior to the start ofthe bounce buffer. Similarly, the end of the bounce buffer is rounded up to analloc_align_mask boundary, potentially resulting in post-padding space. Anypre-padding or post-padding space is not initialized by swiotlb code. The“alloc_align_mask” parameter is used by IOMMU code when mapping for untrusteddevices. It is set to the granule size - 1 so that the bounce buffer isallocated entirely from granules that are not used for any other purpose.
Data structures concepts¶
Memory used for swiotlb bounce buffers is allocated from overall system memoryas one or more “pools”. The default pool is allocated during system boot with adefault size of 64 MiB. The default pool size may be modified with the“swiotlb=” kernel boot line parameter. The default size may also be adjusteddue to other conditions, such as running in a CoCo VM, as described above. IfCONFIG_SWIOTLB_DYNAMIC is enabled, additional pools may be allocated later inthe life of the system. Each pool must be a contiguous range of physicalmemory. The default pool is allocated below the 4 GiB physical address line soit works for devices that can only address 32-bits of physical memory (unlessarchitecture-specific code provides the SWIOTLB_ANY flag). In a CoCo VM, thepool memory must be decrypted before swiotlb is used.
Each pool is divided into “slots” of size IO_TLB_SIZE, which is 2 KiB withcurrent definitions. IO_TLB_SEGSIZE contiguous slots (128 slots) constitutewhat might be called a “slot set”. When a bounce buffer is allocated, itoccupies one or more contiguous slots. A slot is never shared by multiplebounce buffers. Furthermore, a bounce buffer must be allocated from a singleslot set, which leads to the maximum bounce buffer size being IO_TLB_SIZE *IO_TLB_SEGSIZE. Multiple smaller bounce buffers may co-exist in a single slotset if the alignment and size constraints can be met.
Slots are also grouped into “areas”, with the constraint that a slot set existsentirely in a single area. Each area has its own spin lock that must be held tomanipulate the slots in that area. The division into areas avoids contendingfor a single global spin lock when swiotlb is heavily used, such as in a CoCoVM. The number of areas defaults to the number of CPUs in the system formaximum parallelism, but since an area can’t be smaller than IO_TLB_SEGSIZEslots, it might be necessary to assign multiple CPUs to the same area. Thenumber of areas can also be set via the “swiotlb=” kernel boot parameter.
When allocating a bounce buffer, if the area associated with the calling CPUdoes not have enough free space, areas associated with other CPUs are triedsequentially. For each area tried, the area’s spin lock must be obtained beforetrying an allocation, so contention may occur if swiotlb is relatively busyoverall. But an allocation request does not fail unless all areas do not haveenough free space.
IO_TLB_SIZE, IO_TLB_SEGSIZE, and the number of areas must all be powers of 2 asthe code uses shifting and bit masking to do many of the calculations. Thenumber of areas is rounded up to a power of 2 if necessary to meet thisrequirement.
The default pool is allocated with PAGE_SIZE alignment. If an alloc_align_maskargument toswiotlb_tbl_map_single() specifies a larger alignment, one or moreinitial slots in each slot set might not meet the alloc_align_mask criterium.Because a bounce buffer allocation can’t cross a slot set boundary, eliminatingthose initial slots effectively reduces the max size of a bounce buffer.Currently, there’s no problem because alloc_align_mask is set based on IOMMUgranule size, and granules cannot be larger than PAGE_SIZE. But if that were tochange in the future, the initial pool allocation might need to be done withalignment larger than PAGE_SIZE.
Dynamic swiotlb¶
When CONFIG_SWIOTLB_DYNAMIC is enabled, swiotlb can do on-demand expansion ofthe amount of memory available for allocation as bounce buffers. If a bouncebuffer request fails due to lack of available space, an asynchronous backgroundtask is kicked off to allocate memory from general system memory and turn itinto an swiotlb pool. Creating an additional pool must be done asynchronouslybecause the memory allocation may block, and as noted above, swiotlb requestsare not allowed to block. Once the background task is kicked off, the bouncebuffer request creates a “transient pool” to avoid returning an “swiotlb full”error. A transient pool has the size of the bounce buffer request, and isdeleted when the bounce buffer is freed. Memory for this transient pool comesfrom the general system memory atomic pool so that creation does not block.Creating a transient pool has relatively high cost, particularly in a CoCo VMwhere the memory must be decrypted, so it is done only as a stopgap until thebackground task can add another non-transient pool.
Adding a dynamic pool has limitations. Like with the default pool, the memorymust be physically contiguous, so the size is limited to MAX_PAGE_ORDER pages(e.g., 4 MiB on a typical x86 system). Due to memory fragmentation, a max sizeallocation may not be available. The dynamic pool allocator tries smaller sizesuntil it succeeds, but with a minimum size of 1 MiB. Given sufficient systemmemory fragmentation, dynamically adding a pool might not succeed at all.
The number of areas in a dynamic pool may be different from the number of areasin the default pool. Because the new pool size is typically a few MiB at most,the number of areas will likely be smaller. For example, with a new pool sizeof 4 MiB and the 256 KiB minimum area size, only 16 areas can be created. Ifthe system has more than 16 CPUs, multiple CPUs must share an area, creatingmore lock contention.
New pools added via dynamic swiotlb are linked together in a linear list.swiotlb code frequently must search for the pool containing a particularswiotlb physical address, so that search is linear and not performant with alarge number of dynamic pools. The data structures could be improved forfaster searches.
Overall, dynamic swiotlb works best for small configurations with relativelyfew CPUs. It allows the default swiotlb pool to be smaller so that memory isnot wasted, with dynamic pools making more space available if needed (as longas fragmentation isn’t an obstacle). It is less useful for large CoCo VMs.
Data Structure Details¶
swiotlb is managed with four primary data structures: io_tlb_mem, io_tlb_pool,io_tlb_area, and io_tlb_slot. io_tlb_mem describes a swiotlb memory allocator,which includes the default memory pool and any dynamic or transient poolslinked to it. Limited statistics on swiotlb usage are kept per memory allocatorand are stored in this data structure. These statistics are available under/sys/kernel/debug/swiotlb when CONFIG_DEBUG_FS is set.
io_tlb_pool describes a memory pool, either the default pool, a dynamic pool,or a transient pool. The description includes the start and end addresses ofthe memory in the pool, a pointer to an array of io_tlb_area structures, and apointer to an array of io_tlb_slot structures that are associated with the pool.
io_tlb_area describes an area. The primary field is the spin lock used toserialize access to slots in the area. The io_tlb_area array for a pool has anentry for each area, and is accessed using a 0-based area index derived from thecalling processor ID. Areas exist solely to allow parallel access to swiotlbfrom multiple CPUs.
io_tlb_slot describes an individual memory slot in the pool, with sizeIO_TLB_SIZE (2 KiB currently). The io_tlb_slot array is indexed by the slotindex computed from the bounce buffer address relative to the starting memoryaddress of the pool. The size ofstructio_tlb_slot is 24 bytes, so theoverhead is about 1% of the slot size.
The io_tlb_slot array is designed to meet several requirements. First, the DMAAPIs and the corresponding swiotlb APIs use the bounce buffer address as theidentifier for a bounce buffer. This address is returned byswiotlb_tbl_map_single(), and then passed as an argument toswiotlb_tbl_unmap_single() and the swiotlb_sync_*() functions. The originalmemory buffer address obviously must be passed as an argument toswiotlb_tbl_map_single(), but it is not passed to the other APIs. Consequently,swiotlb data structures must save the original memory buffer address so that itcan be used when doing sync operations. This original address is saved in theio_tlb_slot array.
Second, the io_tlb_slot array must handle partial sync requests. In such cases,the argument to swiotlb_sync_*() is not the address of the start of the bouncebuffer but an address somewhere in the middle of the bounce buffer, and theaddress of the start of the bounce buffer isn’t known to swiotlb code. Butswiotlb code must be able to calculate the corresponding original memory bufferaddress to do the CPU copy dictated by the “sync”. So an adjusted originalmemory buffer address is populated into thestructio_tlb_slot for each slotoccupied by the bounce buffer. An adjusted “alloc_size” of the bounce buffer isalso recorded in eachstructio_tlb_slot so a sanity check can be performed onthe size of the “sync” operation. The “alloc_size” field is not used except forthe sanity check.
Third, the io_tlb_slot array is used to track available slots. The “list” fieldinstructio_tlb_slot records how many contiguous available slots exist startingat that slot. A “0” indicates that the slot is occupied. A value of “1”indicates only the current slot is available. A value of “2” indicates thecurrent slot and the next slot are available, etc. The maximum value isIO_TLB_SEGSIZE, which can appear in the first slot in a slot set, and indicatesthat the entire slot set is available. These values are used when searching foravailable slots to use for a new bounce buffer. They are updated when allocatinga new bounce buffer and when freeing a bounce buffer. At pool creation time, the“list” field is initialized to IO_TLB_SEGSIZE down to 1 for the slots in everyslot set.
Fourth, the io_tlb_slot array keeps track of any “padding slots” allocated tomeet alloc_align_mask requirements described above. Whenswiotlb_tbl_map_single() allocates bounce buffer space to meet alloc_align_maskrequirements, it may allocate pre-padding space across zero or more slots. Butwhenswiotlb_tbl_unmap_single() is called with the bounce buffer address, thealloc_align_mask value that governed the allocation, and therefore theallocation of any padding slots, is not known. The “pad_slots” field recordsthe number of padding slots so thatswiotlb_tbl_unmap_single() can free them.The “pad_slots” value is recorded only in the first non-padding slot allocatedto the bounce buffer.
Restricted pools¶
The swiotlb machinery is also used for “restricted pools”, which are pools ofmemory separate from the default swiotlb pool, and that are dedicated for DMAuse by a particular device. Restricted pools provide a level of DMA memoryprotection on systems with limited hardware protection capabilities, such asthose lacking an IOMMU. Such usage is specified by DeviceTree entries andrequires that CONFIG_DMA_RESTRICTED_POOL is set. Each restricted pool is basedon its own io_tlb_mem data structure that is independent of the main swiotlbio_tlb_mem.
Restricted pools addswiotlb_alloc() andswiotlb_free() APIs, which are calledfrom the dma_alloc_*() and dma_free_*() APIs. The swiotlb_alloc/free() APIsallocate/free slots from/to the restricted pool directly and do not go throughswiotlb_tbl_map/unmap_single().