IOMMUFD¶
- Author:
Jason Gunthorpe
- Author:
Kevin Tian
Overview¶
IOMMUFD is the user API to control the IOMMU subsystem as it relates to managingIO page tables from userspace using file descriptors. It intends to be generaland consumable by any driver that wants to expose DMA to userspace. Thesedrivers are eventually expected to deprecate any internal IOMMU logicthey may already/historically implement (e.g. vfio_iommu_type1.c).
At minimum iommufd provides universal support of managing I/O address spaces andI/O page tables for all IOMMUs, with room in the design to add non-genericfeatures to cater to specific hardware functionality.
In this context the capital letter (IOMMUFD) refers to the subsystem while thesmall letter (iommufd) refers to the file descriptors created via /dev/iommu foruse by userspace.
Key Concepts¶
User Visible Objects¶
Following IOMMUFD objects are exposed to userspace:
IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS), allowing map/unmapof user space memory into ranges of I/O Virtual Address (IOVA).
The IOAS is a functional replacement for the VFIO container, and like the VFIOcontainer it copies an IOVA map to a list of iommu_domains held within it.
IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by anexternal driver.
IOMMUFD_OBJ_HWPT_PAGING, representing an actual hardware I/O page table(i.e. a single
structiommu_domain) managed by the iommu driver. “PAGING”primarily indicates this type of HWPT should be linked to an IOAS. It alsoindicates that it is backed by an iommu_domain with __IOMMU_DOMAIN_PAGINGfeature flag. This can be either an UNMANAGED stage-1 domain for a devicerunning in the user space, or a nesting parent stage-2 domain for mappingsfrom guest-level physical addresses to host-level physical addresses.The IOAS has a list of HWPT_PAGINGs that share the same IOVA mapping andit will synchronize its mapping with each member HWPT_PAGING.
IOMMUFD_OBJ_HWPT_NESTED, representing an actual hardware I/O page table(i.e. a single
structiommu_domain) managed by user space (e.g. guest OS).“NESTED” indicates that this type of HWPT should be linked to an HWPT_PAGING.It also indicates that it is backed by an iommu_domain that has a type ofIOMMU_DOMAIN_NESTED. This must be a stage-1 domain for a device running inthe user space (e.g. in a guest VM enabling the IOMMU nested translationfeature.) As such, it must be created with a given nesting parent stage-2domain to associate to. This nested stage-1 page table managed by the userspace usually has mappings from guest-level I/O virtual addresses to guest-level physical addresses.IOMMUFD_FAULT, representing a software queue for an HWPT reporting IO pagefaults using the IOMMU HW’s PRI (Page Request Interface). This queue objectprovides user space an FD to poll the page fault events and also to respondto those events. A FAULT object must be created first to get a fault_id thatcould be then used to allocate a fault-enabled HWPT via the IOMMU_HWPT_ALLOCcommand by setting the IOMMU_HWPT_FAULT_ID_VALID bit in its flags field.
IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,passed to or shared with a VM. It may be some HW-accelerated virtualizationfeatures and some SW resources used by the VM. For examples:
Security namespace for guest owned ID, e.g. guest-controlled cache tags
Non-device-affiliated event reporting, e.g. invalidation queue errors
Access to a shareable nesting parent pagetable across physical IOMMUs
Virtualization of various platforms IDs, e.g. RIDs and others
Delivery of paravirtualized invalidation
Direct assigned invalidation queues
Direct assigned interrupts
Such a vIOMMU object generally has the access to a nesting parent pagetableto support some HW-accelerated virtualization features. So, a vIOMMU objectmust be created given a nesting parent HWPT_PAGING object, and then it wouldencapsulate that HWPT_PAGING object. Therefore, a vIOMMU object can be usedto allocate an HWPT_NESTED object in place of the encapsulated HWPT_PAGING.
Note
The name “vIOMMU” isn’t necessarily identical to a virtualized IOMMU in aVM. A VM can have one giant virtualized IOMMU running on a machine havingmultiple physical IOMMUs, in which case the VMM will dispatch the requestsor configurations from this single virtualized IOMMU instance to multiplevIOMMU objects created for individual slices of different physical IOMMUs.In other words, a vIOMMU object is always a representation of one physicalIOMMU, not necessarily of a virtualized IOMMU. For VMMs that want the fullvirtualization features from physical IOMMUs, it is suggested to build thesame number of virtualized IOMMUs as the number of physical IOMMUs, so thepassed-through devices would be connected to their own virtualized IOMMUsbacked by corresponding vIOMMU objects, in which case a guest OS would dothe “dispatch” naturally instead of VMM trappings.
IOMMUFD_OBJ_VDEVICE, representing a virtual device for an IOMMUFD_OBJ_DEVICEagainst an IOMMUFD_OBJ_VIOMMU. This virtual device holds the device’s virtualinformation or attributes (related to the vIOMMU) in a VM. An immediate vDATAexample can be the virtual ID of the device on a vIOMMU, which is a unique IDthat VMM assigns to the device for a translation channel/port of the vIOMMU,e.g. vSID of ARM SMMUv3, vDeviceID of AMD IOMMU, and vRID of Intel VT-d to aContext Table. Potential use cases of some advanced security information canbe forwarded via this object too, such as security level or realm informationin a Confidential Compute Architecture. A VMM should create a vDEVICE objectto forward all the device information in a VM, when it connects a device to avIOMMU, which is a separate ioctl call from attaching the same device to anHWPT_PAGING that the vIOMMU holds.
IOMMUFD_OBJ_VEVENTQ, representing a software queue for a vIOMMU to report itsevents such as translation faults occurred to a nested stage-1 (excluding I/Opage faults that should go through IOMMUFD_OBJ_FAULT) and HW-specific events.This queue object provides user space an FD to poll/read the vIOMMU events. AvIOMMU object must be created first to get its viommu_id, which could be thenused to allocate a vEVENTQ. Each vIOMMU can support multiple types of vEVENTS,but is confined to one vEVENTQ per vEVENTQ type.
IOMMUFD_OBJ_HW_QUEUE, representing a hardware accelerated queue, as a subsetof IOMMU’s virtualization features, for the IOMMU HW to directly read or writethe virtual queue memory owned by a guest OS. This HW-acceleration feature canallow VM to work with the IOMMU HW directly without a VM Exit, so as to reduceoverhead from the hypercalls. Along with the HW QUEUE object, iommufd providesuser space an mmap interface for VMM to mmap a physical MMIO region from thehost physical address space to the guest physical address space, allowing theguest OS to directly control the allocated HW QUEUE. Thus, when allocating aHW QUEUE, the VMM must request a pair of mmap info (offset/length) and pass inexactly to an mmap syscall via its offset and length arguments.
All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
The diagrams below show relationships between user-visible objects and kerneldatastructures (external to iommufd), with numbers referred to operationscreating the objects and links:
_______________________________________________________________________| iommufd (HWPT_PAGING only) || || [1] [3] [2] || ________________ _____________ ________ || | | | | | | || | IOAS |<---| HWPT_PAGING |<---------------------| DEVICE | || |________________| |_____________| |________| || | | | ||_________|____________________|__________________________________|_____| | | | | ______v_____ ___v__ | PFN storage | (paging) | |struct| |------------>|iommu_domain|<-----------------------|device| |____________| |______| _______________________________________________________________________| iommufd (with HWPT_NESTED) || || [1] [3] [4] [2] || ________________ _____________ _____________ ________ || | | | | | | | | || | IOAS |<---| HWPT_PAGING |<---| HWPT_NESTED |<--| DEVICE | || |________________| |_____________| |_____________| |________| || | | | | ||_________|____________________|__________________|_______________|_____| | | | | | ______v_____ ______v_____ ___v__ | PFN storage | (paging) | | (nested) | |struct| |------------>|iommu_domain|<----|iommu_domain|<----|device| |____________| |____________| |______| _______________________________________________________________________| iommufd (with vIOMMU/vDEVICE) || || [5] [6] || _____________ _____________ || | | | | || |----------------| vIOMMU |<---| vDEVICE |<----| || | | | |_____________| | || | | | | || | [1] | | [4] | [2] || | ______ | | _____________ _|______ || | | | | [3] | | | | | || | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | || | |______| |_____________| |_____________| |________| || | | | | | ||______|________|______________|__________________|_______________|_____| | | | | | ______v_____ | ______v_____ ______v_____ ___v__| struct | | PFN | (paging) | | (nested) | |struct||iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device||____________| storage|____________| |____________| |______|
IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd canhold multiple IOAS objects. IOAS is the most generic object and does notexpose interfaces that are specific to single IOMMU drivers. All operationson the IOAS must operate equally on each of the iommu_domains inside of it.
IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPIto bind a device to an iommufd. The driver is expected to implement a set ofioctls to allow userspace to initiate the binding operation. Successfulcompletion of this operation establishes the desired DMA ownership over thedevice. The driver must also set the driver_managed_dma flag and must nottouch the device until this operation succeeds.
IOMMUFD_OBJ_HWPT_PAGING can be created in two ways:
IOMMUFD_OBJ_HWPT_PAGING is automatically created when an external drivercalls the IOMMUFD kAPI to attach a bound device to an IOAS. Similarly theexternal driver uAPI allows userspace to initiate the attaching operation.If a compatible member HWPT_PAGING object exists in the IOAS’s HWPT_PAGINGlist, then it will be reused. Otherwise a new HWPT_PAGING that representsan iommu_domain to userspace will be created, and then added to the list.Successful completion of this operation sets up the linkages among IOAS,device and iommu_domain. Once this completes the device could do DMA.
IOMMUFD_OBJ_HWPT_PAGING can be manually created via the IOMMU_HWPT_ALLOCuAPI, provided an ioas_id via @pt_id to associate the new HWPT_PAGING tothe corresponding IOAS object. The benefit of this manual allocation is toallow allocation flags (defined in
enumiommufd_hwpt_alloc_flags), e.g. itallocates a nesting parent HWPT_PAGING if the IOMMU_HWPT_ALLOC_NEST_PARENTflag is set.
IOMMUFD_OBJ_HWPT_NESTED can be only manually created via the IOMMU_HWPT_ALLOCuAPI, provided an hwpt_id or a viommu_id of a vIOMMU object encapsulating anesting parent HWPT_PAGING via @pt_id to associate the new HWPT_NESTED objectto the corresponding HWPT_PAGING object. The associating HWPT_PAGING objectmust be a nesting parent manually allocated via the same uAPI previously withan IOMMU_HWPT_ALLOC_NEST_PARENT flag, otherwise the allocation will fail. Theallocation will be further validated by the IOMMU driver to ensure that thenesting parent domain and the nested domain being allocated are compatible.Successful completion of this operation sets up linkages among IOAS, device,and iommu_domains. Once this completes the device could do DMA via a 2-stagetranslation, a.k.a nested translation. Note that multiple HWPT_NESTED objectscan be allocated by (and then associated to) the same nesting parent.
Note
Either a manual IOMMUFD_OBJ_HWPT_PAGING or an IOMMUFD_OBJ_HWPT_NESTED iscreated via the same IOMMU_HWPT_ALLOC uAPI. The difference is at the typeof the object passed in via the @pt_id field of
structiommufd_hwpt_alloc.IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOCuAPI, provided a dev_id (for the device’s physical IOMMU to back the vIOMMU)and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). Theiommufd core will link the vIOMMU object to the
structiommu_devicethat thestructdeviceis behind. And an IOMMU driver can implement a viommu_alloc opto allocate its own vIOMMU data structure embedding the core-level structureiommufd_viommu and some driver-specific data. If necessary, the driver canalso configure its HW virtualization feature for that vIOMMU (and thus forthe VM). Successful completion of this operation sets up the linkages betweenthe vIOMMU object and the HWPT_PAGING, then this vIOMMU object can be usedas a nesting parent object to allocate an HWPT_NESTED object described above.IOMMUFD_OBJ_VDEVICE can be only manually created via the IOMMU_VDEVICE_ALLOCuAPI, provided a viommu_id for an iommufd_viommu object and a dev_id for aniommufd_device object. The vDEVICE object will be the binding between thesetwo parent objects. Another @virt_id will be also set via the uAPI providingthe iommufd core an index to store the vDEVICE object to a vDEVICE array pervIOMMU. If necessary, the IOMMU driver may choose to implement a vdevce_allocop to init its HW for virtualization feature related to a vDEVICE. Successfulcompletion of this operation sets up the linkages between vIOMMU and device.
A device can only bind to an iommufd due to DMA ownership claim and attach to atmost one IOAS object (no support of PASID yet).
Kernel Datastructure¶
User visible objects are backed by following datastructures:
iommufd_ioas for IOMMUFD_OBJ_IOAS.
iommufd_device for IOMMUFD_OBJ_DEVICE.
iommufd_hwpt_paging for IOMMUFD_OBJ_HWPT_PAGING.
iommufd_hwpt_nested for IOMMUFD_OBJ_HWPT_NESTED.
iommufd_fault for IOMMUFD_OBJ_FAULT.
iommufd_viommu for IOMMUFD_OBJ_VIOMMU.
iommufd_vdevice for IOMMUFD_OBJ_VDEVICE.
iommufd_veventq for IOMMUFD_OBJ_VEVENTQ.
iommufd_hw_queue for IOMMUFD_OBJ_HW_QUEUE.
Several terminologies when looking at these datastructures:
Automatic domain - refers to an iommu domain created automatically whenattaching a device to an IOAS object. This is compatible to the semantics ofVFIO type1.
Manual domain - refers to an iommu domain designated by the user as thetarget pagetable to be attached to by a device. Though currently there areno uAPIs to directly create such domain, the datastructure and algorithmsare ready for handling that use case.
In-kernel user - refers to something like a VFIO mdev that is using theIOMMUFD access interface to access the IOAS. This starts by creating aniommufd_access object that is similar to the domain binding a physical devicewould do. The access object will then allow converting IOVA ranges into
structpage* lists, or doing direct read/write to an IOVA.
iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges aremapped to memory pages, composed of:
structio_pagetableholding the IOVA mapstructiopt_area’s representing populated portions of IOVAstructiopt_pagesrepresenting the storage of PFNsstructiommu_domainrepresenting the IO page table in the IOMMUstructiopt_pages_accessrepresenting in-kernel users of PFNsstructxarraypinned_pfns holding a list of pages pinned by in-kernel users
Each iopt_pages represents a logical linear array of full PFNs. The PFNs areultimately derived from userspace VAs via an mm_struct. Once they have beenpinned the PFNs are stored in IOPTEs of an iommu_domain or inside the pinned_pfnsxarray if they have been pinned through an iommufd_access.
PFN have to be copied between all combinations of storage locations, dependingon what domains are present and what kinds of in-kernel “software access” usersexist. The mechanism ensures that a page is pinned only once.
An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with alist of iommu_domains that mirror the IOVA to PFN map.
Multiple io_pagetable-s, through their iopt_area-s, can share a singleiopt_pages which avoids multi-pinning and double accounting of pageconsumption.
iommufd_ioas is shareable between subsystems, e.g. VFIO and VDPA, as long asdevices managed by different subsystems are bound to a same iommufd.
IOMMUFD User API¶
General ioctl format
The ioctl interface follows a general format to allow for extensibility. Eachioctl is passed in a structure pointer as the argument providing the size ofthe structure in the first u32. The kernel checks that any structure spacebeyond what it understands is 0. This allows userspace to use the backwardcompatible portion while consistently using the newer, larger, structures.
ioctls use a standard meaning for common errnos:
ENOTTY: The IOCTL number itself is not supported at all
E2BIG: The IOCTL number is supported, but the provided structure hasnon-zero in a part the kernel does not understand.
EOPNOTSUPP: The IOCTL number is supported, and the structure isunderstood, however a known field has a value the kernel does notunderstand or support.
EINVAL: Everything about the IOCTL was understood, but a field is notcorrect.
ENOENT: An ID or IOVA provided does not exist.
ENOMEM: Out of memory.
EOVERFLOW: Mathematics overflowed.
As well as additional errnos, within specific ioctls.
- structiommu_destroy¶
ioctl(IOMMU_DESTROY)
Definition:
struct iommu_destroy { __u32 size; __u32 id;};Members
sizesizeof(
structiommu_destroy)idiommufd object ID to destroy. Can be any destroyable object type.
Description
Destroy any object held within iommufd.
- structiommu_ioas_alloc¶
ioctl(IOMMU_IOAS_ALLOC)
Definition:
struct iommu_ioas_alloc { __u32 size; __u32 flags; __u32 out_ioas_id;};Members
sizesizeof(
structiommu_ioas_alloc)flagsMust be 0
out_ioas_idOutput IOAS ID for the allocated object
Description
Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)to memory mapping.
- structiommu_iova_range¶
ioctl(IOMMU_IOVA_RANGE)
Definition:
struct iommu_iova_range { __aligned_u64 start; __aligned_u64 last;};Members
startFirst IOVA
lastInclusive last IOVA
Description
An interval in IOVA space.
- structiommu_ioas_iova_ranges¶
ioctl(IOMMU_IOAS_IOVA_RANGES)
Definition:
struct iommu_ioas_iova_ranges { __u32 size; __u32 ioas_id; __u32 num_iovas; __u32 __reserved; __aligned_u64 allowed_iovas; __aligned_u64 out_iova_alignment;};Members
sizesizeof(
structiommu_ioas_iova_ranges)ioas_idIOAS ID to read ranges from
num_iovasInput/Output total number of ranges in the IOAS
__reservedMust be 0
allowed_iovasPointer to the output array of
structiommu_iova_rangeout_iova_alignmentMinimum alignment required for mapping IOVA
Description
Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these rangesis not allowed. num_iovas will be set to the total number of iovas andthe allowed_iovas[] will be filled in as space permits.
The allowed ranges are dependent on the HW path the DMA operation takes, andcan change during the lifetime of the IOAS. A fresh empty IOAS will have afull range, and each attached device will narrow the ranges based on thatdevice’s HW restrictions. Detaching a device can widen the ranges. Userspaceshould query ranges after every attach/detach to know what IOVAs are validfor mapping.
On input num_iovas is the length of the allowed_iovas array. On output it isthe total number of iovas filled in. The ioctl will return -EMSGSIZE and setnum_iovas to the required value if num_iovas is too small. In this case thecaller should allocate a larger output array and re-issue the ioctl.
out_iova_alignment returns the minimum IOVA alignment that can be givento IOMMU_IOAS_MAP/COPY. IOVA’s must satisfy:
starting_iova % out_iova_alignment == 0(starting_iova + length) % out_iova_alignment == 0
out_iova_alignment can be 1 indicating any IOVA is allowed. It cannotbe higher than the system PAGE_SIZE.
- structiommu_ioas_allow_iovas¶
ioctl(IOMMU_IOAS_ALLOW_IOVAS)
Definition:
struct iommu_ioas_allow_iovas { __u32 size; __u32 ioas_id; __u32 num_iovas; __u32 __reserved; __aligned_u64 allowed_iovas;};Members
sizesizeof(
structiommu_ioas_allow_iovas)ioas_idIOAS ID to allow IOVAs from
num_iovasInput/Output total number of ranges in the IOAS
__reservedMust be 0
allowed_iovasPointer to array of
structiommu_iova_range
Description
Ensure a range of IOVAs are always available for allocation. If this callsucceeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA rangesthat are narrower than the ranges provided here. This call will fail ifIOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
When an IOAS is first created the IOVA_RANGES will be maximally sized, and asdevices are attached the IOVA will narrow based on the device restrictions.When an allowed range is specified any narrowing will be refused, ie deviceattachment can fail if the device requires limiting within the allowed range.
Automatic IOVA allocation is also impacted by this call. MAP will onlyallocate within the allowed IOVAs if they are present.
This call replaces the entire allowed list with the given list.
- enumiommufd_ioas_map_flags¶
Flags for map and copy
Constants
IOMMU_IOAS_MAP_FIXED_IOVAIf clear the kernel will compute an appropriateIOVA to place the mapping at
IOMMU_IOAS_MAP_WRITEABLEDMA is allowed to write to this mapping
IOMMU_IOAS_MAP_READABLEDMA is allowed to read from this mapping
- structiommu_ioas_map¶
ioctl(IOMMU_IOAS_MAP)
Definition:
struct iommu_ioas_map { __u32 size; __u32 flags; __u32 ioas_id; __u32 __reserved; __aligned_u64 user_va; __aligned_u64 length; __aligned_u64 iova;};Members
sizesizeof(
structiommu_ioas_map)flagsCombination of
enumiommufd_ioas_map_flagsioas_idIOAS ID to change the mapping of
__reservedMust be 0
user_vaUserspace pointer to start mapping from
lengthNumber of bytes to map
iovaIOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is setthen this must be provided as input.
Description
Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then themapping will be established at iova, otherwise a suitable location based onthe reserved and allowed lists will be automatically selected and returned iniova.
If IOMMU_IOAS_MAP_FIXED_IOVA is specified then the iova range must currentlybe unused, existing IOVA cannot be replaced.
- structiommu_ioas_map_file¶
ioctl(IOMMU_IOAS_MAP_FILE)
Definition:
struct iommu_ioas_map_file { __u32 size; __u32 flags; __u32 ioas_id; __s32 fd; __aligned_u64 start; __aligned_u64 length; __aligned_u64 iova;};Members
sizesizeof(
structiommu_ioas_map_file)flagssame as for iommu_ioas_map
ioas_idsame as for iommu_ioas_map
fdthe memfd to map
startbyte offset from start of file to map from
lengthsame as for iommu_ioas_map
iovasame as for iommu_ioas_map
Description
Set an IOVA mapping from a memfd file. All other arguments and semanticsmatch those of IOMMU_IOAS_MAP.
- structiommu_ioas_copy¶
ioctl(IOMMU_IOAS_COPY)
Definition:
struct iommu_ioas_copy { __u32 size; __u32 flags; __u32 dst_ioas_id; __u32 src_ioas_id; __aligned_u64 length; __aligned_u64 dst_iova; __aligned_u64 src_iova;};Members
sizesizeof(
structiommu_ioas_copy)flagsCombination of
enumiommufd_ioas_map_flagsdst_ioas_idIOAS ID to change the mapping of
src_ioas_idIOAS ID to copy from
lengthNumber of bytes to copy and map
dst_iovaIOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA isset then this must be provided as input.
src_iovaIOVA to start the copy
Description
Copy an already existing mapping from src_ioas_id and establish it indst_ioas_id. The src iova/length must exactly match a range used withIOMMU_IOAS_MAP.
This may be used to efficiently clone a subset of an IOAS to another, or as akind of ‘cache’ to speed up mapping. Copy has an efficiency advantage overestablishing equivalent new mappings, as internal resources are shared, andthe kernel will pin the user memory only once.
- structiommu_ioas_unmap¶
ioctl(IOMMU_IOAS_UNMAP)
Definition:
struct iommu_ioas_unmap { __u32 size; __u32 ioas_id; __aligned_u64 iova; __aligned_u64 length;};Members
sizesizeof(
structiommu_ioas_unmap)ioas_idIOAS ID to change the mapping of
iovaIOVA to start the unmapping at
lengthNumber of bytes to unmap, and return back the bytes unmapped
Description
Unmap an IOVA range. The iova/length must be a superset of a previouslymapped range used with IOMMU_IOAS_MAP or IOMMU_IOAS_COPY. Splitting ortruncating ranges is not allowed. The values 0 to U64_MAX will unmapeverything.
- enumiommufd_option¶
ioctl(IOMMU_OPTION_RLIMIT_MODE) and ioctl(IOMMU_OPTION_HUGE_PAGES)
Constants
IOMMU_OPTION_RLIMIT_MODEChange how RLIMIT_MEMLOCK accounting works. The caller must have privilegeto invoke this. Value 0 (default) is user based accounting, 1 uses processbased accounting. Global option, object_id must be 0
IOMMU_OPTION_HUGE_PAGESValue 1 (default) allows contiguous pages to be combined when generatingiommu mappings. Value 0 disables combining, everything is mapped toPAGE_SIZE. This can be useful for benchmarking. This is a per-IOASoption, the object_id must be the IOAS ID.
- enumiommufd_option_ops¶
ioctl(IOMMU_OPTION_OP_SET) and ioctl(IOMMU_OPTION_OP_GET)
Constants
IOMMU_OPTION_OP_SETSet the option’s value
IOMMU_OPTION_OP_GETGet the option’s value
- structiommu_option¶
iommu option multiplexer
Definition:
struct iommu_option { __u32 size; __u32 option_id; __u16 op; __u16 __reserved; __u32 object_id; __aligned_u64 val64;};Members
sizesizeof(
structiommu_option)option_idOne of
enumiommufd_optionopOne of
enumiommufd_option_ops__reservedMust be 0
object_idID of the object if required
val64Option value to set or value returned on get
Description
Change a simple option value. This multiplexor allows controlling optionson objects. IOMMU_OPTION_OP_SET will load an option and IOMMU_OPTION_OP_GETwill return the current value.
- enumiommufd_vfio_ioas_op¶
IOMMU_VFIO_IOAS_* ioctls
Constants
IOMMU_VFIO_IOAS_GETGet the current compatibility IOAS
IOMMU_VFIO_IOAS_SETChange the current compatibility IOAS
IOMMU_VFIO_IOAS_CLEARDisable VFIO compatibility
- structiommu_vfio_ioas¶
ioctl(IOMMU_VFIO_IOAS)
Definition:
struct iommu_vfio_ioas { __u32 size; __u32 ioas_id; __u16 op; __u16 __reserved;};Members
sizesizeof(
structiommu_vfio_ioas)ioas_idFor IOMMU_VFIO_IOAS_SET the input IOAS ID to setFor IOMMU_VFIO_IOAS_GET will output the IOAS ID
opOne of
enumiommufd_vfio_ioas_op__reservedMust be 0
Description
The VFIO compatibility support uses a single ioas because VFIO APIs do notsupport the ID field. Set or Get the IOAS that VFIO compatibility will use.When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get thecompatibility ioas, either by taking what is already set, or auto creatingone. From then on VFIO will continue to use that ioas and is not effected bythis ioctl. SET or CLEAR does not destroy any auto-created IOAS.
- enumiommufd_hwpt_alloc_flags¶
Flags for HWPT allocation
Constants
IOMMU_HWPT_ALLOC_NEST_PARENTIf set, allocate a HWPT that can serve asthe parent HWPT in a nesting configuration.
IOMMU_HWPT_ALLOC_DIRTY_TRACKINGDirty tracking support for device IOMMU isenforced on device attachment
IOMMU_HWPT_FAULT_ID_VALIDThe fault_id field of hwpt allocation data isvalid.
IOMMU_HWPT_ALLOC_PASIDRequests a domain that can be used with PASID. Thedomain can be attached to any PASID on the device.Any domain attached to the non-PASID part of thedevice must also be flagged, otherwise attaching aPASID will blocked.For the user that wants to attach PASID, ioas isnot recommended for both the non-PASID partand PASID part of the device.If IOMMU does not support PASID it will returnerror (-EOPNOTSUPP).
- enumiommu_hwpt_vtd_s1_flags¶
Intel VT-d stage-1 page table entry attributes
Constants
IOMMU_VTD_S1_SRESupervisor request
IOMMU_VTD_S1_EAFEExtended access enable
IOMMU_VTD_S1_WPEWrite protect enable
- structiommu_hwpt_vtd_s1¶
Intel VT-d stage-1 page table info (IOMMU_HWPT_DATA_VTD_S1)
Definition:
struct iommu_hwpt_vtd_s1 { __aligned_u64 flags; __aligned_u64 pgtbl_addr; __u32 addr_width; __u32 __reserved;};Members
flagsCombination of
enumiommu_hwpt_vtd_s1_flagspgtbl_addrThe base address of the stage-1 page table.
addr_widthThe address width of the stage-1 page table
__reservedMust be 0
- structiommu_hwpt_arm_smmuv3¶
ARM SMMUv3 nested STE (IOMMU_HWPT_DATA_ARM_SMMUV3)
Definition:
struct iommu_hwpt_arm_smmuv3 { __aligned_le64 ste[2];};Members
steThe first two double words of the user space Stream Table Entry forthe translation. Must be little-endian.Allowed fields: (Refer to “5.2 Stream Table Entry” in SMMUv3 HW Spec)- word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax- word-1: EATS, S1DSS, S1CIR, S1COR, S1CSH, S1STALLD
Description
-EIO will be returned ifste is not legal or contains any non-allowed field.Cfg can be used to select a S1, Bypass or Abort configuration. A Bypassnested domain will translate the same as the nesting parent. The S1 willinstall a Context Descriptor Table pointing at userspace memory translatedby the nesting parent.
It’s suggested to allocate a vDEVICE object carrying vSID and then re-attachthe nested domain, as soon as the vSID is available in the VMM level:
when Cfg=translate, a vDEVICE must be allocated prior to attaching to theallocated nested domain, as CD/ATS invalidations and vevents need a vSID.
when Cfg=bypass/abort, a vDEVICE is not enforced during the nested domainattachment, to support a GBPA case where VM sets CR0.SMMUEN=0. However, ifVM sets CR0.SMMUEN=1 while missing a vDEVICE object, kernel would fail toreport events to the VM. E.g. F_TRANSLATION when guest STE.Cfg=abort.
- enumiommu_hwpt_data_type¶
IOMMU HWPT Data Type
Constants
IOMMU_HWPT_DATA_NONEno data
IOMMU_HWPT_DATA_VTD_S1Intel VT-d stage-1 page table
IOMMU_HWPT_DATA_ARM_SMMUV3ARM SMMUv3 Context Descriptor Table
- structiommu_hwpt_alloc¶
ioctl(IOMMU_HWPT_ALLOC)
Definition:
struct iommu_hwpt_alloc { __u32 size; __u32 flags; __u32 dev_id; __u32 pt_id; __u32 out_hwpt_id; __u32 __reserved; __u32 data_type; __u32 data_len; __aligned_u64 data_uptr; __u32 fault_id; __u32 __reserved2;};Members
sizesizeof(
structiommu_hwpt_alloc)flagsCombination of
enumiommufd_hwpt_alloc_flagsdev_idThe device to allocate this HWPT for
pt_idThe IOAS or HWPT or vIOMMU to connect this HWPT to
out_hwpt_idThe ID of the new HWPT
__reservedMust be 0
data_typeOne of
enumiommu_hwpt_data_typedata_lenLength of the type specific data
data_uptrUser pointer to the type specific data
fault_idThe ID of IOMMUFD_FAULT object. Valid only if flags field ofIOMMU_HWPT_FAULT_ID_VALID is set.
__reserved2Padding to 64-bit alignment. Must be 0.
Description
Explicitly allocate a hardware page table object. This is the same objecttype that is returned byiommufd_device_attach() and represents theunderlying iommu driver’s iommu_domain kernel object.
A kernel-managed HWPT will be created with the mappings from the givenIOAS via thept_id. Thedata_type for this allocation must be set toIOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for anesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT viaflags.
A user-managed nested HWPT will be created from a given vIOMMU (wrapping aparent HWPT) or a parent HWPT viapt_id, in which the parent HWPT must beallocated previously via the same ioctl from a given IOAS (pt_id). In thiscase, thedata_type must be set to a pre-defined type corresponding to anI/O page table type supported by the underlying IOMMU hardware. The deviceviadev_id and the vIOMMU viapt_id must be associated to the same IOMMUinstance.
If thedata_type is set to IOMMU_HWPT_DATA_NONE,data_len anddata_uptr should be zero. Otherwise, bothdata_len anddata_uptrmust be given.
- enumiommu_hw_info_vtd_flags¶
Flags for VT-d hw_info
Constants
IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17If set, disallow read-only mappingson a nested_parent domain.https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
- structiommu_hw_info_vtd¶
Intel VT-d hardware information
Definition:
struct iommu_hw_info_vtd { __u32 flags; __u32 __reserved; __aligned_u64 cap_reg; __aligned_u64 ecap_reg;};Members
flagsCombination of
enumiommu_hw_info_vtd_flags__reservedMust be 0
cap_regValue of Intel VT-d capability register defined in VT-d specsection 11.4.2 Capability Register.
ecap_regValue of Intel VT-d capability register defined in VT-d specsection 11.4.3 Extended Capability Register.
Description
User needs to understand the Intel VT-d specification to decode theregister value.
- structiommu_hw_info_arm_smmuv3¶
ARM SMMUv3 hardware information (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
Definition:
struct iommu_hw_info_arm_smmuv3 { __u32 flags; __u32 __reserved; __u32 idr[6]; __u32 iidr; __u32 aidr;};Members
flagsMust be set to 0
__reservedMust be 0
idrImplemented features for ARM SMMU Non-secure programming interface
iidrInformation about the implementation and implementer of ARM SMMU,and architecture version supported
aidrARM SMMU architecture version
Description
For the details ofidr,iidr andaidr, please refer to the chaptersfrom 6.3.1 to 6.3.6 in the SMMUv3 Spec.
This reports the raw HW capability, and not all bits are meaningful to beread by userspace. Only the following fields should be used:
idr[0]: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN , CD2L, ASID16, TTFidr[1]: SIDSIZE, SSIDSIZEidr[3]: BBML, RILidr[5]: VAX, GRAN64K, GRAN16K, GRAN4K
S1P should be assumed to be true if a NESTED HWPT can be created
VFIO/iommufd only support platforms with COHACC, it should be assumed to betrue.
ATS is a per-device property. If the VMM describes any devices as ATScapable in ACPI/DT it should set the corresponding idr.
This list may expand in future (eg E0PD, AIE, PBHA, D128, DS etc). It isimportant that VMMs do not read bits outside the list to allow forcompatibility with future kernels. Several features in the SMMUv3architecture are not currently supported by the kernel for nesting: HTTU,BTM, MPAM and others.
- structiommu_hw_info_tegra241_cmdqv¶
NVIDIA Tegra241 CMDQV Hardware Information (IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
Definition:
struct iommu_hw_info_tegra241_cmdqv { __u32 flags; __u8 version; __u8 log2vcmdqs; __u8 log2vsids; __u8 __reserved;};Members
flagsMust be 0
versionVersion number for the CMDQ-V HW for PARAM bits[03:00]
log2vcmdqsLog2 of the total number of VCMDQs for PARAM bits[07:04]
log2vsidsLog2 of the total number of SID replacements for PARAM bits[15:12]
__reservedMust be 0
Description
VMM can use these fields directly in its emulated global PARAM register. Notethat only one Virtual Interface (VINTF) should be exposed to a VM, i.e. PARAMbits[11:08] should be set to 0 for log2 of the total number of VINTFs.
- enumiommu_hw_info_type¶
IOMMU Hardware Info Types
Constants
IOMMU_HW_INFO_TYPE_NONEOutput by the drivers that do not report hardwareinfo
IOMMU_HW_INFO_TYPE_DEFAULTInput to request for a default type
IOMMU_HW_INFO_TYPE_INTEL_VTDIntel VT-d iommu info type
IOMMU_HW_INFO_TYPE_ARM_SMMUV3ARM SMMUv3 iommu info type
IOMMU_HW_INFO_TYPE_TEGRA241_CMDQVNVIDIA Tegra241 CMDQV (extension for ARMSMMUv3) info type
- enumiommufd_hw_capabilities¶
Constants
IOMMU_HW_CAP_DIRTY_TRACKINGIOMMU hardware support for dirty trackingIf available, it means the following APIsare supported:
IOMMU_HW_CAP_PCI_PASID_EXECExecute Permission Supported, user ignores itwhen the
structiommu_hw_info::out_max_pasid_log2 is zero.IOMMU_HW_CAP_PCI_PASID_PRIVPrivileged Mode Supported, user ignores itwhen the
structiommu_hw_info::out_max_pasid_log2 is zero.
Description
IOMMU_HWPT_GET_DIRTY_BITMAPIOMMU_HWPT_SET_DIRTY_TRACKING
- enumiommufd_hw_info_flags¶
Flags for iommu_hw_info
Constants
IOMMU_HW_INFO_FLAG_INPUT_TYPEIf set,in_data_type carries an input typefor user space to request for a specific info
- structiommu_hw_info¶
ioctl(IOMMU_GET_HW_INFO)
Definition:
struct iommu_hw_info { __u32 size; __u32 flags; __u32 dev_id; __u32 data_len; __aligned_u64 data_uptr; union { __u32 in_data_type; __u32 out_data_type; }; __u8 out_max_pasid_log2; __u8 __reserved[3]; __aligned_u64 out_capabilities;};Members
sizesizeof(
structiommu_hw_info)flagsMust be 0
dev_idThe device bound to the iommufd
data_lenInput the length of a user buffer in bytes. Output the length ofdata that kernel supports
data_uptrUser pointer to a user-space buffer used by the kernel to fillthe iommu type specific hardware information data
{unnamed_union}anonymous
in_data_typeThis shares the same field without_data_type, making it bea bidirectional field. When IOMMU_HW_INFO_FLAG_INPUT_TYPE isset, an input type carried via thisin_data_type field willbe valid, requesting for the info data to the given type. IfIOMMU_HW_INFO_FLAG_INPUT_TYPE is unset, any input value willbe seen as IOMMU_HW_INFO_TYPE_DEFAULT
out_data_typeOutput the iommu hardware info type as defined in the
enumiommu_hw_info_type.out_max_pasid_log2Output the width of PASIDs. 0 means no PASID support.PCI devices turn to out_capabilities to check if thespecific capabilities is supported or not.
__reservedMust be 0
out_capabilitiesOutput the generic iommu capability info type as definedin the
enumiommu_hw_capabilities.
Description
Query an iommu type specific hardware information data from an iommu behinda given device that has been bound to iommufd. This hardware info data willbe used to sync capabilities between the virtual iommu and the physicaliommu, e.g. a nested translation setup needs to check the hardware info, soa guest stage-1 page table can be compatible with the physical iommu.
To capture an iommu type specific hardware information data,data_uptr andits lengthdata_len must be provided. Trailing bytes will be zeroed if theuser buffer is larger than the data that kernel has. Otherwise, kernel onlyfills the buffer using the given length indata_len. If the ioctl succeeds,data_len will be updated to the length that kernel actually supports,out_data_type will be filled to decode the data filled in the bufferpointed bydata_uptr. Inputdata_len == zero is allowed.
- structiommu_hwpt_set_dirty_tracking¶
ioctl(IOMMU_HWPT_SET_DIRTY_TRACKING)
Definition:
struct iommu_hwpt_set_dirty_tracking { __u32 size; __u32 flags; __u32 hwpt_id; __u32 __reserved;};Members
sizeflagsCombination of
enumiommufd_hwpt_set_dirty_tracking_flagshwpt_idHW pagetable ID that represents the IOMMU domain
__reservedMust be 0
Description
Toggle dirty tracking on an HW pagetable.
- enumiommufd_hwpt_get_dirty_bitmap_flags¶
Flags for getting dirty bits
Constants
IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEARJust read the PTEs without clearingany dirty bits metadata. This flagcan be passed in the expectationwhere the next operation is an unmapof the same IOVA range.
- structiommu_hwpt_get_dirty_bitmap¶
ioctl(IOMMU_HWPT_GET_DIRTY_BITMAP)
Definition:
struct iommu_hwpt_get_dirty_bitmap { __u32 size; __u32 hwpt_id; __u32 flags; __u32 __reserved; __aligned_u64 iova; __aligned_u64 length; __aligned_u64 page_size; __aligned_u64 data;};Members
sizehwpt_idHW pagetable ID that represents the IOMMU domain
flagsCombination of
enumiommufd_hwpt_get_dirty_bitmap_flags__reservedMust be 0
iovabase IOVA of the bitmap first bit
lengthIOVA range size
page_sizepage size granularity of each bit in the bitmap
databitmap where to set the dirty bits. The bitmap bits eachrepresent a page_size which you deviate from an arbitrary iova.
Description
Checking a given IOVA is dirty:
data[(iova / page_size) / 64] & (1ULL << ((iova / page_size) % 64))
Walk the IOMMU pagetables for a given IOVA range to return a bitmapwith the dirty IOVAs. In doing so it will also by default clear anydirty bit metadata set in the IOPTE.
- enumiommu_hwpt_invalidate_data_type¶
IOMMU HWPT Cache Invalidation Data Type
Constants
IOMMU_HWPT_INVALIDATE_DATA_VTD_S1Invalidation data for VTD_S1
IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3Invalidation data for ARM SMMUv3
- enumiommu_hwpt_vtd_s1_invalidate_flags¶
Flags for Intel VT-d stage-1 cache invalidation
Constants
IOMMU_VTD_INV_FLAGS_LEAFIndicates whether the invalidation appliesto all-levels page structure cache or justthe leaf PTE cache.
- structiommu_hwpt_vtd_s1_invalidate¶
Intel VT-d cache invalidation (IOMMU_HWPT_INVALIDATE_DATA_VTD_S1)
Definition:
struct iommu_hwpt_vtd_s1_invalidate { __aligned_u64 addr; __aligned_u64 npages; __u32 flags; __u32 __reserved;};Members
addrThe start address of the range to be invalidated. It needs tobe 4KB aligned.
npagesNumber of contiguous 4K pages to be invalidated.
flagsCombination of
enumiommu_hwpt_vtd_s1_invalidate_flags__reservedMust be 0
Description
The Intel VT-d specific invalidation data for user-managed stage-1 cacheinvalidation in nested translation. Userspace uses this structure totell the impacted cache scope after modifying the stage-1 page table.
Invalidating all the caches related to the page table by settingaddrto be 0 andnpages to be U64_MAX.
The device TLB will be invalidated automatically if ATS is enabled.
- structiommu_viommu_arm_smmuv3_invalidate¶
ARM SMMUv3 cache invalidation (IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3)
Definition:
struct iommu_viommu_arm_smmuv3_invalidate { __aligned_le64 cmd[2];};Members
cmd128-bit cache invalidation command that runs in SMMU CMDQ.Must be little-endian.
Description
- Supported command list only when passing in a vIOMMU viahwpt_id:
CMDQ_OP_TLBI_NSNH_ALLCMDQ_OP_TLBI_NH_VACMDQ_OP_TLBI_NH_VAACMDQ_OP_TLBI_NH_ALLCMDQ_OP_TLBI_NH_ASIDCMDQ_OP_ATC_INVCMDQ_OP_CFGI_CDCMDQ_OP_CFGI_CD_ALL
-EIO will be returned if the command is not supported.
- structiommu_hwpt_invalidate¶
ioctl(IOMMU_HWPT_INVALIDATE)
Definition:
struct iommu_hwpt_invalidate { __u32 size; __u32 hwpt_id; __aligned_u64 data_uptr; __u32 data_type; __u32 entry_len; __u32 entry_num; __u32 __reserved;};Members
sizesizeof(
structiommu_hwpt_invalidate)hwpt_idID of a nested HWPT or a vIOMMU, for cache invalidation
data_uptrUser pointer to an array of driver-specific cache invalidationdata.
data_typeOne of
enumiommu_hwpt_invalidate_data_type, defining the datatype of all the entries in the invalidation request array. Itshould be a type supported by the hwpt pointed byhwpt_id.entry_lenLength (in bytes) of a request entry in the request array
entry_numInput the number of cache invalidation requests in the array.Output the number of requests successfully handled by kernel.
__reservedMust be 0.
Description
Invalidate iommu cache for user-managed page table or vIOMMU. Modificationson a user-managed page table should be followed by this operation, if a HWPTis passed in viahwpt_id. Other caches, such as device cache or descriptorcache can be flushed if a vIOMMU is passed in via thehwpt_id field.
Each ioctl can support one or more cache invalidation requests in the arraythat has a total size ofentry_len *entry_num.
An empty invalidation request array by settingentry_num**==0 is allowed, and**entry_len anddata_uptr would be ignored in this case. This can be used tocheck if the givendata_type is supported or not by kernel.
- enumiommu_hwpt_pgfault_flags¶
flags for
structiommu_hwpt_pgfault
Constants
IOMMU_PGFAULT_FLAGS_PASID_VALIDThe pasid field of the fault data isvalid.
IOMMU_PGFAULT_FLAGS_LAST_PAGEIt’s the last fault of a fault group.
- enumiommu_hwpt_pgfault_perm¶
perm bits for
structiommu_hwpt_pgfault
Constants
IOMMU_PGFAULT_PERM_READrequest for read permission
IOMMU_PGFAULT_PERM_WRITErequest for write permission
IOMMU_PGFAULT_PERM_EXEC(PCIE 10.4.1) request with a PASID that has theExecute Requested bit set in PASID TLP Prefix.
IOMMU_PGFAULT_PERM_PRIV(PCIE 10.4.1) request with a PASID that has thePrivileged Mode Requested bit set in PASID TLPPrefix.
- structiommu_hwpt_pgfault¶
iommu page fault data
Definition:
struct iommu_hwpt_pgfault { __u32 flags; __u32 dev_id; __u32 pasid; __u32 grpid; __u32 perm; __u32 __reserved; __aligned_u64 addr; __u32 length; __u32 cookie;};Members
flagsCombination of
enumiommu_hwpt_pgfault_flagsdev_idid of the originated device
pasidProcess Address Space ID
grpidPage Request Group Index
permCombination of
enumiommu_hwpt_pgfault_perm__reservedMust be 0.
addrFault address
lengtha hint of how much data the requestor is expecting to fetch. Forexample, if the PRI initiator knows it is going to do a 10MBtransfer, it could fill in 10MB and the OS could pre-fault in10MB of IOVA. It’s default to 0 if there’s no such hint.
cookiekernel-managed cookie identifying a group of fault messages. Thecookie number encoded in the last page fault of the group shouldbe echoed back in the response message.
- enumiommufd_page_response_code¶
Return status of fault handlers
Constants
IOMMUFD_PAGE_RESP_SUCCESSFault has been handled and the page tablespopulated, retry the access. This is the“Success” defined in PCI 10.4.2.1.
IOMMUFD_PAGE_RESP_INVALIDCould not handle this fault, don’t retry theaccess. This is the “Invalid Request” in PCI10.4.2.1.
- structiommu_hwpt_page_response¶
IOMMU page fault response
Definition:
struct iommu_hwpt_page_response { __u32 cookie; __u32 code;};Members
cookieThe kernel-managed cookie reported in the fault message.
codeOne of response code in
enumiommufd_page_response_code.
- structiommu_fault_alloc¶
ioctl(IOMMU_FAULT_QUEUE_ALLOC)
Definition:
struct iommu_fault_alloc { __u32 size; __u32 flags; __u32 out_fault_id; __u32 out_fault_fd;};Members
sizesizeof(
structiommu_fault_alloc)flagsMust be 0
out_fault_idThe ID of the new FAULT
out_fault_fdThe fd of the new FAULT
Description
Explicitly allocate a fault handling object.
- enumiommu_viommu_type¶
Virtual IOMMU Type
Constants
IOMMU_VIOMMU_TYPE_DEFAULTReserved for future use
IOMMU_VIOMMU_TYPE_ARM_SMMUV3ARM SMMUv3 driver specific type
IOMMU_VIOMMU_TYPE_TEGRA241_CMDQVNVIDIA Tegra241 CMDQV (extension for ARMSMMUv3) enabled ARM SMMUv3 type
- structiommu_viommu_tegra241_cmdqv¶
NVIDIA Tegra241 CMDQV Virtual Interface (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)
Definition:
struct iommu_viommu_tegra241_cmdqv { __aligned_u64 out_vintf_mmap_offset; __aligned_u64 out_vintf_mmap_length;};Members
out_vintf_mmap_offsetmmap offset argument for VINTF’s page0
out_vintf_mmap_lengthmmap length argument for VINTF’s page0
Description
Bothout_vintf_mmap_offset andout_vintf_mmap_length are reported by kernelfor user space to mmap the VINTF page0 from the host physical address spaceto the guest physical address space so that a guest kernel can directly R/Waccess to the VINTF page0 in order to control its virtual command queues.
- structiommu_viommu_alloc¶
ioctl(IOMMU_VIOMMU_ALLOC)
Definition:
struct iommu_viommu_alloc { __u32 size; __u32 flags; __u32 type; __u32 dev_id; __u32 hwpt_id; __u32 out_viommu_id; __u32 data_len; __u32 __reserved; __aligned_u64 data_uptr;};Members
sizesizeof(
structiommu_viommu_alloc)flagsMust be 0
typeType of the virtual IOMMU. Must be defined in
enumiommu_viommu_typedev_idThe device’s physical IOMMU will be used to back the virtual IOMMU
hwpt_idID of a nesting parent HWPT to associate to
out_viommu_idOutput virtual IOMMU ID for the allocated object
data_lenLength of the type specific data
__reservedMust be 0
data_uptrUser pointer to a driver-specific virtual IOMMU data
Description
Allocate a virtual IOMMU object, representing the underlying physical IOMMU’svirtualization support that is a security-isolated slice of the real IOMMU HWthat is unique to a specific VM. Operations global to the IOMMU are connectedto the vIOMMU, such as:- Security namespace for guest owned ID, e.g. guest-controlled cache tags- Non-device-affiliated event reporting, e.g. invalidation queue errors- Access to a sharable nesting parent pagetable across physical IOMMUs- Virtualization of various platforms IDs, e.g. RIDs and others- Delivery of paravirtualized invalidation- Direct assigned invalidation queues- Direct assigned interrupts
- structiommu_vdevice_alloc¶
ioctl(IOMMU_VDEVICE_ALLOC)
Definition:
struct iommu_vdevice_alloc { __u32 size; __u32 viommu_id; __u32 dev_id; __u32 out_vdevice_id; __aligned_u64 virt_id;};Members
sizesizeof(
structiommu_vdevice_alloc)viommu_idvIOMMU ID to associate with the virtual device
dev_idThe physical device to allocate a virtual instance on the vIOMMU
out_vdevice_idObject handle for the vDevice. Pass to IOMMU_DESTORY
virt_idVirtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceIDof AMD IOMMU, and vRID of Intel VT-d
Description
Allocate a virtual device instance (for a physical device) against a vIOMMU.This instance holds the device’s information (related to its vIOMMU) in a VM.User should use IOMMU_DESTROY to destroy the virtual device beforedestroying the physical device (by closing vfio_cdev fd). Otherwise thevirtual device would be forcibly destroyed on physical device destruction,its vdevice_id would be permanently leaked (unremovable & unreusable) untiliommu fd closed.
- structiommu_ioas_change_process¶
ioctl(VFIO_IOAS_CHANGE_PROCESS)
Definition:
struct iommu_ioas_change_process { __u32 size; __u32 __reserved;};Members
sizesizeof(
structiommu_ioas_change_process)__reservedMust be 0
Description
This transfers pinned memory counts for every memory map in every IOASin the context to the current process. This only supports maps createdwith IOMMU_IOAS_MAP_FILE, and returns EINVAL if other maps are present.If the ioctl returns a failure status, then nothing is changed.
This API is useful for transferring operation of a device from one processto another, such as during userland live update.
- enumiommu_veventq_flag¶
flag for
structiommufd_vevent_header
Constants
IOMMU_VEVENTQ_FLAG_LOST_EVENTSvEVENTQ has lost vEVENTs
- structiommufd_vevent_header¶
Virtual Event Header for a vEVENTQ Status
Definition:
struct iommufd_vevent_header { __u32 flags; __u32 sequence;};Members
flagsCombination of
enumiommu_veventq_flagsequenceThe sequence index of a vEVENT in the vEVENTQ, with a range of[0, INT_MAX] where the following index of INT_MAX is 0
Description
Each iommufd_vevent_header reports a sequence index of the following vEVENT:
header0 {sequence=0} | data0 | header1 {sequence=1} | data1 | ... | dataN |
And this sequence index is expected to be monotonic to the sequence index ofthe previous vEVENT. If two adjacent sequence indexes has a delta larger than1, it means that delta - 1 number of vEVENTs has lost, e.g. two lost vEVENTs:
... | header3 {sequence=3} | data3 | header6 {sequence=6} | data6 | ... |
If a vEVENT lost at the tail of the vEVENTQ and there is no following vEVENTproviding the next sequence index, an IOMMU_VEVENTQ_FLAG_LOST_EVENTS headerwould be added to the tail, and no data would follow this header:
header3 {sequence=3} | data3 | header4 {flags=LOST_EVENTS, sequence=4} |
- enumiommu_veventq_type¶
Virtual Event Queue Type
Constants
IOMMU_VEVENTQ_TYPE_DEFAULTReserved for future use
IOMMU_VEVENTQ_TYPE_ARM_SMMUV3ARM SMMUv3 Virtual Event Queue
IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQVNVIDIA Tegra241 CMDQV Extension IRQ
- structiommu_vevent_arm_smmuv3¶
ARM SMMUv3 Virtual Event (IOMMU_VEVENTQ_TYPE_ARM_SMMUV3)
Definition:
struct iommu_vevent_arm_smmuv3 { __aligned_le64 evt[4];};Members
evt256-bit ARM SMMUv3 Event record, little-endian.Reported event records: (Refer to “7.3 Event records” in SMMUv3 HW Spec)- 0x04 C_BAD_STE- 0x06 F_STREAM_DISABLED- 0x08 C_BAD_SUBSTREAMID- 0x0a C_BAD_CD- 0x10 F_TRANSLATION- 0x11 F_ADDR_SIZE- 0x12 F_ACCESS- 0x13 F_PERMISSION
Description
StreamID field reports a virtual device ID. To receive a virtual event for adevice, a vDEVICE must be allocated via IOMMU_VDEVICE_ALLOC.
- structiommu_vevent_tegra241_cmdqv¶
Tegra241 CMDQV IRQ (IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV)
Definition:
struct iommu_vevent_tegra241_cmdqv { __aligned_le64 lvcmdq_err_map[2];};Members
lvcmdq_err_map128-bit logical vcmdq error map, little-endian.(Refer to register LVCMDQ_ERR_MAPs per VINTF )
Description
The 128-bit register value from HW exclusively reflect the error bits for aVirtual Interface represented by a vIOMMU object. Read and report directly.
- structiommu_veventq_alloc¶
ioctl(IOMMU_VEVENTQ_ALLOC)
Definition:
struct iommu_veventq_alloc { __u32 size; __u32 flags; __u32 viommu_id; __u32 type; __u32 veventq_depth; __u32 out_veventq_id; __u32 out_veventq_fd; __u32 __reserved;};Members
sizesizeof(
structiommu_veventq_alloc)flagsMust be 0
viommu_idvirtual IOMMU ID to associate the vEVENTQ with
typeType of the vEVENTQ. Must be defined in
enumiommu_veventq_typeveventq_depthMaximum number of events in the vEVENTQ
out_veventq_idThe ID of the new vEVENTQ
out_veventq_fdThe fd of the new vEVENTQ. User space must close thesuccessfully returned fd after using it
__reservedMust be 0
Description
Explicitly allocate a virtual event queue interface for a vIOMMU. A vIOMMUcan have multiple FDs for different types, but is confined to one pertype.User space should open theout_veventq_fd to read vEVENTs out of a vEVENTQ,if there are vEVENTs available. A vEVENTQ will lose events due to overflow,if the number of the vEVENTs hitsveventq_depth.
Each vEVENT in a vEVENTQ encloses astructiommufd_vevent_header followed bya type-specific data structure, in a normal case:
header0 | data0 | header1 | data1 | ... | headerN | dataN |
unless a tailing IOMMU_VEVENTQ_FLAG_LOST_EVENTS header is logged (refer tostructiommufd_vevent_header).
- enumiommu_hw_queue_type¶
HW Queue Type
Constants
IOMMU_HW_QUEUE_TYPE_DEFAULTReserved for future use
IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQVNVIDIA Tegra241 CMDQV (extension for ARMSMMUv3) Virtual Command Queue (VCMDQ)
- structiommu_hw_queue_alloc¶
ioctl(IOMMU_HW_QUEUE_ALLOC)
Definition:
struct iommu_hw_queue_alloc { __u32 size; __u32 flags; __u32 viommu_id; __u32 type; __u32 index; __u32 out_hw_queue_id; __aligned_u64 nesting_parent_iova; __aligned_u64 length;};Members
sizesizeof(
structiommu_hw_queue_alloc)flagsMust be 0
viommu_idVirtual IOMMU ID to associate the HW queue with
typeOne of
enumiommu_hw_queue_typeindexThe logical index to the HW queue per virtual IOMMU for a multi-queuemodel
out_hw_queue_idThe ID of the new HW queue
nesting_parent_iovaBase address of the queue memory in the guest physicaladdress space
lengthLength of the queue memory
Description
Allocate a HW queue object for a vIOMMU-specific HW-accelerated queue, whichallows HW to access a guest queue memory described usingnesting_parent_iovaandlength.
A vIOMMU can allocate multiple queues, but it must use a differentindex pertype to separate each allocation, e.g:
Type1 HW queue0, Type1 HW queue1, Type2 HW queue0, ...
IOMMUFD Kernel API¶
The IOMMUFD kAPI is device-centric with group-related tricks managed behind thescene. This allows the external drivers calling such kAPI to implement a simpledevice-centric uAPI for connecting its device to an iommufd, instead ofexplicitly imposing the group semantics in its uAPI as VFIO does.
- structiommufd_device*iommufd_device_bind(structiommufd_ctx*ictx,structdevice*dev,u32*id)¶
Bind a physical device to an iommu fd
Parameters
structiommufd_ctx*ictxiommufd file descriptor
structdevice*devPointer to a physical device struct
u32*idOutput ID number to return to userspace for this device
Description
A successful bind establishes an ownership over the device and returnsstructiommufd_device pointer, otherwise returns error pointer.
A driver using this API must set driver_managed_dma and must not touchthe device until this routine succeeds and establishes ownership.
Binding a PCI device places the entire RID under iommufd control.
The caller must undo this withiommufd_device_unbind()
- booliommufd_ctx_has_group(structiommufd_ctx*ictx,structiommu_group*group)¶
True if any device within the group is bound to the ictx
Parameters
structiommufd_ctx*ictxiommufd file descriptor
structiommu_group*groupPointer to a physical iommu_group struct
Description
True if any device within the group has been bound to this ictx, ex. viaiommufd_device_bind(), therefore implying ictx ownership of the group.
- voidiommufd_device_unbind(structiommufd_device*idev)¶
Parameters
structiommufd_device*idevDevice returned by
iommufd_device_bind()
Description
Release the device from iommufd control. The DMA ownership will return backto unowned with DMA controlled by the DMA API. This invalidates theiommufd_device pointer, other APIs that consume it must not be calledconcurrently.
- intiommufd_device_attach(structiommufd_device*idev,ioasid_tpasid,u32*pt_id)¶
Connect a device/pasid to an iommu_domain
Parameters
structiommufd_device*idevdevice to attach
ioasid_tpasidpasid to attach
u32*pt_idInput a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGINGOutput the IOMMUFD_OBJ_HWPT_PAGING ID
Description
This connects the device/pasid to an iommu_domain, either automaticallyor manually selected. Once this completes the device could do DMA withpasid.pasid is IOMMU_NO_PASID if this attach is for no pasid usage.
The caller should return the resulting pt_id back to userspace.This function is undone by callingiommufd_device_detach().
- intiommufd_device_replace(structiommufd_device*idev,ioasid_tpasid,u32*pt_id)¶
Change the device/pasid’s iommu_domain
Parameters
structiommufd_device*idevdevice to change
ioasid_tpasidpasid to change
u32*pt_idInput a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGINGOutput the IOMMUFD_OBJ_HWPT_PAGING ID
Description
This is the same as:
iommufd_device_detach();iommufd_device_attach();
If it fails then no change is made to the attachment. The iommu driver mayimplement this so there is no disruption in translation. This can only becalled ifiommufd_device_attach() has already succeeded.pasid isIOMMU_NO_PASID for no pasid usage.
- voidiommufd_device_detach(structiommufd_device*idev,ioasid_tpasid)¶
Disconnect a device/device to an iommu_domain
Parameters
structiommufd_device*idevdevice to detach
ioasid_tpasidpasid to detach
Description
Undoiommufd_device_attach(). This disconnects the idev from the previouslyattached pt_id. The device returns back to a blocked DMA translation.pasid is IOMMU_NO_PASID for no pasid usage.
- structiommufd_access*iommufd_access_create(structiommufd_ctx*ictx,conststructiommufd_access_ops*ops,void*data,u32*id)¶
Create an iommufd_access
Parameters
structiommufd_ctx*ictxiommufd file descriptor
conststructiommufd_access_ops*opsDriver’s ops to associate with the access
void*dataOpaque data to pass into ops functions
u32*idOutput ID number to return to userspace for this access
Description
An iommufd_access allows a driver to read/write to the IOAS without usingDMA. The underlying CPU memory can be accessed using theiommufd_access_pin_pages() oriommufd_access_rw() functions.
The provided ops are required to useiommufd_access_pin_pages().
- voidiommufd_access_destroy(structiommufd_access*access)¶
Destroy an iommufd_access
Parameters
structiommufd_access*accessThe access to destroy
Description
The caller must stop using the access before destroying it.
- voidiommufd_access_unpin_pages(structiommufd_access*access,unsignedlongiova,unsignedlonglength)¶
Undo iommufd_access_pin_pages
Parameters
structiommufd_access*accessIOAS access to act on
unsignedlongiovaStarting IOVA
unsignedlonglengthNumber of bytes to access
Description
Return thestructpage’s. The caller must stop accessing them before callingthis. The iova/length must exactly match the one provided to access_pages.
- intiommufd_access_pin_pages(structiommufd_access*access,unsignedlongiova,unsignedlonglength,structpage**out_pages,unsignedintflags)¶
Return a list of pages under the iova
Parameters
structiommufd_access*accessIOAS access to act on
unsignedlongiovaStarting IOVA
unsignedlonglengthNumber of bytes to access
structpage**out_pagesOutput page list
unsignedintflagsIOPMMUFD_ACCESS_RW_* flags
Description
Readslength bytes starting at iova and returns thestructpage * pointers.These can be kmap’d by the caller for CPU access.
The caller must performiommufd_access_unpin_pages() when done to balancethis.
This API always requires a page aligned iova. This happens naturally if theioas alignment is >= PAGE_SIZE and the iova is PAGE_SIZE aligned. Howeversmaller alignments have corner cases where this API can fail on otherwisealigned iova.
- intiommufd_access_rw(structiommufd_access*access,unsignedlongiova,void*data,size_tlength,unsignedintflags)¶
Read or write data under the iova
Parameters
structiommufd_access*accessIOAS access to act on
unsignedlongiovaStarting IOVA
void*dataKernel buffer to copy to/from
size_tlengthNumber of bytes to access
unsignedintflagsIOMMUFD_ACCESS_RW_* flags
Description
Copy kernel to/from data into the range given by IOVA/length. If flagsindicates IOMMUFD_ACCESS_RW_KTHREAD then a large copy can be optimizedby changing it into copy_to/from_user().
- voidiommufd_ctx_get(structiommufd_ctx*ictx)¶
Get a context reference
Parameters
structiommufd_ctx*ictxContext to get
Description
The caller must already hold a valid reference to ictx.
- structiommufd_ctx*iommufd_ctx_from_file(structfile*file)¶
Acquires a reference to the iommufd context
Parameters
structfile*fileFile to obtain the reference from
Description
Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. Thestructfileremains owned by the caller and the caller must still do fput. On successthe caller is responsible to calliommufd_ctx_put().
- structiommufd_ctx*iommufd_ctx_from_fd(intfd)¶
Acquires a reference to the iommufd context
Parameters
intfdFile descriptor to obtain the reference from
Description
Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. On successthe caller is responsible to calliommufd_ctx_put().
- voidiommufd_ctx_put(structiommufd_ctx*ictx)¶
Put back a reference
Parameters
structiommufd_ctx*ictxContext to put back
VFIO and IOMMUFD¶
Connecting a VFIO device to iommufd can be done in two ways.
First is a VFIO compatible way by directly implementing the /dev/vfio/vfiocontainer IOCTLs by mapping them into io_pagetable operations. Doing so allowsthe use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of acontainer fd.
The second approach directly extends VFIO to support a new set of device-centricuser API based on aforementioned IOMMUFD kernel API. It requires userspacechange but better matches the IOMMUFD API semantics and easier to support newiommufd features when comparing it to the first approach.
Currently both approaches are still work-in-progress.
There are still a few gaps to be resolved to catch up with VFIO type1, asdocumented iniommufd_vfio_check_extension().
Future TODOs¶
Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIOtype1. New features on the radar include:
Binding iommu_domain’s to PASID/SSID
Userspace page tables, for ARM, x86 and S390
Kernel bypass’d invalidation of user page tables
Re-use of the KVM page table in the IOMMU
Dirty page tracking in the IOMMU
Runtime Increase/Decrease of IOPTE size
PRI support with faults resolved in userspace