IOMMUFD

Author:

Jason Gunthorpe

Author:

Kevin Tian

Overview

IOMMUFD is the user API to control the IOMMU subsystem as it relates to managingIO page tables from userspace using file descriptors. It intends to be generaland consumable by any driver that wants to expose DMA to userspace. Thesedrivers are eventually expected to deprecate any internal IOMMU logicthey may already/historically implement (e.g. vfio_iommu_type1.c).

At minimum iommufd provides universal support of managing I/O address spaces andI/O page tables for all IOMMUs, with room in the design to add non-genericfeatures to cater to specific hardware functionality.

In this context the capital letter (IOMMUFD) refers to the subsystem while thesmall letter (iommufd) refers to the file descriptors created via /dev/iommu foruse by userspace.

Key Concepts

User Visible Objects

Following IOMMUFD objects are exposed to userspace:

  • IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS), allowing map/unmapof user space memory into ranges of I/O Virtual Address (IOVA).

    The IOAS is a functional replacement for the VFIO container, and like the VFIOcontainer it copies an IOVA map to a list of iommu_domains held within it.

  • IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by anexternal driver.

  • IOMMUFD_OBJ_HWPT_PAGING, representing an actual hardware I/O page table(i.e. a singlestructiommu_domain) managed by the iommu driver. “PAGING”primarily indicates this type of HWPT should be linked to an IOAS. It alsoindicates that it is backed by an iommu_domain with __IOMMU_DOMAIN_PAGINGfeature flag. This can be either an UNMANAGED stage-1 domain for a devicerunning in the user space, or a nesting parent stage-2 domain for mappingsfrom guest-level physical addresses to host-level physical addresses.

    The IOAS has a list of HWPT_PAGINGs that share the same IOVA mapping andit will synchronize its mapping with each member HWPT_PAGING.

  • IOMMUFD_OBJ_HWPT_NESTED, representing an actual hardware I/O page table(i.e. a singlestructiommu_domain) managed by user space (e.g. guest OS).“NESTED” indicates that this type of HWPT should be linked to an HWPT_PAGING.It also indicates that it is backed by an iommu_domain that has a type ofIOMMU_DOMAIN_NESTED. This must be a stage-1 domain for a device running inthe user space (e.g. in a guest VM enabling the IOMMU nested translationfeature.) As such, it must be created with a given nesting parent stage-2domain to associate to. This nested stage-1 page table managed by the userspace usually has mappings from guest-level I/O virtual addresses to guest-level physical addresses.

  • IOMMUFD_FAULT, representing a software queue for an HWPT reporting IO pagefaults using the IOMMU HW’s PRI (Page Request Interface). This queue objectprovides user space an FD to poll the page fault events and also to respondto those events. A FAULT object must be created first to get a fault_id thatcould be then used to allocate a fault-enabled HWPT via the IOMMU_HWPT_ALLOCcommand by setting the IOMMU_HWPT_FAULT_ID_VALID bit in its flags field.

  • IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,passed to or shared with a VM. It may be some HW-accelerated virtualizationfeatures and some SW resources used by the VM. For examples:

    • Security namespace for guest owned ID, e.g. guest-controlled cache tags

    • Non-device-affiliated event reporting, e.g. invalidation queue errors

    • Access to a shareable nesting parent pagetable across physical IOMMUs

    • Virtualization of various platforms IDs, e.g. RIDs and others

    • Delivery of paravirtualized invalidation

    • Direct assigned invalidation queues

    • Direct assigned interrupts

    Such a vIOMMU object generally has the access to a nesting parent pagetableto support some HW-accelerated virtualization features. So, a vIOMMU objectmust be created given a nesting parent HWPT_PAGING object, and then it wouldencapsulate that HWPT_PAGING object. Therefore, a vIOMMU object can be usedto allocate an HWPT_NESTED object in place of the encapsulated HWPT_PAGING.

    Note

    The name “vIOMMU” isn’t necessarily identical to a virtualized IOMMU in aVM. A VM can have one giant virtualized IOMMU running on a machine havingmultiple physical IOMMUs, in which case the VMM will dispatch the requestsor configurations from this single virtualized IOMMU instance to multiplevIOMMU objects created for individual slices of different physical IOMMUs.In other words, a vIOMMU object is always a representation of one physicalIOMMU, not necessarily of a virtualized IOMMU. For VMMs that want the fullvirtualization features from physical IOMMUs, it is suggested to build thesame number of virtualized IOMMUs as the number of physical IOMMUs, so thepassed-through devices would be connected to their own virtualized IOMMUsbacked by corresponding vIOMMU objects, in which case a guest OS would dothe “dispatch” naturally instead of VMM trappings.

  • IOMMUFD_OBJ_VDEVICE, representing a virtual device for an IOMMUFD_OBJ_DEVICEagainst an IOMMUFD_OBJ_VIOMMU. This virtual device holds the device’s virtualinformation or attributes (related to the vIOMMU) in a VM. An immediate vDATAexample can be the virtual ID of the device on a vIOMMU, which is a unique IDthat VMM assigns to the device for a translation channel/port of the vIOMMU,e.g. vSID of ARM SMMUv3, vDeviceID of AMD IOMMU, and vRID of Intel VT-d to aContext Table. Potential use cases of some advanced security information canbe forwarded via this object too, such as security level or realm informationin a Confidential Compute Architecture. A VMM should create a vDEVICE objectto forward all the device information in a VM, when it connects a device to avIOMMU, which is a separate ioctl call from attaching the same device to anHWPT_PAGING that the vIOMMU holds.

  • IOMMUFD_OBJ_VEVENTQ, representing a software queue for a vIOMMU to report itsevents such as translation faults occurred to a nested stage-1 (excluding I/Opage faults that should go through IOMMUFD_OBJ_FAULT) and HW-specific events.This queue object provides user space an FD to poll/read the vIOMMU events. AvIOMMU object must be created first to get its viommu_id, which could be thenused to allocate a vEVENTQ. Each vIOMMU can support multiple types of vEVENTS,but is confined to one vEVENTQ per vEVENTQ type.

  • IOMMUFD_OBJ_HW_QUEUE, representing a hardware accelerated queue, as a subsetof IOMMU’s virtualization features, for the IOMMU HW to directly read or writethe virtual queue memory owned by a guest OS. This HW-acceleration feature canallow VM to work with the IOMMU HW directly without a VM Exit, so as to reduceoverhead from the hypercalls. Along with the HW QUEUE object, iommufd providesuser space an mmap interface for VMM to mmap a physical MMIO region from thehost physical address space to the guest physical address space, allowing theguest OS to directly control the allocated HW QUEUE. Thus, when allocating aHW QUEUE, the VMM must request a pair of mmap info (offset/length) and pass inexactly to an mmap syscall via its offset and length arguments.

All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.

The diagrams below show relationships between user-visible objects and kerneldatastructures (external to iommufd), with numbers referred to operationscreating the objects and links:

 _______________________________________________________________________|                      iommufd (HWPT_PAGING only)                       ||                                                                       ||        [1]                  [3]                                [2]    ||  ________________      _____________                        ________  || |                |    |             |                      |        | || |      IOAS      |<---| HWPT_PAGING |<---------------------| DEVICE | || |________________|    |_____________|                      |________| ||         |                    |                                  |     ||_________|____________________|__________________________________|_____|          |                    |                                  |          |              ______v_____                          ___v__          | PFN storage |  (paging)  |                        |struct|          |------------>|iommu_domain|<-----------------------|device|                        |____________|                        |______| _______________________________________________________________________|                      iommufd (with HWPT_NESTED)                       ||                                                                       ||        [1]                  [3]                [4]             [2]    ||  ________________      _____________      _____________     ________  || |                |    |             |    |             |   |        | || |      IOAS      |<---| HWPT_PAGING |<---| HWPT_NESTED |<--| DEVICE | || |________________|    |_____________|    |_____________|   |________| ||         |                    |                  |               |     ||_________|____________________|__________________|_______________|_____|          |                    |                  |               |          |              ______v_____       ______v_____       ___v__          | PFN storage |  (paging)  |     |  (nested)  |     |struct|          |------------>|iommu_domain|<----|iommu_domain|<----|device|                        |____________|     |____________|     |______| _______________________________________________________________________|                      iommufd (with vIOMMU/vDEVICE)                    ||                                                                       ||                             [5]                [6]                    ||                        _____________      _____________               ||                       |             |    |             |              ||      |----------------|    vIOMMU   |<---|   vDEVICE   |<----|        ||      |                |             |    |_____________|     |        ||      |                |             |                        |        ||      |      [1]       |             |          [4]           | [2]    ||      |     ______     |             |     _____________     _|______  ||      |    |      |    |     [3]     |    |             |   |        | ||      |    | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | ||      |    |______|    |_____________|    |_____________|   |________| ||      |        |              |                  |               |     ||______|________|______________|__________________|_______________|_____|       |        |              |                  |               | ______v_____   |        ______v_____       ______v_____       ___v__|   struct   |  |  PFN  |  (paging)  |     |  (nested)  |     |struct||iommu_device|  |------>|iommu_domain|<----|iommu_domain|<----|device||____________|   storage|____________|     |____________|     |______|
  1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd canhold multiple IOAS objects. IOAS is the most generic object and does notexpose interfaces that are specific to single IOMMU drivers. All operationson the IOAS must operate equally on each of the iommu_domains inside of it.

  2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPIto bind a device to an iommufd. The driver is expected to implement a set ofioctls to allow userspace to initiate the binding operation. Successfulcompletion of this operation establishes the desired DMA ownership over thedevice. The driver must also set the driver_managed_dma flag and must nottouch the device until this operation succeeds.

  3. IOMMUFD_OBJ_HWPT_PAGING can be created in two ways:

    • IOMMUFD_OBJ_HWPT_PAGING is automatically created when an external drivercalls the IOMMUFD kAPI to attach a bound device to an IOAS. Similarly theexternal driver uAPI allows userspace to initiate the attaching operation.If a compatible member HWPT_PAGING object exists in the IOAS’s HWPT_PAGINGlist, then it will be reused. Otherwise a new HWPT_PAGING that representsan iommu_domain to userspace will be created, and then added to the list.Successful completion of this operation sets up the linkages among IOAS,device and iommu_domain. Once this completes the device could do DMA.

    • IOMMUFD_OBJ_HWPT_PAGING can be manually created via the IOMMU_HWPT_ALLOCuAPI, provided an ioas_id via @pt_id to associate the new HWPT_PAGING tothe corresponding IOAS object. The benefit of this manual allocation is toallow allocation flags (defined inenumiommufd_hwpt_alloc_flags), e.g. itallocates a nesting parent HWPT_PAGING if the IOMMU_HWPT_ALLOC_NEST_PARENTflag is set.

  4. IOMMUFD_OBJ_HWPT_NESTED can be only manually created via the IOMMU_HWPT_ALLOCuAPI, provided an hwpt_id or a viommu_id of a vIOMMU object encapsulating anesting parent HWPT_PAGING via @pt_id to associate the new HWPT_NESTED objectto the corresponding HWPT_PAGING object. The associating HWPT_PAGING objectmust be a nesting parent manually allocated via the same uAPI previously withan IOMMU_HWPT_ALLOC_NEST_PARENT flag, otherwise the allocation will fail. Theallocation will be further validated by the IOMMU driver to ensure that thenesting parent domain and the nested domain being allocated are compatible.Successful completion of this operation sets up linkages among IOAS, device,and iommu_domains. Once this completes the device could do DMA via a 2-stagetranslation, a.k.a nested translation. Note that multiple HWPT_NESTED objectscan be allocated by (and then associated to) the same nesting parent.

    Note

    Either a manual IOMMUFD_OBJ_HWPT_PAGING or an IOMMUFD_OBJ_HWPT_NESTED iscreated via the same IOMMU_HWPT_ALLOC uAPI. The difference is at the typeof the object passed in via the @pt_id field ofstructiommufd_hwpt_alloc.

  5. IOMMUFD_OBJ_VIOMMU can be only manually created via the IOMMU_VIOMMU_ALLOCuAPI, provided a dev_id (for the device’s physical IOMMU to back the vIOMMU)and an hwpt_id (to associate the vIOMMU to a nesting parent HWPT_PAGING). Theiommufd core will link the vIOMMU object to thestructiommu_device that thestructdevice is behind. And an IOMMU driver can implement a viommu_alloc opto allocate its own vIOMMU data structure embedding the core-level structureiommufd_viommu and some driver-specific data. If necessary, the driver canalso configure its HW virtualization feature for that vIOMMU (and thus forthe VM). Successful completion of this operation sets up the linkages betweenthe vIOMMU object and the HWPT_PAGING, then this vIOMMU object can be usedas a nesting parent object to allocate an HWPT_NESTED object described above.

  6. IOMMUFD_OBJ_VDEVICE can be only manually created via the IOMMU_VDEVICE_ALLOCuAPI, provided a viommu_id for an iommufd_viommu object and a dev_id for aniommufd_device object. The vDEVICE object will be the binding between thesetwo parent objects. Another @virt_id will be also set via the uAPI providingthe iommufd core an index to store the vDEVICE object to a vDEVICE array pervIOMMU. If necessary, the IOMMU driver may choose to implement a vdevce_allocop to init its HW for virtualization feature related to a vDEVICE. Successfulcompletion of this operation sets up the linkages between vIOMMU and device.

A device can only bind to an iommufd due to DMA ownership claim and attach to atmost one IOAS object (no support of PASID yet).

Kernel Datastructure

User visible objects are backed by following datastructures:

  • iommufd_ioas for IOMMUFD_OBJ_IOAS.

  • iommufd_device for IOMMUFD_OBJ_DEVICE.

  • iommufd_hwpt_paging for IOMMUFD_OBJ_HWPT_PAGING.

  • iommufd_hwpt_nested for IOMMUFD_OBJ_HWPT_NESTED.

  • iommufd_fault for IOMMUFD_OBJ_FAULT.

  • iommufd_viommu for IOMMUFD_OBJ_VIOMMU.

  • iommufd_vdevice for IOMMUFD_OBJ_VDEVICE.

  • iommufd_veventq for IOMMUFD_OBJ_VEVENTQ.

  • iommufd_hw_queue for IOMMUFD_OBJ_HW_QUEUE.

Several terminologies when looking at these datastructures:

  • Automatic domain - refers to an iommu domain created automatically whenattaching a device to an IOAS object. This is compatible to the semantics ofVFIO type1.

  • Manual domain - refers to an iommu domain designated by the user as thetarget pagetable to be attached to by a device. Though currently there areno uAPIs to directly create such domain, the datastructure and algorithmsare ready for handling that use case.

  • In-kernel user - refers to something like a VFIO mdev that is using theIOMMUFD access interface to access the IOAS. This starts by creating aniommufd_access object that is similar to the domain binding a physical devicewould do. The access object will then allow converting IOVA ranges intostructpage * lists, or doing direct read/write to an IOVA.

iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges aremapped to memory pages, composed of:

  • structio_pagetable holding the IOVA map

  • structiopt_area’s representing populated portions of IOVA

  • structiopt_pages representing the storage of PFNs

  • structiommu_domain representing the IO page table in the IOMMU

  • structiopt_pages_access representing in-kernel users of PFNs

  • structxarray pinned_pfns holding a list of pages pinned by in-kernel users

Each iopt_pages represents a logical linear array of full PFNs. The PFNs areultimately derived from userspace VAs via an mm_struct. Once they have beenpinned the PFNs are stored in IOPTEs of an iommu_domain or inside the pinned_pfnsxarray if they have been pinned through an iommufd_access.

PFN have to be copied between all combinations of storage locations, dependingon what domains are present and what kinds of in-kernel “software access” usersexist. The mechanism ensures that a page is pinned only once.

An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with alist of iommu_domains that mirror the IOVA to PFN map.

Multiple io_pagetable-s, through their iopt_area-s, can share a singleiopt_pages which avoids multi-pinning and double accounting of pageconsumption.

iommufd_ioas is shareable between subsystems, e.g. VFIO and VDPA, as long asdevices managed by different subsystems are bound to a same iommufd.

IOMMUFD User API

General ioctl format

The ioctl interface follows a general format to allow for extensibility. Eachioctl is passed in a structure pointer as the argument providing the size ofthe structure in the first u32. The kernel checks that any structure spacebeyond what it understands is 0. This allows userspace to use the backwardcompatible portion while consistently using the newer, larger, structures.

ioctls use a standard meaning for common errnos:

  • ENOTTY: The IOCTL number itself is not supported at all

  • E2BIG: The IOCTL number is supported, but the provided structure hasnon-zero in a part the kernel does not understand.

  • EOPNOTSUPP: The IOCTL number is supported, and the structure isunderstood, however a known field has a value the kernel does notunderstand or support.

  • EINVAL: Everything about the IOCTL was understood, but a field is notcorrect.

  • ENOENT: An ID or IOVA provided does not exist.

  • ENOMEM: Out of memory.

  • EOVERFLOW: Mathematics overflowed.

As well as additional errnos, within specific ioctls.

structiommu_destroy

ioctl(IOMMU_DESTROY)

Definition:

struct iommu_destroy {    __u32 size;    __u32 id;};

Members

size

sizeof(structiommu_destroy)

id

iommufd object ID to destroy. Can be any destroyable object type.

Description

Destroy any object held within iommufd.

structiommu_ioas_alloc

ioctl(IOMMU_IOAS_ALLOC)

Definition:

struct iommu_ioas_alloc {    __u32 size;    __u32 flags;    __u32 out_ioas_id;};

Members

size

sizeof(structiommu_ioas_alloc)

flags

Must be 0

out_ioas_id

Output IOAS ID for the allocated object

Description

Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)to memory mapping.

structiommu_iova_range

ioctl(IOMMU_IOVA_RANGE)

Definition:

struct iommu_iova_range {    __aligned_u64 start;    __aligned_u64 last;};

Members

start

First IOVA

last

Inclusive last IOVA

Description

An interval in IOVA space.

structiommu_ioas_iova_ranges

ioctl(IOMMU_IOAS_IOVA_RANGES)

Definition:

struct iommu_ioas_iova_ranges {    __u32 size;    __u32 ioas_id;    __u32 num_iovas;    __u32 __reserved;    __aligned_u64 allowed_iovas;    __aligned_u64 out_iova_alignment;};

Members

size

sizeof(structiommu_ioas_iova_ranges)

ioas_id

IOAS ID to read ranges from

num_iovas

Input/Output total number of ranges in the IOAS

__reserved

Must be 0

allowed_iovas

Pointer to the output array ofstructiommu_iova_range

out_iova_alignment

Minimum alignment required for mapping IOVA

Description

Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these rangesis not allowed. num_iovas will be set to the total number of iovas andthe allowed_iovas[] will be filled in as space permits.

The allowed ranges are dependent on the HW path the DMA operation takes, andcan change during the lifetime of the IOAS. A fresh empty IOAS will have afull range, and each attached device will narrow the ranges based on thatdevice’s HW restrictions. Detaching a device can widen the ranges. Userspaceshould query ranges after every attach/detach to know what IOVAs are validfor mapping.

On input num_iovas is the length of the allowed_iovas array. On output it isthe total number of iovas filled in. The ioctl will return -EMSGSIZE and setnum_iovas to the required value if num_iovas is too small. In this case thecaller should allocate a larger output array and re-issue the ioctl.

out_iova_alignment returns the minimum IOVA alignment that can be givento IOMMU_IOAS_MAP/COPY. IOVA’s must satisfy:

starting_iova % out_iova_alignment == 0(starting_iova + length) % out_iova_alignment == 0

out_iova_alignment can be 1 indicating any IOVA is allowed. It cannotbe higher than the system PAGE_SIZE.

structiommu_ioas_allow_iovas

ioctl(IOMMU_IOAS_ALLOW_IOVAS)

Definition:

struct iommu_ioas_allow_iovas {    __u32 size;    __u32 ioas_id;    __u32 num_iovas;    __u32 __reserved;    __aligned_u64 allowed_iovas;};

Members

size

sizeof(structiommu_ioas_allow_iovas)

ioas_id

IOAS ID to allow IOVAs from

num_iovas

Input/Output total number of ranges in the IOAS

__reserved

Must be 0

allowed_iovas

Pointer to array ofstructiommu_iova_range

Description

Ensure a range of IOVAs are always available for allocation. If this callsucceeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA rangesthat are narrower than the ranges provided here. This call will fail ifIOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.

When an IOAS is first created the IOVA_RANGES will be maximally sized, and asdevices are attached the IOVA will narrow based on the device restrictions.When an allowed range is specified any narrowing will be refused, ie deviceattachment can fail if the device requires limiting within the allowed range.

Automatic IOVA allocation is also impacted by this call. MAP will onlyallocate within the allowed IOVAs if they are present.

This call replaces the entire allowed list with the given list.

enumiommufd_ioas_map_flags

Flags for map and copy

Constants

IOMMU_IOAS_MAP_FIXED_IOVA

If clear the kernel will compute an appropriateIOVA to place the mapping at

IOMMU_IOAS_MAP_WRITEABLE

DMA is allowed to write to this mapping

IOMMU_IOAS_MAP_READABLE

DMA is allowed to read from this mapping

structiommu_ioas_map

ioctl(IOMMU_IOAS_MAP)

Definition:

struct iommu_ioas_map {    __u32 size;    __u32 flags;    __u32 ioas_id;    __u32 __reserved;    __aligned_u64 user_va;    __aligned_u64 length;    __aligned_u64 iova;};

Members

size

sizeof(structiommu_ioas_map)

flags

Combination ofenumiommufd_ioas_map_flags

ioas_id

IOAS ID to change the mapping of

__reserved

Must be 0

user_va

Userspace pointer to start mapping from

length

Number of bytes to map

iova

IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is setthen this must be provided as input.

Description

Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then themapping will be established at iova, otherwise a suitable location based onthe reserved and allowed lists will be automatically selected and returned iniova.

If IOMMU_IOAS_MAP_FIXED_IOVA is specified then the iova range must currentlybe unused, existing IOVA cannot be replaced.

structiommu_ioas_map_file

ioctl(IOMMU_IOAS_MAP_FILE)

Definition:

struct iommu_ioas_map_file {    __u32 size;    __u32 flags;    __u32 ioas_id;    __s32 fd;    __aligned_u64 start;    __aligned_u64 length;    __aligned_u64 iova;};

Members

size

sizeof(structiommu_ioas_map_file)

flags

same as for iommu_ioas_map

ioas_id

same as for iommu_ioas_map

fd

the memfd to map

start

byte offset from start of file to map from

length

same as for iommu_ioas_map

iova

same as for iommu_ioas_map

Description

Set an IOVA mapping from a memfd file. All other arguments and semanticsmatch those of IOMMU_IOAS_MAP.

structiommu_ioas_copy

ioctl(IOMMU_IOAS_COPY)

Definition:

struct iommu_ioas_copy {    __u32 size;    __u32 flags;    __u32 dst_ioas_id;    __u32 src_ioas_id;    __aligned_u64 length;    __aligned_u64 dst_iova;    __aligned_u64 src_iova;};

Members

size

sizeof(structiommu_ioas_copy)

flags

Combination ofenumiommufd_ioas_map_flags

dst_ioas_id

IOAS ID to change the mapping of

src_ioas_id

IOAS ID to copy from

length

Number of bytes to copy and map

dst_iova

IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA isset then this must be provided as input.

src_iova

IOVA to start the copy

Description

Copy an already existing mapping from src_ioas_id and establish it indst_ioas_id. The src iova/length must exactly match a range used withIOMMU_IOAS_MAP.

This may be used to efficiently clone a subset of an IOAS to another, or as akind of ‘cache’ to speed up mapping. Copy has an efficiency advantage overestablishing equivalent new mappings, as internal resources are shared, andthe kernel will pin the user memory only once.

structiommu_ioas_unmap

ioctl(IOMMU_IOAS_UNMAP)

Definition:

struct iommu_ioas_unmap {    __u32 size;    __u32 ioas_id;    __aligned_u64 iova;    __aligned_u64 length;};

Members

size

sizeof(structiommu_ioas_unmap)

ioas_id

IOAS ID to change the mapping of

iova

IOVA to start the unmapping at

length

Number of bytes to unmap, and return back the bytes unmapped

Description

Unmap an IOVA range. The iova/length must be a superset of a previouslymapped range used with IOMMU_IOAS_MAP or IOMMU_IOAS_COPY. Splitting ortruncating ranges is not allowed. The values 0 to U64_MAX will unmapeverything.

enumiommufd_option

ioctl(IOMMU_OPTION_RLIMIT_MODE) and ioctl(IOMMU_OPTION_HUGE_PAGES)

Constants

IOMMU_OPTION_RLIMIT_MODE

Change how RLIMIT_MEMLOCK accounting works. The caller must have privilegeto invoke this. Value 0 (default) is user based accounting, 1 uses processbased accounting. Global option, object_id must be 0

IOMMU_OPTION_HUGE_PAGES

Value 1 (default) allows contiguous pages to be combined when generatingiommu mappings. Value 0 disables combining, everything is mapped toPAGE_SIZE. This can be useful for benchmarking. This is a per-IOASoption, the object_id must be the IOAS ID.

enumiommufd_option_ops

ioctl(IOMMU_OPTION_OP_SET) and ioctl(IOMMU_OPTION_OP_GET)

Constants

IOMMU_OPTION_OP_SET

Set the option’s value

IOMMU_OPTION_OP_GET

Get the option’s value

structiommu_option

iommu option multiplexer

Definition:

struct iommu_option {    __u32 size;    __u32 option_id;    __u16 op;    __u16 __reserved;    __u32 object_id;    __aligned_u64 val64;};

Members

size

sizeof(structiommu_option)

option_id

One ofenumiommufd_option

op

One ofenumiommufd_option_ops

__reserved

Must be 0

object_id

ID of the object if required

val64

Option value to set or value returned on get

Description

Change a simple option value. This multiplexor allows controlling optionson objects. IOMMU_OPTION_OP_SET will load an option and IOMMU_OPTION_OP_GETwill return the current value.

enumiommufd_vfio_ioas_op

IOMMU_VFIO_IOAS_* ioctls

Constants

IOMMU_VFIO_IOAS_GET

Get the current compatibility IOAS

IOMMU_VFIO_IOAS_SET

Change the current compatibility IOAS

IOMMU_VFIO_IOAS_CLEAR

Disable VFIO compatibility

structiommu_vfio_ioas

ioctl(IOMMU_VFIO_IOAS)

Definition:

struct iommu_vfio_ioas {    __u32 size;    __u32 ioas_id;    __u16 op;    __u16 __reserved;};

Members

size

sizeof(structiommu_vfio_ioas)

ioas_id

For IOMMU_VFIO_IOAS_SET the input IOAS ID to setFor IOMMU_VFIO_IOAS_GET will output the IOAS ID

op

One ofenumiommufd_vfio_ioas_op

__reserved

Must be 0

Description

The VFIO compatibility support uses a single ioas because VFIO APIs do notsupport the ID field. Set or Get the IOAS that VFIO compatibility will use.When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get thecompatibility ioas, either by taking what is already set, or auto creatingone. From then on VFIO will continue to use that ioas and is not effected bythis ioctl. SET or CLEAR does not destroy any auto-created IOAS.

enumiommufd_hwpt_alloc_flags

Flags for HWPT allocation

Constants

IOMMU_HWPT_ALLOC_NEST_PARENT

If set, allocate a HWPT that can serve asthe parent HWPT in a nesting configuration.

IOMMU_HWPT_ALLOC_DIRTY_TRACKING

Dirty tracking support for device IOMMU isenforced on device attachment

IOMMU_HWPT_FAULT_ID_VALID

The fault_id field of hwpt allocation data isvalid.

IOMMU_HWPT_ALLOC_PASID

Requests a domain that can be used with PASID. Thedomain can be attached to any PASID on the device.Any domain attached to the non-PASID part of thedevice must also be flagged, otherwise attaching aPASID will blocked.For the user that wants to attach PASID, ioas isnot recommended for both the non-PASID partand PASID part of the device.If IOMMU does not support PASID it will returnerror (-EOPNOTSUPP).

enumiommu_hwpt_vtd_s1_flags

Intel VT-d stage-1 page table entry attributes

Constants

IOMMU_VTD_S1_SRE

Supervisor request

IOMMU_VTD_S1_EAFE

Extended access enable

IOMMU_VTD_S1_WPE

Write protect enable

structiommu_hwpt_vtd_s1

Intel VT-d stage-1 page table info (IOMMU_HWPT_DATA_VTD_S1)

Definition:

struct iommu_hwpt_vtd_s1 {    __aligned_u64 flags;    __aligned_u64 pgtbl_addr;    __u32 addr_width;    __u32 __reserved;};

Members

flags

Combination ofenumiommu_hwpt_vtd_s1_flags

pgtbl_addr

The base address of the stage-1 page table.

addr_width

The address width of the stage-1 page table

__reserved

Must be 0

structiommu_hwpt_arm_smmuv3

ARM SMMUv3 nested STE (IOMMU_HWPT_DATA_ARM_SMMUV3)

Definition:

struct iommu_hwpt_arm_smmuv3 {    __aligned_le64 ste[2];};

Members

ste

The first two double words of the user space Stream Table Entry forthe translation. Must be little-endian.Allowed fields: (Refer to “5.2 Stream Table Entry” in SMMUv3 HW Spec)- word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax- word-1: EATS, S1DSS, S1CIR, S1COR, S1CSH, S1STALLD

Description

-EIO will be returned ifste is not legal or contains any non-allowed field.Cfg can be used to select a S1, Bypass or Abort configuration. A Bypassnested domain will translate the same as the nesting parent. The S1 willinstall a Context Descriptor Table pointing at userspace memory translatedby the nesting parent.

It’s suggested to allocate a vDEVICE object carrying vSID and then re-attachthe nested domain, as soon as the vSID is available in the VMM level:

  • when Cfg=translate, a vDEVICE must be allocated prior to attaching to theallocated nested domain, as CD/ATS invalidations and vevents need a vSID.

  • when Cfg=bypass/abort, a vDEVICE is not enforced during the nested domainattachment, to support a GBPA case where VM sets CR0.SMMUEN=0. However, ifVM sets CR0.SMMUEN=1 while missing a vDEVICE object, kernel would fail toreport events to the VM. E.g. F_TRANSLATION when guest STE.Cfg=abort.

enumiommu_hwpt_data_type

IOMMU HWPT Data Type

Constants

IOMMU_HWPT_DATA_NONE

no data

IOMMU_HWPT_DATA_VTD_S1

Intel VT-d stage-1 page table

IOMMU_HWPT_DATA_ARM_SMMUV3

ARM SMMUv3 Context Descriptor Table

structiommu_hwpt_alloc

ioctl(IOMMU_HWPT_ALLOC)

Definition:

struct iommu_hwpt_alloc {    __u32 size;    __u32 flags;    __u32 dev_id;    __u32 pt_id;    __u32 out_hwpt_id;    __u32 __reserved;    __u32 data_type;    __u32 data_len;    __aligned_u64 data_uptr;    __u32 fault_id;    __u32 __reserved2;};

Members

size

sizeof(structiommu_hwpt_alloc)

flags

Combination ofenumiommufd_hwpt_alloc_flags

dev_id

The device to allocate this HWPT for

pt_id

The IOAS or HWPT or vIOMMU to connect this HWPT to

out_hwpt_id

The ID of the new HWPT

__reserved

Must be 0

data_type

One ofenumiommu_hwpt_data_type

data_len

Length of the type specific data

data_uptr

User pointer to the type specific data

fault_id

The ID of IOMMUFD_FAULT object. Valid only if flags field ofIOMMU_HWPT_FAULT_ID_VALID is set.

__reserved2

Padding to 64-bit alignment. Must be 0.

Description

Explicitly allocate a hardware page table object. This is the same objecttype that is returned byiommufd_device_attach() and represents theunderlying iommu driver’s iommu_domain kernel object.

A kernel-managed HWPT will be created with the mappings from the givenIOAS via thept_id. Thedata_type for this allocation must be set toIOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for anesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT viaflags.

A user-managed nested HWPT will be created from a given vIOMMU (wrapping aparent HWPT) or a parent HWPT viapt_id, in which the parent HWPT must beallocated previously via the same ioctl from a given IOAS (pt_id). In thiscase, thedata_type must be set to a pre-defined type corresponding to anI/O page table type supported by the underlying IOMMU hardware. The deviceviadev_id and the vIOMMU viapt_id must be associated to the same IOMMUinstance.

If thedata_type is set to IOMMU_HWPT_DATA_NONE,data_len anddata_uptr should be zero. Otherwise, bothdata_len anddata_uptrmust be given.

enumiommu_hw_info_vtd_flags

Flags for VT-d hw_info

Constants

IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17

If set, disallow read-only mappingson a nested_parent domain.https://www.intel.com/content/www/us/en/content-details/772415/content-details.html

structiommu_hw_info_vtd

Intel VT-d hardware information

Definition:

struct iommu_hw_info_vtd {    __u32 flags;    __u32 __reserved;    __aligned_u64 cap_reg;    __aligned_u64 ecap_reg;};

Members

flags

Combination ofenumiommu_hw_info_vtd_flags

__reserved

Must be 0

cap_reg

Value of Intel VT-d capability register defined in VT-d specsection 11.4.2 Capability Register.

ecap_reg

Value of Intel VT-d capability register defined in VT-d specsection 11.4.3 Extended Capability Register.

Description

User needs to understand the Intel VT-d specification to decode theregister value.

structiommu_hw_info_arm_smmuv3

ARM SMMUv3 hardware information (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)

Definition:

struct iommu_hw_info_arm_smmuv3 {    __u32 flags;    __u32 __reserved;    __u32 idr[6];    __u32 iidr;    __u32 aidr;};

Members

flags

Must be set to 0

__reserved

Must be 0

idr

Implemented features for ARM SMMU Non-secure programming interface

iidr

Information about the implementation and implementer of ARM SMMU,and architecture version supported

aidr

ARM SMMU architecture version

Description

For the details ofidr,iidr andaidr, please refer to the chaptersfrom 6.3.1 to 6.3.6 in the SMMUv3 Spec.

This reports the raw HW capability, and not all bits are meaningful to beread by userspace. Only the following fields should be used:

idr[0]: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN , CD2L, ASID16, TTFidr[1]: SIDSIZE, SSIDSIZEidr[3]: BBML, RILidr[5]: VAX, GRAN64K, GRAN16K, GRAN4K

  • S1P should be assumed to be true if a NESTED HWPT can be created

  • VFIO/iommufd only support platforms with COHACC, it should be assumed to betrue.

  • ATS is a per-device property. If the VMM describes any devices as ATScapable in ACPI/DT it should set the corresponding idr.

This list may expand in future (eg E0PD, AIE, PBHA, D128, DS etc). It isimportant that VMMs do not read bits outside the list to allow forcompatibility with future kernels. Several features in the SMMUv3architecture are not currently supported by the kernel for nesting: HTTU,BTM, MPAM and others.

structiommu_hw_info_tegra241_cmdqv

NVIDIA Tegra241 CMDQV Hardware Information (IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)

Definition:

struct iommu_hw_info_tegra241_cmdqv {    __u32 flags;    __u8 version;    __u8 log2vcmdqs;    __u8 log2vsids;    __u8 __reserved;};

Members

flags

Must be 0

version

Version number for the CMDQ-V HW for PARAM bits[03:00]

log2vcmdqs

Log2 of the total number of VCMDQs for PARAM bits[07:04]

log2vsids

Log2 of the total number of SID replacements for PARAM bits[15:12]

__reserved

Must be 0

Description

VMM can use these fields directly in its emulated global PARAM register. Notethat only one Virtual Interface (VINTF) should be exposed to a VM, i.e. PARAMbits[11:08] should be set to 0 for log2 of the total number of VINTFs.

enumiommu_hw_info_type

IOMMU Hardware Info Types

Constants

IOMMU_HW_INFO_TYPE_NONE

Output by the drivers that do not report hardwareinfo

IOMMU_HW_INFO_TYPE_DEFAULT

Input to request for a default type

IOMMU_HW_INFO_TYPE_INTEL_VTD

Intel VT-d iommu info type

IOMMU_HW_INFO_TYPE_ARM_SMMUV3

ARM SMMUv3 iommu info type

IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV

NVIDIA Tegra241 CMDQV (extension for ARMSMMUv3) info type

enumiommufd_hw_capabilities

Constants

IOMMU_HW_CAP_DIRTY_TRACKING

IOMMU hardware support for dirty trackingIf available, it means the following APIsare supported:

IOMMU_HW_CAP_PCI_PASID_EXEC

Execute Permission Supported, user ignores itwhen thestructiommu_hw_info::out_max_pasid_log2 is zero.

IOMMU_HW_CAP_PCI_PASID_PRIV

Privileged Mode Supported, user ignores itwhen thestructiommu_hw_info::out_max_pasid_log2 is zero.

Description

IOMMU_HWPT_GET_DIRTY_BITMAPIOMMU_HWPT_SET_DIRTY_TRACKING

enumiommufd_hw_info_flags

Flags for iommu_hw_info

Constants

IOMMU_HW_INFO_FLAG_INPUT_TYPE

If set,in_data_type carries an input typefor user space to request for a specific info

structiommu_hw_info

ioctl(IOMMU_GET_HW_INFO)

Definition:

struct iommu_hw_info {    __u32 size;    __u32 flags;    __u32 dev_id;    __u32 data_len;    __aligned_u64 data_uptr;    union {        __u32 in_data_type;        __u32 out_data_type;    };    __u8 out_max_pasid_log2;    __u8 __reserved[3];    __aligned_u64 out_capabilities;};

Members

size

sizeof(structiommu_hw_info)

flags

Must be 0

dev_id

The device bound to the iommufd

data_len

Input the length of a user buffer in bytes. Output the length ofdata that kernel supports

data_uptr

User pointer to a user-space buffer used by the kernel to fillthe iommu type specific hardware information data

{unnamed_union}

anonymous

in_data_type

This shares the same field without_data_type, making it bea bidirectional field. When IOMMU_HW_INFO_FLAG_INPUT_TYPE isset, an input type carried via thisin_data_type field willbe valid, requesting for the info data to the given type. IfIOMMU_HW_INFO_FLAG_INPUT_TYPE is unset, any input value willbe seen as IOMMU_HW_INFO_TYPE_DEFAULT

out_data_type

Output the iommu hardware info type as defined in theenumiommu_hw_info_type.

out_max_pasid_log2

Output the width of PASIDs. 0 means no PASID support.PCI devices turn to out_capabilities to check if thespecific capabilities is supported or not.

__reserved

Must be 0

out_capabilities

Output the generic iommu capability info type as definedin theenumiommu_hw_capabilities.

Description

Query an iommu type specific hardware information data from an iommu behinda given device that has been bound to iommufd. This hardware info data willbe used to sync capabilities between the virtual iommu and the physicaliommu, e.g. a nested translation setup needs to check the hardware info, soa guest stage-1 page table can be compatible with the physical iommu.

To capture an iommu type specific hardware information data,data_uptr andits lengthdata_len must be provided. Trailing bytes will be zeroed if theuser buffer is larger than the data that kernel has. Otherwise, kernel onlyfills the buffer using the given length indata_len. If the ioctl succeeds,data_len will be updated to the length that kernel actually supports,out_data_type will be filled to decode the data filled in the bufferpointed bydata_uptr. Inputdata_len == zero is allowed.

structiommu_hwpt_set_dirty_tracking

ioctl(IOMMU_HWPT_SET_DIRTY_TRACKING)

Definition:

struct iommu_hwpt_set_dirty_tracking {    __u32 size;    __u32 flags;    __u32 hwpt_id;    __u32 __reserved;};

Members

size

sizeof(structiommu_hwpt_set_dirty_tracking)

flags

Combination ofenumiommufd_hwpt_set_dirty_tracking_flags

hwpt_id

HW pagetable ID that represents the IOMMU domain

__reserved

Must be 0

Description

Toggle dirty tracking on an HW pagetable.

enumiommufd_hwpt_get_dirty_bitmap_flags

Flags for getting dirty bits

Constants

IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR

Just read the PTEs without clearingany dirty bits metadata. This flagcan be passed in the expectationwhere the next operation is an unmapof the same IOVA range.

structiommu_hwpt_get_dirty_bitmap

ioctl(IOMMU_HWPT_GET_DIRTY_BITMAP)

Definition:

struct iommu_hwpt_get_dirty_bitmap {    __u32 size;    __u32 hwpt_id;    __u32 flags;    __u32 __reserved;    __aligned_u64 iova;    __aligned_u64 length;    __aligned_u64 page_size;    __aligned_u64 data;};

Members

size

sizeof(structiommu_hwpt_get_dirty_bitmap)

hwpt_id

HW pagetable ID that represents the IOMMU domain

flags

Combination ofenumiommufd_hwpt_get_dirty_bitmap_flags

__reserved

Must be 0

iova

base IOVA of the bitmap first bit

length

IOVA range size

page_size

page size granularity of each bit in the bitmap

data

bitmap where to set the dirty bits. The bitmap bits eachrepresent a page_size which you deviate from an arbitrary iova.

Description

Checking a given IOVA is dirty:

data[(iova / page_size) / 64] & (1ULL << ((iova / page_size) % 64))

Walk the IOMMU pagetables for a given IOVA range to return a bitmapwith the dirty IOVAs. In doing so it will also by default clear anydirty bit metadata set in the IOPTE.

enumiommu_hwpt_invalidate_data_type

IOMMU HWPT Cache Invalidation Data Type

Constants

IOMMU_HWPT_INVALIDATE_DATA_VTD_S1

Invalidation data for VTD_S1

IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3

Invalidation data for ARM SMMUv3

enumiommu_hwpt_vtd_s1_invalidate_flags

Flags for Intel VT-d stage-1 cache invalidation

Constants

IOMMU_VTD_INV_FLAGS_LEAF

Indicates whether the invalidation appliesto all-levels page structure cache or justthe leaf PTE cache.

structiommu_hwpt_vtd_s1_invalidate

Intel VT-d cache invalidation (IOMMU_HWPT_INVALIDATE_DATA_VTD_S1)

Definition:

struct iommu_hwpt_vtd_s1_invalidate {    __aligned_u64 addr;    __aligned_u64 npages;    __u32 flags;    __u32 __reserved;};

Members

addr

The start address of the range to be invalidated. It needs tobe 4KB aligned.

npages

Number of contiguous 4K pages to be invalidated.

flags

Combination ofenumiommu_hwpt_vtd_s1_invalidate_flags

__reserved

Must be 0

Description

The Intel VT-d specific invalidation data for user-managed stage-1 cacheinvalidation in nested translation. Userspace uses this structure totell the impacted cache scope after modifying the stage-1 page table.

Invalidating all the caches related to the page table by settingaddrto be 0 andnpages to be U64_MAX.

The device TLB will be invalidated automatically if ATS is enabled.

structiommu_viommu_arm_smmuv3_invalidate

ARM SMMUv3 cache invalidation (IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3)

Definition:

struct iommu_viommu_arm_smmuv3_invalidate {    __aligned_le64 cmd[2];};

Members

cmd

128-bit cache invalidation command that runs in SMMU CMDQ.Must be little-endian.

Description

Supported command list only when passing in a vIOMMU viahwpt_id:

CMDQ_OP_TLBI_NSNH_ALLCMDQ_OP_TLBI_NH_VACMDQ_OP_TLBI_NH_VAACMDQ_OP_TLBI_NH_ALLCMDQ_OP_TLBI_NH_ASIDCMDQ_OP_ATC_INVCMDQ_OP_CFGI_CDCMDQ_OP_CFGI_CD_ALL

-EIO will be returned if the command is not supported.

structiommu_hwpt_invalidate

ioctl(IOMMU_HWPT_INVALIDATE)

Definition:

struct iommu_hwpt_invalidate {    __u32 size;    __u32 hwpt_id;    __aligned_u64 data_uptr;    __u32 data_type;    __u32 entry_len;    __u32 entry_num;    __u32 __reserved;};

Members

size

sizeof(structiommu_hwpt_invalidate)

hwpt_id

ID of a nested HWPT or a vIOMMU, for cache invalidation

data_uptr

User pointer to an array of driver-specific cache invalidationdata.

data_type

One ofenumiommu_hwpt_invalidate_data_type, defining the datatype of all the entries in the invalidation request array. Itshould be a type supported by the hwpt pointed byhwpt_id.

entry_len

Length (in bytes) of a request entry in the request array

entry_num

Input the number of cache invalidation requests in the array.Output the number of requests successfully handled by kernel.

__reserved

Must be 0.

Description

Invalidate iommu cache for user-managed page table or vIOMMU. Modificationson a user-managed page table should be followed by this operation, if a HWPTis passed in viahwpt_id. Other caches, such as device cache or descriptorcache can be flushed if a vIOMMU is passed in via thehwpt_id field.

Each ioctl can support one or more cache invalidation requests in the arraythat has a total size ofentry_len *entry_num.

An empty invalidation request array by settingentry_num**==0 is allowed, and**entry_len anddata_uptr would be ignored in this case. This can be used tocheck if the givendata_type is supported or not by kernel.

enumiommu_hwpt_pgfault_flags

flags forstructiommu_hwpt_pgfault

Constants

IOMMU_PGFAULT_FLAGS_PASID_VALID

The pasid field of the fault data isvalid.

IOMMU_PGFAULT_FLAGS_LAST_PAGE

It’s the last fault of a fault group.

enumiommu_hwpt_pgfault_perm

perm bits forstructiommu_hwpt_pgfault

Constants

IOMMU_PGFAULT_PERM_READ

request for read permission

IOMMU_PGFAULT_PERM_WRITE

request for write permission

IOMMU_PGFAULT_PERM_EXEC

(PCIE 10.4.1) request with a PASID that has theExecute Requested bit set in PASID TLP Prefix.

IOMMU_PGFAULT_PERM_PRIV

(PCIE 10.4.1) request with a PASID that has thePrivileged Mode Requested bit set in PASID TLPPrefix.

structiommu_hwpt_pgfault

iommu page fault data

Definition:

struct iommu_hwpt_pgfault {    __u32 flags;    __u32 dev_id;    __u32 pasid;    __u32 grpid;    __u32 perm;    __u32 __reserved;    __aligned_u64 addr;    __u32 length;    __u32 cookie;};

Members

flags

Combination ofenumiommu_hwpt_pgfault_flags

dev_id

id of the originated device

pasid

Process Address Space ID

grpid

Page Request Group Index

perm

Combination ofenumiommu_hwpt_pgfault_perm

__reserved

Must be 0.

addr

Fault address

length

a hint of how much data the requestor is expecting to fetch. Forexample, if the PRI initiator knows it is going to do a 10MBtransfer, it could fill in 10MB and the OS could pre-fault in10MB of IOVA. It’s default to 0 if there’s no such hint.

cookie

kernel-managed cookie identifying a group of fault messages. Thecookie number encoded in the last page fault of the group shouldbe echoed back in the response message.

enumiommufd_page_response_code

Return status of fault handlers

Constants

IOMMUFD_PAGE_RESP_SUCCESS

Fault has been handled and the page tablespopulated, retry the access. This is the“Success” defined in PCI 10.4.2.1.

IOMMUFD_PAGE_RESP_INVALID

Could not handle this fault, don’t retry theaccess. This is the “Invalid Request” in PCI10.4.2.1.

structiommu_hwpt_page_response

IOMMU page fault response

Definition:

struct iommu_hwpt_page_response {    __u32 cookie;    __u32 code;};

Members

cookie

The kernel-managed cookie reported in the fault message.

code

One of response code inenumiommufd_page_response_code.

structiommu_fault_alloc

ioctl(IOMMU_FAULT_QUEUE_ALLOC)

Definition:

struct iommu_fault_alloc {    __u32 size;    __u32 flags;    __u32 out_fault_id;    __u32 out_fault_fd;};

Members

size

sizeof(structiommu_fault_alloc)

flags

Must be 0

out_fault_id

The ID of the new FAULT

out_fault_fd

The fd of the new FAULT

Description

Explicitly allocate a fault handling object.

enumiommu_viommu_type

Virtual IOMMU Type

Constants

IOMMU_VIOMMU_TYPE_DEFAULT

Reserved for future use

IOMMU_VIOMMU_TYPE_ARM_SMMUV3

ARM SMMUv3 driver specific type

IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV

NVIDIA Tegra241 CMDQV (extension for ARMSMMUv3) enabled ARM SMMUv3 type

structiommu_viommu_tegra241_cmdqv

NVIDIA Tegra241 CMDQV Virtual Interface (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)

Definition:

struct iommu_viommu_tegra241_cmdqv {    __aligned_u64 out_vintf_mmap_offset;    __aligned_u64 out_vintf_mmap_length;};

Members

out_vintf_mmap_offset

mmap offset argument for VINTF’s page0

out_vintf_mmap_length

mmap length argument for VINTF’s page0

Description

Bothout_vintf_mmap_offset andout_vintf_mmap_length are reported by kernelfor user space to mmap the VINTF page0 from the host physical address spaceto the guest physical address space so that a guest kernel can directly R/Waccess to the VINTF page0 in order to control its virtual command queues.

structiommu_viommu_alloc

ioctl(IOMMU_VIOMMU_ALLOC)

Definition:

struct iommu_viommu_alloc {    __u32 size;    __u32 flags;    __u32 type;    __u32 dev_id;    __u32 hwpt_id;    __u32 out_viommu_id;    __u32 data_len;    __u32 __reserved;    __aligned_u64 data_uptr;};

Members

size

sizeof(structiommu_viommu_alloc)

flags

Must be 0

type

Type of the virtual IOMMU. Must be defined inenumiommu_viommu_type

dev_id

The device’s physical IOMMU will be used to back the virtual IOMMU

hwpt_id

ID of a nesting parent HWPT to associate to

out_viommu_id

Output virtual IOMMU ID for the allocated object

data_len

Length of the type specific data

__reserved

Must be 0

data_uptr

User pointer to a driver-specific virtual IOMMU data

Description

Allocate a virtual IOMMU object, representing the underlying physical IOMMU’svirtualization support that is a security-isolated slice of the real IOMMU HWthat is unique to a specific VM. Operations global to the IOMMU are connectedto the vIOMMU, such as:- Security namespace for guest owned ID, e.g. guest-controlled cache tags- Non-device-affiliated event reporting, e.g. invalidation queue errors- Access to a sharable nesting parent pagetable across physical IOMMUs- Virtualization of various platforms IDs, e.g. RIDs and others- Delivery of paravirtualized invalidation- Direct assigned invalidation queues- Direct assigned interrupts

structiommu_vdevice_alloc

ioctl(IOMMU_VDEVICE_ALLOC)

Definition:

struct iommu_vdevice_alloc {    __u32 size;    __u32 viommu_id;    __u32 dev_id;    __u32 out_vdevice_id;    __aligned_u64 virt_id;};

Members

size

sizeof(structiommu_vdevice_alloc)

viommu_id

vIOMMU ID to associate with the virtual device

dev_id

The physical device to allocate a virtual instance on the vIOMMU

out_vdevice_id

Object handle for the vDevice. Pass to IOMMU_DESTORY

virt_id

Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceIDof AMD IOMMU, and vRID of Intel VT-d

Description

Allocate a virtual device instance (for a physical device) against a vIOMMU.This instance holds the device’s information (related to its vIOMMU) in a VM.User should use IOMMU_DESTROY to destroy the virtual device beforedestroying the physical device (by closing vfio_cdev fd). Otherwise thevirtual device would be forcibly destroyed on physical device destruction,its vdevice_id would be permanently leaked (unremovable & unreusable) untiliommu fd closed.

structiommu_ioas_change_process

ioctl(VFIO_IOAS_CHANGE_PROCESS)

Definition:

struct iommu_ioas_change_process {    __u32 size;    __u32 __reserved;};

Members

size

sizeof(structiommu_ioas_change_process)

__reserved

Must be 0

Description

This transfers pinned memory counts for every memory map in every IOASin the context to the current process. This only supports maps createdwith IOMMU_IOAS_MAP_FILE, and returns EINVAL if other maps are present.If the ioctl returns a failure status, then nothing is changed.

This API is useful for transferring operation of a device from one processto another, such as during userland live update.

enumiommu_veventq_flag

flag forstructiommufd_vevent_header

Constants

IOMMU_VEVENTQ_FLAG_LOST_EVENTS

vEVENTQ has lost vEVENTs

structiommufd_vevent_header

Virtual Event Header for a vEVENTQ Status

Definition:

struct iommufd_vevent_header {    __u32 flags;    __u32 sequence;};

Members

flags

Combination ofenumiommu_veventq_flag

sequence

The sequence index of a vEVENT in the vEVENTQ, with a range of[0, INT_MAX] where the following index of INT_MAX is 0

Description

Each iommufd_vevent_header reports a sequence index of the following vEVENT:

header0 {sequence=0}

data0

header1 {sequence=1}

data1

...

dataN

And this sequence index is expected to be monotonic to the sequence index ofthe previous vEVENT. If two adjacent sequence indexes has a delta larger than1, it means that delta - 1 number of vEVENTs has lost, e.g. two lost vEVENTs:

...

header3 {sequence=3}

data3

header6 {sequence=6}

data6

...

If a vEVENT lost at the tail of the vEVENTQ and there is no following vEVENTproviding the next sequence index, an IOMMU_VEVENTQ_FLAG_LOST_EVENTS headerwould be added to the tail, and no data would follow this header:

header3 {sequence=3}

data3

header4 {flags=LOST_EVENTS, sequence=4}

enumiommu_veventq_type

Virtual Event Queue Type

Constants

IOMMU_VEVENTQ_TYPE_DEFAULT

Reserved for future use

IOMMU_VEVENTQ_TYPE_ARM_SMMUV3

ARM SMMUv3 Virtual Event Queue

IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV

NVIDIA Tegra241 CMDQV Extension IRQ

structiommu_vevent_arm_smmuv3

ARM SMMUv3 Virtual Event (IOMMU_VEVENTQ_TYPE_ARM_SMMUV3)

Definition:

struct iommu_vevent_arm_smmuv3 {    __aligned_le64 evt[4];};

Members

evt

256-bit ARM SMMUv3 Event record, little-endian.Reported event records: (Refer to “7.3 Event records” in SMMUv3 HW Spec)- 0x04 C_BAD_STE- 0x06 F_STREAM_DISABLED- 0x08 C_BAD_SUBSTREAMID- 0x0a C_BAD_CD- 0x10 F_TRANSLATION- 0x11 F_ADDR_SIZE- 0x12 F_ACCESS- 0x13 F_PERMISSION

Description

StreamID field reports a virtual device ID. To receive a virtual event for adevice, a vDEVICE must be allocated via IOMMU_VDEVICE_ALLOC.

structiommu_vevent_tegra241_cmdqv

Tegra241 CMDQV IRQ (IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV)

Definition:

struct iommu_vevent_tegra241_cmdqv {    __aligned_le64 lvcmdq_err_map[2];};

Members

lvcmdq_err_map

128-bit logical vcmdq error map, little-endian.(Refer to register LVCMDQ_ERR_MAPs per VINTF )

Description

The 128-bit register value from HW exclusively reflect the error bits for aVirtual Interface represented by a vIOMMU object. Read and report directly.

structiommu_veventq_alloc

ioctl(IOMMU_VEVENTQ_ALLOC)

Definition:

struct iommu_veventq_alloc {    __u32 size;    __u32 flags;    __u32 viommu_id;    __u32 type;    __u32 veventq_depth;    __u32 out_veventq_id;    __u32 out_veventq_fd;    __u32 __reserved;};

Members

size

sizeof(structiommu_veventq_alloc)

flags

Must be 0

viommu_id

virtual IOMMU ID to associate the vEVENTQ with

type

Type of the vEVENTQ. Must be defined inenumiommu_veventq_type

veventq_depth

Maximum number of events in the vEVENTQ

out_veventq_id

The ID of the new vEVENTQ

out_veventq_fd

The fd of the new vEVENTQ. User space must close thesuccessfully returned fd after using it

__reserved

Must be 0

Description

Explicitly allocate a virtual event queue interface for a vIOMMU. A vIOMMUcan have multiple FDs for different types, but is confined to one pertype.User space should open theout_veventq_fd to read vEVENTs out of a vEVENTQ,if there are vEVENTs available. A vEVENTQ will lose events due to overflow,if the number of the vEVENTs hitsveventq_depth.

Each vEVENT in a vEVENTQ encloses astructiommufd_vevent_header followed bya type-specific data structure, in a normal case:

header0

data0

header1

data1

...

headerN

dataN

unless a tailing IOMMU_VEVENTQ_FLAG_LOST_EVENTS header is logged (refer tostructiommufd_vevent_header).

enumiommu_hw_queue_type

HW Queue Type

Constants

IOMMU_HW_QUEUE_TYPE_DEFAULT

Reserved for future use

IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV

NVIDIA Tegra241 CMDQV (extension for ARMSMMUv3) Virtual Command Queue (VCMDQ)

structiommu_hw_queue_alloc

ioctl(IOMMU_HW_QUEUE_ALLOC)

Definition:

struct iommu_hw_queue_alloc {    __u32 size;    __u32 flags;    __u32 viommu_id;    __u32 type;    __u32 index;    __u32 out_hw_queue_id;    __aligned_u64 nesting_parent_iova;    __aligned_u64 length;};

Members

size

sizeof(structiommu_hw_queue_alloc)

flags

Must be 0

viommu_id

Virtual IOMMU ID to associate the HW queue with

type

One ofenumiommu_hw_queue_type

index

The logical index to the HW queue per virtual IOMMU for a multi-queuemodel

out_hw_queue_id

The ID of the new HW queue

nesting_parent_iova

Base address of the queue memory in the guest physicaladdress space

length

Length of the queue memory

Description

Allocate a HW queue object for a vIOMMU-specific HW-accelerated queue, whichallows HW to access a guest queue memory described usingnesting_parent_iovaandlength.

A vIOMMU can allocate multiple queues, but it must use a differentindex pertype to separate each allocation, e.g:

Type1 HW queue0, Type1 HW queue1, Type2 HW queue0, ...

IOMMUFD Kernel API

The IOMMUFD kAPI is device-centric with group-related tricks managed behind thescene. This allows the external drivers calling such kAPI to implement a simpledevice-centric uAPI for connecting its device to an iommufd, instead ofexplicitly imposing the group semantics in its uAPI as VFIO does.

structiommufd_device*iommufd_device_bind(structiommufd_ctx*ictx,structdevice*dev,u32*id)

Bind a physical device to an iommu fd

Parameters

structiommufd_ctx*ictx

iommufd file descriptor

structdevice*dev

Pointer to a physical device struct

u32*id

Output ID number to return to userspace for this device

Description

A successful bind establishes an ownership over the device and returnsstructiommufd_device pointer, otherwise returns error pointer.

A driver using this API must set driver_managed_dma and must not touchthe device until this routine succeeds and establishes ownership.

Binding a PCI device places the entire RID under iommufd control.

The caller must undo this withiommufd_device_unbind()

booliommufd_ctx_has_group(structiommufd_ctx*ictx,structiommu_group*group)

True if any device within the group is bound to the ictx

Parameters

structiommufd_ctx*ictx

iommufd file descriptor

structiommu_group*group

Pointer to a physical iommu_group struct

Description

True if any device within the group has been bound to this ictx, ex. viaiommufd_device_bind(), therefore implying ictx ownership of the group.

voidiommufd_device_unbind(structiommufd_device*idev)

Undoiommufd_device_bind()

Parameters

structiommufd_device*idev

Device returned byiommufd_device_bind()

Description

Release the device from iommufd control. The DMA ownership will return backto unowned with DMA controlled by the DMA API. This invalidates theiommufd_device pointer, other APIs that consume it must not be calledconcurrently.

intiommufd_device_attach(structiommufd_device*idev,ioasid_tpasid,u32*pt_id)

Connect a device/pasid to an iommu_domain

Parameters

structiommufd_device*idev

device to attach

ioasid_tpasid

pasid to attach

u32*pt_id

Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGINGOutput the IOMMUFD_OBJ_HWPT_PAGING ID

Description

This connects the device/pasid to an iommu_domain, either automaticallyor manually selected. Once this completes the device could do DMA withpasid.pasid is IOMMU_NO_PASID if this attach is for no pasid usage.

The caller should return the resulting pt_id back to userspace.This function is undone by callingiommufd_device_detach().

intiommufd_device_replace(structiommufd_device*idev,ioasid_tpasid,u32*pt_id)

Change the device/pasid’s iommu_domain

Parameters

structiommufd_device*idev

device to change

ioasid_tpasid

pasid to change

u32*pt_id

Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGINGOutput the IOMMUFD_OBJ_HWPT_PAGING ID

Description

This is the same as:

iommufd_device_detach();iommufd_device_attach();

If it fails then no change is made to the attachment. The iommu driver mayimplement this so there is no disruption in translation. This can only becalled ifiommufd_device_attach() has already succeeded.pasid isIOMMU_NO_PASID for no pasid usage.

voidiommufd_device_detach(structiommufd_device*idev,ioasid_tpasid)

Disconnect a device/device to an iommu_domain

Parameters

structiommufd_device*idev

device to detach

ioasid_tpasid

pasid to detach

Description

Undoiommufd_device_attach(). This disconnects the idev from the previouslyattached pt_id. The device returns back to a blocked DMA translation.pasid is IOMMU_NO_PASID for no pasid usage.

structiommufd_access*iommufd_access_create(structiommufd_ctx*ictx,conststructiommufd_access_ops*ops,void*data,u32*id)

Create an iommufd_access

Parameters

structiommufd_ctx*ictx

iommufd file descriptor

conststructiommufd_access_ops*ops

Driver’s ops to associate with the access

void*data

Opaque data to pass into ops functions

u32*id

Output ID number to return to userspace for this access

Description

An iommufd_access allows a driver to read/write to the IOAS without usingDMA. The underlying CPU memory can be accessed using theiommufd_access_pin_pages() oriommufd_access_rw() functions.

The provided ops are required to useiommufd_access_pin_pages().

voidiommufd_access_destroy(structiommufd_access*access)

Destroy an iommufd_access

Parameters

structiommufd_access*access

The access to destroy

Description

The caller must stop using the access before destroying it.

voidiommufd_access_unpin_pages(structiommufd_access*access,unsignedlongiova,unsignedlonglength)

Undo iommufd_access_pin_pages

Parameters

structiommufd_access*access

IOAS access to act on

unsignedlongiova

Starting IOVA

unsignedlonglength

Number of bytes to access

Description

Return thestructpage’s. The caller must stop accessing them before callingthis. The iova/length must exactly match the one provided to access_pages.

intiommufd_access_pin_pages(structiommufd_access*access,unsignedlongiova,unsignedlonglength,structpage**out_pages,unsignedintflags)

Return a list of pages under the iova

Parameters

structiommufd_access*access

IOAS access to act on

unsignedlongiova

Starting IOVA

unsignedlonglength

Number of bytes to access

structpage**out_pages

Output page list

unsignedintflags

IOPMMUFD_ACCESS_RW_* flags

Description

Readslength bytes starting at iova and returns thestructpage * pointers.These can be kmap’d by the caller for CPU access.

The caller must performiommufd_access_unpin_pages() when done to balancethis.

This API always requires a page aligned iova. This happens naturally if theioas alignment is >= PAGE_SIZE and the iova is PAGE_SIZE aligned. Howeversmaller alignments have corner cases where this API can fail on otherwisealigned iova.

intiommufd_access_rw(structiommufd_access*access,unsignedlongiova,void*data,size_tlength,unsignedintflags)

Read or write data under the iova

Parameters

structiommufd_access*access

IOAS access to act on

unsignedlongiova

Starting IOVA

void*data

Kernel buffer to copy to/from

size_tlength

Number of bytes to access

unsignedintflags

IOMMUFD_ACCESS_RW_* flags

Description

Copy kernel to/from data into the range given by IOVA/length. If flagsindicates IOMMUFD_ACCESS_RW_KTHREAD then a large copy can be optimizedby changing it into copy_to/from_user().

voidiommufd_ctx_get(structiommufd_ctx*ictx)

Get a context reference

Parameters

structiommufd_ctx*ictx

Context to get

Description

The caller must already hold a valid reference to ictx.

structiommufd_ctx*iommufd_ctx_from_file(structfile*file)

Acquires a reference to the iommufd context

Parameters

structfile*file

File to obtain the reference from

Description

Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. Thestructfileremains owned by the caller and the caller must still do fput. On successthe caller is responsible to calliommufd_ctx_put().

structiommufd_ctx*iommufd_ctx_from_fd(intfd)

Acquires a reference to the iommufd context

Parameters

intfd

File descriptor to obtain the reference from

Description

Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. On successthe caller is responsible to calliommufd_ctx_put().

voidiommufd_ctx_put(structiommufd_ctx*ictx)

Put back a reference

Parameters

structiommufd_ctx*ictx

Context to put back

VFIO and IOMMUFD

Connecting a VFIO device to iommufd can be done in two ways.

First is a VFIO compatible way by directly implementing the /dev/vfio/vfiocontainer IOCTLs by mapping them into io_pagetable operations. Doing so allowsthe use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of acontainer fd.

The second approach directly extends VFIO to support a new set of device-centricuser API based on aforementioned IOMMUFD kernel API. It requires userspacechange but better matches the IOMMUFD API semantics and easier to support newiommufd features when comparing it to the first approach.

Currently both approaches are still work-in-progress.

There are still a few gaps to be resolved to catch up with VFIO type1, asdocumented iniommufd_vfio_check_extension().

Future TODOs

Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIOtype1. New features on the radar include:

  • Binding iommu_domain’s to PASID/SSID

  • Userspace page tables, for ARM, x86 and S390

  • Kernel bypass’d invalidation of user page tables

  • Re-use of the KVM page table in the IOMMU

  • Dirty page tracking in the IOMMU

  • Runtime Increase/Decrease of IOPTE size

  • PRI support with faults resolved in userspace