32.Shared Virtual Addressing (SVA) with ENQCMD

32.1.Background

Shared Virtual Addressing (SVA) allows the processor and device to use thesame virtual addresses avoiding the need for software to translate virtualaddresses to physical addresses. SVA is what PCIe calls Shared VirtualMemory (SVM).

In addition to the convenience of using application virtual addressesby the device, it also doesn’t require pinning pages for DMA.PCIe Address Translation Services (ATS) along with Page Request Interface(PRI) allow devices to function much the same way as the CPU handlingapplication page-faults. For more information please refer to the PCIespecification Chapter 10: ATS Specification.

Use of SVA requires IOMMU support in the platform. IOMMU is alsorequired to support the PCIe features ATS and PRI. ATS allows devicesto cache translations for virtual addresses. The IOMMU driver uses themmu_notifier() support to keep the device TLB cache and the CPU cache insync. When an ATS lookup fails for a virtual address, the device shoulduse the PRI in order to request the virtual address to be paged into theCPU page tables. The device must use ATS again in order to fetch thetranslation before use.

32.2.Shared Hardware Workqueues

Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permitsthe use of Shared Work Queues (SWQ) by both applications and VirtualMachines (VM’s). This allows better hardware utilization vs. hardpartitioning resources that could result in under utilization. In order toallow the hardware to distinguish the context for which work is beingexecuted in the hardware by SWQ interface, SIOV uses Process Address SpaceID (PASID), which is a 20-bit number defined by the PCIe SIG.

PASID value is encoded in all transactions from the device. This allows theIOMMU to track I/O on a per-PASID granularity in addition to using the PCIeResource Identifier (RID) which is the Bus/Device/Function.

32.3.ENQCMD

ENQCMD is a new instruction on Intel platforms that atomically submits awork descriptor to a device. The descriptor includes the operation to beperformed, virtual addresses of all parameters, virtual address of a completionrecord, and the PASID (process address space ID) of the current process.

ENQCMD works with non-posted semantics and carries a status back if thecommand was accepted by hardware. This allows the submitter to know if thesubmission needs to be retried or other device specific mechanisms toimplement fairness or ensure forward progress should be provided.

ENQCMD is the glue that ensures applications can directly submit commandsto the hardware and also permits hardware to be aware of application contextto perform I/O operations via use of PASID.

32.4.Process Address Space Tagging

A new thread-scoped MSR (IA32_PASID) provides the connection betweenuser processes and the rest of the hardware. When an application firstaccesses an SVA-capable device, this MSR is initialized with a newlyallocated PASID. The driver for the device calls an IOMMU-specific APIthat sets up the routing for DMA and page-requests.

For example, the Intel Data Streaming Accelerator (DSA) usesiommu_sva_bind_device(), which will do the following:

  • Allocate the PASID, and program the process page-table (%cr3 register) in thePASID context entries.

  • Register formmu_notifier() to track any page-table invalidations to keepthe device TLB in sync. For example, when a page-table entry is invalidated,the IOMMU propagates the invalidation to the device TLB. This will force anyfuture access by the device to this virtual address to participate inATS. If the IOMMU responds with proper response that a page is notpresent, the device would request the page to be paged in via the PCIe PRIprotocol before performing I/O.

This MSR is managed with the XSAVE feature set as “supervisor state” toensure the MSR is updated during context switch.

32.5.PASID Management

The kernel must allocate a PASID on behalf of each process which will useENQCMD and program it into the new MSR to communicate the process identity toplatform hardware. ENQCMD uses the PASID stored in this MSR to tag requestsfrom this process. When a user submits a work descriptor to a device using theENQCMD instruction, the PASID field in the descriptor is auto-filled with thevalue from MSR_IA32_PASID. Requests for DMA from the device are also taggedwith the same PASID. The platform IOMMU uses the PASID in the transaction toperform address translation. The IOMMU APIs setup the corresponding PASIDentry in IOMMU with the process address used by the CPU (e.g. %cr3 register inx86).

The MSR must be configured on each logical CPU before any applicationthread can interact with a device. Threads that belong to the sameprocess share the same page tables, thus the same MSR value.

32.6.PASID Life Cycle Management

PASID is initialized as IOMMU_PASID_INVALID (-1) when a process is created.

Only processes that access SVA-capable devices need to have a PASIDallocated. This allocation happens when a process opens/binds an SVA-capabledevice but finds no PASID for this process. Subsequent binds of the same, orother devices will share the same PASID.

Although the PASID is allocated to the process by opening a device,it is not active in any of the threads of that process. It’s loaded to theIA32_PASID MSR lazily when a thread tries to submit a work descriptorto a device using the ENQCMD.

That first access will trigger a #GP fault because the IA32_PASID MSRhas not been initialized with the PASID value assigned to the processwhen the device was opened. The Linux #GP handler notes that a PASID hasbeen allocated for the process, and so initializes the IA32_PASID MSRand returns so that the ENQCMD instruction is re-executed.

On fork(2) or exec(2) the PASID is removed from the process as it nolonger has the same address space that it had when the device was opened.

On clone(2) the new task shares the same address space, so will beable to use the PASID allocated to the process. The IA32_PASID is notpreemptively initialized as the PASID value might not be allocated yet orthe kernel does not know whether this thread is going to access the deviceand the cleared IA32_PASID MSR reduces context switch overhead by xstateinit optimization. Since #GP faults have to be handled on any threads thatwere created before the PASID was assigned to the mm of the process, newlycreated threads might as well be treated in a consistent way.

Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs inall threads in unbind, free the PASID lazily only on mm exit.

If a process does a close(2) of the device file descriptor and munmap(2)of the device MMIO portal, then the driver will unbind the device. ThePASID is still marked VALID in the PASID_MSR for any threads in theprocess that accessed the device. But this is harmless as without theMMIO portal they cannot submit new work to the device.

32.7.Relationships

  • Each process has many threads, but only one PASID.

  • Devices have a limited number (~10’s to 1000’s) of hardware workqueues.The device driver manages allocating hardware workqueues.

  • A single mmap() maps a single hardware workqueue as a “portal” andeach portal maps down to a single workqueue.

  • For each device with which a process interacts, there must beone or more mmap()’d portals.

  • Many threads within a process can share a single portal to accessa single device.

  • Multiple processes can separately mmap() the same portal, inwhich case they still share one device hardware workqueue.

  • The single process-wide PASID is used by all threads to interactwith all devices. There is not, for instance, a PASID for eachthread or each thread<->device pair.

32.8.FAQ

  • What is SVA/SVM?

Shared Virtual Addressing (SVA) permits I/O hardware and the processor towork in the same address space, i.e., to share it. Some call it SharedVirtual Memory (SVM), but Linux community wanted to avoid confusing it withPOSIX Shared Memory and Secure Virtual Machines which were terms already incirculation.

  • What is a PASID?

A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.PASID is included in all transactions between the platform and the device.

  • How are shared workqueues different?

Traditionally, in order for userspace applications to interact with hardware,there is a separate hardware instance required per process. For example,consider doorbells as a mechanism of informing hardware about work to process.Each doorbell is required to be spaced 4k (or page-size) apart for processisolation. This requires hardware to provision that space and reserve it inMMIO. This doesn’t scale as the number of threads becomes quite large. Thehardware also manages the queue depth for Shared Work Queues (SWQ), andconsumers don’t need to track queue depth. If there is no space to accepta command, the device will return an error indicating retry.

A user should check Deferrable Memory Write (DMWr) capability on the deviceand only submits ENQCMD when the device supports it. In the new DMWr PCIeterminology, devices need to support DMWr completer capability. In addition,it requires all switch ports to support DMWr routing and must be enabled bythe PCIe subsystem, much like how PCIe atomic operations are managed forinstance.

SWQ allows hardware to provision just a single address in the device. Whenused with ENQCMD to submit work, the device can distinguish the processsubmitting the work since it will include the PASID assigned to thatprocess. This helps the device scale to a large number of processes.

  • Is this the same as a user space device driver?

Communicating with the device via the shared workqueue is much simplerthan a full blown user space driver. The kernel driver does all theinitialization of the hardware. User space only needs to worry aboutsubmitting work and processing completions.

  • Is this the same as SR-IOV?

Single Root I/O Virtualization (SR-IOV) focuses on providing independenthardware interfaces for virtualizing hardware. Hence, it’s required to bean almost fully functional interface to software supporting the traditionalBARs, space for interrupts via MSI-X, its own register layout.Virtual Functions (VFs) are assisted by the Physical Function (PF)driver.

Scalable I/O Virtualization builds on the PASID concept to create deviceinstances for virtualization. SIOV requires host software to assist increating virtual devices; each virtual device is represented by a PASIDalong with the bus/device/function of the device. This allows devicehardware to optimize device resource creation and can grow dynamically ondemand. SR-IOV creation and management is very static in nature. Consultreferences below for more details.

  • Why not just create a virtual function for each app?

Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs requireduplicated hardware for PCI config space and interrupts such as MSI-X.Resources such as interrupts have to be hard partitioned between VFs atcreation time, and cannot scale dynamically on demand. The VFs are notcompletely independent from the Physical Function (PF). Most VFs requiresome communication and assistance from the PF driver. SIOV, in contrast,creates a software-defined device where all the configuration and controlaspects are mediated via the slow path. The work submission and completionhappen without any mediation.

  • Does this support virtualization?

ENQCMD can be used from within a guest VM. In these cases, the VMM helpswith setting up a translation table to translate from Guest PASID to HostPASID. Please consult the ENQCMD instruction set reference for moredetails.

  • Does memory need to be pinned?

When devices support SVA along with platform hardware such as IOMMUsupporting such devices, there is no need to pin memory for DMA purposes.Devices that support SVA also support other PCIe features that remove thepinning requirement for memory.

Device TLB support - Device requests the IOMMU to lookup an address beforeuse via Address Translation Service (ATS) requests. If the mapping existsbut there is no page allocated by the OS, IOMMU hardware returns that nomapping exists.

Device requests the virtual address to be mapped via Page RequestInterface (PRI). Once the OS has successfully completed the mapping, itreturns the response back to the device. The device requests again fora translation and continues.

IOMMU works with the OS in managing consistency of page-tables with thedevice. When removing pages, it interacts with the device to remove anydevice TLB entry that might have been cached before removing the mappings fromthe OS.

32.9.References

VT-D:https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d

SIOV:https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux

ENQCMD in ISE:https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

DSA spec:https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf