Asynchronous VM_BIND¶
Nomenclature:¶
VRAM: On-device memory. Sometimes referred to as device local memory.gpu_vm: A virtual GPU address space. Typically per process, butcan be shared by multiple processes.VM_BIND: An operation or a list of operations to modify a gpu_vm usingan IOCTL. The operations include mapping and unmapping system- orVRAM memory.syncobj: A container that abstracts synchronization objects. Thesynchronization objects can be either generic, like dma-fences ordriver specific. A syncobj typically indicates the type of theunderlying synchronization object.in-syncobj: Argument to a VM_BIND IOCTL, the VM_BIND operation waitsfor these before starting.out-syncobj: Argument to a VM_BIND_IOCTL, the VM_BIND operationsignals these when the bind operation is complete.dma-fence: A cross-driver synchronization object. A basicunderstanding of dma-fences is required to digest thisdocument. Please refer to theDMAFencessection of thedma-buf doc.memoryfence: A synchronization object, different from a dma-fence.A memory fence uses the value of a specified memory location to determinesignaled status. A memory fence can be awaited and signaled by boththe GPU and CPU. Memory fences are sometimes referred to asuser-fences, userspace-fences or gpu futexes and do not necessarily obeythe dma-fence rule of signaling within a “reasonable amount of time”.The kernel should thus avoid waiting for memory fences with locks held.long-runningworkload: A workload that may take more than thecurrent stipulated dma-fence maximum signal delay to complete andwhich therefore needs to set the gpu_vm or the GPU execution context ina certain mode that disallows completion dma-fences.execfunction: An exec function is a function that revalidates allaffected gpu_vmas, submits a GPU command batch and registers thedma_fence representing the GPU command’s activity with all affecteddma_resvs. For completeness, although not covered by this document,it’s worth mentioning that an exec function may also be therevalidation worker that is used by some drivers in compute /long-running mode.bindcontext: A context identifier used for the VM_BINDoperation. VM_BIND operations that use the same bind context can beassumed, where it matters, to complete in order of submission. No suchassumptions can be made for VM_BIND operations using separate bind contexts.UMD: User-mode driver.KMD: Kernel-mode driver.
Synchronous / Asynchronous VM_BIND operation¶
Synchronous VM_BIND¶
With Synchronous VM_BIND, the VM_BIND operations all complete before theIOCTL returns. A synchronous VM_BIND takes neither in-fences norout-fences. Synchronous VM_BIND may block and wait for GPU operations;for example swap-in or clearing, or even previous binds.
Asynchronous VM_BIND¶
Asynchronous VM_BIND accepts both in-syncobjs and out-syncobjs. While theIOCTL may return immediately, the VM_BIND operations wait for the in-syncobjsbefore modifying the GPU page-tables, and signal the out-syncobjs whenthe modification is done in the sense that the next exec function thatawaits for the out-syncobjs will see the change. Errors are reportedsynchronously.In low-memory situations the implementation may block, performing theVM_BIND synchronously, because there might not be enough memoryimmediately available for preparing the asynchronous operation.
If the VM_BIND IOCTL takes a list or an array of operations as an argument,the in-syncobjs needs to signal before the first operation starts toexecute, and the out-syncobjs signal after the last operationcompletes. Operations in the operation list can be assumed, where itmatters, to complete in order.
Since asynchronous VM_BIND operations may use dma-fences embedded inout-syncobjs and internally in KMD to signal bind completion, anymemory fences given as VM_BIND in-fences need to be awaitedsynchronously before the VM_BIND ioctl returns, since dma-fences,required to signal in a reasonable amount of time, can never be madeto depend on memory fences that don’t have such a restriction.
The purpose of an Asynchronous VM_BIND operation is for user-modedrivers to be able to pipeline interleaved gpu_vm modifications andexec functions. For long-running workloads, such pipelining of a bindoperation is not allowed and any in-fences need to be awaitedsynchronously. The reason for this is twofold. First, any memoryfences gated by a long-running workload and used as in-syncobjs for theVM_BIND operation will need to be awaited synchronously anyway (seeabove). Second, any dma-fences used as in-syncobjs for VM_BINDoperations for long-running workloads will not allow for pipelininganyway since long-running workloads don’t allow for dma-fences asout-syncobjs, so while theoretically possible the use of them isquestionable and should be rejected until there is a valuable use-case.Note that this is not a limitation imposed by dma-fence rules, butrather a limitation imposed to keep KMD implementation simple. It doesnot affect using dma-fences as dependencies for the long-runningworkload itself, which is allowed by dma-fence rules, but rather forthe VM_BIND operation only.
An asynchronous VM_BIND operation may take substantial time tocomplete and signal the out_fence. In particular if the operation isdeeply pipelined behind other VM_BIND operations and workloadssubmitted using exec functions. In that case, UMD might want to avoid asubsequent VM_BIND operation to be queued behind the first one ifthere are no explicit dependencies. In order to circumvent such a queue-up, aVM_BIND implementation may allow for VM_BIND contexts to becreated. For each context, VM_BIND operations will be guaranteed tocomplete in the order they were submitted, but that is not the casefor VM_BIND operations executing on separate VM_BIND contexts. InsteadKMD will attempt to execute such VM_BIND operations in parallel butleaving no guarantee that they will actually be executed inparallel. There may be internal implicit dependencies that only KMD knowsabout, for example page-table structure changes. A way to attemptto avoid such internal dependencies is to have different VM_BINDcontexts use separate regions of a VM.
Also for VM_BINDS for long-running gpu_vms the user-mode driver should typicallyselect memory fences as out-fences since that gives greater flexibility forthe kernel mode driver to inject other operations into the bind /unbind operations. Like for example inserting breakpoints into batchbuffers. The workload execution can then easily be pipelined behindthe bind completion using the memory out-fence as the signal conditionfor a GPU semaphore embedded by UMD in the workload.
There is no difference in the operations supported or inmulti-operation support between asynchronous VM_BIND and synchronous VM_BIND.
Multi-operation VM_BIND IOCTL error handling and interrupts¶
The VM_BIND operations of the IOCTL may error for various reasons, forexample due to lack of resources to complete and due to interruptedwaits.In these situations UMD should preferably restart the IOCTL aftertaking suitable action.If UMD has over-committed a memory resource, an -ENOSPC error will bereturned, and UMD may then unbind resources that are not used at themoment and rerun the IOCTL. On -EINTR, UMD should simply rerun theIOCTL and on -ENOMEM user-space may either attempt to free knownsystem memory resources or fail. In case of UMD deciding to fail abind operation, due to an error return, no additional action is neededto clean up the failed operation, and the VM is left in the same stateas it was before the failing IOCTL.Unbind operations are guaranteed not to return any errors due toresource constraints, but may return errors due to, for example,invalid arguments or the gpu_vm being banned.In the case an unexpected error happens during the asynchronous bindprocess, the gpu_vm will be banned, and attempts to use it after banningwill return -ENOENT.
Example: The Xe VM_BIND uAPI¶
Starting with the VM_BIND operation struct, the IOCTL call can takezero, one or many such operations. A zero number means only thesynchronization part of the IOCTL is carried out: an asynchronousVM_BIND updates the syncobjects, whereas a sync VM_BIND waits for theimplicit dependencies to be fulfilled.
structdrm_xe_vm_bind_op{/** * @obj: GEM object to operate on, MBZ for MAP_USERPTR, MBZ for UNMAP */__u32obj;/** @pad: MBZ */__u32pad;union{/** * @obj_offset: Offset into the object for MAP. */__u64obj_offset;/** @userptr: user virtual address for MAP_USERPTR */__u64userptr;};/** * @range: Number of bytes from the object to bind to addr, MBZ for UNMAP_ALL */__u64range;/** @addr: Address to operate on, MBZ for UNMAP_ALL */__u64addr;/** * @tile_mask: Mask for which tiles to create binds for, 0 == All tiles, * only applies to creating new VMAs */__u64tile_mask;/* Map (parts of) an object into the GPU virtual address range. #define XE_VM_BIND_OP_MAP 0x0 /* Unmap a GPU virtual address range */#define XE_VM_BIND_OP_UNMAP 0x1/* * Map a CPU virtual address range into a GPU virtual * address range. */#define XE_VM_BIND_OP_MAP_USERPTR 0x2/* Unmap a gem object from the VM. */#define XE_VM_BIND_OP_UNMAP_ALL 0x3/* * Make the backing memory of an address range resident if * possible. Note that this doesn't pin backing memory. */#define XE_VM_BIND_OP_PREFETCH 0x4/* Make the GPU map readonly. */#define XE_VM_BIND_FLAG_READONLY (0x1 << 16)/* * Valid on a faulting VM only, do the MAP operation immediately rather * than deferring the MAP to the page fault handler. */#define XE_VM_BIND_FLAG_IMMEDIATE (0x1 << 17)/* * When the NULL flag is set, the page tables are setup with a special * bit which indicates writes are dropped and all reads return zero. In * the future, the NULL flags will only be valid for XE_VM_BIND_OP_MAP * operations, the BO handle MBZ, and the BO offset MBZ. This flag is * intended to implement VK sparse bindings. */#define XE_VM_BIND_FLAG_NULL (0x1 << 18)/** @op: Operation to perform (lower 16 bits) and flags (upper 16 bits) */__u32op;/** @mem_region: Memory region to prefetch VMA to, instance not a mask */__u32region;/** @reserved: Reserved */__u64reserved[2];};
The VM_BIND IOCTL argument itself, looks like follows. Note that forsynchronous VM_BIND, the num_syncs and syncs fields must be zero. Heretheexec_queue_id field is the VM_BIND context discussed previouslythat is used to facilitate out-of-order VM_BINDs.
structdrm_xe_vm_bind{/** @extensions: Pointer to the first extension struct, if any */__u64extensions;/** @vm_id: The ID of the VM to bind to */__u32vm_id;/** * @exec_queue_id: exec_queue_id, must be of class DRM_XE_ENGINE_CLASS_VM_BIND * and exec queue must have same vm_id. If zero, the default VM bind engine * is used. */__u32exec_queue_id;/** @num_binds: number of binds in this IOCTL */__u32num_binds;/* If set, perform an async VM_BIND, if clear a sync VM_BIND */#define XE_VM_BIND_IOCTL_FLAG_ASYNC (0x1 << 0)/** @flag: Flags controlling all operations in this ioctl. */__u32flags;union{/** @bind: used if num_binds == 1 */structdrm_xe_vm_bind_opbind;/** * @vector_of_binds: userptr to array of struct * drm_xe_vm_bind_op if num_binds > 1 */__u64vector_of_binds;};/** @num_syncs: amount of syncs to wait for or to signal on completion. */__u32num_syncs;/** @pad2: MBZ */__u32pad2;/** @syncs: pointer to struct drm_xe_sync array */__u64syncs;/** @reserved: Reserved */__u64reserved[2];};