User Mode Queues

Introduction

Similar to the KFD, GPU engine queues move into userspace. The idea is to letuser processes manage their submissions to the GPU engines directly, bypassingIOCTL calls to the driver to submit work. This reduces overhead and also allowsthe GPU to submit work to itself. Applications can set up work graphs of jobsacross multiple GPU engines without needing trips through the CPU.

UMDs directly interface with firmware via per application shared memory areas.The main vehicle for this is queue. A queue is a ring buffer with a readpointer (rptr) and a write pointer (wptr). The UMD writes IP specific packetsinto the queue and the firmware processes those packets, kicking off work on theGPU engines. The CPU in the application (or another queue or device) updatesthe wptr to tell the firmware how far into the ring buffer to process packetsand the rtpr provides feedback to the UMD on how far the firmware has progressedin executing those packets. When the wptr and the rptr are equal, the queue isidle.

Theory of Operation

The various engines on modern AMD GPUs support multiple queues per engine with ascheduling firmware which handles dynamically scheduling user queues on theavailable hardware queue slots. When the number of user queues outnumbers theavailable hardware queue slots, the scheduling firmware dynamically maps andunmaps queues based on priority and time quanta. The state of each user queueis managed in the kernel driver in an MQD (Memory Queue Descriptor). This is abuffer in GPU accessible memory that stores the state of a user queue. Thescheduling firmware uses the MQD to load the queue state into an HQD (HardwareQueue Descriptor) when a user queue is mapped. Each user queue requires anumber of additional buffers which represent the ring buffer and any metadataneeded by the engine for runtime operation. On most engines this consists ofthe ring buffer itself, a rptr buffer (where the firmware will shadow the rptrto userspace), a wptr buffer (where the application will write the wptr for thefirmware to fetch it), and a doorbell. A doorbell is a piece of one of thedevice’s MMIO BARs which can be mapped to specific user queues. When theapplication writes to the doorbell, it will signal the firmware to take someaction. Writing to the doorbell wakes the firmware and causes it to fetch thewptr and start processing the packets in the queue. Each 4K page of the doorbellBAR supports specific offset ranges for specific engines. The doorbell of aqueue must be mapped into the aperture aligned to the IP used by the queue(e.g., GFX, VCN, SDMA, etc.). These doorbell apertures are set up via NBIOregisters. Doorbells are 32 bit or 64 bit (depending on the engine) chunks ofthe doorbell BAR. A 4K doorbell page provides 512 64-bit doorbells for up to512 user queues. A subset of each page is reserved for each IP type supportedon the device. The user can query the doorbell ranges for each IP via the INFOIOCTL. See the IOCTL Interfaces section for more information.

When an application wants to create a user queue, it allocates the necessarybuffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).These can be separate buffers or all part of one larger buffer. The applicationwould map the buffer(s) into its GPUVM and use the GPU virtual addresses of forthe areas of memory they want to use for the user queue. They would alsoallocate a doorbell page for the doorbells used by the user queues. Theapplication would then populate the MQD in the USERQ IOCTL structure with theGPU virtual addresses and doorbell index they want to use. The user can alsospecify the attributes for the user queue (priority, whether the queue is securefor protected content, etc.). The application would then call the USERQCREATE IOCTL to create the queue using the specified MQD details in the IOCTL.The kernel driver then validates the MQD provided by the application andtranslates the MQD into the engine specific MQD format for the IP. The IPspecific MQD would be allocated and the queue would be added to the run listmaintained by the scheduling firmware. Once the queue has been created, theapplication can write packets directly into the queue, update the wptr, andwrite to the doorbell offset to kick off work in the user queue.

When the application is done with the user queue, it would call the USERQFREE IOCTL to destroy it. The kernel driver would preempt the queue andremove it from the scheduling firmware’s run list. Then the IP specific MQDwould be freed and the user queue state would be cleaned up.

Some engines may require the aggregated doorbell too if the engine does notsupport doorbells from unmapped queues. The aggregated doorbell is a specialpage of doorbell space which wakes the scheduler. In cases where the engine maybe oversubscribed, some queues may not be mapped. If the doorbell is rung whenthe queue is not mapped, the engine firmware may miss the request. Somescheduling firmware may work around this by polling wptr shadows when thehardware is oversubscribed, other engines may support doorbell updates fromunmapped queues. In the event that one of these options is not available, thekernel driver will map a page of aggregated doorbell space into each GPUVMspace. The UMD will then update the doorbell and wptr as normal and then writeto the aggregated doorbell as well.

Special Packets

In order to support legacy implicit synchronization, as well as mixed user andkernel queues, we need a synchronization mechanism that is secure. Becausekernel queues or memory management tasks depend on kernel fences, we need a wayfor user queues to update memory that the kernel can use for a fence, that can’tbe messed with by a bad actor. To support this, we’ve added a protected fencepacket. This packet works by writing a monotonically increasing value toa memory location that only privileged clients have write access to. Userqueues only have read access. When this packet is executed, the memory locationis updated and other queues (kernel or user) can see the results. Theuser application would submit this packet in their command stream. The actualpacket format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but thebehavior is the same. The packet submission is handled in userspace. Thekernel driver sets up the privileged memory used for each user queue when itsets the queues up when the application creates them.

Memory Management

It is assumed that all buffers mapped into the GPUVM space for the process arevalid when engines on the GPU are running. The kernel driver will only allowuser queues to run when all buffers are mapped. If there is a memory event thatrequires buffer migration, the kernel driver will preempt the user queues,migrate buffers to where they need to be, update the GPUVM page tables andinvaldidate the TLB, and then resume the user queues.

Interaction with Kernel Queues

Depending on the IP and the scheduling firmware, you can enable kernel queuesand user queues at the same time, however, you are limited by the HQD slots.Kernel queues are always mapped so any work that goes into kernel queues willtake priority. This limits the available HQD slots for user queues.

Not all IPs will support user queues on all GPUs. As such, UMDs will need tosupport both user queues and kernel queues depending on the IP. For example, aGPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,and VPE. UMDs need to support both. The kernel driver provides a way todetermine if user queues and kernel queues are supported on a per IP basis.UMDs can query this information via the INFO IOCTL and determine whether to usekernel queues or user queues for each IP.

Queue Resets

For most engines, queues can be reset individually. GFX, compute, and SDMAqueues can be reset individually. When a hung queue is detected, it can bereset either via the scheduling firmware or MMIO. Since there are no kernelfences for most user queues, they will usually only be detected when some otherevent happens; e.g., a memory event which requires migration of buffers. Whenthe queues are preempted, if the queue is hung, the preemption will fail.Driver will then look up the queues that failed to preempt and reset them andrecord which queues are hung.

On the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queuestatus. UMD will provide the queue id in the IOCTL and the kernel driverwill check if it has already recorded the queue as hung (e.g., due to failedpeemption) and report back the status.

IOCTL Interfaces

GPU virtual addresses used for queues and related data (rptrs, wptrs, contextsave areas, etc.) should be validated by the kernel mode driver to prevent theuser from specifying invalid GPU virtual addresses. If the user providesinvalid GPU virtual addresses or doorbell indicies, the IOCTL should return anerror message. These buffers should also be tracked in the kernel driver sothat if the user attempts to unmap the buffer(s) from the GPUVM, the umap callwould return an error.

INFO

There are several new INFO queries related to user queues in order to query thesize of user queue meta data needed for a user queue (e.g., context save areasor shadow buffers), whether kernel or user queues or both are supportedfor each IP type, and the offsets for each IP type in each doorbell page.

USERQ

The USERQ IOCTL is used for creating, freeing, and querying the status of userqueues. It supports 3 opcodes:

  1. CREATE - Create a user queue. The application provides an MQD-like structurethat defines the type of queue and associated metadata and flags for thatqueue type. Returns the queue id.

  2. FREE - Free a user queue.

  3. QUERY_STATUS - Query that status of a queue. Used to check if the queue ishealthy or not. E.g., if the queue has been reset. (WIP)

USERQ_SIGNAL

The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.

USERQ_WAIT

The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.

Kernel and User Queues

In order to properly validate and test performance, we have a driver option toselect what type of queues are enabled (kernel queues, user queues or both).The user_queue driver parameter allows you to enable kernel queues only (0),user queues and kernel queues (1), and user queues only (2). Enabling userqueues only will free up static queue assignments that would otherwise be usedby kernel queues for use by the scheduling firmware. Some kernel queues arerequired for kernel driver operation and they will always be created. When thekernel queues are not enabled, they are not registered with the drm schedulerand the CS IOCTL will reject any incoming command submissions which target thosequeue types. Kernel queues only mirrors the behavior on all existing GPUs.Enabling both queues allows for backwards compatibility with old userspace whilestill supporting user queues.