GPU SVM Section¶
Agreed upon design principles¶
- migrate_to_ram path
Rely only on core MM concepts (migration PTEs, page references, andpage locking).
No driver specific locks other than locks for hardware interaction inthis path. These are not required and generally a bad idea toinvent driver defined locks to seal core MM races.
An example of a driver-specific lock causing issues occurred beforefixing do_swap_page to lock the faulting page. A driver-exclusive lockin migrate_to_ram produced a stable livelock if enough threads readthe faulting page.
Partial migration is supported (i.e., a subset of pages attempting tomigrate can actually migrate, with only the faulting page guaranteedto migrate).
Driver handles mixed migrations via retry loops rather than locking.
- Eviction
Eviction is defined as migrating data from the GPU back to theCPU without a virtual address to free up GPU memory.
Only looking at physical memory data structures and locks as opposed tolooking at virtual memory data structures and locks.
No looking at mm/vma structs or relying on those being locked.
The rationale for the above two points is that CPU virtual addressescan change at any moment, while the physical pages remain stable.
GPU page table invalidation, which requires a GPU virtual address, ishandled via the notifier that has access to the GPU virtual address.
- GPU fault side
mmap_read only used around core MM functions which require this lockand should strive to take mmap_read lock only in GPU SVM layer.
Big retry loop to handle all races with the mmu notifier under the gpupagetable locks/mmu notifier range lock/whatever we end up callingthose.
Races (especially against concurrent eviction or migrate_to_ram)should not be handled on the fault side by trying to hold locks;rather, they should be handled using retry loops. One possibleexception is holding a BO’s dma-resv lock during the initial migrationto VRAM, as this is a well-defined lock that can be taken underneaththe mmap_read lock.
One possible issue with the above approach is if a driver has a strictmigration policy requiring GPU access to occur in GPU memory.Concurrent CPU access could cause a livelock due to endless retries.While no current user (Xe) of GPU SVM has such a policy, it is likelyto be added in the future. Ideally, this should be resolved on thecore-MM side rather than through a driver-side lock.
- Physical memory to virtual backpointer
This does not work, as no pointers from physical memory to virtualmemory should exist.
mremap()is an example of the core MM updatingthe virtual address without notifying the driver of addresschange rather the driver only receiving the invalidation notifier.The physical memory backpointer (page->zone_device_data) should remainstable from allocation to page free. Safely updating this against aconcurrent user would be very difficult unless the page is free.
- GPU pagetable locking
Notifier lock only protects range tree, pages valid state for a range(rather than seqno due to wider notifiers), pagetable entries, andmmu notifier seqno tracking, it is not a global lock to protectagainst races.
All races handled with big retry as mentioned above.
Overview of baseline design¶
GPU Shared Virtual Memory (GPU SVM) layer for the Direct Rendering Manager (DRM)is a component of the DRM framework designed to manage shared virtual memorybetween the CPU and GPU. It enables efficient data exchange and processingfor GPU-accelerated applications by allowing memory sharing andsynchronization between the CPU’s and GPU’s virtual address spaces.
Key GPU SVM Components:
- Notifiers:
Used for tracking memory intervals and notifying the GPU of changes,notifiers are sized based on a GPU SVM initialization parameter, with arecommendation of 512M or larger. They maintain a Red-BlacK tree and alist of ranges that fall within the notifier interval. Notifiers aretracked within a GPU SVM Red-BlacK tree and list and are dynamicallyinserted or removed as ranges within the interval are created ordestroyed.
- Ranges:
Represent memory ranges mapped in a DRM device and managed by GPU SVM.They are sized based on an array of chunk sizes, which is a GPU SVMinitialization parameter, and the CPU address space. Upon GPU fault,the largest aligned chunk that fits within the faulting CPU addressspace is chosen for the range size. Ranges are expected to bedynamically allocated on GPU fault and removed on an MMU notifier UNMAPevent. As mentioned above, ranges are tracked in a notifier’s Red-Blacktree.
- Operations:
Define the interface for driver-specific GPU SVM operations such asrange allocation, notifier allocation, and invalidations.
- Device Memory Allocations:
Embedded structure containing enough information for GPU SVM to migrateto / from device memory.
- Device Memory Operations:
Define the interface for driver-specific device memory operationsrelease memory, populate pfns, and copy to / from device memory.
This layer provides interfaces for allocating, mapping, migrating, andreleasing memory ranges between the CPU and GPU. It handles all core memorymanagement interactions (DMA mapping, HMM, and migration) and providesdriver-specific virtual functions (vfuncs). This infrastructure is sufficientto build the expected driver components for an SVM implementation as detailedbelow.
Expected Driver Components:
- GPU page fault handler:
Used to create ranges and notifiers based on the fault address,optionally migrate the range to device memory, and create GPU bindings.
- Garbage collector:
Used to unmap and destroy GPU bindings for ranges. Ranges are expectedto be added to the garbage collector upon a MMU_NOTIFY_UNMAP event innotifier callback.
- Notifier callback:
Used to invalidate and DMA unmap GPU bindings for ranges.
GPU SVM handles locking for core MM interactions, i.e., it locks/unlocks themmap lock as needed.
GPU SVM introduces a global notifier lock, which safeguards the notifier’srange RB tree and list, as well as the range’s DMA mappings and sequencenumber. GPU SVM manages all necessary locking and unlocking operations,except for the recheck range’s pages being valid(drm_gpusvm_range_pages_valid) when the driver is committing GPU bindings.This lock corresponds to thedriver->update lock mentioned inHeterogeneous Memory Management (HMM). Future revisions may transition from a GPU SVMglobal lock to a per-notifier lock if finer-grained locking is deemednecessary.
In addition to the locking mentioned above, the driver should implement alock to safeguard core GPU SVM function calls that modify state, such asdrm_gpusvm_range_find_or_insert and drm_gpusvm_range_remove. This lock isdenoted as ‘driver_svm_lock’ in code examples. Finer grained driver sidelocking should also be possible for concurrent GPU fault processing within asingle GPU SVM. The ‘driver_svm_lock’ can be via drm_gpusvm_driver_set_lockto add annotations to GPU SVM.
Partial unmapping of ranges (e.g., 1M out of 2M is unmapped by CPU resultingin MMU_NOTIFY_UNMAP event) presents several challenges, with the main onebeing that a subset of the range still has CPU and GPU mappings. If thebacking store for the range is in device memory, a subset of the backingstore has references. One option would be to split the range and devicememory backing store, but the implementation for this would be quitecomplicated. Given that partial unmappings are rare and driver-defined rangesizes are relatively small, GPU SVM does not support splitting of ranges.
With no support for range splitting, upon partial unmapping of a range, thedriver is expected to invalidate and destroy the entire range. If the rangehas device memory as its backing, the driver is also expected to migrate anyremaining pages back to RAM.
This section provides three examples of how to build the expected drivercomponents: the GPU page fault handler, the garbage collector, and thenotifier callback.
The generic code provided does not include logic for complex migrationpolicies, optimized invalidations, fined grained driver locking, or otherpotentially required driver locking (e.g., DMA-resv locks).
GPU page fault handler
intdriver_bind_range(structdrm_gpusvm*gpusvm,structdrm_gpusvm_range*range){interr=0;driver_alloc_and_setup_memory_for_bind(gpusvm,range);drm_gpusvm_notifier_lock(gpusvm);if(drm_gpusvm_range_pages_valid(range))driver_commit_bind(gpusvm,range);elseerr=-EAGAIN;drm_gpusvm_notifier_unlock(gpusvm);returnerr;}intdriver_gpu_fault(structdrm_gpusvm*gpusvm,unsignedlongfault_addr,unsignedlonggpuva_start,unsignedlonggpuva_end){structdrm_gpusvm_ctxctx={};interr;driver_svm_lock();retry:// Always process UNMAPs first so view of GPU SVM ranges is currentdriver_garbage_collector(gpusvm);range=drm_gpusvm_range_find_or_insert(gpusvm,fault_addr,gpuva_start,gpuva_end,&ctx);if(IS_ERR(range)){err=PTR_ERR(range);gotounlock;}if(driver_migration_policy(range)){err=drm_pagemap_populate_mm(driver_choose_drm_pagemap(),gpuva_start,gpuva_end,gpusvm->mm,ctx->timeslice_ms);if(err)// CPU mappings may have changedgotoretry;}err=drm_gpusvm_range_get_pages(gpusvm,range,&ctx);if(err==-EOPNOTSUPP||err==-EFAULT||err==-EPERM){// CPU mappings changedif(err==-EOPNOTSUPP)drm_gpusvm_range_evict(gpusvm,range);gotoretry;}elseif(err){gotounlock;}err=driver_bind_range(gpusvm,range);if(err==-EAGAIN)// CPU mappings changedgotoretryunlock:driver_svm_unlock();returnerr;}
Garbage Collector
void__driver_garbage_collector(structdrm_gpusvm*gpusvm,structdrm_gpusvm_range*range){assert_driver_svm_locked(gpusvm);// Partial unmap, migrate any remaining device memory pages back to RAMif(range->flags.partial_unmap)drm_gpusvm_range_evict(gpusvm,range);driver_unbind_range(range);drm_gpusvm_range_remove(gpusvm,range);}voiddriver_garbage_collector(structdrm_gpusvm*gpusvm){assert_driver_svm_locked(gpusvm);for_each_range_in_garbage_collector(gpusvm,range)__driver_garbage_collector(gpusvm,range);}
Notifier callback
voiddriver_invalidation(structdrm_gpusvm*gpusvm,structdrm_gpusvm_notifier*notifier,conststructmmu_notifier_range*mmu_range){structdrm_gpusvm_ctxctx={.in_notifier=true,};structdrm_gpusvm_range*range=NULL;driver_invalidate_device_pages(gpusvm,mmu_range->start,mmu_range->end);drm_gpusvm_for_each_range(range,notifier,mmu_range->start,mmu_range->end){drm_gpusvm_range_unmap_pages(gpusvm,range,&ctx);if(mmu_range->event!=MMU_NOTIFY_UNMAP)continue;drm_gpusvm_range_set_unmapped(range,mmu_range);driver_garbage_collector_add(gpusvm,range);}}
Overview of drm_pagemap design¶
The DRM pagemap layer is intended to augment the dev_pagemap functionality byproviding a way to populate astructmm_struct virtual range with deviceprivate pages and to provide helpers to abstract device memory allocations,to migrate memory back and forth between device memory and system RAM andto handle access (and in the future migration) between devices implementinga fast interconnect that is not necessarily visible to the rest of thesystem.
Typically the DRM pagemap receives requests from one or more DRM GPU SVMinstances to populatestructmm_struct virtual ranges with memory, and themigration is best effort only and may thus fail. The implementation shouldalso handle device unbinding by blocking (return an -ENODEV) error for newpopulation requests and after that migrate all device pages to system ram.
Migration granularity typically follows the GPU SVM range requests, butif there are clashes, due to races or due to the fact that multiple GPUSVM instances have different views of the ranges used, and because of thatparts of a requested range is already present in the requested device memory,the implementation has a variety of options. It can fail and it can chooseto populate only the part of the range that isn’t already in device memory,and it can evict the range to system before trying to migrate. Ideally animplementation would just try to migrate the missing part of the range andallocate just enough memory to do so.
When migrating to system memory as a response to a cpu fault or a devicememory eviction request, currently a full device memory allocation ismigrated back to system. Moving forward this might need improvement forsituations where a single page needs bouncing between system memory anddevice memory due to, for example, atomic operations.
Key DRM pagemap components:
- Device Memory Allocations:
Embedded structure containing enough information for the drm_pagemap tomigrate to / from device memory.
- Device Memory Operations:
Define the interface for driver-specific device memory operationsrelease memory, populate pfns, and copy to / from device memory.
Possible future design features¶
- Concurrent GPU faults
CPU faults are concurrent so makes sense to have concurrent GPUfaults.
Should be possible with fined grained locking in the driver GPUfault handler.
No expected GPU SVM changes required.
- Ranges with mixed system and device pages
Can be added if required to drm_gpusvm_get_pages fairly easily.
- Multi-GPU support
Work in progress and patches expected after initially landing on GPUSVM.
Ideally can be done with little to no changes to GPU SVM.
- Drop ranges in favor of radix tree
May be desirable for faster notifiers.
- Compound device pages
Nvidia, AMD, and Intel all have agreed expensive core MM functions inmigrate device layer are a performance bottleneck, having compounddevice pages should help increase performance by reducing the numberof these expensive calls.
- Higher order dma mapping for migration
4k dma mapping adversely affects migration performance on Intelhardware, higher order (2M) dma mapping should help here.
Build common userptr implementation on top of GPU SVM
Driver side madvise implementation and migration policies
Pull in pending dma-mapping API changes from Leon / Nvidia when these land