VM_BIND locking¶
This document attempts to describe what’s needed to get VM_BIND locking right,including the userptr mmu_notifier locking. It also discusses someoptimizations to get rid of the looping through of all userptr mappings andexternal / shared object mappings that is needed in the simplestimplementation. In addition, there is a section describing the VM_BIND lockingrequired for implementing recoverable pagefaults.
The DRM GPUVM set of helpers¶
There is a set of helpers for drivers implementing VM_BIND, and thisset of helpers implements much, but not all of the locking describedin this document. In particular, it is currently lacking a userptrimplementation. This document does not intend to describe the DRM GPUVMimplementation in detail, but it is covered inits owndocumentation. It is highly recommended for any driverimplementing VM_BIND to use the DRM GPUVM helpers and to extend it ifcommon functionality is missing.
Nomenclature¶
gpu_vm: Abstraction of a virtual GPU address space withmeta-data. Typically one per client (DRM file-private), or one perexecution context.gpu_vma: Abstraction of a GPU address range within a gpu_vm withassociated meta-data. The backing storage of a gpu_vma can either bea GEM object or anonymous or page-cache pages mapped also into the CPUaddress space for the process.gpu_vm_bo: Abstracts the association of a GEM object anda VM. The GEM object maintains a list of gpu_vm_bos, where each gpu_vm_bomaintains a list of gpu_vmas.userptrgpu_vmaorjustuserptr: A gpu_vma, whose backing storeis anonymous or page-cache pages as described above.revalidating: Revalidating a gpu_vma means making the latest versionof the backing store resident and making sure the gpu_vma’spage-table entries point to that backing store.dma_fence: Astructdma_fencethat is similar to astructcompletionand which tracks GPU activity. When the GPU activity is finished,the dma_fence signals. Please refer to theDMAFencessection ofthedma-buf doc.dma_resv: Astructdma_resv(a.k.a reservation object) that is usedto track GPU activity in the form of multiple dma_fences on agpu_vm or a GEM object. The dma_resv contains an array / listof dma_fences and a lock that needs to be held when addingadditional dma_fences to the dma_resv. The lock is of a type thatallows deadlock-safe locking of multiple dma_resvs in arbitraryorder. Please refer to theReservationObjectssection of thedma-buf doc.execfunction: An exec function is a function that revalidates allaffected gpu_vmas, submits a GPU command batch and registers thedma_fence representing the GPU command’s activity with all affecteddma_resvs. For completeness, although not covered by this document,it’s worth mentioning that an exec function may also be therevalidation worker that is used by some drivers in compute /long-running mode.localobject: A GEM object which is only mapped within asingle VM. Local GEM objects share the gpu_vm’s dma_resv.externalobject: a.k.a shared object: A GEM object which may be sharedby multiple gpu_vms and whose backing storage may be shared withother drivers.
Locks and locking order¶
One of the benefits of VM_BIND is that local GEM objects share the gpu_vm’sdma_resv object and hence the dma_resv lock. So, even with a hugenumber of local GEM objects, only one lock is needed to make the execsequence atomic.
The following locks and locking orders are used:
The
gpu_vm->lock(optionally an rwsem). Protects the gpu_vm’sdata structure keeping track of gpu_vmas. It can also protect thegpu_vm’s list of userptr gpu_vmas. With a CPU mm analogy this wouldcorrespond to the mmap_lock. An rwsem allows several readers to walkthe VM tree concurrently, but the benefit of that concurrency mostlikely varies from driver to driver.The
userptr_seqlock. This lock is taken in read mode for eachuserptr gpu_vma on the gpu_vm’s userptr list, and in write mode during mmunotifier invalidation. This is not a real seqlock but described inmm/mmu_notifier.cas a “Collision-retry read-side/write-side‘lock’ a lot like a seqcount. However this allows multiplewrite-sides to hold it at once...”. The read side critical sectionis enclosed bymmu_interval_read_begin()/mmu_interval_read_retry()withmmu_interval_read_begin()sleeping if the write side is held.The write side is held by the core mm while calling mmu intervalinvalidation notifiers.The
gpu_vm->resvlock. Protects the gpu_vm’s list of gpu_vmas needingrebinding, as well as the residency state of all the gpu_vm’s localGEM objects.Furthermore, it typically protects the gpu_vm’s list of evicted andexternal GEM objects.The
gpu_vm->userptr_notifier_lock. This is an rwsem that istaken in read mode during exec and write mode during a mmu notifierinvalidation. The userptr notifier lock is per gpu_vm.The
gem_object->gpuva_lockThis lock protects the GEM object’slist of gpu_vm_bos. This is usually the same lock as the GEMobject’s dma_resv, but some drivers protects this list differently,see below.The
gpu_vmlistspinlocks. With some implementations they are neededto be able to update the gpu_vm evicted- and external objectlist. For those implementations, the spinlocks are grabbed when thelists are manipulated. However, to avoid locking order violationswith the dma_resv locks, a special scheme is needed when iteratingover the lists.
Protection and lifetime of gpu_vm_bos and gpu_vmas¶
The GEM object’s list of gpu_vm_bos, and the gpu_vm_bo’s list of gpu_vmasis protected by thegem_object->gpuva_lock, which is typically thesame as the GEM object’s dma_resv, but if the driverneeds to access these lists from within a dma_fence signallingcritical section, it can instead choose to protect it with aseparate lock, which can be locked from within the dma_fence signallingcritical section. Such drivers then need to pay additional attentionto what locks need to be taken from within the loop when iteratingover the gpu_vm_bo and gpu_vma lists to avoid locking-order violations.
The DRM GPUVM set of helpers provide lockdep asserts that this lock isheld in relevant situations and also provides a means of making itselfaware of which lock is actually used:drm_gem_gpuva_set_lock().
Each gpu_vm_bo holds a reference counted pointer to the underlying GEMobject, and each gpu_vma holds a reference counted pointer to thegpu_vm_bo. When iterating over the GEM object’s list of gpu_vm_bos andover the gpu_vm_bo’s list of gpu_vmas, thegem_object->gpuva_lock mustnot be dropped, otherwise, gpu_vmas attached to a gpu_vm_bo maydisappear without notice since those are not reference-counted. Adriver may implement its own scheme to allow this at the expense ofadditional complexity, but this is outside the scope of this document.
In the DRM GPUVM implementation, each gpu_vm_bo and each gpu_vmaholds a reference count on the gpu_vm itself. Due to this, and to avoid circularreference counting, cleanup of the gpu_vm’s gpu_vmas must not be done from thegpu_vm’s destructor. Drivers typically implements a gpu_vm closefunction for this cleanup. The gpu_vm close function will abort gpuexecution using this VM, unmap all gpu_vmas and release page-table memory.
Revalidation and eviction of local objects¶
Note that in all the code examples given below we use simplifiedpseudo-code. In particular, the dma_resv deadlock avoidance algorithmas well as reserving memory for dma_resv fences is left out.
Revalidation¶
With VM_BIND, all local objects need to be resident when the gpu isexecuting using the gpu_vm, and the objects need to have validgpu_vmas set up pointing to them. Typically, each gpu command buffersubmission is therefore preceded with a re-validation section:
dma_resv_lock(gpu_vm->resv);// Validation section starts here.for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list,&gpu_vm_bo){validate_gem_bo(&gpu_vm_bo->gem_bo);// The following list iteration needs the Gem object's// dma_resv to be held (it protects the gpu_vm_bo's list of// gpu_vmas, but since local gem objects share the gpu_vm's// dma_resv, it is already held at this point.for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo,&gpu_vma)move_gpu_vma_to_rebind_list(&gpu_vma,&gpu_vm->rebind_list);}for_each_gpu_vma_on_rebind_list(&gpuvm->rebind_list,&gpu_vma){rebind_gpu_vma(&gpu_vma);remove_gpu_vma_from_rebind_list(&gpu_vma);}// Validation section ends here, and job submission starts.add_dependencies(&gpu_job,&gpu_vm->resv);job_dma_fence=gpu_submit(&gpu_job));add_dma_fence(job_dma_fence,&gpu_vm->resv);dma_resv_unlock(gpu_vm->resv);
The reason for having a separate gpu_vm rebind list is that theremight be userptr gpu_vmas that are not mapping a buffer object thatalso need rebinding.
Eviction¶
Eviction of one of these local objects will then look similar to thefollowing:
obj=get_object_from_lru();dma_resv_lock(obj->resv);for_each_gpu_vm_bo_of_obj(obj,&gpu_vm_bo);add_gpu_vm_bo_to_evict_list(&gpu_vm_bo,&gpu_vm->evict_list);add_dependencies(&eviction_job,&obj->resv);job_dma_fence=gpu_submit(&eviction_job);add_dma_fence(&obj->resv,job_dma_fence);dma_resv_unlock(&obj->resv);put_object(obj);
Note that since the object is local to the gpu_vm, it will share the gpu_vm’sdma_resv lock such thatobj->resv==gpu_vm->resv.The gpu_vm_bos marked for eviction are put on the gpu_vm’s evict list,which is protected bygpu_vm->resv. During eviction all localobjects have their dma_resv locked and, due to the above equality, alsothe gpu_vm’s dma_resv protecting the gpu_vm’s evict list is locked.
With VM_BIND, gpu_vmas don’t need to be unbound before eviction,since the driver must ensure that the eviction blit or copy will waitfor GPU idle or depend on all previous GPU activity. Furthermore, anysubsequent attempt by the GPU to access freed memory through thegpu_vma will be preceded by a new exec function, with a revalidationsection which will make sure all gpu_vmas are rebound. The evictioncode holding the object’s dma_resv while revalidating will ensure anew exec function may not race with the eviction.
A driver can be implemented in such a way that, on each exec function,only a subset of vmas are selected for rebind. In this case, all vmas that arenot selected for rebind must be unbound before the execfunction workload is submitted.
Locking with external buffer objects¶
Since external buffer objects may be shared by multiple gpu_vm’s theycan’t share their reservation object with a single gpu_vm. Insteadthey need to have a reservation object of their own. The externalobjects bound to a gpu_vm using one or many gpu_vmas are therefore put on aper-gpu_vm list which is protected by the gpu_vm’s dma_resv lock orone of thegpu_vm list spinlocks. Oncethe gpu_vm’s reservation object is locked, it is safe to traverse theexternal object list and lock the dma_resvs of all externalobjects. However, if instead a list spinlock is used, a more elaborateiteration scheme needs to be used.
At eviction time, the gpu_vm_bos ofall the gpu_vms an externalobject is bound to need to be put on their gpu_vm’s evict list.However, when evicting an external object, the dma_resvs of thegpu_vms the object is bound to are typically not held. Onlythe object’s private dma_resv can be guaranteed to be held. If thereis a ww_acquire context at hand at eviction time we could grab thosedma_resvs but that could cause expensive ww_mutex rollbacks. A simpleoption is to just mark the gpu_vm_bos of the evicted gem object withanevicted bool that is inspected before the next time thecorresponding gpu_vm evicted list needs to be traversed. For example, whentraversing the list of external objects and locking them. At that time,both the gpu_vm’s dma_resv and the object’s dma_resv is held, and thegpu_vm_bo marked evicted, can then be added to the gpu_vm’s list ofevicted gpu_vm_bos. Theevicted bool is formally protected by theobject’s dma_resv.
The exec function becomes
dma_resv_lock(gpu_vm->resv);// External object list is protected by the gpu_vm->resv lock.for_each_gpu_vm_bo_on_extobj_list(gpu_vm,&gpu_vm_bo){dma_resv_lock(gpu_vm_bo.gem_obj->resv);if(gpu_vm_bo_marked_evicted(&gpu_vm_bo))add_gpu_vm_bo_to_evict_list(&gpu_vm_bo,&gpu_vm->evict_list);}for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list,&gpu_vm_bo){validate_gem_bo(&gpu_vm_bo->gem_bo);for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo,&gpu_vma)move_gpu_vma_to_rebind_list(&gpu_vma,&gpu_vm->rebind_list);}for_each_gpu_vma_on_rebind_list(&gpuvm->rebind_list,&gpu_vma){rebind_gpu_vma(&gpu_vma);remove_gpu_vma_from_rebind_list(&gpu_vma);}add_dependencies(&gpu_job,&gpu_vm->resv);job_dma_fence=gpu_submit(&gpu_job));add_dma_fence(job_dma_fence,&gpu_vm->resv);for_each_external_obj(gpu_vm,&obj)add_dma_fence(job_dma_fence,&obj->resv);dma_resv_unlock_all_resv_locks();
And the corresponding shared-object aware eviction would look like:
obj=get_object_from_lru();dma_resv_lock(obj->resv);for_each_gpu_vm_bo_of_obj(obj,&gpu_vm_bo)if(object_is_vm_local(obj))add_gpu_vm_bo_to_evict_list(&gpu_vm_bo,&gpu_vm->evict_list);elsemark_gpu_vm_bo_evicted(&gpu_vm_bo);add_dependencies(&eviction_job,&obj->resv);job_dma_fence=gpu_submit(&eviction_job);add_dma_fence(&obj->resv,job_dma_fence);dma_resv_unlock(&obj->resv);put_object(obj);
Accessing the gpu_vm’s lists without the dma_resv lock held¶
Some drivers will hold the gpu_vm’s dma_resv lock when accessing thegpu_vm’s evict list and external objects lists. However, there aredrivers that need to access these lists without the dma_resv lockheld, for example due to asynchronous state updates from within thedma_fence signalling critical path. In such cases, a spinlock can beused to protect manipulation of the lists. However, since higher levelsleeping locks need to be taken for each list item while iteratingover the lists, the items already iterated over need to betemporarily moved to a private list and the spinlock releasedwhile processing each item:
Due to the additional locking and atomic operations, drivers thatcanavoid accessing the gpu_vm’s list outside of the dma_resv lockmight want to avoid also this iteration scheme. Particularly, if thedriver anticipates a large number of list items. For lists where theanticipated number of list items is small, where list iteration doesn’thappen very often or if there is a significant additional costassociated with each iteration, the atomic operation overheadassociated with this type of iteration is, most likely, negligible. Note thatif this scheme is used, it is necessary to make sure this listiteration is protected by an outer level lock or semaphore, since listitems are temporarily pulled off the list while iterating, and it isalso worth mentioning that the local liststill_in_list shouldalso be considered protected by thegpu_vm->list_lock, and it isthus possible that items can be removed also from the local listconcurrently with list iteration.
Please refer to theDRM GPUVM locking section and its internalget_next_vm_bo_from_list() function.
userptr gpu_vmas¶
A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to aGPU virtual address range, directly maps a CPU mm range of anonymous-or file page-cache pages.A very simple approach would be to just pin the pages usingpin_user_pages() at bind time and unpin them at unbind time, but thiscreates a Denial-Of-Service vector since a single user-space processwould be able to pin down all of system memory, which is notdesirable. (For special use-cases and assuming proper accounting pinning mightstill be a desirable feature, though). What we need to do in thegeneral case is to obtain a reference to the desired pages, make surewe are notified using a MMU notifier just before the CPU mm unmaps thepages, dirty them if they are not mapped read-only to the GPU, andthen drop the reference.When we are notified by the MMU notifier that CPU mm is about to drop thepages, we need to stop GPU access to the pages by waiting for VM idlein the MMU notifier and make sure that before the next time the GPUtries to access whatever is now present in the CPU mm range, we unmapthe old pages from the GPU page tables and repeat the process ofobtaining new page references. (See thenotifier example below). Note that when the core mm decides tolaundry pages, we get such an unmap MMU notification and can mark thepages dirty again before the next GPU access. We also get similar MMUnotifications for NUMA accounting which the GPU driver doesn’t reallyneed to care about, but so far it has proven difficult to excludecertain notifications.
Using a MMU notifier for device DMA (and other methods) is described inthe pin_user_pages() documentation.
Now, the method of obtainingstructpage references usingget_user_pages() unfortunately can’t be used under a dma_resv locksince that would violate the locking order of the dma_resv lock vs themmap_lock that is grabbed when resolving a CPU pagefault. This meansthe gpu_vm’s list of userptr gpu_vmas needs to be protected by anouter lock, which in our example below is thegpu_vm->lock.
The MMU interval seqlock for a userptr gpu_vma is used in the followingway:
// Exclusive locking mode here is strictly needed only if there are// invalidated userptr gpu_vmas present, to avoid concurrent userptr// revalidations of the same userptr gpu_vma.down_write(&gpu_vm->lock);retry:// Note: mmu_interval_read_begin() blocks until there is no// invalidation notifier running anymore.seq=mmu_interval_read_begin(&gpu_vma->userptr_interval);if(seq!=gpu_vma->saved_seq){obtain_new_page_pointers(&gpu_vma);dma_resv_lock(&gpu_vm->resv);add_gpu_vma_to_revalidate_list(&gpu_vma,&gpu_vm);dma_resv_unlock(&gpu_vm->resv);gpu_vma->saved_seq=seq;}// The usual revalidation goes here.// Final userptr sequence validation may not happen before the// submission dma_fence is added to the gpu_vm's resv, from the POW// of the MMU invalidation notifier. Hence the// userptr_notifier_lock that will make them appear atomic.add_dependencies(&gpu_job,&gpu_vm->resv);down_read(&gpu_vm->userptr_notifier_lock);if(mmu_interval_read_retry(&gpu_vma->userptr_interval,gpu_vma->saved_seq)){up_read(&gpu_vm->userptr_notifier_lock);gotoretry;}job_dma_fence=gpu_submit(&gpu_job));add_dma_fence(job_dma_fence,&gpu_vm->resv);for_each_external_obj(gpu_vm,&obj)add_dma_fence(job_dma_fence,&obj->resv);dma_resv_unlock_all_resv_locks();up_read(&gpu_vm->userptr_notifier_lock);up_write(&gpu_vm->lock);
The code betweenmmu_interval_read_begin() and themmu_interval_read_retry() marks the read side critical section ofwhat we call theuserptr_seqlock. In reality, the gpu_vm’s userptrgpu_vma list is looped through, and the check is done forall of itsuserptr gpu_vmas, although we only show a single one here.
The userptr gpu_vma MMU invalidation notifier might be called fromreclaim context and, again, to avoid locking order violations, we can’ttake any dma_resv lock nor the gpu_vm->lock from within it.
boolgpu_vma_userptr_invalidate(userptr_interval,cur_seq){// Make sure the exec function either sees the new sequence// and backs off or we wait for the dma-fence:down_write(&gpu_vm->userptr_notifier_lock);mmu_interval_set_seq(userptr_interval,cur_seq);up_write(&gpu_vm->userptr_notifier_lock);// At this point, the exec function can't succeed in// submitting a new job, because cur_seq is an invalid// sequence number and will always cause a retry. When all// invalidation callbacks, the mmu notifier core will flip// the sequence number to a valid one. However we need to// stop gpu access to the old pages here.dma_resv_wait_timeout(&gpu_vm->resv,DMA_RESV_USAGE_BOOKKEEP,false,MAX_SCHEDULE_TIMEOUT);returntrue;}
When this invalidation notifier returns, the GPU can no longer beaccessing the old pages of the userptr gpu_vma and needs to redo thepage-binding before a new GPU submission can succeed.
Efficient userptr gpu_vma exec_function iteration¶
If the gpu_vm’s list of userptr gpu_vmas becomes large, it’sinefficient to iterate through the complete lists of userptrs on eachexec function to check whether each userptr gpu_vma’s savedsequence number is stale. A solution to this is to put allinvalidated userptr gpu_vmas on a separate gpu_vm list andonly check the gpu_vmas present on this list on each execfunction. This list will then lend itself very-well to the spinlocklocking scheme that isdescribed in the spinlock iteration section, sincein the mmu notifier, where we add the invalidated gpu_vmas to thelist, it’s not possible to take any outer locks like thegpu_vm->lock or thegpu_vm->resv lock. Note that thegpu_vm->lock still needs to be taken while iterating to ensure the list iscomplete, as also mentioned in that section.
If using an invalidated userptr list like this, the retry check in theexec function trivially becomes a check for invalidated list empty.
Locking at bind and unbind time¶
At bind time, assuming a GEM object backed gpu_vma, eachgpu_vma needs to be associated with a gpu_vm_bo and thatgpu_vm_bo in turn needs to be added to the GEM object’sgpu_vm_bo list, and possibly to the gpu_vm’s external objectlist. This is referred to aslinking the gpu_vma, and typicallyrequires that thegpu_vm->lock and thegem_object->gpuva_lockare held. When unlinking a gpu_vma the same locks should be held,and that ensures that when iterating overgpu_vmas`,eitherunderthe``gpu_vm->resv or the GEM object’s dma_resv, that the gpu_vmasstay alive as long as the lock under which we iterate is not released. Foruserptr gpu_vmas it’s similarly required that during vma destroy, theoutergpu_vm->lock is held, since otherwise when iterating overthe invalidated userptr list as described in the previous section,there is nothing keeping those userptr gpu_vmas alive.
Locking for recoverable page-fault page-table updates¶
There are two important things we need to ensure with locking forrecoverable page-faults:
At the time we return pages back to the system / allocator forreuse, there should be no remaining GPU mappings and any GPU TLBmust have been flushed.
The unmapping and mapping of a gpu_vma must not race.
Since the unmapping (or zapping) of GPU ptes is typically taking placewhere it is hard or even impossible to take any outer level locks wemust either introduce a new lock that is held at both mapping andunmapping time, or look at the locks we do hold at unmapping time andmake sure that they are held also at mapping time. For userptrgpu_vmas, theuserptr_seqlock is held in write mode in the mmuinvalidation notifier where zapping happens. Hence, if theuserptr_seqlock as well as thegpu_vm->userptr_notifier_lockis held in read mode during mapping, it will not race with thezapping. For GEM object backed gpu_vmas, zapping will take place underthe GEM object’s dma_resv and ensuring that the dma_resv is held alsowhen populating the page-tables for any gpu_vma pointing to the GEMobject, will similarly ensure we are race-free.
If any part of the mapping is performed asynchronouslyunder a dma-fence with these locks released, the zapping will need towait for that dma-fence to signal under the relevant lock beforestarting to modify the page-table.
Since modifying thepage-table structure in a way that frees up page-table memorymight also require outer level locks, the zapping of GPU ptestypically focuses only on zeroing page-table or page-directory entriesand flushing TLB, whereas freeing of page-table memory is deferred tounbind or rebind time.