pin_user_pages() and related calls¶
Overview¶
This document describes the following functions:
pin_user_pages()pin_user_pages_fast()pin_user_pages_remote()
Basic description of FOLL_PIN¶
FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()(“gup”) family of functions. FOLL_PIN has significant interactions andinterdependencies with FOLL_LONGTERM, so both are covered here.
FOLL_PIN is internal to gup, meaning that it should not appear at the gup callsites. This allows the associated wrapper functions (pin_user_pages*() andothers) to set the correct combination of these flags, and to check for problemsas well.
FOLL_LONGTERM, on the other hand,is allowed to be set at the gup call sites.This is in order to avoid creating a large number of wrapper functions to coverall combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, thepin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, sothat’s a natural dividing line, and a good point to make separate wrapper calls.In other words, use pin_user_pages*() for DMA-pinned pages, andget_user_pages*() for other cases. There are five cases described later on inthis document, to further clarify that concept.
FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,multiple threads and call sites are free to pin the samestructpages, via bothFOLL_PIN and FOLL_GET. It’s just the call site that needs to choose one or theother, not thestructpage(s).
The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PINuses a different reference counting technique.
FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
Which flags are set by each wrapper¶
For these pin_user_pages*() functions, FOLL_PIN is OR’d in with whatever gupflags the caller provides. The caller is required to pass in a non-nullstructpages* array, and the function then pins pages by incrementing each by a specialvalue: GUP_PIN_COUNTING_BIAS.
For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,the extra space available in thestructfolio is used to store thepincount directly.
This approach for large folios avoids the counting upper limit problemsthat are discussed below. Those limitations would have been aggravatedseverely by huge pages, because each tail page adds a refcount to thehead page. And in fact, testing revealed that, without a separate pincountfield, refcount overflows were seen in some huge page stress tests.
This also means that huge pages and large folios do not sufferfrom the false positives problem that is mentioned below.:
Function--------pin_user_pages FOLL_PIN is always set internally by this function.pin_user_pages_fast FOLL_PIN is always set internally by this function.pin_user_pages_remote FOLL_PIN is always set internally by this function.
For these get_user_pages*() functions, FOLL_GET might not even be specified.Behavior is a little more complex than above. If FOLL_GET wasnot specified,but the caller passed in a non-nullstructpages* array, then the functionsets FOLL_GET for you, and proceeds to pin pages by incrementing the refcountof each page by +1.:
Function--------get_user_pages FOLL_GET is sometimes set internally by this function.get_user_pages_fast FOLL_GET is sometimes set internally by this function.get_user_pages_remote FOLL_GET is sometimes set internally by this function.
Tracking dma-pinned pages¶
Some of the key design constraints, and solutions, for tracking dma-pinnedpages:
An actual reference count, per
structpage, is required. This is becausemultiple processes may pin and unpin a page.False positives (reporting that a page is dma-pinned, when in fact it is not)are acceptable, but false negatives are not.
structpagemay not be increased in size for this, and all fields are alreadyused.Given the above, we can overload the page->_refcount field by using, sort of,the upper bits in that field for a dma-pinned count. “Sort of”, means that,rather than dividing page->_refcount into bit fields, we simple add a medium-large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) topage->_refcount. This provides fuzzy behavior: if a page has
get_page()calledon it 1024 times, then it will appear to have a single dma-pinned count.And again, that’s acceptable.
This also leads to limitations: there are only 31-10==21 bits available for acounter that increments 10 bits at a time.
Because of that limitation, special handling is applied to the zero pageswhen using FOLL_PIN. We only pretend to pin a zero page - we don’t alter itsrefcount or pincount at all (it is permanent, so there’s no need). Theunpinning functions also don’t do anything to a zero page. This istransparent to the caller.
Callers must specifically request “dma-pinned tracking of pages”. In otherwords, just calling
get_user_pages()will not suffice; a new set of functions,pin_user_page()and related, must be used.
FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags¶
Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describingthese categories:
CASE 1: Direct IO (DIO)¶
There are GUP references to pages that are servingas DIO buffers. These buffers are needed for a relatively short time (so theyare not “long term”). No special synchronization withfolio_mkclean() ormunmap() is provided. Therefore, flags to set at the call site are:
FOLL_PIN
...but rather than setting FOLL_PIN directly, call sites should use one ofthe pin_user_pages*() routines that set FOLL_PIN.
CASE 2: RDMA¶
There are GUP references to pages that are serving as DMAbuffers. These buffers are needed for a long time (“long term”). No specialsynchronization withfolio_mkclean() ormunmap() is provided. Therefore, flagsto set at the call site are:
FOLL_PIN | FOLL_LONGTERM
NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That’sbecause DAX pages do not have a separate page cache, and so “pinning” implieslocking down file system blocks, which is not (yet) supported in that way.
CASE 3: MMU notifier registration, with or without page faulting hardware¶
Device drivers can pin pages via get_user_pages*(), and register for mmunotifier callbacks for the memory range. Then, upon receiving a notifier“invalidate range” callback , stop the device from using the range, and unpinthe pages. There may be other possible schemes, such as for example explicitlysynchronizing against pending IO, that accomplish approximately the same thing.
Or, if the hardware supports replayable page faults, then the device driver canavoid pinning entirely (this is ideal), as follows: register for mmu notifiercallbacks as above, but instead of stopping the device and unpinning in thecallback, simply remove the range from the device’s page tables.
Either way, as long as the driver unpins the pages upon mmu notifier callback,then there is proper synchronization with both filesystem and mm(folio_mkclean(),munmap(), etc). Therefore, neither flag needs to be set.
CASE 4: Pinning for struct page manipulation only¶
If onlystructpage data (as opposed to the actual memory contents that a pageis tracking) is affected, then normal GUP calls are sufficient, and neither flagneeds to be set.
CASE 5: Pinning in order to write to the data within the page¶
Even though neither DMA nor Direct IO is involved, just a simple case of “pin,write to a page’s data, unpin” can cause a problem. Case 5 may be considered asuperset of Case 1, plus Case 2, plus anything that invokes that pattern. Inother words, if the code is neither Case 1 nor Case 2, it may still requireFOLL_PIN, for patterns like this:
- Correct (uses FOLL_PIN calls):
pin_user_pages()write to the data within the pagesunpin_user_pages()- INCORRECT (uses FOLL_GET calls):
get_user_pages()write to the data within the pagesput_page()
folio_maybe_dma_pinned(): the whole point of pinning¶
The whole point of marking folios as “DMA-pinned” or “gup-pinned” is to be ableto query, “is this folio DMA-pinned?” That allows code such asfolio_mkclean()(and file system writeback code in general) to make informed decisions aboutwhat to do when a folio cannot be unmapped due to such pins.
What to do in those cases is the subject of a years-long series of discussionsand debates (see the References at the end of this document). It’s a TODO itemhere: fill in the details once that’s worked out. Meanwhile, it’s safe to saythat having this available:
static inline bool folio_maybe_dma_pinned(struct folio *folio)
...is a prerequisite to solving the long-running gup+DMA problem.
Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM¶
Another way of thinking about these flags is as a progression of restrictions:FOLL_GET is forstructpage manipulation, without affecting the data that thestructpage refers to. FOLL_PIN is areplacement for FOLL_GET, and is forshort term pins on pages whose datawill get accessed. As such, FOLL_PIN isa “more severe” form of pinning. And finally, FOLL_LONGTERM is an even morerestrictive case that has FOLL_PIN as a prerequisite: this is for pages thatwill be pinned longterm, and whose data will be accessed.
Unit testing¶
This file:
tools/testing/selftests/mm/gup_test.c
has the following new calls to exercise the new pin*() wrapper functions:
PIN_FAST_BENCHMARK (./gup_test -a)
PIN_BASIC_TEST (./gup_test -b)
You can monitor how many total dma-pinned pages have been acquired and releasedsince the system was booted, via two new /proc/vmstat entries:
/proc/vmstat/nr_foll_pin_acquired/proc/vmstat/nr_foll_pin_released
Under normal conditions, these two values will be equal unless there are anylong-term [R]DMA pins in place, or during pin/unpin transitions.
nr_foll_pin_acquired: This is the number of logical pins that have beenacquired since the system was powered on. For huge pages, the head page ispinned once for each page (head page and each tail page) within the huge page.This follows the same sort of behavior that
get_user_pages()uses for hugepages: the head page is refcounted once for each tail or head page in the hugepage, whenget_user_pages()is applied to a huge page.nr_foll_pin_released: The number of logical pins that have been released sincethe system was powered on. Note that pages are released (unpinned) on aPAGE_SIZE granularity, even if the original pin was applied to a huge page.Becaused of the pin count behavior described above in “nr_foll_pin_acquired”,the accounting balances out, so that after doing this:
pin_user_pages(huge_page);for (each page in huge_page) unpin_user_page(page);
...the following is expected:
nr_foll_pin_released == nr_foll_pin_acquired
(...unless it was already out of balance due to a long-term RDMA pin being inplace.)
Other diagnostics¶
dump_page() has been enhanced slightly to handle these new countingfields, and to better report on large folios in general. Specifically,for large folios, the exact pincount is reported.
References¶
John Hubbard, October, 2019