Process Addresses

Userland memory ranges are tracked by the kernel via Virtual Memory Areas or‘VMA’s of typestructvm_area_struct.

Each VMA describes a virtually contiguous memory range with identicalattributes, each described by astructvm_area_structobject. Userland access outside of VMAs is invalid except in the case where anadjacent stack VMA could be extended to contain the accessed address.

All VMAs are contained within one and only one virtual address space, describedby astructmm_struct object which is referenced by all tasks (that is,threads) which share the virtual address space. We refer to this as themm.

Each mm object contains a maple tree data structure which describes all VMAswithin the virtual address space.

Note

An exception to this is the ‘gate’ VMA which is provided byarchitectures which usevsyscall and is a global staticobject which does not belong to any specific mm.

Locking

The kernel is designed to be highly scalable against concurrent read operationson VMAmetadata so a complicated set of locks are required to ensure memorycorruption does not occur.

Note

Locking VMAs for their metadata does not have any impact on the memorythey describe nor the page tables that map them.

Terminology

  • mmap locks - Each MM has a read/write semaphoremmap_lockwhich locks at a process address space granularity which can be acquired viammap_read_lock(),mmap_write_lock() and variants.

  • VMA locks - The VMA lock is at VMA granularity (of course) which behavesas a read/write semaphore in practice. A VMA read lock is obtained vialock_vma_under_rcu() (and unlocked viavma_end_read()) and awrite lock viavma_start_write() orvma_start_write_killable()(all VMA write locks are unlockedautomatically when the mmap write lock is released). To take a VMA write lockyoumust have already acquired anmmap_write_lock().

  • rmap locks - When trying to access VMAs through the reverse mapping via astructaddress_space orstructanon_vma object(reachable from a folio viafolio->mapping). VMAs must be stabilised viaanon_vma_[try]lock_read() oranon_vma_[try]lock_write() foranonymous memory andi_mmap_[try]lock_read() ori_mmap_[try]lock_write() for file-backed memory. We refer to theselocks as the reverse mapping locks, or ‘rmap locks’ for brevity.

We discuss page table locks separately in the dedicated section below.

The first thingany of these locks achieve is tostabilise the VMAwithin the MM tree. That is, guaranteeing that the VMA object will not bedeleted from under you nor modified (except for some specific fieldsdescribed below).

Stabilising a VMA also keeps the address space described by it around.

Lock usage

If you want toread VMA metadata fields or just keep the VMA stable, youmust do one of the following:

  • Obtain an mmap read lock at the MM granularity viammap_read_lock() (or asuitable variant), unlocking it with a matchingmmap_read_unlock() whenyou’re done with the VMA,or

  • Try to obtain a VMA read lock vialock_vma_under_rcu(). This tries toacquire the lock atomically so might fail, in which case fall-back logic isrequired to instead obtain an mmap read lock if this returnsNULL,or

  • Acquire an rmap lock before traversing the locked interval tree (whetheranonymous or file-backed) to obtain the required VMA.

If you want towrite VMA metadata fields, then things vary depending on thefield (we explore each VMA field in detail below). For the majority you must:

  • Obtain an mmap write lock at the MM granularity viammap_write_lock() (or asuitable variant), unlocking it with a matchingmmap_write_unlock() whenyou’re done with the VMA,and

  • Obtain a VMA write lock viavma_start_write() for each VMA you wish tomodify, which will be released automatically whenmmap_write_unlock() iscalled.

  • If you want to be able to write toany field, you must also hide the VMAfrom the reverse mapping by obtaining anrmap write lock.

VMA locks are special in that you must obtain an mmapwrite lockfirstin order to obtain a VMAwrite lock. A VMAread lock however can beobtained without any other lock (lock_vma_under_rcu() will acquire thenrelease an RCU lock to lookup the VMA for you).

This constrains the impact of writers on readers, as a writer can interact withone VMA while a reader interacts with another simultaneously.

Note

The primary users of VMA read locks are page fault handlers, whichmeans that without a VMA write lock, page faults will run concurrent withwhatever you are doing.

Examining all valid lock states:

mmap lock

VMA lock

rmap lock

Stable?

Read?

Write most?

Write all?

-

-

-

N

N

N

N

-

R

-

Y

Y

N

N

-

-

R/W

Y

Y

N

N

R/W

-/R

-/R/W

Y

Y

N

N

W

W

-/R

Y

Y

Y

N

W

W

W

Y

Y

Y

Y

Warning

While it’s possible to obtain a VMA lock while holding an mmap read lock,attempting to do the reverse is invalid as it can result in deadlock - ifanother task already holds an mmap write lock and attempts to acquire a VMAwrite lock that will deadlock on the VMA read lock.

All of these locks behave as read/write semaphores in practice, so you canobtain either a read or a write lock for each of these.

Note

Generally speaking, a read/write semaphore is a class of lock whichpermits concurrent readers. However a write lock can only be obtainedonce all readers have left the critical region (and pending readersmade to wait).

This renders read locks on a read/write semaphore concurrent with otherreaders and write locks exclusive against all others holding the semaphore.

VMA fields

We can subdividestructvm_area_struct fields by their purpose, which makes iteasier to explore their locking characteristics:

Note

We exclude VMA lock-specific fields here to avoid confusion, as theseare in effect an internal implementation detail.

Virtual layout fields

Field

Description

Write lock

vm_start

Inclusive start virtual address of rangeVMA describes.

mmap write,VMA write,rmap write.

vm_end

Exclusive end virtual address of rangeVMA describes.

mmap write,VMA write,rmap write.

vm_pgoff

Describes the page offset into the file,the original page offset within thevirtual address space (prior to anymremap()), or PFN if a PFN mapand the architecture does not supportCONFIG_ARCH_HAS_PTE_SPECIAL.

mmap write,VMA write,rmap write.

These fields describes the size, start and end of the VMA, and as such cannot bemodified without first being hidden from the reverse mapping since these fieldsare used to locate VMAs within the reverse mapping interval trees.

Core fields

Field

Description

Write lock

vm_mm

Containing mm_struct.

None - written once oninitial map.

vm_page_prot

Architecture-specific page tableprotection bits determined from VMAflags.

mmap write, VMA write.

vm_flags

Read-only access to VMA flags describingattributes of the VMA, inunionwithprivate writable__vm_flags.

N/A

__vm_flags

Private, writable access to VMA flagsfield, updated byvm_flags_*() functions.

mmap write, VMA write.

vm_file

If the VMA is file-backed, points to astructfile object describing theunderlying file, if anonymous thenNULL.

None - written once oninitial map.

vm_ops

If the VMA is file-backed, then eitherthe driver or file-system provides astructvm_operations_structobject describing callbacks to beinvoked on VMA lifetime events.

None - Written once oninitial map byf_ops->mmap().

vm_private_data

Avoid* field fordriver-specific metadata.

Handled by driver.

These are the core fields which describe the MM the VMA belongs to and its attributes.

Config-specific fields

Field

Configuration option

Description

Write lock

anon_name

CONFIG_ANON_VMA_NAME

A field for storing astructanon_vma_nameobject providing a name for anonymousmappings, orNULL if noneis set or the VMA is file-backed. Theunderlying object is reference countedand can be shared across multiple VMAsfor scalability.

mmap write,VMA write.

swap_readahead_info

CONFIG_SWAP

Metadata used by the swap mechanismto perform readahead. This field isaccessed atomically.

mmap read,swap-specificlock.

vm_policy

CONFIG_NUMA

mempolicy object whichdescribes the NUMA behaviour of theVMA. The underlying object is referencecounted.

mmap write,VMA write.

numab_state

CONFIG_NUMA_BALANCING

vma_numab_state object whichdescribes the current state ofNUMA balancing in relation to this VMA.Updated under mmap read lock bytask_numa_work().

mmap read,numab-specificlock.

vm_userfaultfd_ctx

CONFIG_USERFAULTFD

Userfaultfd context wrapper object oftypevm_userfaultfd_ctx,either of zero size if userfaultfd isdisabled, or containing a pointerto an underlyinguserfaultfd_ctx object whichdescribes userfaultfd metadata.

mmap write,VMA write.

These fields are present or not depending on whether the relevant kernelconfiguration option is set.

Reverse mapping fields

Field

Description

Write lock

shared.rb

A red/black tree node used, if themapping is file-backed, to place the VMAin thestructaddress_space->i_mmapred/black interval tree.

mmap write, VMA write,i_mmap write.

shared.rb_subtree_last

Metadata used for management of theinterval tree if the VMA is file-backed.

mmap write, VMA write,i_mmap write.

anon_vma_chain

List of pointers to both forked/CoW’danon_vma objects andvma->anon_vma if it isnon-NULL.

mmap read, anon_vma write.

anon_vma

anon_vma object used byanonymous folios mapped exclusively tothis VMA. Initially set byanon_vma_prepare() serialisedby thepage_table_lock. Thisis set as soon as any page is faulted in.

WhenNULL andsetting non-NULL:mmap read, page_table_lock.

When non-NULL andsettingNULL:mmap write, VMA write,anon_vma write.

These fields are used to both place the VMA within the reverse mapping, and foranonymous mappings, to be able to access both relatedstructanon_vma objectsand thestructanon_vma in which folios mapped exclusively to this VMA shouldreside.

Note

If a file-backed mapping is mapped withMAP_PRIVATE setthen it can be in both theanon_vma andi_mmaptrees at the same time, so all of these fields might be utilised atonce.

Page tables

We won’t speak exhaustively on the subject but broadly speaking, page tables mapvirtual addresses to physical ones through a series of page tables, each ofwhich contain entries with physical addresses for the next page table level(along with flags), and at the leaf level the physical addresses of theunderlying physical data pages or a special entry such as a swap entry,migration entry or other special marker. Offsets into these pages are providedby the virtual address itself.

In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Hugepages might eliminate one or two of these levels, but when this is the case wetypically refer to the leaf level as the PTE level regardless.

Note

In instances where the architecture supports fewer page tables thanfive the kernel cleverly ‘folds’ page table levels, that is stubbingout functions related to the skipped levels. This allows us toconceptually act as if there were always five levels, even if thecompiler might, in practice, eliminate any code relating to missingones.

There are four key operations typically performed on page tables:

  1. Traversing page tables - Simply reading page tables in order to traversethem. This only requires that the VMA is kept stable, so a lock whichestablishes this suffices for traversal (there are also lockless variantswhich eliminate even this requirement, such asgup_fast()). There isalso a special case of page table traversal for non-VMA regions which weconsider separately below.

  2. Installing page table mappings - Whether creating a new mapping ormodifying an existing one in such a way as to change its identity. Thisrequires that the VMA is kept stable via an mmap or VMA lock (explicitly notrmap locks).

  3. Zapping/unmapping page table entries - This is what the kernel callsclearing page table mappings at the leaf level only, whilst leaving all pagetables in place. This is a very common operation in the kernel performed onfile truncation, theMADV_DONTNEED operation viamadvise(), and others. This is performed by a number of functionsincludingunmap_mapping_range() andunmap_mapping_pages().The VMA need only be kept stable for this operation.

  4. Freeing page tables - When finally the kernel removes page tables from auserland process (typically viafree_pgtables()) extreme care mustbe taken to ensure this is done safely, as this logic finally frees all pagetables in the specified range, ignoring existing leaf entries (it assumes thecaller has both zapped the range and prevented any further faults ormodifications within it).

Note

Modifying mappings for reclaim or migration is performed under rmaplock as it, like zapping, does not fundamentally modify the identityof what is being mapped.

Traversing andzapping ranges can be performed holding any one of thelocks described in the terminology section above - that is the mmap lock, theVMA lock or either of the reverse mapping locks.

That is - as long as you keep the relevant VMAstable - you are good to goahead and perform these operations on page tables (though internally, kerneloperations that perform writes also acquire internal page table locks toserialise - see the page table implementation detail section for more details).

Note

We free empty PTE tables on zap under the RCU lock - this does notchange the aforementioned locking requirements around zapping.

Wheninstalling page table entries, the mmap or VMA lock must be held tokeep the VMA stable. We explore why this is in the page table locking detailssection below.

Freeing page tables is an entirely internal memory management operation andhas special requirements (see the page freeing section below for more details).

Warning

Whenfreeing page tables, it must not be possible for VMAscontaining the ranges those page tables map to be accessible viathe reverse mapping.

Thefree_pgtables() function removes the relevant VMAsfrom the reverse mappings, but no other VMAs can be permitted to beaccessible and span the specified range.

Traversing non-VMA page tables

We’ve focused above on traversal of page tables belonging to VMAs. It is alsopossible to traverse page tables which are not represented by VMAs.

Kernel page table mappings themselves are generally managed but whatever part ofthe kernel established them and the aforementioned locking rules do not apply -for instance vmalloc has its own set of locks which are utilised forestablishing and tearing down page its page tables.

However, for convenience we provide thewalk_kernel_page_table_range()function which is synchronised via the mmap lock on theinit_mmkernel instantiation of thestructmm_struct metadata object.

If an operation requires exclusive access, a write lock is used, but if not, aread lock suffices - we assert only that at least a read lock has been acquired.

Since, aside from vmalloc and memory hot plug, kernel page tables are not torndown all that often - this usually suffices, however any caller of thisfunctionality must ensure that any additionally required locks are acquired inadvance.

We also permit a truly unusual case is the traversal of non-VMA ranges inuserland ranges, as provided for bywalk_page_range_debug().

This has only one user - the general page table dumping logic (implemented inmm/ptdump.c) - which seeks to expose all mappings for debug purposeseven if they are highly unusual (possibly architecture-specific) and are notbacked by a VMA.

We must take great care in this case, as themunmap() implementationdetaches VMAs under an mmap write lock before tearing down page tables under adowngraded mmap read lock.

This means such an operation could race with this, and thus an mmapwritelock is required.

Lock ordering

As we have multiple locks across the kernel which may or may not be taken at thesame time as explicit mm or VMA locks, we have to be wary of lock inversion, andtheorder in which locks are acquired and released becomes very important.

Note

Lock inversion occurs when two threads need to acquire multiple locks,but in doing so inadvertently cause a mutual deadlock.

For example, consider thread 1 which holds lock A and tries to acquire lock B,while thread 2 holds lock B and tries to acquire lock A.

Both threads are now deadlocked on each other. However, had they attempted toacquire locks in the same order, one would have waited for the other tocomplete its work and no deadlock would have occurred.

The opening comment inmm/rmap.c describes in detail the requiredordering of locks within memory management code:

inode->i_rwsem        (while writing or truncating, not reading or faulting)  mm->mmap_lock    mapping->invalidate_lock (in filemap_fault)      folio_lock        hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)          vma_start_write            mapping->i_mmap_rwsem              anon_vma->rwsem                mm->page_table_lock or pte_lock                  swap_lock (in swap_duplicate, swap_info_get)                    mmlist_lock (in mmput, drain_mmlist and others)                    mapping->private_lock (in block_dirty_folio)                        i_pages lock (widely used)                          lruvec->lru_lock (in folio_lruvec_lock_irq)                    inode->i_lock (in set_page_dirty's __mark_inode_dirty)                    bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)                      sb_lock (within inode_lock in fs/fs-writeback.c)                      i_pages lock (widely used, in set_page_dirty,                                in arch-dependent flush_dcache_mmap_lock,                                within bdi.wb->list_lock in __sync_single_inode)

There is also a file-system specific lock ordering comment located at the top ofmm/filemap.c:

->i_mmap_rwsem                        (truncate_pagecache)  ->private_lock                      (__free_pte->block_dirty_folio)    ->swap_lock                       (exclusive_swap_page, others)      ->i_pages lock->i_rwsem  ->invalidate_lock                   (acquired by fs in truncate path)    ->i_mmap_rwsem                    (truncate->unmap_mapping_range)->mmap_lock  ->i_mmap_rwsem    ->page_table_lock or pte_lock     (various, mainly in memory.c)      ->i_pages lock                  (arch-dependent flush_dcache_mmap_lock)->mmap_lock  ->invalidate_lock                   (filemap_fault)    ->lock_page                       (filemap_fault, access_process_vm)->i_rwsem                             (generic_perform_write)  ->mmap_lock                         (fault_in_readable->do_page_fault)bdi->wb.list_lock  sb_lock                             (fs/fs-writeback.c)  ->i_pages lock                      (__sync_single_inode)->i_mmap_rwsem  ->anon_vma.lock                     (vma_merge)->anon_vma.lock  ->page_table_lock or pte_lock       (anon_vma_prepare and various)->page_table_lock or pte_lock  ->swap_lock                         (try_to_unmap_one)  ->private_lock                      (try_to_unmap_one)  ->i_pages lock                      (try_to_unmap_one)  ->lruvec->lru_lock                  (follow_page_mask->mark_page_accessed)  ->lruvec->lru_lock                  (check_pte_range->folio_isolate_lru)  ->private_lock                      (folio_remove_rmap_pte->set_page_dirty)  ->i_pages lock                      (folio_remove_rmap_pte->set_page_dirty)  bdi.wb->list_lock                   (folio_remove_rmap_pte->set_page_dirty)  ->inode->i_lock                     (folio_remove_rmap_pte->set_page_dirty)  bdi.wb->list_lock                   (zap_pte_range->set_page_dirty)  ->inode->i_lock                     (zap_pte_range->set_page_dirty)  ->private_lock                      (zap_pte_range->block_dirty_folio)

Please check the current state of these comments which may have changed sincethe time of writing of this document.

Locking Implementation Details

Warning

Locking rules for PTE-level page tables are very different fromlocking rules for page tables at other levels.

Page table locking details

Note

This section explores page table locking requirements for page tablesencompassed by a VMA. See the above section on non-VMA page tabletraversal for details on how we handle that case.

In addition to the locks described in the terminology section above, we haveadditional locks dedicated to page tables:

  • Higher level page table locks - Higher level page tables, that is PGD, P4Dand PUD each make use of the process address space granularitymm->page_table_lock lock when modified.

  • Fine-grained page table locks - PMDs and PTEs each have fine-grained lockseither kept within the folios describing the page tables or allocatedseparated and pointed at by the folios ifALLOC_SPLIT_PTLOCKS isset. The PMD spin lock is obtained viapmd_lock(), however PTEs aremapped into higher memory (if a 32-bit system) and carefully locked viapte_offset_map_lock().

These locks represent the minimum required to interact with each page tablelevel, but there are further requirements.

Importantly, note that on atraversal of page tables, sometimes no suchlocks are taken. However, at the PTE level, at least concurrent page tabledeletion must be prevented (using RCU) and the page table must be mapped intohigh memory, see below.

Whether care is taken on reading the page table entries depends on thearchitecture, see the section on atomicity below.

Locking rules

We establish basic locking rules when interacting with page tables:

  • When changing a page table entry the page table lock for that page tablemust be held, except if you can safely assume nobody can access the pagetables concurrently (such as on invocation offree_pgtables()).

  • Reads from and writes to page table entries must beappropriatelyatomic. See the section on atomicity below for details.

  • Populating previously empty entries requires that the mmap or VMA locks areheld (read or write), doing so with only rmap locks would be dangerous (seethe warning below).

  • As mentioned previously, zapping can be performed while simply keeping the VMAstable, that is holding any one of the mmap, VMA or rmap locks.

Warning

Populating previously empty entries is dangerous as, when unmappingVMAs,vms_clear_ptes() has a window of time betweenzapping (viaunmap_vmas()) and freeing page tables (viafree_pgtables()), where the VMA is still visible in thermap tree.free_pgtables() assumes that the zap hasalready been performed and removes PTEs unconditionally (along withall other page tables in the freed range), so installing new PTEentries could leak memory and also cause other unexpected anddangerous behaviour.

There are additional rules applicable when moving page tables, which we discussin the section on this topic below.

PTE-level page tables are different from page tables at other levels, and thereare extra requirements for accessing them:

  • On 32-bit architectures, they may be in high memory (meaning they need to bemapped into kernel memory to be accessible).

  • When empty, they can be unlinked and RCU-freed while holding an mmap lock orrmap lock for reading in combination with the PTE and PMD page table locks.In particular, this happens inretract_page_tables() when handlingMADV_COLLAPSE.So accessing PTE-level page tables requires at least holding an RCU read lock;but that only suffices for readers that can tolerate racing with concurrentpage table updates such that an empty PTE is observed (in a page table thathas actually already been detached and marked for RCU freeing) while anothernew page table has been installed in the same location and filled withentries. Writers normally need to take the PTE lock and revalidate that thePMD entry still refers to the same PTE-level page table.If the writer does not care whether it is the same PTE-level page table, itcan take the PMD lock and revalidate that the contents of pmd entry still meetthe requirements. In particular, this also happens inretract_page_tables()when handlingMADV_COLLAPSE.

To access PTE-level page tables, a helper likepte_offset_map_lock() orpte_offset_map() can be used depending on stability requirements.These map the page table into kernel memory if required, take the RCU lock, anddepending on variant, may also look up or acquire the PTE lock.See the comment on__pte_offset_map_lock().

Atomicity

Regardless of page table locks, the MMU hardware concurrently updates accessedand dirty bits (perhaps more, depending on architecture). Additionally, pagetable traversal operations in parallel (though holding the VMA stable) andfunctionality like GUP-fast locklessly traverses (that is reads) page tables,without even keeping the VMA stable at all.

When performing a page table traversal and keeping the VMA stable, whether aread must be performed once and only once or not depends on the architecture(for instance x86-64 does not require any special precautions).

If a write is being performed, or if a read informs whether a write takes place(on an installation of a page table entry say, for instance in__pud_install()), special care must always be taken. In these cases wecan never assume that page table locks give us entirely exclusive access, andmust retrieve page table entries once and only once.

If we are reading page table entries, then we need only ensure that the compilerdoes not rearrange our loads. This is achieved viapXXp_get()functions -pgdp_get(),p4dp_get(),pudp_get(),pmdp_get(), andptep_get().

Each of these usesREAD_ONCE() to guarantee that the compiler readsthe page table entry only once.

However, if we wish to manipulate an existing page table entry and care aboutthe previously stored data, we must go further and use an hardware atomicoperation as, for example, inptep_get_and_clear().

Equally, operations that do not rely on the VMA being held stable, such asGUP-fast (seegup_fast() and its various page table level handlers likegup_fast_pte_range()), must very carefully interact with page tableentries, using functions such asptep_get_lockless() and equivalent forhigher level page table levels.

Writes to page table entries must also be appropriately atomic, as establishedbyset_pXX() functions -set_pgd(),set_p4d(),set_pud(),set_pmd(), andset_pte().

Equally functions which clear page table entries must be appropriately atomic,as inpXX_clear() functions -pgd_clear(),p4d_clear(),pud_clear(),pmd_clear(), andpte_clear().

Page table installation

Page table installation is performed with the VMA held stable explicitly by anmmap or VMA lock in read or write mode (see the warning in the locking rulessection for details as to why).

When allocating a P4D, PUD or PMD and setting the relevant entry in the abovePGD, P4D or PUD, themm->page_table_lock must be held. This isacquired in__p4d_alloc(),__pud_alloc() and__pmd_alloc() respectively.

Note

__pmd_alloc() actually invokespud_lock() andpud_lockptr() in turn, however at the time of writing it ultimatelyreferences themm->page_table_lock.

Allocating a PTE will either use themm->page_table_lock or, ifUSE_SPLIT_PMD_PTLOCKS is defined, a lock embedded in the PMDphysical page metadata in the form of astructptdesc, acquired bypmd_ptdesc() called frompmd_lock() and ultimately__pte_alloc().

Finally, modifying the contents of the PTE requires special treatment, as thePTE page table lock must be acquired whenever we want stable and exclusiveaccess to entries contained within a PTE, especially when we wish to modifythem.

This is performed viapte_offset_map_lock() which carefully checks toensure that the PTE hasn’t changed from under us, ultimately invokingpte_lockptr() to obtain a spin lock at PTE granularity contained withinthestructptdesc associated with the physical PTE page. The lockmust be released viapte_unmap_unlock().

Note

There are some variants on this, such aspte_offset_map_rw_nolock() when we know we hold the PTE stable butfor brevity we do not explore this. See the comment for__pte_offset_map_lock() for more details.

When modifying data in ranges we typically only wish to allocate higher pagetables as necessary, using these locks to avoid races or overwriting anything,and set/clear data at the PTE level as required (for instance when page faultingor zapping).

A typical pattern taken when traversing page table entries to install a newmapping is to optimistically determine whether the page table entry in the tableabove is empty, if so, only then acquiring the page table lock and checkingagain to see if it was allocated underneath us.

This allows for a traversal with page table locks only being taken whenrequired. An example of this is__pud_alloc().

At the leaf page table, that is the PTE, we can’t entirely rely on this patternas we have separate PMD and PTE locks and a THP collapse for instance might haveeliminated the PMD entry as well as the PTE from under us.

This is why__pte_offset_map_lock() locklessly retrieves the PMD entryfor the PTE, carefully checking it is as expected, before acquiring thePTE-specific lock, and thenagain checking that the PMD entry is as expected.

If a THP collapse (or similar) were to occur then the lock on both pages wouldbe acquired, so we can ensure this is prevented while the PTE lock is held.

Installing entries this way ensures mutual exclusion on write.

Page table freeing

Tearing down page tables themselves is something that requires significantcare. There must be no way that page tables designated for removal can betraversed or referenced by concurrent tasks.

It is insufficient to simply hold an mmap write lock and VMA lock (which willprevent racing faults, and rmap operations), as a file-backed mapping can betruncated under thestructaddress_space->i_mmap_rwsem alone.

As a result, no VMA which can be accessed via the reverse mapping (eitherthrough thestructanon_vma->rb_root or thestructaddress_space->i_mmap interval trees) can have its page tables torn down.

The operation is typically performed viafree_pgtables(), which assumeseither the mmap write lock has been taken (as specified by itsmm_wr_locked parameter), or that the VMA is already unreachable.

It carefully removes the VMA from all reverse mappings, however it’s importantthat no new ones overlap these or any route remain to permit access to addresseswithin the range whose page tables are being torn down.

Additionally, it assumes that a zap has already been performed and steps havebeen taken to ensure that no further page table entries can be installed betweenthe zap and the invocation offree_pgtables().

Since it is assumed that all such steps have been taken, page table entries arecleared without page table locks (in thepgd_clear(),p4d_clear(),pud_clear(), andpmd_clear() functions.

Note

It is possible for leaf page tables to be torn down independent ofthe page tables above it as is done byretract_page_tables(), which is performed under the i_mmapread lock, PMD, and PTE page table locks, without this level of care.

Page table moving

Some functions manipulate page table levels above PMD (that is PUD, P4D and PGDpage tables). Most notable of these ismremap(), which is capable ofmoving higher level page tables.

In these instances, it is required thatall locks are taken, that isthe mmap lock, the VMA lock and the relevant rmap locks.

You can observe this in themremap() implementation in the functionstake_rmap_locks() anddrop_rmap_locks() which perform the rmapside of lock acquisition, invoked ultimately bymove_page_tables().

VMA lock internals

Overview

VMA read locking is entirely optimistic - if the lock is contended or a competingwrite has started, then we do not obtain a read lock.

A VMAread lock is obtained bylock_vma_under_rcu(), which firstcallsrcu_read_lock() to ensure that the VMA is looked up in an RCUcritical section, then attempts to VMA lock it viavma_start_read(),before releasing the RCU lock viarcu_read_unlock().

In cases when the user already holds mmap read lock,vma_start_read_locked()andvma_start_read_locked_nested() can be used. These functions do notfail due to lock contention but the caller should still check their return valuesin case they fail for other reasons.

VMA read locks incrementvma.vm_refcnt reference counter for theirduration and the caller oflock_vma_under_rcu() must drop it viavma_end_read().

VMAwrite locks are acquired viavma_start_write() in instances where aVMA is about to be modified, unlikevma_start_read() the lock is alwaysacquired. An mmap write lockmust be held for the duration of the VMA writelock, releasing or downgrading the mmap write lock also releases the VMA writelock so there is novma_end_write() function.

Note that when write-locking a VMA lock, thevma.vm_refcnt is temporarilymodified so that readers can detect the presense of a writer. The reference counter isrestored once the vma sequence number used for serialisation is updated.

This ensures the semantics we require - VMA write locks provide exclusive writeaccess to the VMA.

Implementation details

The VMA lock mechanism is designed to be a lightweight means of avoiding the useof the heavily contended mmap lock. It is implemented using a combination of areference counter and sequence numbers belonging to the containingstructmm_struct and the VMA.

Read locks are acquired viavma_start_read(), which is an optimisticoperation, i.e. it tries to acquire a read lock but returns false if it isunable to do so. At the end of the read operation,vma_end_read() iscalled to release the VMA read lock.

Invokingvma_start_read() requires thatrcu_read_lock() hasbeen called first, establishing that we are in an RCU critical section upon VMAread lock acquisition. Once acquired, the RCU lock can be released as it is onlyrequired for lookup. This is abstracted bylock_vma_under_rcu() whichis the interface a user should use.

Writing requires the mmap to be write-locked and the VMA lock to be acquired viavma_start_write(), however the write lock is released by the termination ordowngrade of the mmap write lock so novma_end_write() is required.

All this is achieved by the use of per-mm and per-VMA sequence counts, which areused in order to reduce complexity, especially for operations which write-lockmultiple VMAs at once.

If the mm sequence count,mm->mm_lock_seq is equal to the VMAsequence countvma->vm_lock_seq then the VMA is write-locked. Ifthey differ, then it is not.

Each time the mmap write lock is released inmmap_write_unlock() ormmap_write_downgrade(),vma_end_write_all() is invoked whichalso incrementsmm->mm_lock_seq viamm_lock_seqcount_end().

This way, we ensure that, regardless of the VMA’s sequence number, a write lockis never incorrectly indicated and that when we release an mmap write lock weefficiently releaseall VMA write locks contained within the mmap at thesame time.

Since the mmap write lock is exclusive against others who hold it, the automaticrelease of any VMA locks on its release makes sense, as you would never want tokeep VMAs locked across entirely separate write operations. It also maintainscorrect lock ordering.

Each time a VMA read lock is acquired, we incrementvma.vm_refcntreference counter and check that the sequence count of the VMA does not matchthat of the mm.

If it does, the read lock fails andvma.vm_refcnt is dropped.If it does not, we keep the reference counter raised, excluding writers, butpermitting other readers, who can also obtain this lock under RCU.

Importantly, maple tree operations performed inlock_vma_under_rcu()are also RCU safe, so the whole read lock operation is guaranteed to functioncorrectly.

On the write side, we set a bit invma.vm_refcnt which can’t bemodified by readers and wait for all readers to drop their reference count.Once there are no readers, the VMA’s sequence number is set to match that ofthe mm. During this entire operation mmap write lock is held.

This way, if any read locks are in effect,vma_start_write() will sleepuntil these are finished and mutual exclusion is achieved.

After setting the VMA’s sequence number, the bit invma.vm_refcntindicating a writer is cleared. From this point on, VMA’s sequence number willindicate VMA’s write-locked state until mmap write lock is dropped or downgraded.

This clever combination of a reference counter and sequence count allows forfast RCU-based per-VMA lock acquisition (especially on page fault, thoughutilised elsewhere) with minimal complexity around lock ordering.

mmap write lock downgrading

When an mmap write lock is held one has exclusive access to resources within themmap (with the usual caveats about requiring VMA write locks to avoid races withtasks holding VMA read locks).

It is then possible todowngrade from a write lock to a read lock viammap_write_downgrade() which, similar tommap_write_unlock(),implicitly terminates all VMA write locks viavma_end_write_all(), butimportantly does not relinquish the mmap lock while downgrading, thereforekeeping the locked virtual address space stable.

An interesting consequence of this is that downgraded locks are exclusiveagainst any other task possessing a downgraded lock (since a racing task wouldhave to acquire a write lock first to downgrade it, and the downgraded lockprevents a new write lock from being obtained until the original lock isreleased).

For clarity, we map read (R)/downgraded write (D)/write (W) locks against oneanother showing which locks exclude the others:

Lock exclusivity

R

D

W

R

N

N

Y

D

N

Y

Y

W

Y

Y

Y

Here a Y indicates the locks in the matching row/column are mutually exclusive,and N indicates that they are not.

Stack expansion

Stack expansion throws up additional complexities in that we cannot permit thereto be racing page faults, as a result we invokevma_start_write() toprevent this inexpand_downwards() orexpand_upwards().

Functions and structures

intvma_start_write_killable(structvm_area_struct*vma)

Begin writing to a VMA.

Parameters

structvm_area_struct*vma

The VMA we are going to modify.

Description

Exclude concurrent readers under the per-VMA lock until the currentlywrite-locked mmap_lock is dropped or downgraded.

Context

May sleep while waiting for readers to drop the vma read lock.Caller must already hold the mmap_lock for write.

Return

0 for a successful acquisition. -EINTR if a fatal signal wasreceived.