English

Chinese (Simplified)

Memory Management APIs¶

User Space Memory Access¶

get_user¶

get_user(x,ptr)

Get a simple variable from user space.

Parameters

x: Variable to store result.
ptr: Source address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple variable from user space to kernelspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result ofdereferencingptr must be assignable tox without a cast.

Return

zero on success, or -EFAULT on error.On error, the variablex is set to zero.

__get_user¶

__get_user(x,ptr)

Get a simple variable from user space, with less checking.

Parameters

x: Variable to store result.
ptr: Source address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple variable from user space to kernelspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result ofdereferencingptr must be assignable tox without a cast.

Caller must check the pointer withaccess_ok() before calling thisfunction.

Return

zero on success, or -EFAULT on error.On error, the variablex is set to zero.

put_user¶

put_user(x,ptr)

Write a simple value into user space.

Parameters

x: Value to copy to user space.
ptr: Destination address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple value from kernel space to userspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, andx must be assignableto the result of dereferencingptr.

Return

zero on success, or -EFAULT on error.

__put_user¶

__put_user(x,ptr)

Write a simple value into user space, with less checking.

Parameters

x: Value to copy to user space.
ptr: Destination address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple value from kernel space to userspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, andx must be assignableto the result of dereferencingptr.

Caller must check the pointer withaccess_ok() before calling thisfunction.

Return

zero on success, or -EFAULT on error.

unsignedlongclear_user(void__user*to,unsignedlongn)¶: Zero a block of memory in user space.

Parameters

void__user*to: Destination address, in user space.
unsignedlongn: Number of bytes to zero.

Description

Zero a block of memory in user space.

Return

number of bytes that could not be cleared.On success, this will be zero.

unsignedlong__clear_user(void__user*to,unsignedlongn)¶: Zero a block of memory in user space, with less checking.

Parameters

void__user*to: Destination address, in user space.
unsignedlongn: Number of bytes to zero.

Description

Zero a block of memory in user space. Caller must checkthe specified block withaccess_ok() before calling this function.

Return

number of bytes that could not be cleared.On success, this will be zero.

intget_user_pages_fast(unsignedlongstart,intnr_pages,unsignedintgup_flags,structpage**pages)¶: pin user pages in memory

Parameters

unsignedlongstart: starting user address
intnr_pages: number of pages from start to pin
unsignedintgup_flags: flags modifying pin behaviour
structpage**pages: array that receives pointers to the pages pinned.Should be at least nr_pages long.

Description

Attempt to pin user pages in memory without taking mm->mmap_lock.If not successful, it will fall back to taking the lock andcallingget_user_pages().

Returns number of pages pinned. This may be fewer than the number requested.If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns-errno.

Memory Allocation Controls¶

Page mobility and placement hints¶

These flags provide hints about how mobile the page is. Pages with similarmobility are placed within the same pageblocks to minimise problems dueto external fragmentation.

__GFP_MOVABLE (also a zone modifier) indicates that the page can bemoved by page migration during memory compaction or can be reclaimed.

__GFP_RECLAIMABLE is used for slab allocations that specifySLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.

__GFP_WRITE indicates the caller intends to dirty the page. Where possible,these pages will be spread between local zones to avoid all the dirtypages being in one zone (fair zone allocation policy).

__GFP_HARDWALL enforces the cpuset memory allocation policy.

__GFP_THISNODE forces the allocation to be satisfied from the requestednode with no fallbacks or placement policy enforcements.

__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.

__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.

Watermark modifiers -- controls access to emergency reserves¶

__GFP_HIGH indicates that the caller is high-priority and that grantingthe request is necessary before the system can make forward progress.For example creating an IO context to clean pages and requestsfrom atomic context.

__GFP_MEMALLOC allows access to all memory. This should only be used whenthe caller guarantees the allocation will allow more memory to be freedvery shortly e.g. process exiting or swapping. Users either shouldbe the MM or co-ordinating closely with the VM (e.g. swap over NFS).Users of this flag have to be extremely careful to not deplete the reservecompletely and implement a throttling mechanism which controls theconsumption of the reserve based on the amount of freed memory.Usage of a pre-allocated pool (e.g. mempool) should be always consideredbefore using this flag.

__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.This takes precedence over the__GFP_MEMALLOC flag if both are set.

Reclaim modifiers¶

Please note that all the following flags are only applicable to sleepableallocations (e.g.GFP_NOWAIT andGFP_ATOMIC will ignore them).

__GFP_IO can start physical IO.

__GFP_FS can call down to the low-level FS. Clearing the flag avoids theallocator recursing into the filesystem which might already be holdinglocks.

__GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.This flag can be cleared to avoid unnecessary delays when a fallbackoption is available.

__GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd whenthe low watermark is reached and have it reclaim pages until the highwatermark is reached. A caller may wish to clear this flag when fallbackoptions are available and the reclaim is likely to disrupt the system. Thecanonical example is THP allocation where a fallback is cheap butreclaim/compaction may cause indirect stalls.

__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.

The default allocator behavior depends on the request size. We have a conceptof so-called costly allocations (with order >PAGE_ALLOC_COSTLY_ORDER).!costly allocations are too essential to fail so they are implicitlynon-failing by default (with some exceptions like OOM victims might fail sothe caller still has to check for failures) while costly requests try to benot disruptive and back off even without invoking the OOM killer.The following three modifiers might be used to override some of theseimplicit rules. Please note that all of them must be used along with__GFP_DIRECT_RECLAIM flag.

__GFP_NORETRY: The VM implementation will try only very lightweightmemory direct reclaim to get some memory under memory pressure (thusit can sleep). It will avoid disruptive actions like OOM killer. Thecaller must handle the failure which is quite likely to happen underheavy memory pressure. The flag is suitable when failure can easily behandled at small cost, such as reduced throughput.

__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaimprocedures that have previously failed if there is some indicationthat progress has been made elsewhere. It can wait for othertasks to attempt high-level approaches to freeing memory such ascompaction (which removes fragmentation) and page-out.There is still a definite limit to the number of retries, but it isa larger limit than with__GFP_NORETRY.Allocations with this flag may fail, but only when there isgenuinely little unused memory. While these allocations do notdirectly trigger the OOM killer, their failure indicates thatthe system is likely to need to use the OOM killer soon. Thecaller must handle failure, but can reasonably do so by failinga higher-level request, or completing it only in a much lessefficient manner.If the allocation does fail, and the caller is in a position tofree some non-essential memory, doing so could benefit the systemas a whole.

__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the callercannot handle allocation failures. The allocation could blockindefinitely but will never return with failure. Testing forfailure is pointless.It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM.It should _never_ be used in non-sleepable contexts.New users should be evaluated carefully (and the flag should beused only when there is no reasonable failure policy) but it isdefinitely preferable to use the flag rather than opencode endlessloop around allocator.Allocating pages from the buddy with __GFP_NOFAIL and order > 1 isnot supported. Please consider usingkvmalloc() instead.

Useful GFP flag combinations¶

Useful GFP flag combinations that are commonly used. It is recommendedthat subsystems start with one of these combinations and then set/clear__GFP_FOO flags as necessary.

GFP_ATOMIC users can not sleep and need the allocation to succeed. A lowerwatermark is applied to allow access to “atomic reserves”.The current implementation doesn’t support NMI and few other strictnon-preemptive contexts (e.g. raw_spin_lock). The same applies toGFP_NOWAIT.

GFP_KERNEL is typical for kernel-internal allocations. The caller requiresZONE_NORMAL or a lower zone for direct access but can direct reclaim.

GFP_KERNEL_ACCOUNT is the same as GFP_KERNEL, except the allocation isaccounted to kmemcg.

GFP_NOWAIT is for kernel allocations that should not stall for directreclaim, start physical IO or use any filesystem callback. It is verylikely to fail to allocate memory, even for very small allocations.

GFP_NOIO will use direct reclaim to discard clean pages or slab pagesthat do not require the starting of any physical IO.Please try to avoid using this flag directly and instead usememalloc_noio_{save,restore} to mark the whole scope which cannotperform any IO with a short explanation why. All allocation requestswill inherit GFP_NOIO implicitly.

GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.Please try to avoid using this flag directly and instead usememalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn’trecurse into the FS layer with a short explanation why. All allocationrequests will inherit GFP_NOFS implicitly.

GFP_USER is for userspace allocations that also need to be directlyaccessibly by the kernel or hardware. It is typically used by hardwarefor buffers that are mapped to userspace (e.g. graphics) that hardwarestill must DMA to. cpuset limits are enforced for these allocations.

GFP_DMA exists for historical reasons and should be avoided where possible.The flags indicates that the caller requires that the lowest zone beused (ZONE_DMA or 16M on x86-64). Ideally, this would be removed butit would require careful auditing as some users really require it andothers use the flag to avoid lowmem reserves inZONE_DMA and treat thelowest zone as a type of emergency reserve.

GFP_DMA32 is similar toGFP_DMA except that the caller requires a 32-bitaddress. Note that kmalloc(..., GFP_DMA32) does not return DMA32 memorybecause the DMA32 kmalloc cache array is not implemented.(Reason: there is no such user in kernel).

GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,do not need to be directly accessible by the kernel but that cannotmove once in use. An example may be a hardware allocation that mapsdata directly into userspace but has no addressing limitations.

GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does notneed direct access to but can usekmap() when access is required. Theyare expected to be movable via page reclaim or page migration. Typically,pages on the LRU would also be allocated withGFP_HIGHUSER_MOVABLE.

GFP_TRANSHUGE andGFP_TRANSHUGE_LIGHT are used for THP allocations. Theyare compound allocations that will generally fail quickly if memory is notavailable and will not wake kswapd/kcompactd on failure. The _LIGHTversion does not attempt reclaim/compaction at all and is by default usedin page fault path, while the non-light is used by khugepaged.

The Slab Cache¶

SLAB_HWCACHE_ALIGN¶

SLAB_HWCACHE_ALIGN

Align objects on cache line boundaries.
Description
Sufficiently large objects are aligned on cache line boundary. For objectsize smaller than a half of cache line size, the alignment is on the half ofcache line size. In general, if object size is smaller than 1/2^n of cacheline size, the alignment is adjusted to 1/2^n.
If explicit alignment is also requested by the respectivestructkmem_cache_args field, the greater of both is alignments is applied.

SLAB_TYPESAFE_BY_RCU¶

SLAB_TYPESAFE_BY_RCU

WARNING READ THIS!
Description
This delays freeing the SLAB page by a grace period, it does _NOT_delay object freeing. This means that if you dokmem_cache_free()that memory location is free to be reused at any time. Thus it maybe possible to see another object there in the same RCU grace period.
This feature only ensures the memory location backing the objectstays valid, the trick to using this is relying on an independentobject validation pass. Something like:
begin: rcu_read_lock(); obj = lockless_lookup(key); if (obj) {   if (!try_get_ref(obj)) // might fail for free objects     rcu_read_unlock();     goto begin;   if (obj->key != key) { // not the object we expected     put_ref(obj);     rcu_read_unlock();     goto begin;   } }rcu_read_unlock();
This is useful if we need to approach a kernel structure obliquely,from its address obtained without the usual locking. We can lockthe structure to stabilize it and check it’s still at the given address,only if we can be sure that the memory has not been meanwhile reusedfor some other kind of object (which our subsystem’s lock might corrupt).
rcu_read_lock before reading the address, then rcu_read_unlock aftertaking the spinlock within the structure expected at that address.
Note that object identity check has to be doneafter acquiring areference, therefore user has to ensure proper ordering for loads.Similarly, when initializing objects allocated with SLAB_TYPESAFE_BY_RCU,the newly allocated object has to be fully initializedbefore itsrefcount gets initialized and proper ordering for stores is required.refcount_{add|inc}_not_zero_acquire() andrefcount_set_release() aredesigned with the proper fences required for reference counting objectsallocated with SLAB_TYPESAFE_BY_RCU.
Note that it is not possible to acquire a lock within a structureallocated with SLAB_TYPESAFE_BY_RCU without first acquiring a referenceas described above. The reason is that SLAB_TYPESAFE_BY_RCU pagesare not zeroed before being given to the slab, which means that anylocks must be initialized after each and everykmem_struct_alloc().Alternatively, make the ctor passed tokmem_cache_create() initializethe locks at page-allocation time, as is done in__i915_request_ctor(),sighand_ctor(), andanon_vma_ctor(). Such a ctor permits readersto safely acquire those ctor-initialized locks underrcu_read_lock()protection.
Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.

SLAB_ACCOUNT¶

SLAB_ACCOUNT

Account allocations to memcg.
Description
All object allocations from this cache will be memcg accounted, regardless of__GFP_ACCOUNT being or not being passed to individual allocations.

SLAB_RECLAIM_ACCOUNT¶

SLAB_RECLAIM_ACCOUNT

Objects are reclaimable.
Description
Use this flag for caches that have an associated shrinker. As a result, slabpages are allocated with __GFP_RECLAIMABLE, which affects grouping pages bymobility, and are accounted in SReclaimable counter in /proc/meminfo

structkmem_cache_args¶: Less common arguments forkmem_cache_create()

Definition:

struct kmem_cache_args {    unsigned int align;    unsigned int useroffset;    unsigned int usersize;    unsigned int freeptr_offset;    bool use_freeptr_offset;    void (*ctor)(void *);    unsigned int sheaf_capacity;};

Members

align

The required alignment for the objects.

0 means no specific alignment is requested.

useroffset

Usercopy region offset.

0 is a valid offset, whenusersize is non-0

usersize

Usercopy region size.

0 means no usercopy region is specified.

freeptr_offset

Custom offset for the free pointerinSLAB_TYPESAFE_BY_RCU caches

By defaultSLAB_TYPESAFE_BY_RCU caches place the free pointeroutside of the object. This might cause the object to grow in size.Cache creators that have a reason to avoid this can specify a customfree pointer offset in theirstructwhere the free pointer will beplaced.

Note that placing the free pointer inside the object requires thecaller to ensure that no fields are invalidated that are required toguard against object recycling (SeeSLAB_TYPESAFE_BY_RCU fordetails).

Using0 as a value forfreeptr_offset is valid. Iffreeptr_offsetis specified,use_freeptr_offset must be settrue.

Note thatctor currently isn’t supported with custom free pointersas actor requires an external free pointer.

use_freeptr_offset

Whether afreeptr_offset is used.

ctor

A constructor for the objects.

The constructor is invoked for each object in a newly allocated slabpage. It is the cache user’s responsibility to free object in thesame state as after calling the constructor, or deal appropriatelywith any differences between a freshly constructed and a reallocatedobject.

NULL means no constructor.

sheaf_capacity

Enable sheaves of given capacity for the cache.

With a non-zero value, allocations from the cache go through cachingarrays called sheaves. Each cpu has a main sheaf that’s alwayspresent, and a spare sheaf that may be not present. When both becomeempty, there’s an attempt to replace an empty sheaf with a full sheaffrom the per-node barn.

When no full sheaf is available, and gfp flags allow blocking, asheaf is allocated and filled from slab(s) using bulk allocation.Otherwise the allocation falls back to the normal operationallocating a single object from a slab.

Analogically when freeing and both percpu sheaves are full, the barnmay replace it with an empty sheaf, unless it’s over capacity. Inthat case a sheaf is bulk freed to slab pages.

The sheaves do not enforce NUMA placement of objects, so allocationsviakmem_cache_alloc_node() with a node specified other thanNUMA_NO_NODE will bypass them.

Bulk allocation and free operations also try to use the cpu sheavesand barn, but fallback to using slab pages directly.

When slub_debug is enabled for the cache, the sheaf_capacity argumentis ignored.

0 means no sheaves will be created.

Description

Any uninitialized fields of the structure are interpreted as unused. Theexception isfreeptr_offset where0 is a valid value, souse_freeptr_offset must be also set totrue in order to interpret the fieldas used. Foruseroffset0 is also valid, but only with non-0usersize.

WhenNULL args is passed tokmem_cache_create(), it is equivalent to allfields unused.

structkmem_cache*kmem_cache_create_usercopy(constchar*name,unsignedintsize,unsignedintalign,slab_flags_tflags,unsignedintuseroffset,unsignedintusersize,void(*ctor)(void*))¶: Create a kmem cache with a region suitable for copying to userspace.

Parameters

constchar*name: A string which is used in /proc/slabinfo to identify this cache.
unsignedintsize: The size of objects to be created in this cache.
unsignedintalign: The required alignment for the objects.
slab_flags_tflags: SLAB flags
unsignedintuseroffset: Usercopy region offset
unsignedintusersize: Usercopy region size
void(*ctor)(void*): A constructor for the objects, orNULL.

Description

This is a legacy wrapper, new code should use eitherKMEM_CACHE_USERCOPY()if whitelisting a single field is sufficient, orkmem_cache_create() withthe necessary parameters passed via the args parameter (seestructkmem_cache_args)

Return

a pointer to the cache on success, NULL on failure.

kmem_cache_create¶

kmem_cache_create(__name,__object_size,__args,...)

Create a kmem cache.

Parameters

__name: A string which is used in /proc/slabinfo to identify this cache.
__object_size: The size of objects to be created in this cache.
__args: Optional arguments, seestructkmem_cache_args. PassingNULLmeans defaults will be used for all the arguments.
...: variable arguments

Description

This is currently implemented as a macro using_Generic() to calleither the new variant of the function, or a legacy one.

The new variant has 4 parameters:kmem_cache_create(name,object_size,args,flags)

See__kmem_cache_create_args() which implements this.

The legacy variant has 5 parameters:kmem_cache_create(name,object_size,align,flags,ctor)

The align and ctor parameters map to the respective fields ofstructkmem_cache_args

Context

Cannot be called within a interrupt, but can be interrupted.

Return

a pointer to the cache on success, NULL on failure.

size_tksize(constvoid*objp)¶: Report actual allocation size of associated object

Parameters

constvoid*objp: Pointer returned from a priorkmalloc()-family allocation.

Description

This should not be used for writing beyond the originally requestedallocation size. Either usekrealloc() or round up the allocation sizewithkmalloc_size_roundup() prior to allocation. If this is used toaccess beyond the originally requested allocation size, UBSAN_BOUNDSand/or FORTIFY_SOURCE may trip, since they only know about theoriginally allocated size via the __alloc_size attribute.

void*kmem_cache_alloc(structkmem_cache*cachep,gfp_tflags)¶: Allocate an object

Parameters

structkmem_cache*cachep: The cache to allocate from.
gfp_tflags: Seekmalloc().

Description

Allocate an object from this cache.Seekmem_cache_zalloc() for a shortcut of adding __GFP_ZERO to flags.

Return

pointer to the new object orNULL in case of error

boolkmem_cache_charge(void*objp,gfp_tgfpflags)¶: memcg charge an already allocated slab memory

Parameters

void*objp: address of the slab object to memcg charge
gfp_tgfpflags: describe the allocation context

Description

kmem_cache_charge allows charging a slab object to the current memcg,primarily in cases where charging at allocation time might not be possiblebecause the target memcg is not known (i.e. softirq context)

The objp should be pointer returned by the slab allocator functions likekmalloc (with __GFP_ACCOUNT in flags) or kmem_cache_alloc. The memcg chargebehavior can be controlled through gfpflags parameter, which affects how thenecessary internal metadata can be allocated. Including __GFP_NOFAIL denotesthat overcharging is requested instead of failure, but is not applied for theinternal metadata allocation.

There are several cases where it will return true even if the charging wasnot done:More specifically:

For !CONFIG_MEMCG or cgroup_disable=memory systems.
Already charged slab objects.
For slab objects from KMALLOC_NORMAL caches - allocated bykmalloc()without __GFP_ACCOUNT
Allocating internal metadata has failed

Return

true if charge was successful otherwise false.

void*kmalloc(size_tsize,gfp_tflags)¶: allocate kernel memory

Parameters

size_tsize: how many bytes of memory are required.
gfp_tflags: describe the allocation context

Description

kmalloc is the normal method of allocating memoryfor objects smaller than page size in the kernel.

The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGNbytes. Forsize of power of two bytes, the alignment is also guaranteedto be at least to the size. For other sizes, the alignment is guaranteed tobe at least the largest power-of-two divisor ofsize.

Theflags argument may be one of the GFP flags defined atinclude/linux/gfp_types.h and described atDocumentation/core-api/mm-api.rst

The recommended usage of theflags is described atDocumentation/core-api/memory-allocation.rst

Below is a brief outline of the most useful GFP flags

GFP_KERNEL: Allocate normal kernel ram. May sleep.
GFP_NOWAIT: Allocation will not sleep.
GFP_ATOMIC: Allocation will not sleep. May use emergency pools.

Also it is possible to set different flags by OR’ingin one or more of the following additionalflags:

__GFP_ZERO: Zero the allocated memory before returning. Also seekzalloc().
__GFP_HIGH: This allocation has high priority and may use emergency pools.
__GFP_NOFAIL: Indicate that this allocation is in no way allowed to fail(think twice before using).
__GFP_NORETRY: If memory is not immediately available,then give up at once.
__GFP_NOWARN: If allocation fails, don’t issue any warnings.
__GFP_RETRY_MAYFAIL: Try really hard to succeed the allocation but faileventually.

void*kmalloc_array(size_tn,size_tsize,gfp_tflags)¶: allocate memory for an array.

Parameters

size_tn: number of elements.
size_tsize: element size.
gfp_tflags: the type of memory to allocate (see kmalloc).

void*krealloc_array(void*p,size_tnew_n,size_tnew_size,gfp_tflags)¶: reallocate memory for an array.

Parameters

void*p: pointer to the memory chunk to reallocate
size_tnew_n: new number of elements to alloc
size_tnew_size: new size of a single member of the array
gfp_tflags: the type of memory to allocate (see kmalloc)

Description

If __GFP_ZERO logic is requested, callers must ensure that, starting with theinitial memory allocation, every subsequent call to this API for the samememory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that__GFP_ZERO is not fully honored by this API.

Seekrealloc_noprof() for further details.

In any case, the contents of the object pointed to are preserved up to thelesser of the new and old sizes.

kcalloc¶

kcalloc(n,size,flags)

allocate memory for an array. The memory is set to zero.

Parameters

n: number of elements.
size: element size.
flags: the type of memory to allocate (see kmalloc).

void*kzalloc(size_tsize,gfp_tflags)¶: allocate memory. The memory is set to zero.

Parameters

size_tsize: how many bytes of memory are required.
gfp_tflags: the type of memory to allocate (see kmalloc).

size_tkmalloc_size_roundup(size_tsize)¶: Report allocation bucket size for the given size

Parameters

size_tsize: Number of bytes to round up from.

Description

This returns the number of bytes that would be available in akmalloc()allocation ofsize bytes. For example, a 126 byte request would berounded up to the next sized kmalloc bucket, 128 bytes. (This is strictlyfor the general-purposekmalloc()-based allocations, and is not for thepre-sizedkmem_cache_alloc()-based allocations.)

Use this tokmalloc() the full bucket size ahead of time instead of usingksize() to query the size after an allocation.

void*kmem_cache_alloc_node(structkmem_cache*s,gfp_tgfpflags,intnode)¶: Allocate an object on the specified node

Parameters

structkmem_cache*s: The cache to allocate from.
gfp_tgfpflags: Seekmalloc().
intnode: node number of the target node.

Description

Identical to kmem_cache_alloc but it will allocate memory on the givennode, which can improve the performance for cpu bound structures.

Fallback to other node is possible if __GFP_THISNODE is not set.

Return

pointer to the new object orNULL in case of error

void*kmalloc_nolock(size_tsize,gfp_tgfp_flags,intnode)¶: Allocate an object of given size from any context.

Parameters

size_tsize: size to allocate
gfp_tgfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO, __GFP_NO_OBJ_EXTallowed.
intnode: node number of the target node.

Return

pointer to the new object or NULL in case of error.NULL does not mean EBUSY or EAGAIN. It means ENOMEM.There is no reason to call it again and expect !NULL.

voidkmem_cache_free(structkmem_cache*s,void*x)¶: Deallocate an object

Parameters

structkmem_cache*s: The cache the allocation was from.
void*x: The previously allocated object.

Description

Free an object which was previously allocated from thiscache.

voidkfree(constvoid*object)¶: free previously allocated memory

Parameters

constvoid*object: pointer returned bykmalloc() orkmem_cache_alloc()

Description

Ifobject is NULL, no operation is performed.

void*krealloc_node_align(constvoid*p,size_tnew_size,unsignedlongalign,gfp_tflags,intnid)¶: reallocate memory. The contents will remain unchanged.

Parameters

constvoid*p: object to reallocate memory for.
size_tnew_size: how many bytes of memory are required.
unsignedlongalign: desired alignment.
gfp_tflags: the type of memory to allocate.
intnid: NUMA node or NUMA_NO_NODE

Description

Ifp isNULL,krealloc() behaves exactly likekmalloc(). Ifnew_sizeis 0 andp is not aNULL pointer, the object pointed to is freed.

Only alignments up to those guaranteed bykmalloc() will be honored. Please seeMemory Allocation Guide for more details.

Whenslub_debug_orig_size() is off,krealloc() only knows about the bucketsize of an allocation (but not the exact size it was allocated with) andhence implements the following semantics for shrinking and growing bufferswith __GFP_ZERO:

        new             bucket0       size             size|--------|----------------||  keep  |      zero      |

Otherwise, the original allocation size ‘orig_size’ could be used toprecisely clear the requested size, and the new size will also be storedas the new ‘orig_size’.

In any case, the contents of the object pointed to are preserved up to thelesser of the new and old sizes.

Return

pointer to the allocated memory orNULL in case of error

void*__kvmalloc_node(size,b,unsignedlongalign,gfp_tflags,intnode)¶: attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation.

Parameters

size: size of the request.
b: which set of kmalloc buckets to allocate from.
unsignedlongalign: desired alignment.
gfp_tflags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
intnode: numa node to allocate from

Description

Only alignments up to those guaranteed bykmalloc() will be honored. Please seeMemory Allocation Guide for more details.

Uses kmalloc to get the memory but if the allocation fails then falls backto the vmalloc allocator. Use kvfree for freeing the memory.

GFP_NOWAIT and GFP_ATOMIC are supported, the __GFP_NORETRY modifier is not.__GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc ispreferable to the vmalloc fallback, due to visible performance drawbacks.

Return

pointer to the allocated memory ofNULL in case of failure

voidkvfree(constvoid*addr)¶: Free memory.

Parameters

constvoid*addr: Pointer to allocated memory.

Description

kvfree frees memory allocated by any ofvmalloc(),kmalloc() orkvmalloc().It is slightly more efficient to usekfree() orvfree() if you are certainthat you know which one to use.

Context

Either preemptible task context or not-NMI interrupt.

voidkvfree_sensitive(constvoid*addr,size_tlen)¶: Free a data object containing sensitive information.

Parameters

constvoid*addr: address of the data object to be freed.
size_tlen: length of the data object.

Description

Use the specialmemzero_explicit() function to clear the content of akvmalloc’ed object containing sensitive data to make sure that thecompiler won’t optimize out the data clearing.

void*kvrealloc_node_align(constvoid*p,size_tsize,unsignedlongalign,gfp_tflags,intnid)¶: reallocate memory; contents remain unchanged

Parameters

constvoid*p: object to reallocate memory for
size_tsize: the size to reallocate
unsignedlongalign: desired alignment
gfp_tflags: the flags for the page level allocator
intnid: NUMA node id

Description

Ifp isNULL,kvrealloc() behaves exactly likekvmalloc(). Ifsize is 0andp is not aNULL pointer, the object pointed to is freed.

Only alignments up to those guaranteed bykmalloc() will be honored. Please seeMemory Allocation Guide for more details.

In any case, the contents of the object pointed to are preserved up to thelesser of the new and old sizes.

This function must not be called concurrently with itself orkvfree() for thesame memory allocation.

Return

pointer to the allocated memory orNULL in case of error

structkmem_cache*__kmem_cache_create_args(constchar*name,unsignedintobject_size,structkmem_cache_args*args,slab_flags_tflags)¶: Create a kmem cache.

Parameters

constchar*name: A string which is used in /proc/slabinfo to identify this cache.
unsignedintobject_size: The size of objects to be created in this cache.
structkmem_cache_args*args: Additional arguments for the cache creation (seestructkmem_cache_args).
slab_flags_tflags: See the descriptions of individual flags. The common ones are listedin the description below.

Description

Not to be called directly, use thekmem_cache_create() wrapper with the sameparameters.

Commonly usedflags:

SLAB_ACCOUNT - Account allocations to memcg.

SLAB_HWCACHE_ALIGN - Align objects on cache line boundaries.

SLAB_RECLAIM_ACCOUNT - Objects are reclaimable.

SLAB_TYPESAFE_BY_RCU - Slab page (not individual objects) freeing delayedby a grace period - see the full description before using.

Context

Cannot be called within a interrupt, but can be interrupted.

Return

a pointer to the cache on success, NULL on failure.

kmem_buckets*kmem_buckets_create(constchar*name,slab_flags_tflags,unsignedintuseroffset,unsignedintusersize,void(*ctor)(void*))¶: Create a set of caches that handle dynamic sized allocations viakmem_buckets_alloc()

Parameters

constchar*name: A prefix string which is used in /proc/slabinfo to identify thiscache. The individual caches with have their sizes as the suffix.
slab_flags_tflags: SLAB flags (seekmem_cache_create() for details).
unsignedintuseroffset: Starting offset within an allocation that may be copiedto/from userspace.
unsignedintusersize: How many bytes, starting atuseroffset, may be copiedto/from userspace.
void(*ctor)(void*): A constructor for the objects, run when new allocations are made.

Description

Cannot be called within an interrupt, but can be interrupted.

Return

a pointer to the cache on success, NULL on failure. WhenCONFIG_SLAB_BUCKETS is not enabled, ZERO_SIZE_PTR is returned, andsubsequent calls tokmem_buckets_alloc() will fall back tokmalloc().(i.e. callers only need to check for NULL on failure.)

intkmem_cache_shrink(structkmem_cache*cachep)¶: Shrink a cache.

Parameters

structkmem_cache*cachep: The cache to shrink.

Description

Releases as many slabs as possible for a cache.To help debugging, a zero exit status indicates all slabs were released.

Return

0 if all slabs were released, non-zero otherwise

boolkmem_dump_obj(void*object)¶: Print available slab provenance information

Parameters

void*object: slab object for which to find provenance information.

Description

This function usespr_cont(), so that the caller is expected to haveprinted out whatever preamble is appropriate. The provenance informationdepends on the type of object and on how much debugging is enabled.For a slab-cache object, the fact that it is a slab object is printed,and, if available, the slab name, return address, and stack trace fromthe allocation and last free path of that object.

Return

true if the pointer is to a not-yet-freed object fromkmalloc() orkmem_cache_alloc(), eithertrue orfalse if the pointeris to an already-freed object, andfalse otherwise.

voidkfree_sensitive(constvoid*p)¶: Clear sensitive information in memory before freeing

Parameters

constvoid*p: object to free memory of

Description

The memory of the objectp points to is zeroed before freed.Ifp isNULL,kfree_sensitive() does nothing.

Note

this function zeroes the whole allocated buffer which can be a gooddeal bigger than the requested buffer size passed tokmalloc(). So becareful when using this function in performance sensitive code.

voidkvfree_rcu_barrier(void)¶: Wait until all in-flightkvfree_rcu() complete.

Parameters

void: no arguments

Description

Note that a single argument ofkvfree_rcu() call has a slow path thattriggerssynchronize_rcu() following by freeing a pointer. It is donebefore the return from the function. Therefore for any single-argumentcall that will result in akfree() to a cache that is to be destroyedduring module exit, it is developer’s responsibility to ensure that allsuch calls have returned before the call tokmem_cache_destroy().

voidkvfree_rcu_barrier_on_cache(structkmem_cache*s)¶: Wait for in-flightkvfree_rcu() calls on a specific slab cache.

Parameters

structkmem_cache*s: slab cache to wait for

Description

See the description ofkvfree_rcu_barrier() for details.

voidkfree_const(constvoid*x)¶: conditionally free memory

Parameters

constvoid*x: pointer to the memory

Description

Function calls kfree only ifx is not in .rodata section.

Virtually Contiguous Mappings¶

voidvm_unmap_aliases(void)¶: unmap outstanding lazy aliases in the vmap layer

Parameters

void: no arguments

Description

The vmap/vmalloc layer lazily flushes kernel virtual mappings primarilyto amortize TLB flushing overheads. What this means is that any page youhave now, may, in a former life, have been mapped into kernel virtualaddress by the vmap layer and so there might be some CPUs with TLB entriesstill referencing that page (additional to the regular 1:1 kernel mapping).

vm_unmap_aliases flushes all such lazy mappings. After it returns, we canbe sure that none of the pages we have control over will have any aliasesfrom the vmap layer.

voidvm_unmap_ram(constvoid*mem,unsignedintcount)¶: unmap linear kernel address space set up by vm_map_ram

Parameters

constvoid*mem: the pointer returned by vm_map_ram
unsignedintcount: the count passed to that vm_map_ram call (cannot unmap partial)

void*vm_map_ram(structpage**pages,unsignedintcount,intnode)¶: map pages linearly into kernel virtual address (vmalloc space)

Parameters

structpage**pages: an array of pointers to the pages to be mapped
unsignedintcount: number of pages
intnode: prefer to allocate data structures on this node

Description

If you use this function for less than VMAP_MAX_ALLOC pages, it could befaster than vmap so it’s good. But if you mix long-life and short-lifeobjects withvm_map_ram(), it could consume lots of address space throughfragmentation (especially on a 32bit machine). You could see failures inthe end. Please use this function for short-lived objects.

Return

a pointer to the address that has been mapped, orNULL on failure

voidvfree(constvoid*addr)¶: Release memory allocated byvmalloc()

Parameters

constvoid*addr: Memory base address

Description

Free the virtually continuous memory area starting ataddr, as obtainedfrom one of thevmalloc() family of APIs. This will usually also free thephysical memory underlying the virtual allocation, but that memory isreference counted, so it will not be freed until the last user goes away.

Ifaddr is NULL, no operation is performed.

Context

May sleep if callednot from interrupt context.Must not be called in NMI context (strictly speaking, it could beif we have CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the callingconventions forvfree() arch-dependent would be a really bad idea).

voidvunmap(constvoid*addr)¶: release virtual mapping obtained byvmap()

Parameters

constvoid*addr: memory base address

Description

Free the virtually contiguous memory area starting ataddr,which was created from the page array passed tovmap().

Must not be called in interrupt context.

void*vmap(structpage**pages,unsignedintcount,unsignedlongflags,pgprot_tprot)¶: map an array of pages into virtually contiguous space

Parameters

structpage**pages: array of page pointers
unsignedintcount: number of pages to map
unsignedlongflags: vm_area->flags
pgprot_tprot: page protection for the mapping

Description

Mapscount pages frompages into contiguous kernel virtual space.Ifflags containsVM_MAP_PUT_PAGES the ownership of the pages array itself(which must be kmalloc or vmalloc memory) and one reference per pages in itare transferred from the caller tovmap(), and will be freed / dropped whenvfree() is called on the return value.

Return

the address of the area orNULL on failure

void*vmap_pfn(unsignedlong*pfns,unsignedintcount,pgprot_tprot)¶: map an array of PFNs into virtually contiguous space

Parameters

unsignedlong*pfns: array of PFNs
unsignedintcount: number of pages to map
pgprot_tprot: page protection for the mapping

Description

Mapscount PFNs frompfns into contiguous kernel virtual space and returnsthe start address of the mapping.

void*__vmalloc_node(unsignedlongsize,unsignedlongalign,gfp_tgfp_mask,intnode,constvoid*caller)¶: allocate virtually contiguous memory

Parameters

unsignedlongsize: allocation size
unsignedlongalign: desired alignment
gfp_tgfp_mask: flags for the page level allocator
intnode: node to use for allocation or NUMA_NO_NODE
constvoid*caller: caller’s return address

Description

Allocate enough pages to coversize from the page level allocator withgfp_mask flags. Map them into contiguous kernel virtual space.

Semantics ofgfp_mask (including reclaim/retry modifiers such as__GFP_NOFAIL) are the same as in__vmalloc_node_range_noprof().

Return

pointer to the allocated memory orNULL on error

void*vmalloc(unsignedlongsize)¶: allocate virtually contiguous memory

Parameters

unsignedlongsize: allocation size

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.

For tight control over page level allocator and protection flagsuse__vmalloc() instead.

Return

pointer to the allocated memory orNULL on error

void*vmalloc_huge_node(unsignedlongsize,gfp_tgfp_mask,intnode)¶: allocate virtually contiguous memory, allow huge pages

Parameters

unsignedlongsize: allocation size
gfp_tgfp_mask: flags for the page level allocator
intnode: node to use for allocation or NUMA_NO_NODE

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.Ifsize is greater than or equal to PMD_SIZE, allow usinghuge pages for the memory

Return

pointer to the allocated memory orNULL on error

void*vzalloc(unsignedlongsize)¶: allocate virtually contiguous memory with zero fill

Parameters

unsignedlongsize: allocation size

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.The memory allocated is set to zero.

For tight control over page level allocator and protection flagsuse__vmalloc() instead.

Return

pointer to the allocated memory orNULL on error

void*vmalloc_user(unsignedlongsize)¶: allocate zeroed virtually contiguous memory for userspace

Parameters

unsignedlongsize: allocation size

Description

The resulting memory area is zeroed so it can be mapped to userspacewithout leaking data.

Return

pointer to the allocated memory orNULL on error

void*vmalloc_node(unsignedlongsize,intnode)¶: allocate memory on a specific node

Parameters

unsignedlongsize: allocation size
intnode: numa node

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.

For tight control over page level allocator and protection flagsuse__vmalloc() instead.

Return

pointer to the allocated memory orNULL on error

void*vzalloc_node(unsignedlongsize,intnode)¶: allocate memory on a specific node with zero fill

Parameters

unsignedlongsize: allocation size
intnode: numa node

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.The memory allocated is set to zero.

Return

pointer to the allocated memory orNULL on error

void*vmalloc_32(unsignedlongsize)¶: allocate virtually contiguous memory (32bit addressable)

Parameters

unsignedlongsize: allocation size

Description

Allocate enough 32bit PA addressable pages to coversize from thepage level allocator and map them into contiguous kernel virtual space.

Return

pointer to the allocated memory orNULL on error

void*vmalloc_32_user(unsignedlongsize)¶: allocate zeroed virtually contiguous 32bit memory

Parameters

unsignedlongsize: allocation size

Description

The resulting memory area is 32bit addressable and zeroed so it can bemapped to userspace without leaking data.

Return

pointer to the allocated memory orNULL on error

intremap_vmalloc_range(structvm_area_struct*vma,void*addr,unsignedlongpgoff)¶: map vmalloc pages to userspace

Parameters

structvm_area_struct*vma: vma to cover (map full range of vma)
void*addr: vmalloc memory
unsignedlongpgoff: number of pages into addr before first page to map

Return

0 for success, -Exxx on failure

Description

This function checks that addr is a valid vmalloc’ed area, andthat it is big enough to cover the vma. Will return failure ifthat criteria isn’t met.

Similar toremap_pfn_range() (see mm/memory.c)

File Mapping and Page Cache¶

Filemap¶

intfilemap_fdatawrite_range(structaddress_space*mapping,loff_tstart,loff_tend)¶: start writeback on mapping dirty pages in range

Parameters

structaddress_space*mapping: address space structure to write
loff_tstart: offset in bytes where the range starts
loff_tend: offset in bytes where the range ends (inclusive)

Description

Start writeback against all of a mapping’s dirty pages that liewithin the byte offsets <start, end> inclusive.

This is a data integrity operation that waits upon dirty or in writebackpages.

Return

0 on success, negative error code otherwise.

intfilemap_flush_range(structaddress_space*mapping,loff_tstart,loff_tend)¶: start writeback on a range

Parameters

structaddress_space*mapping: target address_space
loff_tstart: index to start writeback on
loff_tend: last (inclusive) index for writeback

Description

This is a non-integrity writeback helper, to start writing back foliosfor the indicated range.

Return

0 on success, negative error code otherwise.

intfilemap_flush(structaddress_space*mapping)¶: mostly a non-blocking flush

Parameters

structaddress_space*mapping: target address_space

Description

This is a mostly non-blocking flush. Not suitable for data-integritypurposes - I/O may not be started against all dirty pages.

Return

0 on success, negative error code otherwise.

boolfilemap_range_has_page(structaddress_space*mapping,loff_tstart_byte,loff_tend_byte)¶: check if a page exists in range.

Parameters

structaddress_space*mapping: address space within which to check
loff_tstart_byte: offset in bytes where the range starts
loff_tend_byte: offset in bytes where the range ends (inclusive)

Description

Find at least one page in the range supplied, usually used to check ifdirect writing in this range will trigger a writeback.

Return

true if at least one page exists in the specified range,false otherwise.

intfilemap_fdatawait_range(structaddress_space*mapping,loff_tstart_byte,loff_tend_byte)¶: wait for writeback to complete

Parameters

structaddress_space*mapping: address space structure to wait for
loff_tstart_byte: offset in bytes where the range starts
loff_tend_byte: offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the given address spacein the given range and wait for all of them. Check error status ofthe address space and return it.

Since the error status of the address space is cleared by this function,callers are responsible for checking the return value and handling and/orreporting the error.

Return

error status of the address space.

intfilemap_fdatawait_range_keep_errors(structaddress_space*mapping,loff_tstart_byte,loff_tend_byte)¶: wait for writeback to complete

Parameters

structaddress_space*mapping: address space structure to wait for
loff_tstart_byte: offset in bytes where the range starts
loff_tend_byte: offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the given address space in thegiven range and wait for all of them. Unlikefilemap_fdatawait_range(),this function does not clear error status of the address space.

Use this function if callers don’t handle errors themselves. Expectedcall sites are system-wide / filesystem-wide data flushers: e.g. sync(2),fsfreeze(8)

intfile_fdatawait_range(structfile*file,loff_tstart_byte,loff_tend_byte)¶: wait for writeback to complete

Parameters

structfile*file: file pointing to address space structure to wait for
loff_tstart_byte: offset in bytes where the range starts
loff_tend_byte: offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the address space that filerefers to, in the given range and wait for all of them. Check errorstatus of the address space vs. the file->f_wb_err cursor and return it.

Since the error status of the file is advanced by this function,callers are responsible for checking the return value and handling and/orreporting the error.

Return

error status of the address space vs. the file->f_wb_err cursor.

intfilemap_fdatawait_keep_errors(structaddress_space*mapping)¶: wait for writeback without clearing errors

Parameters

structaddress_space*mapping: address space structure to wait for

Description

Walk the list of under-writeback pages of the given address spaceand wait for all of them. Unlikefilemap_fdatawait(), this functiondoes not clear error status of the address space.

Use this function if callers don’t handle errors themselves. Expectedcall sites are system-wide / filesystem-wide data flushers: e.g. sync(2),fsfreeze(8)

Return

error status of the address space.

intfilemap_write_and_wait_range(structaddress_space*mapping,loff_tlstart,loff_tlend)¶: write out & wait on a file range

Parameters

structaddress_space*mapping: the address_space for the pages
loff_tlstart: offset in bytes where the range starts
loff_tlend: offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note thatlend is inclusive (describes the last byte to be written) sothat this function can be used to write to the very end-of-file (end = -1).

Return

error status of the address space.

intfile_check_and_advance_wb_err(structfile*file)¶: report wb error (if any) that was previously and advance wb_err to current one

Parameters

structfile*file: structfile on which the error is being reported

Description

When userland calls fsync (or something like nfsd does the equivalent), wewant to report any writeback errors that occurred since the last fsync (orsince the file was opened if there haven’t been any).

Grab the wb_err from the mapping. If it matches what we have in the file,then just quickly return 0. The file is all caught up.

If it doesn’t match, then take the mapping value, set the “seen” flag init and try to swap it into place. If it works, or another task beat usto it with the new value, then update the f_wb_err and return the errorportion. The error at this point must be reported via proper channels(a’la fsync, or NFS COMMIT operation, etc.).

While we handle mapping->wb_err with atomic operations, the f_wb_errvalue is protected by the f_lock since we must ensure that it reflectsthe latest value swapped in for this file descriptor.

Return

0 on success, negative error code otherwise.

intfile_write_and_wait_range(structfile*file,loff_tlstart,loff_tlend)¶: write out & wait on a file range

Parameters

structfile*file: file pointing to address_space with pages
loff_tlstart: offset in bytes where the range starts
loff_tlend: offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note thatlend is inclusive (describes the last byte to be written) sothat this function can be used to write to the very end-of-file (end = -1).

After writing out and waiting on the data, we check and advance thef_wb_err cursor to the latest value, and return any errors detected there.

Return

0 on success, negative error code otherwise.

voidreplace_page_cache_folio(structfolio*old,structfolio*new)¶: replace a pagecache folio with a new one

Parameters

structfolio*old: folio to be replaced
structfolio*new: folio to replace with

Description

This function replaces a folio in the pagecache with a new one. Onsuccess it acquires the pagecache reference for the new folio anddrops it for the old folio. Both the old and new folios must belocked. This function does not add the new folio to the LRU, thecaller must do that.

The remove + add is atomic. This function cannot fail.

voidfolio_unlock(structfolio*folio)¶: Unlock a locked folio.

Parameters

structfolio*folio: The folio.

Description

Unlocks the folio and wakes up any thread sleeping on the page lock.

Context

May be called from interrupt or process context. May not becalled from NMI context.

voidfolio_end_read(structfolio*folio,boolsuccess)¶: End read on a folio.

Parameters

structfolio*folio: The folio.
boolsuccess: True if all reads completed successfully.

Description

When all reads against a folio have completed, filesystems shouldcall this function to let the pagecache know that no more readsare outstanding. This will unlock the folio and wake up any threadsleeping on the lock. The folio will also be marked uptodate if allreads succeeded.

Context

May be called from interrupt or process context. May not becalled from NMI context.

voidfolio_end_private_2(structfolio*folio)¶: Clear PG_private_2 and wake any waiters.

Parameters

structfolio*folio: The folio.

Description

Clear the PG_private_2 bit on a folio and wake up any sleepers waiting forit. The folio reference held for PG_private_2 being set is released.

This is, for example, used when a netfs folio is being written to a localdisk cache, thereby allowing writes to the cache for the same folio to beserialised.

voidfolio_wait_private_2(structfolio*folio)¶: Wait for PG_private_2 to be cleared on a folio.

Parameters

structfolio*folio: The folio to wait on.

Description

Wait for PG_private_2 to be cleared on a folio.

intfolio_wait_private_2_killable(structfolio*folio)¶: Wait for PG_private_2 to be cleared on a folio.

Parameters

structfolio*folio: The folio to wait on.

Description

Wait for PG_private_2 to be cleared on a folio or until a fatal signal isreceived by the calling task.

Return

0 if successful.
-EINTR if a fatal signal was encountered.

voidfolio_end_writeback_no_dropbehind(structfolio*folio)¶: End writeback against a folio.

Parameters

structfolio*folio: The folio.

Description

The folio must actually be under writeback.This call is intended for filesystems that need to defer dropbehind.

Context

May be called from process or interrupt context.

voidfolio_end_writeback(structfolio*folio)¶: End writeback against a folio.

Parameters

structfolio*folio: The folio.

Description

The folio must actually be under writeback.

Context

May be called from process or interrupt context.

void__folio_lock(structfolio*folio)¶: Get a lock on the folio, assuming we need to sleep to get it.

Parameters

structfolio*folio: The folio to lock

pgoff_tpage_cache_next_miss(structaddress_space*mapping,pgoff_tindex,unsignedlongmax_scan)¶: Find the next gap in the page cache.

Parameters

structaddress_space*mapping: Mapping.
pgoff_tindex: Index.
unsignedlongmax_scan: Maximum range to search.

Description

Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for thegap with the lowest index.

This function may be called under the rcu_read_lock. However, this willnot atomically search a snapshot of the cache at a single point in time.For example, if a gap is created at index 5, then subsequently a gap iscreated at index 10, page_cache_next_miss covering both indices mayreturn 10 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside therange specified (in which case ‘return - index >= max_scan’ will be true).In the rare case of index wrap-around, 0 will be returned.

pgoff_tpage_cache_prev_miss(structaddress_space*mapping,pgoff_tindex,unsignedlongmax_scan)¶: Find the previous gap in the page cache.

Parameters

structaddress_space*mapping: Mapping.
pgoff_tindex: Index.
unsignedlongmax_scan: Maximum range to search.

Description

Search the range [max(index - max_scan + 1, 0), index] for thegap with the highest index.

This function may be called under the rcu_read_lock. However, this willnot atomically search a snapshot of the cache at a single point in time.For example, if a gap is created at index 10, then subsequently a gap iscreated at index 5,page_cache_prev_miss() covering both indices mayreturn 5 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside therange specified (in which case ‘index - return >= max_scan’ will be true).In the rare case of wrap-around, ULONG_MAX will be returned.

structfolio*__filemap_get_folio_mpol(structaddress_space*mapping,pgoff_tindex,fgf_tfgp_flags,gfp_tgfp,structmempolicy*policy)¶: Find and get a reference to a folio.

Parameters

structaddress_space*mapping: The address_space to search.
pgoff_tindex: The page index.
fgf_tfgp_flags: FGP flags modify how the folio is returned.
gfp_tgfp: Memory allocation flags to use ifFGP_CREAT is specified.
structmempolicy*policy: NUMA memory allocation policy to follow.

Description

Looks up the page cache entry atmapping &index.

IfFGP_LOCK orFGP_CREAT are specified then the function may sleep evenif theGFP flags specified forFGP_CREAT are atomic.

If this function returns a folio, it is returned with an increased refcount.

Return

The found folio or anERR_PTR() otherwise.

unsignedfilemap_get_folios(structaddress_space*mapping,pgoff_t*start,pgoff_tend,structfolio_batch*fbatch)¶: Get a batch of folios

Parameters

structaddress_space*mapping: The address_space to search
pgoff_t*start: The starting page index
pgoff_tend: The final page index (inclusive)
structfolio_batch*fbatch: The batch to fill.

Description

Search for and return a batch of folios in the mapping starting atindexstart and up to indexend (inclusive). The folios are returnedinfbatch with an elevated reference count.

Return

The number of folios which were found.We also updatestart to index the next folio for the traversal.

unsignedfilemap_get_folios_contig(structaddress_space*mapping,pgoff_t*start,pgoff_tend,structfolio_batch*fbatch)¶: Get a batch of contiguous folios

Parameters

structaddress_space*mapping: The address_space to search
pgoff_t*start: The starting page index
pgoff_tend: The final page index (inclusive)
structfolio_batch*fbatch: The batch to fill

Description

filemap_get_folios_contig() works exactly likefilemap_get_folios(),except the returned folios are guaranteed to be contiguous. This maynot return all contiguous folios if the batch gets filled up.

Return

The number of folios found.Also updatestart to be positioned for traversal of the next folio.

unsignedfilemap_get_folios_tag(structaddress_space*mapping,pgoff_t*start,pgoff_tend,xa_mark_ttag,structfolio_batch*fbatch)¶: Get a batch of folios matchingtag

Parameters

structaddress_space*mapping: The address_space to search
pgoff_t*start: The starting page index
pgoff_tend: The final page index (inclusive)
xa_mark_ttag: The tag index
structfolio_batch*fbatch: The batch to fill

Description

The first folio may start beforestart; if it does, it will containstart. The final folio may extend beyondend; if it does, it willcontainend. The folios have ascending indices. There may be gapsbetween the folios if there are indices which have no folio in thepage cache. If folios are added to or removed from the page cachewhile this is running, they may or may not be found by this call.Only returns folios that are tagged withtag.

Return

The number of folios found.Also updatestart to index the next folio for traversal.

ssize_tfilemap_read(structkiocb*iocb,structiov_iter*iter,ssize_talready_read)¶: Read data from the page cache.

Parameters

structkiocb*iocb: The iocb to read.
structiov_iter*iter: Destination for the data.
ssize_talready_read: Number of bytes already read by the caller.

Description

Copies data from the page cache. If the data is not currently present,uses the readahead and read_folio address_space operations to fetch it.

Return

Total number of bytes copied, including those already read bythe caller. If an error happens before any bytes are copied, returnsa negative error number.

ssize_tgeneric_file_read_iter(structkiocb*iocb,structiov_iter*iter)¶: generic filesystem read routine

Parameters

structkiocb*iocb: kernel I/O control block
structiov_iter*iter: destination for the data read

Description

This is the “read_iter()” routine for all filesystemsthat can use the page cache directly.

The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shallbe returned when no data can be read without waiting for I/O requeststo complete; it doesn’t prevent readahead.

The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/Orequests shall be made for the read or for readahead. When no datacan be read, -EAGAIN shall be returned. When readahead would betriggered, a partial, possibly empty read shall be returned.

Return

number of bytes copied, even for partial reads
negative error code (or 0 if IOCB_NOIO) if nothing was read

ssize_tfilemap_splice_read(structfile*in,loff_t*ppos,structpipe_inode_info*pipe,size_tlen,unsignedintflags)¶: Splice data from a file’s pagecache into a pipe

Parameters

structfile*in: The file to read from
loff_t*ppos: Pointer to the file position to read from
structpipe_inode_info*pipe: The pipe to splice into
size_tlen: The amount to splice
unsignedintflags: The SPLICE_F_* flags

Description

This function gets folios from a file’s pagecache and splices them into thepipe. Readahead will be called as necessary to fill more folios. This maybe used for blockdevs also.

Return

On success, the number of bytes read will be returned and*pposwill be updated if appropriate; 0 will be returned if there is no more datato be read; -EAGAIN will be returned if the pipe had no space, and someother negative error code will be returned on error. A short read may occurif the pipe has insufficient space, we reach the end of the data or we hit ahole.

vm_fault_tfilemap_fault(structvm_fault*vmf)¶: read in file data for page fault handling

Parameters

structvm_fault*vmf: structvm_fault containing details of the fault

Description

filemap_fault() is invoked via the vma operations vector for amapped memory region to read in file data during a page fault.

The goto’s are kind of ugly, but this streamlines the normal case of havingit in the page cache, and handles the special cases reasonably withouthaving a lot of duplicated code.

vma->vm_mm->mmap_lock must be held on entry.

If our return value has VM_FAULT_RETRY set, it’s because the mmap_lockmay be dropped before doing I/O or bylock_folio_maybe_drop_mmap().

If our return value does not have VM_FAULT_RETRY set, the mmap_lockhas not been released.

We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.

Return

bitwise-OR ofVM_FAULT_ codes.

structfolio*read_cache_folio(structaddress_space*mapping,pgoff_tindex,filler_tfiller,structfile*file)¶: Read into page cache, fill it if needed.

Parameters

structaddress_space*mapping: The address_space to read from.
pgoff_tindex: The index to read.
filler_tfiller: Function to perform the read, or NULL to use aops->read_folio().
structfile*file: Passed to filler function, may be NULL if not required.

Description

Read one page into the page cache. If it succeeds, the folio returnedwill containindex, but it may not be the first page of the folio.

If the filler function returns an error, it will be returned to thecaller.

Context

May sleep. Expects mapping->invalidate_lock to be held.

Return

An uptodate folio on success,ERR_PTR() on failure.

structfolio*mapping_read_folio_gfp(structaddress_space*mapping,pgoff_tindex,gfp_tgfp)¶: Read into page cache, using specified allocation flags.

Parameters

structaddress_space*mapping: The address_space for the folio.
pgoff_tindex: The index that the allocated folio will contain.
gfp_tgfp: The page allocator flags to use if allocating.

Description

This is the same as “read_cache_folio(mapping, index, NULL, NULL)”, but withany new memory allocations done using the specified allocation flags.

The most likely error from this function is EIO, but ENOMEM ispossible and so is EINTR. If ->read_folio returns another error,that will be returned to the caller.

The function expects mapping->invalidate_lock to be already held.

Return

Uptodate folio on success,ERR_PTR() on failure.

structpage*read_cache_page_gfp(structaddress_space*mapping,pgoff_tindex,gfp_tgfp)¶: read into page cache, using specified page allocation flags.

Parameters

structaddress_space*mapping: the page’s address_space
pgoff_tindex: the page index
gfp_tgfp: the page allocator flags to use if allocating

Description

This is the same as “read_mapping_page(mapping, index, NULL)”, but withany new page allocations done using the specified allocation flags.

If the page does not get brought uptodate, return -EIO.

The function expects mapping->invalidate_lock to be already held.

Return

up to date page on success,ERR_PTR() on failure.

ssize_t__generic_file_write_iter(structkiocb*iocb,structiov_iter*from)¶: write data to a file

Parameters

structkiocb*iocb: IO state structure (file, offset, etc.)
structiov_iter*from: iov_iter with data to write

Description

This function does all the work needed for actually writing data to afile. It does all basic checks, removes SUID from the file, updatesmodification times and calls proper subroutines depending on whether wedo direct IO or a standard buffered write.

It expects i_rwsem to be grabbed unless we work on a block device or similarobject which does not need locking at all.

This function doesnot take care of syncing data in case of O_SYNC write.A caller has to handle it. This is mainly due to the fact that we want toavoid syncing under i_rwsem.

Return

number of bytes written, even for truncated writes
negative error code if no data has been written at all

ssize_tgeneric_file_write_iter(structkiocb*iocb,structiov_iter*from)¶: write data to a file

Parameters

structkiocb*iocb: IO state structure
structiov_iter*from: iov_iter with data to write

Description

This is a wrapper around__generic_file_write_iter() to be used by mostfilesystems. It takes care of syncing the file in case of O_SYNC fileand acquires i_rwsem as needed.

Return

negative error code if no data has been written at all ofvfs_fsync_range() failed for a synchronous write
number of bytes written, even for truncated writes

boolfilemap_release_folio(structfolio*folio,gfp_tgfp)¶: Release fs-specific metadata on a folio.

Parameters

structfolio*folio: The folio which the kernel is trying to free.
gfp_tgfp: Memory allocation flags (and I/O mode).

Description

The address_space is trying to release any data attached to a folio(presumably at folio->private).

This will also be called if the private_2 flag is set on a page,indicating that the folio has other metadata associated with it.

Thegfp argument specifies whether I/O may be performed to releasethis page (__GFP_IO), and whether the call may block(__GFP_RECLAIM & __GFP_FS).

Return

true if the release was successful, otherwisefalse.

intfilemap_invalidate_inode(structinode*inode,boolflush,loff_tstart,loff_tend)¶: Invalidate/forcibly write back a range of an inode’s pagecache

Parameters

structinode*inode: The inode to flush
boolflush: Set to write back rather than simply invalidate.
loff_tstart: First byte to in range.
loff_tend: Last byte in range (inclusive), or LLONG_MAX for everything from startonwards.

Description

Invalidate all the folios on an inode that contribute to the specifiedrange, possibly writing them back first. Whilst the operation isundertaken, the invalidate lock is held to prevent new folios from beinginstalled.

Readahead¶

Readahead is used to read content into the page cache before it isexplicitly requested by the application. Readahead only everattempts to read folios that are not yet in the page cache. If afolio is present but not up-to-date, readahead will not try to readit. In that case a simple ->read_folio() will be requested.

Readahead is triggered when an application read request (whether asystem call or a page fault) finds that the requested folio is not inthe page cache, or that it is in the page cache and has thereadahead flag set. This flag indicates that the folio was readas part of a previous readahead request and now that it has beenaccessed, it is time for the next readahead.

Each readahead request is partly synchronous read, and partly asyncreadahead. This is reflected in thestructfile_ra_state whichcontains ->size being the total number of pages, and ->async_sizewhich is the number of pages in the async section. The readaheadflag will be set on the first folio in this async section to triggera subsequent readahead. Once a series of sequential reads has beenestablished, there should be no need for a synchronous component andall readahead request will be fully asynchronous.

When either of the triggers causes a readahead, three numbers needto be determined: the start of the region to read, the size of theregion, and the size of the async tail.

The start of the region is simply the first page address at or afterthe accessed address, which is not currently populated in the pagecache. This is found with a simple search in the page cache.

The size of the async tail is determined by subtracting the size thatwas explicitly requested from the determined request size, unlessthis would be less than zero - then zero is used. NOTE THISCALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSEDPAGE. ALSO THIS CALCULATION IS NOT USED CONSISTENTLY.

The size of the region is normally determined from the size of theprevious readahead which loaded the preceding pages. This may bediscovered from thestructfile_ra_state for simple sequential reads,or from examining the state of the page cache when multiplesequential reads are interleaved. Specifically: where the readaheadwas triggered by the readahead flag, the size of the previousreadahead is assumed to be the number of pages from the triggeringpage to the start of the new readahead. In these cases, the size ofthe previous readahead is scaled, often doubled, for the newreadahead, though seeget_next_ra_size() for details.

If the size of the previous read cannot be determined, the number ofpreceding pages in the page cache is used to estimate the size ofa previous read. This estimate could easily be misled by randomreads being coincidentally adjacent, so it is ignored unless it islarger than the current request, and it is not scaled up, unless itis at the start of file.

In general readahead is accelerated at the start of the file, asreads from there are often sequential. There are other minoradjustments to the readahead size in various special cases and theseare best discovered by reading the code.

The above calculation, based on the previous readahead size,determines the size of the readahead, to which any requested readsize may be added.

Readahead requests are sent to the filesystem using the ->readahead()address space operation, for whichmpage_readahead() is a canonicalimplementation. ->readahead() should normally initiate reads on allfolios, but may fail to read any or all folios without causing an I/Oerror. The page cache reading code will issue a ->read_folio() requestfor any folio which ->readahead() did not read, and only an errorfrom this will be final.

->readahead() will generally callreadahead_folio() repeatedly to geteach folio from those prepared for readahead. It may fail to read afolio by:

not callingreadahead_folio() sufficiently many times, effectivelyignoring some folios, as might be appropriate if the path tostorage is congested.
failing to actually submit a read request for a given folio,possibly due to insufficient resources, or
getting an error during subsequent processing of a request.

In the last two cases, the folio should be unlocked by the filesystemto indicate that the read attempt has failed. In the first case thefolio will be unlocked by the VFS.

Those folios not in the finalasync_size of the request should beconsidered to be important and ->readahead() should not fail them dueto congestion or temporary resource unavailability, but should waitfor necessary resources (e.g. memory or indexing information) tobecome available. Folios in the finalasync_size may beconsidered less urgent and failure to read them is more acceptable.In this case it is best to usefilemap_remove_folio() to remove thefolios from the page cache as is automatically done for folios thatwere not fetched withreadahead_folio(). This will allow asubsequent synchronous readahead request to try them again. If theyare left in the page cache, then they will be read individually using->read_folio() which may be less efficient.

voidpage_cache_ra_unbounded(structreadahead_control*ractl,unsignedlongnr_to_read,unsignedlonglookahead_size)¶: Start unchecked readahead.

Parameters

structreadahead_control*ractl: Readahead control.
unsignedlongnr_to_read: The number of pages to read.
unsignedlonglookahead_size: Where to start the next readahead.

Description

This function is for filesystems to call when they want to startreadahead beyond a file’s stated i_size. This is almost certainlynot the function you want to call. Usepage_cache_async_readahead()orpage_cache_sync_readahead() instead.

Context

File is referenced by caller. Mutexes may be held by caller.May sleep, but will not reenter filesystem to reclaim memory.

voidreadahead_expand(structreadahead_control*ractl,loff_tnew_start,size_tnew_len)¶: Expand a readahead request

Parameters

structreadahead_control*ractl: The request to be expanded
loff_tnew_start: The revised start
size_tnew_len: The revised size of the request

Description

Attempt to expand a readahead request outwards from the current size to thespecified size by inserting locked pages before and after the current windowto increase the size to the new window. This may involve the insertion ofTHPs, in which case the window may get expanded even beyond what wasrequested.

The algorithm will stop if it encounters a conflicting page already in thepagecache and leave a smaller expansion than requested.

The caller must check for this by examining the revisedractl object for adifferent expansion than was requested.

Writeback¶

intbalance_dirty_pages_ratelimited_flags(structaddress_space*mapping,unsignedintflags)¶: Balance dirty memory state.

Parameters

structaddress_space*mapping: address_space which was dirtied.
unsignedintflags: BDP flags.

Description

Processes which are dirtying memory should call in here once for each pagewhich was newly dirtied. The function will periodically check the system’sdirty state and will initiate writeback if needed.

Seebalance_dirty_pages_ratelimited() for details.

Return

Ifflags contains BDP_ASYNC, it may return -EAGAIN toindicate that memory is out of balance and the caller must waitfor I/O to complete. Otherwise, it will return 0 to indicatethat either memory was already in balance, or it was able to sleepuntil the amount of dirty memory returned to balance.

voidbalance_dirty_pages_ratelimited(structaddress_space*mapping)¶: balance dirty memory state.

Parameters

structaddress_space*mapping: address_space which was dirtied.

Description

Once we’re over the dirty memory limit we decrease the ratelimitingby a lot, to prevent individual processes from overshooting the limitby (ratelimit_pages) each.

voidtag_pages_for_writeback(structaddress_space*mapping,pgoff_tstart,pgoff_tend)¶: tag pages to be written by writeback

Parameters

structaddress_space*mapping: address space structure to write
pgoff_tstart: starting page index
pgoff_tend: ending page index (inclusive)

Description

This function scans the page range fromstart toend (inclusive) and tagsall pages that have DIRTY tag set with a special TOWRITE tag. The callercan then use the TOWRITE tag to identify pages eligible for writeback.This mechanism is used to avoid livelocking of writeback by a processsteadily creating new dirty pages in the file (thus it is important for thisfunction to be quick so that it can tag pages faster than a dirtying processcan create them).

structfolio*writeback_iter(structaddress_space*mapping,structwriteback_control*wbc,structfolio*folio,int*error)¶: iterate folio of a mapping for writeback

Parameters

structaddress_space*mapping: address space structure to write
structwriteback_control*wbc: writeback context
structfolio*folio: previously iterated folio (NULL to start)
int*error: in-out pointer for writeback errors (see below)

Description

This function returns the next folio for the writeback operation described bywbc onmapping and should be called in a while loop in the ->writepagesimplementation.

To start the writeback operation,NULL is passed in thefolio argument, andfor every subsequent iteration the folio returned previously should be passedback in.

If there was an error in the per-folio writeback inside thewriteback_iter()loop,error should be set to the error value.

Once the writeback described inwbc has finished, this function will returnNULL and if there was an error in any iteration restore it toerror.

Note

callers should not manually break out of the loop using break or gotobut must keep callingwriteback_iter() until it returnsNULL.

Return

the folio to write orNULL if the loop is done.

boolfilemap_dirty_folio(structaddress_space*mapping,structfolio*folio)¶: Mark a folio dirty for filesystems which do not use buffer_heads.

Parameters

structaddress_space*mapping: Address space this folio belongs to.
structfolio*folio: Folio to be marked as dirty.

Description

Filesystems which do not use buffer heads should call this functionfrom their dirty_folio address space operation. It ignores thecontents offolio_get_private(), so if the filesystem marks individualblocks as dirty, the filesystem should handle that itself.

This is also sometimes used by filesystems which use buffer_heads whena single buffer is being dirtied: we want to set the folio dirty inthat case, but not all the buffers. This is a “bottom-up” dirtying,whereasblock_dirty_folio() is a “top-down” dirtying.

The caller must ensure this doesn’t race with truncation. Most willsimply hold the folio lock, but e.g.zap_pte_range() calls with thefolio mapped and the pte lock held, which also locks out truncation.

boolfolio_redirty_for_writepage(structwriteback_control*wbc,structfolio*folio)¶: Decline to write a dirty folio.

Parameters

structwriteback_control*wbc: The writeback control.
structfolio*folio: The folio.

Description

When a writepage implementation decides that it doesn’t want to writefolio for some reason, it should call this function, unlockfolio andreturn 0.

Return

True if we redirtied the folio. False if someone else dirtiedit first.

boolfolio_mark_dirty(structfolio*folio)¶: Mark a folio as being modified.

Parameters

structfolio*folio: The folio.

Description

The folio may not be truncated while this function is running.Holding the folio lock is sufficient to prevent truncation, but somecallers cannot acquire a sleeping lock. These callers instead holdthe page table lock for a page table which contains at least one pagein this folio. Truncation will block on the page table lock as itunmaps pages before removing the folio from its mapping.

Return

True if the folio was newly dirtied, false if it was already dirty.

voidfolio_wait_writeback(structfolio*folio)¶: Wait for a folio to finish writeback.

Parameters

structfolio*folio: The folio to wait for.

Description

If the folio is currently being written back to storage, wait for theI/O to complete.

Context

Sleeps. Must be called in process context and withno spinlocks held. Caller should hold a reference on the folio.If the folio is not locked, writeback may start again after writebackhas finished.

intfolio_wait_writeback_killable(structfolio*folio)¶: Wait for a folio to finish writeback.

Parameters

structfolio*folio: The folio to wait for.

Description

If the folio is currently being written back to storage, wait for theI/O to complete or a fatal signal to arrive.

Context

Sleeps. Must be called in process context and withno spinlocks held. Caller should hold a reference on the folio.If the folio is not locked, writeback may start again after writebackhas finished.

Return

0 on success, -EINTR if we get a fatal signal while waiting.

voidfolio_wait_stable(structfolio*folio)¶: wait for writeback to finish, if necessary.

Parameters

structfolio*folio: The folio to wait on.

Description

This function determines if the given folio is related to a backingdevice that requires folio contents to be held stable during writeback.If so, then it will wait for any pending writeback to complete.

Context

Sleeps. Must be called in process context and withno spinlocks held. Caller should hold a reference on the folio.If the folio is not locked, writeback may start again after writebackhas finished.

Truncate¶

voidfolio_invalidate(structfolio*folio,size_toffset,size_tlength)¶: Invalidate part or all of a folio.

Parameters

structfolio*folio: The folio which is affected.
size_toffset: start of the range to invalidate
size_tlength: length of the range to invalidate

Description

folio_invalidate() is called when all or part of the folio has becomeinvalidated by a truncate operation.

folio_invalidate() does not have to release all buffers, but it mustensure that no dirty buffer is left outsideoffset and that no I/Ois underway against any of the blocks which are outside the truncationpoint. Because the caller is about to free (and possibly reuse) thoseblocks on-disk.

voidtruncate_inode_pages_range(structaddress_space*mapping,loff_tlstart,uoff_tlend)¶: truncate range of pages specified by start & end byte offsets

Parameters

structaddress_space*mapping: mapping to truncate
loff_tlstart: offset from which to truncate
uoff_tlend: offset to which to truncate (inclusive)

Description

Truncate the page cache, removing the pages that are betweenspecified offsets (and zeroing out partial pagesif lstart or lend + 1 is not page aligned).

Truncate takes two passes - the first pass is nonblocking. It will notblock on page locks and it will not block on writeback. The second passwill wait. This is to prevent as much IO as possible in the affected region.The first pass will remove most pages, so the search cost of the second passis low.

We pass down the cache-hot hint to the page freeing code. Even if themapping is large, it is probably the case that the final pages are the mostrecently touched, and freeing happens in ascending file offset order.

Note that since ->invalidate_folio() accepts range to invalidatetruncate_inode_pages_range is able to handle cases where lend + 1 is notpage aligned properly.

voidtruncate_inode_pages(structaddress_space*mapping,loff_tlstart)¶: truncateall the pages from an offset

Parameters

structaddress_space*mapping: mapping to truncate
loff_tlstart: offset from which to truncate

Description

Called under (and serialised by) inode->i_rwsem andmapping->invalidate_lock.

Note

When this function returns, there can be a page in the process ofdeletion (inside__filemap_remove_folio()) in the specified range. Thusmapping->nrpages can be non-zero when this function returns even aftertruncation of the whole mapping.

voidtruncate_inode_pages_final(structaddress_space*mapping)¶: truncateall pages before inode dies

Parameters

structaddress_space*mapping: mapping to truncate

Description

Called under (and serialized by) inode->i_rwsem.

Filesystems have to use this in the .evict_inode path to inform theVM that this is the final truncate and the inode is going away.

unsignedlonginvalidate_mapping_pages(structaddress_space*mapping,pgoff_tstart,pgoff_tend)¶: Invalidate all clean, unlocked cache of one inode

Parameters

structaddress_space*mapping: the address_space which holds the cache to invalidate
pgoff_tstart: the offset ‘from’ which to invalidate
pgoff_tend: the offset ‘to’ which to invalidate (inclusive)

Description

This function removes pages that are clean, unmapped and unlocked,as well as shadow entries. It will not block on IO activity.

If you want to remove all the pages of one inode, regardless oftheir use and writeback state, usetruncate_inode_pages().

Return

The number of indices that had their contents invalidated

intinvalidate_inode_pages2_range(structaddress_space*mapping,pgoff_tstart,pgoff_tend)¶: remove range of pages from an address_space

Parameters

structaddress_space*mapping: the address_space
pgoff_tstart: the page offset ‘from’ which to invalidate
pgoff_tend: the page offset ‘to’ which to invalidate (inclusive)

Description

Any pages which are found to be mapped into pagetables are unmapped prior toinvalidation.

Return

-EBUSY if any pages could not be invalidated.

intinvalidate_inode_pages2(structaddress_space*mapping)¶: remove all pages from an address_space

Parameters

structaddress_space*mapping: the address_space

Description

Any pages which are found to be mapped into pagetables are unmapped prior toinvalidation.

Return

-EBUSY if any pages could not be invalidated.

voidtruncate_pagecache(structinode*inode,loff_tnewsize)¶: unmap and remove pagecache that has been truncated

Parameters

structinode*inode: inode
loff_tnewsize: new file size

Description

inode’s new i_size must already be written before truncate_pagecacheis called.

This function should typically be called before the filesystemreleases resources associated with the freed range (eg. deallocatesblocks). This way, pagecache will always stay logically coherentwith on-disk format, and the filesystem would not have to deal withsituations such as writepage being called for a page that has alreadyhad its underlying blocks deallocated.

voidtruncate_setsize(structinode*inode,loff_tnewsize)¶: update inode and pagecache for a new file size

Parameters

structinode*inode: inode
loff_tnewsize: new file size

Description

truncate_setsize updates i_size and performs pagecache truncation (ifnecessary) tonewsize. It will be typically be called from the filesystem’ssetattr function when ATTR_SIZE is passed in.

Must be called with a lock serializing truncates and writes (generallyi_rwsem but e.g. xfs uses a different lock) and before all filesystemspecific block truncation has been performed.

voidpagecache_isize_extended(structinode*inode,loff_tfrom,loff_tto)¶: update pagecache after extension of i_size

Parameters

structinode*inode: inode for which i_size was extended
loff_tfrom: original inode size
loff_tto: new inode size

Description

Handle extension of inode size either caused by extending truncate orby write starting after current i_size. We mark the page straddlingcurrent i_size RO so thatpage_mkwrite() is called on the firstwrite access to the page. The filesystem will update its per-blockinformation before user writes to the page via mmap after the i_sizehas been changed.

The function must be called after i_size is updated so that page faultcoming after we unlock the folio will already see the new i_size.The function must be called while we still hold i_rwsem - this not onlymakes sure i_size is stable but also that userspace cannot observe newi_size value before we are prepared to store mmap writes at new inode size.

voidtruncate_pagecache_range(structinode*inode,loff_tlstart,loff_tlend)¶: unmap and remove pagecache that is hole-punched

Parameters

structinode*inode: inode
loff_tlstart: offset of beginning of hole
loff_tlend: offset of last byte of hole

Description

voidfilemap_set_wb_err(structaddress_space*mapping,interr)¶: set a writeback error on an address_space

Parameters

structaddress_space*mapping: mapping in which to set writeback error
interr: error to be set in mapping

Description

When writeback fails in some way, we must record that error so thatuserspace can be informed when fsync and the like are called. We endeavorto report errors on any file that was open at the time of the error. Someinternal callers also need to know when writeback errors have occurred.

When a writeback error occurs, most filesystems will want to callfilemap_set_wb_err to record the error in the mapping so that it will beautomatically reported whenever fsync is called on the file.

intfilemap_check_wb_err(structaddress_space*mapping,errseq_tsince)¶: has an error occurred since the mark was sampled?

Parameters

structaddress_space*mapping: mapping to check for writeback errors
errseq_tsince: previously-sampled errseq_t

Description

Grab the errseq_t value from the mapping, and see if it has changed “since”the given value was sampled.

If it has then report the latest error set, otherwise return 0.

errseq_tfilemap_sample_wb_err(structaddress_space*mapping)¶: sample the current errseq_t to test for later errors

Parameters

structaddress_space*mapping: mapping to be sampled

Description

Writeback errors are always reported relative to a particular sample pointin the past. This function provides those sample points.

errseq_tfile_sample_sb_err(structfile*file)¶: sample the current errseq_t to test for later errors

Parameters

structfile*file: file pointer to be sampled

Description

Grab the most current superblock-level errseq_t value for the givenstructfile.

voidmapping_set_error(structaddress_space*mapping,interror)¶: record a writeback error in the address_space

Parameters

structaddress_space*mapping: the mapping in which an error should be set
interror: the error to set in the mapping

Description

When a writeback error occurs, most filesystems will want to callmapping_set_error to record the error in the mapping so that it can bereported when the application calls fsync(2).

voidmapping_set_large_folios(structaddress_space*mapping)¶: Indicate the file supports large folios.

Parameters

structaddress_space*mapping: The address space of the file.

Description

The filesystem should call this function in its inode constructor toindicate that the VFS can use large folios to cache the contents ofthe file.

Context

This should not be called while the inode is active as itis non-atomic.

pgoff_tmapping_align_index(conststructaddress_space*mapping,pgoff_tindex)¶: Align index for this mapping.

Parameters

conststructaddress_space*mapping: The address_space.
pgoff_tindex: The page index.

Description

The index of a folio must be naturally aligned. If you are adding anew folio to the page cache and need to know what index to give it,call this function.

structaddress_space*folio_flush_mapping(structfolio*folio)¶: Find the file mapping this folio belongs to.

Parameters

structfolio*folio: The folio.

Description

For folios which are in the page cache, return the mapping that thispage belongs to. Anonymous folios return NULL, even if they’re inthe swap cache. Other kinds of folio also return NULL.

This is ONLY used by architecture cache flushing code. If you aren’twriting cache flushing code, you want eitherfolio_mapping() orfolio_file_mapping().

structinode*folio_inode(structfolio*folio)¶: Get the host inode for this folio.

Parameters

structfolio*folio: The folio.

Description

For folios which are in the page cache, return the inode that this foliobelongs to.

Do not call this for folios which aren’t in the page cache.

voidfolio_attach_private(structfolio*folio,void*data)¶: Attach private data to a folio.

Parameters

structfolio*folio: Folio to attach data to.
void*data: Data to attach to folio.

Description

Attaching private data to a folio increments the page’s reference count.The data must be detached before the folio will be freed.

void*folio_change_private(structfolio*folio,void*data)¶: Change private data on a folio.

Parameters

structfolio*folio: Folio to change the data on.
void*data: Data to set on the folio.

Description

Change the private data attached to a folio and return the olddata. The page must previously have had data attached and the datamust be detached before the folio will be freed.

Return

Data that was previously attached to the folio.

void*folio_detach_private(structfolio*folio)¶: Detach private data from a folio.

Parameters

structfolio*folio: Folio to detach data from.

Description

Removes the data that was previously attached to the folio and decrementsthe refcount on the page.

Return

Data that was attached to the folio.

typefgf_t¶: Flags for getting folios from the page cache.

Description

Most users of the page cache will not need to use these flags;there are convenience functions such asfilemap_get_folio() andfilemap_lock_folio(). For users which need more control over exactlywhat is done with the folios, these flags to__filemap_get_folio()are available.

FGP_ACCESSED - The folio will be marked accessed.
FGP_LOCK - The folio is returned locked.
FGP_CREAT - If no folio is present then a new folio is allocated,added to the page cache and the VM’s LRU list. The folio isreturned locked.
FGP_FOR_MMAP - The caller wants to do its own locking dance if thefolio is already in cache. If the folio was allocated, unlock itbefore returning so the caller can do the same dance.
FGP_WRITE - The folio will be written to by the caller.
FGP_NOFS - __GFP_FS will get cleared in gfp.
FGP_NOWAIT - Don’t block on the folio lock.
FGP_STABLE - Wait for the folio to be stable (finished writeback)
FGP_DONTCACHE - Uncached buffered IO
FGP_WRITEBEGIN - The flags to use in a filesystemwrite_begin()implementation.

fgf_tfgf_set_order(size_tsize)¶: Encode a length in the fgf_t flags.

Parameters

size_tsize: The suggested size of the folio to create.

Description

The caller of__filemap_get_folio() can use this to suggest a preferredsize for the folio that is created. If there is already a folio atthe index, it will be returned, no matter what its size. If a foliois freshly created, it may be of a different size than requesteddue to alignment constraints, memory pressure, or the presence ofother folios at nearby indices.

structfolio*write_begin_get_folio(conststructkiocb*iocb,structaddress_space*mapping,pgoff_tindex,size_tlen)¶: Get folio for write_begin with flags.

Parameters

conststructkiocb*iocb: The kiocb passed from write_begin (may be NULL).
structaddress_space*mapping: The address space to search.
pgoff_tindex: The page cache index.
size_tlen: Length of data being written.

Description

This is a helper for filesystemwrite_begin() implementations.It wraps__filemap_get_folio(), setting appropriate flags inthe write begin context.

Return

A folio or an ERR_PTR.

structfolio*filemap_get_folio(structaddress_space*mapping,pgoff_tindex)¶: Find and get a folio.

Parameters

structaddress_space*mapping: The address_space to search.
pgoff_tindex: The page index.

Description

Looks up the page cache entry atmapping &index. If a folio ispresent, it is returned with an increased refcount.

Return

A folio or ERR_PTR(-ENOENT) if there is no folio in the cache forthis index. Will not return a shadow, swap or DAX entry.

structfolio*filemap_lock_folio(structaddress_space*mapping,pgoff_tindex)¶: Find and lock a folio.

Parameters

structaddress_space*mapping: The address_space to search.
pgoff_tindex: The page index.

Description

Looks up the page cache entry atmapping &index. If a folio ispresent, it is returned locked with an increased refcount.

Context

May sleep.

Return

A folio or ERR_PTR(-ENOENT) if there is no folio in the cache forthis index. Will not return a shadow, swap or DAX entry.

structfolio*filemap_grab_folio(structaddress_space*mapping,pgoff_tindex)¶: grab a folio from the page cache

Parameters

structaddress_space*mapping: The address space to search
pgoff_tindex: The page index

Description

Looks up the page cache entry atmapping &index. If no folio is found,a new folio is created. The folio is locked, marked as accessed, andreturned.

Return

A found or created folio. ERR_PTR(-ENOMEM) if no folio is foundand failed to create a folio.

structpage*find_get_page(structaddress_space*mapping,pgoff_toffset)¶: find and get a page reference

Parameters

structaddress_space*mapping: the address_space to search
pgoff_toffset: the page index

Description

Looks up the page cache slot atmapping &offset. If there is apage cache page, it is returned with an increased refcount.

Otherwise,NULL is returned.

structpage*find_lock_page(structaddress_space*mapping,pgoff_tindex)¶: locate, pin and lock a pagecache page

Parameters

structaddress_space*mapping: the address_space to search
pgoff_tindex: the page index

Description

Looks up the page cache entry atmapping &index. If there is apage cache page, it is returned locked and with an increasedrefcount.

Context

May sleep.

Return

Astructpage orNULL if there is no page in the cache for thisindex.

structpage*find_or_create_page(structaddress_space*mapping,pgoff_tindex,gfp_tgfp_mask)¶: locate or add a pagecache page

Parameters

structaddress_space*mapping: the page’s address_space
pgoff_tindex: the page’s index into the mapping
gfp_tgfp_mask: page allocation mode

Description

Looks up the page cache slot atmapping &offset. If there is apage cache page, it is returned locked and with an increasedrefcount.

If the page is not present, a new page is allocated usinggfp_maskand added to the page cache and the VM’s LRU list. The page isreturned locked and with an increased refcount.

On memory exhaustion,NULL is returned.

find_or_create_page() may sleep, even ifgfp_flags specifies anatomic allocation!

structpage*grab_cache_page_nowait(structaddress_space*mapping,pgoff_tindex)¶: returns locked page at given index in given cache

Parameters

structaddress_space*mapping: target address_space
pgoff_tindex: the page index

Description

Returns locked page at given index in given cache, creating it ifneeded, but do not wait if the page is locked or to reclaim memory.This is intended for speculative data generators, where the data canbe regenerated if the page couldn’t be grabbed. This routine shouldbe safe to call while holding the lock for another page.

Clear __GFP_FS when allocating the page to avoid recursion into the fsand deadlock against the caller’s locked page.

pgoff_tfolio_next_index(conststructfolio*folio)¶: Get the index of the next folio.

Parameters

conststructfolio*folio: The current folio.

Return

The index of the folio which follows this folio in the file.

loff_tfolio_next_pos(conststructfolio*folio)¶: Get the file position of the next folio.

Parameters

conststructfolio*folio: The current folio.

Return

The position of the folio which follows this folio in the file.

structpage*folio_file_page(structfolio*folio,pgoff_tindex)¶: The page for a particular index.

Parameters

structfolio*folio: The folio which contains this index.
pgoff_tindex: The index we want to look up.

Description

Sometimes after looking up a folio in the page cache, we need toobtain the specific page for an index (eg a page fault).

Return

The page containing the file data for this index.

boolfolio_contains(conststructfolio*folio,pgoff_tindex)¶: Does this folio contain this index?

Parameters

conststructfolio*folio: The folio.
pgoff_tindex: The page index within the file.

Context

The caller should have the folio locked and ensuree.g., shmem did not move this folio to the swap cache.

Return

true or false.

pgoff_tpage_pgoff(conststructfolio*folio,conststructpage*page)¶: Calculate the logical page offset of this page.

Parameters

conststructfolio*folio: The folio containing this page.
conststructpage*page: The page which we need the offset of.

Description

For file pages, this is the offset from the beginning of the filein units of PAGE_SIZE. For anonymous pages, this is the offset fromthe beginning of the anon_vma in units of PAGE_SIZE. This willreturn nonsense for KSM pages.

Context

Caller must have a reference on the folio or otherwiseprevent it from being split or freed.

Return

The offset in units of PAGE_SIZE.

loff_tfolio_pos(conststructfolio*folio)¶: Returns the byte position of this folio in its file.

Parameters

conststructfolio*folio: The folio.

boolfolio_trylock(structfolio*folio)¶: Attempt to lock a folio.

Parameters

structfolio*folio: The folio to attempt to lock.

Description

Sometimes it is undesirable to wait for a folio to be unlocked (egwhen the locks are being taken in the wrong order, or if makingprogress through a batch of folios is more important than processingthem in order). Usuallyfolio_lock() is the correct function to call.

Context

Any context.

Return

Whether the lock was successfully acquired.

voidfolio_lock(structfolio*folio)¶: Lock this folio.

Parameters

structfolio*folio: The folio to lock.

Description

The folio lock protects against many things, probably more than itshould. It is primarily held while a folio is being brought uptodate,either from its backing file or from swap. It is also held while afolio is being truncated from its address_space, so holding the lockis sufficient to keep folio->mapping stable.

The folio lock is also held while write() is modifying the page toprovide POSIX atomicity guarantees (as long as the write does notcross a page boundary). Other modifications to the data in the foliodo not hold the folio lock and can race with writes, eg DMA and storesto mapped pages.

Context

May sleep. If you need to acquire the locks of two ormore folios, they must be in order of ascending index, if they arein the same address_space. If they are in different address_spaces,acquire the lock of the folio which belongs to the address_space whichhas the lowest address in memory first.

voidlock_page(structpage*page)¶: Lock the folio containing this page.

Parameters

structpage*page: The page to lock.

Description

Seefolio_lock() for a description of what the lock protects.This is a legacy function and new code should probably usefolio_lock()instead.

Context

May sleep. Pages in the same folio share a lock, so do notattempt to lock two pages which share a folio.

intfolio_lock_killable(structfolio*folio)¶: Lock this folio, interruptible by a fatal signal.

Parameters

structfolio*folio: The folio to lock.

Description

Attempts to lock the folio, likefolio_lock(), except that the sleepto acquire the lock is interruptible by a fatal signal.

Context

May sleep; seefolio_lock().

Return

0 if the lock was acquired; -EINTR if a fatal signal was received.

boolfilemap_range_needs_writeback(structaddress_space*mapping,loff_tstart_byte,loff_tend_byte)¶: check if range potentially needs writeback

Parameters

structaddress_space*mapping: address space within which to check
loff_tstart_byte: offset in bytes where the range starts
loff_tend_byte: offset in bytes where the range ends (inclusive)

Description

Find at least one page in the range supplied, usually used to check ifdirect writing in this range will trigger a writeback. Used by O_DIRECTread/write with IOCB_NOWAIT, to see if the caller needs to dofilemap_write_and_wait_range() before proceeding.

Return

true if the caller should dofilemap_write_and_wait_range() beforedoing O_DIRECT to a page in this range,false otherwise.

structreadahead_control¶: Describes a readahead request.

Definition:

struct readahead_control {    struct file *file;    struct address_space *mapping;    struct file_ra_state *ra;};

Members

file: The file, used primarily by network filesystems for authentication.May be NULL if invoked internally by the filesystem.
mapping: Readahead this filesystem object.
ra: File readahead state. May be NULL.

Description

A readahead request is for consecutive pages. Filesystems whichimplement the ->readahead method should callreadahead_folio() or__readahead_batch() in a loop and attempt to start reads into eachfolio in the request.

Most of the fields in thisstructare private and should be accessedby the functions below.

voidpage_cache_sync_readahead(structaddress_space*mapping,structfile_ra_state*ra,structfile*file,pgoff_tindex,unsignedlongreq_count)¶: generic file readahead

Parameters

structaddress_space*mapping: address_space which holds the pagecache and I/O vectors
structfile_ra_state*ra: file_ra_state which holds the readahead state
structfile*file: Used by the filesystem for authentication.
pgoff_tindex: Index of first page to be read.
unsignedlongreq_count: Total number of pages being read by the caller.

Description

page_cache_sync_readahead() should be called when a cache miss happened:it will submit the read. The readahead logic may decide to piggyback morepages onto the read request if access patterns suggest it will improveperformance.

voidpage_cache_async_readahead(structaddress_space*mapping,structfile_ra_state*ra,structfile*file,structfolio*folio,unsignedlongreq_count)¶: file readahead for marked pages

Parameters

structaddress_space*mapping: address_space which holds the pagecache and I/O vectors
structfile_ra_state*ra: file_ra_state which holds the readahead state
structfile*file: Used by the filesystem for authentication.
structfolio*folio: The folio which triggered the readahead call.
unsignedlongreq_count: Total number of pages being read by the caller.

Description

page_cache_async_readahead() should be called when a page is used whichis marked as PageReadahead; this is a marker to suggest that the applicationhas used up enough of the readahead window that we should start pulling inmore pages.

structfolio*readahead_folio(structreadahead_control*ractl)¶: Get the next folio to read.

Parameters

structreadahead_control*ractl: The current readahead request.

Context

The folio is locked. The caller should unlock the folio onceall I/O to that folio has completed.

Return

A pointer to the next folio, orNULL if we are done.

loff_treadahead_pos(conststructreadahead_control*rac)¶: The byte offset into the file of this readahead request.

Parameters

conststructreadahead_control*rac: The readahead request.

size_treadahead_length(conststructreadahead_control*rac)¶: The number of bytes in this readahead request.

Parameters

conststructreadahead_control*rac: The readahead request.

pgoff_treadahead_index(conststructreadahead_control*rac)¶: The index of the first page in this readahead request.

Parameters

conststructreadahead_control*rac: The readahead request.

unsignedintreadahead_count(conststructreadahead_control*rac)¶: The number of pages in this readahead request.

Parameters

conststructreadahead_control*rac: The readahead request.

size_treadahead_batch_length(conststructreadahead_control*rac)¶: The number of bytes in the current batch.

Parameters

conststructreadahead_control*rac: The readahead request.

ssize_tfolio_mkwrite_check_truncate(conststructfolio*folio,conststructinode*inode)¶: check if folio was truncated

Parameters

conststructfolio*folio: the folio to check
conststructinode*inode: the inode to check the folio against

Return

the number of bytes in the folio up to EOF,or -EFAULT if the folio was truncated.

unsignedinti_blocks_per_folio(conststructinode*inode,conststructfolio*folio)¶: How many blocks fit in this folio.

Parameters

conststructinode*inode: The inode which contains the blocks.
conststructfolio*folio: The folio.

Description

If the block size is larger than the size of this folio, return zero.

Context

The caller should hold a refcount on the folio to prevent itfrom being split.

Return

The number of filesystem blocks covered by this folio.

Memory pools¶

voidmempool_exit(structmempool*pool)¶: exit a mempool initialized withmempool_init()

Parameters

structmempool*pool: pointer to the memory pool which was initialized withmempool_init().

Description

Free all reserved elements inpool andpool itself. This functiononly sleeps if thefree_fn() function sleeps.

May be called on a zeroed but uninitialized mempool (i.e. allocated withkzalloc()).

voidmempool_destroy(structmempool*pool)¶: deallocate a memory pool

Parameters

structmempool*pool: pointer to the memory pool which was allocated viamempool_create().

Description

Free all reserved elements inpool andpool itself. This functiononly sleeps if thefree_fn() function sleeps.

intmempool_init(structmempool*pool,intmin_nr,mempool_alloc_t*alloc_fn,mempool_free_t*free_fn,void*pool_data)¶: initialize a memory pool

Parameters

structmempool*pool: pointer to the memory pool that should be initialized
intmin_nr: the minimum number of elements guaranteed to beallocated for this pool.
mempool_alloc_t*alloc_fn: user-defined element-allocation function.
mempool_free_t*free_fn: user-defined element-freeing function.
void*pool_data: optional private data available to the user-defined functions.

Description

Likemempool_create(), but initializes the pool in (i.e. embedded in anotherstructure).

Return

0 on success, negative error code otherwise.

structmempool*mempool_create_node(intmin_nr,mempool_alloc_t*alloc_fn,mempool_free_t*free_fn,void*pool_data,gfp_tgfp_mask,intnode_id)¶: create a memory pool

Parameters

intmin_nr: the minimum number of elements guaranteed to beallocated for this pool.
mempool_alloc_t*alloc_fn: user-defined element-allocation function.
mempool_free_t*free_fn: user-defined element-freeing function.
void*pool_data: optional private data available to the user-defined functions.
gfp_tgfp_mask: memory allocation flags
intnode_id: numa node to allocate on

Description

this function creates and allocates a guaranteed size, preallocatedmemory pool. The pool can be used from themempool_alloc() andmempool_free()functions. This function might sleep. Both thealloc_fn() and thefree_fn()functions might sleep - as long as themempool_alloc() function is not calledfrom IRQ contexts.

Return

pointer to the created memory pool object orNULL on error.

intmempool_resize(structmempool*pool,intnew_min_nr)¶: resize an existing memory pool

Parameters

structmempool*pool: pointer to the memory pool which was allocated viamempool_create().
intnew_min_nr: the new minimum number of elements guaranteed to beallocated for this pool.

Description

This function shrinks/grows the pool. In the case of growing,it cannot be guaranteed that the pool will be grown to the newsize immediately, but newmempool_free() calls will refill it.This function may sleep.

Note, the caller must guarantee that no mempool_destroy is calledwhile this function is running.mempool_alloc() &mempool_free()might be called (eg. from IRQ contexts) while this function executes.

Return

0 on success, negative error code otherwise.

intmempool_alloc_bulk(structmempool*pool,void**elems,unsignedintcount,unsignedintallocated)¶: allocate multiple elements from a memory pool

Parameters

structmempool*pool: pointer to the memory pool
void**elems: partially or fully populated elements array
unsignedintcount: number of entries inelem that need to be allocated
unsignedintallocated: number of entries inelem already allocated

Description

Allocate elements for each slot inelem that is non-NULL. This is done byfirst calling into the alloc_fn supplied at pool initialization time, anddipping into the reserved pool when alloc_fn fails to allocate an element.

On return allcount elements inelems will be populated.

Return

Always 0. If it wasn’t for %$#^$ alloc tags, it would return void.

void*mempool_alloc(structmempool*pool,gfp_tgfp_mask)¶: allocate an element from a memory pool

Parameters

structmempool*pool: pointer to the memory pool
gfp_tgfp_mask: GFP_* flags.__GFP_ZERO is not supported.

Description

Allocate an element frompool. This is done by first calling into thealloc_fn supplied at pool initialization time, and dipping into the reservedpool when alloc_fn fails to allocate an element.

This function only sleeps if the alloc_fn callback sleeps, or when waitingfor elements to become available in the pool.

Return

pointer to the allocated element orNULL when failing to allocatean element. Allocation failure can only happen whengfp_mask does notinclude__GFP_DIRECT_RECLAIM.

void*mempool_alloc_preallocated(structmempool*pool)¶: allocate an element from preallocated elements belonging to a memory pool

Parameters

structmempool*pool: pointer to the memory pool

Description

This function is similar tomempool_alloc(), but it only attempts allocatingan element from the preallocated elements. It only takes a single spinlock_tand immediately returns if no preallocated elements are available.

Return

pointer to the allocated element orNULL if no elements areavailable.

unsignedintmempool_free_bulk(structmempool*pool,void**elems,unsignedintcount)¶: return elements to a mempool

Parameters

structmempool*pool: pointer to the memory pool
void**elems: elements to return
unsignedintcount: number of elements to return

Description

Returns a number of elements from the start ofelem topool ifpool needsreplenishing and sets their slots inelem to NULL. Other elements are leftinelem.

Return

number of elements transferred topool. Elements are alwaystransferred from the beginning ofelem, so the return value can be used asan offset intoelem for the freeing the remaining elements in the caller.

voidmempool_free(void*element,structmempool*pool)¶: return an element to the pool.

Parameters

void*element: element to return
structmempool*pool: pointer to the memory pool

Description

Returnselement topool if it needs replenishing, else frees it usingthe free_fn callback inpool.

This function only sleeps if the free_fn callback sleeps.

More Memory Management Functions¶

voidzap_vma_ptes(structvm_area_struct*vma,unsignedlongaddress,unsignedlongsize)¶: remove ptes mapping the vma

Parameters

structvm_area_struct*vma: vm_area_struct holding ptes to be zapped
unsignedlongaddress: starting address of pages to zap
unsignedlongsize: number of bytes to zap

Description

This function only unmaps ptes assigned to VM_PFNMAP vmas.

The entire address range must be fully contained within the vma.

intvm_insert_pages(structvm_area_struct*vma,unsignedlongaddr,structpage**pages,unsignedlong*num)¶: insert multiple pages into user vma, batching the pmd lock.

Parameters

structvm_area_struct*vma: user vma to map to
unsignedlongaddr: target start user address of these pages
structpage**pages: source kernel pages
unsignedlong*num: in: number of pages to map. out: number of pages that werenotmapped. (0 means all pages were successfully mapped).

Description

Preferred overvm_insert_page() when inserting multiple pages.

In case of error, we may have mapped a subset of the providedpages. It is the caller’s responsibility to account for this case.

The same restrictions apply as invm_insert_page().

intvm_insert_page(structvm_area_struct*vma,unsignedlongaddr,structpage*page)¶: insert single page into user vma

Parameters

structvm_area_struct*vma: user vma to map to
unsignedlongaddr: target user address of this page
structpage*page: source kernel page

Description

This allows drivers to insert individual pages they’ve allocatedinto a user vma. The zeropage is supported in some VMAs,seevm_mixed_zeropage_allowed().

The page has to be a nice clean _individual_ kernel allocation.If you allocate a compound page, you need to have marked it assuch (__GFP_COMP), or manually just split the page up yourself(seesplit_page()).

NOTE! Traditionally this was done with “remap_pfn_range()” whichtook an arbitrary page protection parameter. This doesn’t allowthat. Your vma protection will have to be set up correctly, whichmeans that if you want a shared writable mapping, you’d betterask for a shared writable mapping!

The page does not need to be reserved.

Usually this function is called from f_op->mmap() handlerunder mm->mmap_lock write-lock, so it can change vma->vm_flags.Caller must set VM_MIXEDMAP on vma if it wants to call thisfunction from other places, for example from page-fault handler.

Return

0 on success, negative error code otherwise.

intvm_map_pages(structvm_area_struct*vma,structpage**pages,unsignedlongnum)¶: maps range of kernel pages starts with non zero offset

Parameters

structvm_area_struct*vma: user vma to map to
structpage**pages: pointer to array of source kernel pages
unsignedlongnum: number of pages in page array

Description

Maps an object consisting ofnum pages, catering for the user’srequested vm_pgoff

If we fail to insert any page into the vma, the function will returnimmediately leaving any previously inserted pages present. Callersfrom the mmap handler may immediately return the error as their callerwill destroy the vma, removing any successfully inserted pages. Othercallers should make their own arrangements for callingunmap_region().

Context

Process context. Called by mmap handlers.

Return

0 on success and error code otherwise.

intvm_map_pages_zero(structvm_area_struct*vma,structpage**pages,unsignedlongnum)¶: map range of kernel pages starts with zero offset

Parameters

structvm_area_struct*vma: user vma to map to
structpage**pages: pointer to array of source kernel pages
unsignedlongnum: number of pages in page array

Description

Similar tovm_map_pages(), except that it explicitly sets the offsetto 0. This function is intended for the drivers that did not considervm_pgoff.

Context

Process context. Called by mmap handlers.

Return

0 on success and error code otherwise.

vm_fault_tvmf_insert_pfn_prot(structvm_area_struct*vma,unsignedlongaddr,unsignedlongpfn,pgprot_tpgprot)¶: insert single pfn into user vma with specified pgprot

Parameters

structvm_area_struct*vma: user vma to map to
unsignedlongaddr: target user address of this page
unsignedlongpfn: source kernel pfn
pgprot_tpgprot: pgprot flags for the inserted page

Description

This is exactly likevmf_insert_pfn(), except that it allows driversto override pgprot on a per-page basis.

This only makes sense for IO mappings, and it makes no sense forCOW mappings. In general, using multiple vmas is preferable;vmf_insert_pfn_prot should only be used if using multiple VMAs isimpractical.

pgprot typically only differs fromvma->vm_page_prot when drivers setcaching- and encryption bits different than those ofvma->vm_page_prot,because the caching- or encryption mode may not be known at mmap() time.

This is ok as long asvma->vm_page_prot is not used by the core vmto set caching and encryption bits for those vmas (except for COW pages).This is ensured by core vm only modifying these page table entries usingfunctions that don’t touch caching- or encryption bits, usingpte_modify()if needed. (See for examplemprotect()).

Also when new page-table entries are created, this is only done using thefault() callback, and never using the value of vma->vm_page_prot,except for page-table entries that point to anonymous pages as the resultof COW.

Context

Process context. May allocate usingGFP_KERNEL.

Return

vm_fault_t value.

vm_fault_tvmf_insert_pfn(structvm_area_struct*vma,unsignedlongaddr,unsignedlongpfn)¶: insert single pfn into user vma

Parameters

structvm_area_struct*vma: user vma to map to
unsignedlongaddr: target user address of this page
unsignedlongpfn: source kernel pfn

Description

Similar to vm_insert_page, this allows drivers to insert individual pagesthey’ve allocated into a user vma. Same comments apply.

This function should only be called from a vm_ops->fault handler, andin that case the handler should return the result of this function.

vma cannot be a COW mapping.

As this is called only for pages that do not currently exist, wedo not need to flush old virtual caches or the TLB.

Context

Process context. May allocate usingGFP_KERNEL.

Return

vm_fault_t value.

intremap_pfn_range(structvm_area_struct*vma,unsignedlongaddr,unsignedlongpfn,unsignedlongsize,pgprot_tprot)¶: remap kernel memory to userspace

Parameters

structvm_area_struct*vma: user vma to map to
unsignedlongaddr: target page aligned user address to start at
unsignedlongpfn: page frame number of kernel physical memory address
unsignedlongsize: size of mapping area
pgprot_tprot: page protection flags for this mapping

Note

this is only safe if the mm semaphore is held when called.

Return

0 on success, negative error code otherwise.

intvm_iomap_memory(structvm_area_struct*vma,phys_addr_tstart,unsignedlonglen)¶: remap memory to userspace

Parameters

structvm_area_struct*vma: user vma to map to
phys_addr_tstart: start of the physical memory to be mapped
unsignedlonglen: size of area

Description

This is a simplifiedio_remap_pfn_range() for common driver use. Thedriver just needs to give us the physical memory range to be mapped,we’ll figure out the rest from the vma information.

NOTE! Some drivers might want to tweak vma->vm_page_prot first to getwhatever write-combining details or similar.

Return

0 on success, negative error code otherwise.

voidunmap_mapping_pages(structaddress_space*mapping,pgoff_tstart,pgoff_tnr,booleven_cows)¶: Unmap pages from processes.

Parameters

structaddress_space*mapping: The address space containing pages to be unmapped.
pgoff_tstart: Index of first page to be unmapped.
pgoff_tnr: Number of pages to be unmapped. 0 to unmap to end of file.
booleven_cows: Whether to unmap even private COWed pages.

Description

Unmap the pages in this address space from any userspace process whichhas them mmaped. Generally, you want to remove COWed pages as well whena file is being truncated, but not when invalidating pages from the pagecache.

voidunmap_mapping_range(structaddress_space*mapping,loff_tconstholebegin,loff_tconstholelen,inteven_cows)¶: unmap the portion of all mmaps in the specified address_space corresponding to the specified byte range in the underlying file.

Parameters

structaddress_space*mapping: the address space containing mmaps to be unmapped.
loff_tconstholebegin: byte in first page to unmap, relative to the start ofthe underlying file. This will be rounded down to a PAGE_SIZEboundary. Note that this is different fromtruncate_pagecache(), whichmust keep the partial page. In contrast, we must get rid ofpartial pages.
loff_tconstholelen: size of prospective hole in bytes. This will be roundedup to a PAGE_SIZE boundary. A holelen of zero truncates to theend of the file.
inteven_cows: 1 when truncating a file, unmap even private COWed pages;but 0 when invalidating pagecache, don’t throw away private data.

intfollow_pfnmap_start(structfollow_pfnmap_args*args)¶: Look up a pfn mapping at a user virtual address

Parameters

structfollow_pfnmap_args*args: Pointer to structfollow_pfnmap_args

Description

The caller needs to setup args->vma and args->address to point to thevirtual address as the target of such lookup. On a successful return,the results will be put into other output fields.

After the caller finished using the fields, the caller must invokeanotherfollow_pfnmap_end() to proper releases the locks and resourcesof such look up request.

During thestart() andend() calls, the results inargs will be validas proper locks will be held. After theend() is called, all the fieldsinfollow_pfnmap_args will be invalid to be further accessed. Furtheruse of such information afterend() may require proper synchronizationsby the caller with page table updates, otherwise it can create asecurity bug.

If the PTE maps a refcounted page, callers are responsible to protectagainst invalidation with MMU notifiers; otherwise access to the PFN ata later point in time can trigger use-after-free.

Only IO mappings and raw PFN mappings are allowed. The mmap semaphoreshould be taken for read, and the mmap semaphore cannot be releasedbefore theend() is invoked.

This function must not be used to modify PTE content.

Return

zero on success, negative otherwise.

voidfollow_pfnmap_end(structfollow_pfnmap_args*args)¶: End afollow_pfnmap_start() process

Parameters

structfollow_pfnmap_args*args: Pointer to structfollow_pfnmap_args

Description

Must be used in pair offollow_pfnmap_start(). See thestart() functionabove for more information.

intgeneric_access_phys(structvm_area_struct*vma,unsignedlongaddr,void*buf,intlen,intwrite)¶: generic implementation for iomem mmap access

Parameters

structvm_area_struct*vma: the vma to access
unsignedlongaddr: userspace address, not relative offset withinvma
void*buf: buffer to read/write
intlen: length of transfer
intwrite: set to FOLL_WRITE when writing, otherwise reading

Description

This is a generic implementation forvm_operations_struct.access for aniomem mapping. This callback is used byaccess_process_vm() when thevma isnot page based.

intcopy_remote_vm_str(structtask_struct*tsk,unsignedlongaddr,void*buf,intlen,unsignedintgup_flags)¶: copy a string from another process’s address space.

Parameters

structtask_struct*tsk: the task of the target address space
unsignedlongaddr: start address to read from
void*buf: destination buffer
intlen: number of bytes to copy
unsignedintgup_flags: flags modifying lookup behaviour

Description

The caller must hold a reference onmm.

Return

number of bytes copied fromaddr (source) tobuf (destination);not including the trailing NUL. Always guaranteed to leave NUL-terminatedbuffer. On any error, return -EFAULT.

unsignedlong__get_pfnblock_flags_mask(conststructpage*page,unsignedlongpfn,unsignedlongmask)¶: Return the requested group of flags for a pageblock_nr_pages block of pages

Parameters

conststructpage*page: The page within the block of interest
unsignedlongpfn: The target page frame number
unsignedlongmask: mask of bits that the caller is interested in

Return

pageblock_bits flags

boolget_pfnblock_bit(conststructpage*page,unsignedlongpfn,enumpageblock_bitspb_bit)¶: Check if a standalone bit of a pageblock is set

Parameters

conststructpage*page: The page within the block of interest
unsignedlongpfn: The target page frame number
enumpageblock_bitspb_bit: pageblock bit to check

Return

true if the bit is set, otherwise false

enummigratetypeget_pfnblock_migratetype(conststructpage*page,unsignedlongpfn)¶: Return the migratetype of a pageblock

Parameters

conststructpage*page: The page within the block of interest
unsignedlongpfn: The target page frame number

Return

The migratetype of the pageblock

Description

Useget_pfnblock_migratetype() if caller already has bothpage andpfnto save a call topage_to_pfn().

void__set_pfnblock_flags_mask(structpage*page,unsignedlongpfn,unsignedlongflags,unsignedlongmask)¶: Set the requested group of flags for a pageblock_nr_pages block of pages

Parameters

structpage*page: The page within the block of interest
unsignedlongpfn: The target page frame number
unsignedlongflags: The flags to set
unsignedlongmask: mask of bits that the caller is interested in

voidset_pfnblock_bit(conststructpage*page,unsignedlongpfn,enumpageblock_bitspb_bit)¶: Set a standalone bit of a pageblock

Parameters

conststructpage*page: The page within the block of interest
unsignedlongpfn: The target page frame number
enumpageblock_bitspb_bit: pageblock bit to set

voidclear_pfnblock_bit(conststructpage*page,unsignedlongpfn,enumpageblock_bitspb_bit)¶: Clear a standalone bit of a pageblock

Parameters

conststructpage*page: The page within the block of interest
unsignedlongpfn: The target page frame number
enumpageblock_bitspb_bit: pageblock bit to clear

voidset_pageblock_migratetype(structpage*page,enummigratetypemigratetype)¶: Set the migratetype of a pageblock

Parameters

structpage*page: The page within the block of interest
enummigratetypemigratetype: migratetype to set

bool__move_freepages_block_isolate(structzone*zone,structpage*page,boolisolate)¶: move free pages in block for page isolation

Parameters

structzone*zone: the zone
structpage*page: the pageblock page
boolisolate: to isolate the given pageblock or unisolate it

Description

This is similar tomove_freepages_block(), but handles the specialcase encountered in page isolation, where the block of interestmight be part of a larger buddy spanning multiple pageblocks.

Unlike the regular page allocator path, which moves pages whilestealing buddies off the freelist, page isolation is interested inarbitrary pfn ranges that may have overlapping buddies on both ends.

This function handles that. Straddling buddies are split intoindividual pageblocks. Only the block of interest is moved.

Returnstrue if pages could be moved,false otherwise.

void__putback_isolated_page(structpage*page,unsignedintorder,intmt)¶: Return a now-isolated page back where we got it

Parameters

structpage*page: Page that was isolated
unsignedintorder: Order of the isolated page
intmt: The page’s pageblock’s migratetype

Description

This function is meant to return a page pulled from the free lists via__isolate_free_page back to the free lists they were pulled from.

void__free_pages(structpage*page,unsignedintorder)¶: Free pages allocated withalloc_pages().

Parameters

structpage*page: The page pointer returned fromalloc_pages().
unsignedintorder: The order of the allocation.

Description

This function can free multi-page allocations that are not compoundpages. It does not check that theorder passed in matches that ofthe allocation, so it is easy to leak memory. Freeing more memorythan was allocated will probably emit a warning.

If the last reference to this page is speculative, it will be releasedbyput_page() which only frees the first page of a non-compoundallocation. To prevent the remaining pages from being leaked, we freethe subsequent pages here. If you want to use the page’s referencecount to decide when to free the allocation, you should allocate acompound page, and useput_page() instead of__free_pages().

Context

May be called in interrupt context or while holding a normalspinlock, but not in NMI context or while holding a raw spinlock.

voidfree_pages(unsignedlongaddr,unsignedintorder)¶: Free pages allocated with__get_free_pages().

Parameters

unsignedlongaddr: The virtual address tied to a page returned from__get_free_pages().
unsignedintorder: The order of the allocation.

Description

This function behaves the same as__free_pages(). Use this functionto free pages when you only have a valid virtual address. If you havethe page, call__free_pages() instead.

void*alloc_pages_exact(size_tsize,gfp_tgfp_mask)¶: allocate an exact number physically-contiguous pages.

Parameters

size_tsize: the number of bytes to allocate
gfp_tgfp_mask: GFP flags for the allocation, must not contain __GFP_COMP

Description

This function is similar toalloc_pages(), except that it allocates theminimum number of pages to satisfy the request.alloc_pages() can onlyallocate memory in power-of-two pages.

This function is also limited by MAX_PAGE_ORDER.

Memory allocated by this function must be released byfree_pages_exact().

Return

pointer to the allocated area orNULL in case of error.

void*alloc_pages_exact_nid(intnid,size_tsize,gfp_tgfp_mask)¶: allocate an exact number of physically-contiguous pages on a node.

Parameters

intnid: the preferred node ID where memory should be allocated
size_tsize: the number of bytes to allocate
gfp_tgfp_mask: GFP flags for the allocation, must not contain __GFP_COMP

Description

Likealloc_pages_exact(), but try to allocate on node nid first before fallingback.

Return

pointer to the allocated area orNULL in case of error.

voidfree_pages_exact(void*virt,size_tsize)¶: release memory allocated viaalloc_pages_exact()

Parameters

void*virt: the value returned by alloc_pages_exact.
size_tsize: size of allocation, same value as passed toalloc_pages_exact().

Description

Release the memory allocated by a previous call to alloc_pages_exact.

unsignedlongnr_free_zone_pages(intoffset)¶: count number of pages beyond high watermark

Parameters

intoffset: The zone index of the highest zone

Description

nr_free_zone_pages() counts the number of pages which are beyond thehigh watermark within all zones at or below a given zone index. For eachzone, the number of pages is calculated as:

nr_free_zone_pages = managed_pages - high_pages

Return

number of pages beyond high watermark.

unsignedlongnr_free_buffer_pages(void)¶: count number of pages beyond high watermark

Parameters

void: no arguments

Description

nr_free_buffer_pages() counts the number of pages which are beyond the highwatermark within ZONE_DMA and ZONE_NORMAL.

Return

number of pages beyond high watermark within ZONE_DMA andZONE_NORMAL.

intfind_next_best_node(intnode,nodemask_t*used_node_mask)¶: find the next node that should appear in a given node’s fallback list

Parameters

intnode: node whose fallback list we’re appending
nodemask_t*used_node_mask: nodemask_t of already used nodes

Description

We use a number of factors to determine which is the next node that shouldappear on a given node’s fallback list. The node should not have appearedalready innode’s fallback list, and it should be the next closest nodeaccording to the distance array (which contains arbitrary distance valuesfrom each node to each node in the system), and should also prefer nodeswith no CPUs, since presumably they’ll have very little allocation pressureon them otherwise.

Return

node id of the found node orNUMA_NO_NODE if no node is found.

voidsetup_per_zone_wmarks(void)¶: called when min_free_kbytes changes or when memory is hot-{added|removed}

Parameters

void: no arguments

Description

Ensures that the watermark[min,low,high] values for each zone are setcorrectly with respect to min_free_kbytes.

intalloc_contig_range(unsignedlongstart,unsignedlongend,acr_flags_talloc_flags,gfp_tgfp_mask)¶

tries to allocate given range of pages

Parameters

unsignedlongstart: start PFN to allocate
unsignedlongend: one-past-the-last PFN to allocate
acr_flags_talloc_flags: allocation information
gfp_tgfp_mask: GFP mask. Node/zone/placement hints are ignored; only someaction and reclaim modifiers are supported. Reclaim modifierscontrol allocation behavior during compaction/migration/reclaim.

Description

The PFN range does not have to be pageblock aligned. The PFN range mustbelong to a single zone.

The first thing this routine does is attempt to MIGRATE_ISOLATE allpageblocks in the range. Once isolated, the pageblocks should notbe modified by others.

Return

zero on success or negative error code. On success allpages which PFN is in [start, end) are allocated for the caller andneed to be freed withfree_contig_range().

structpage*alloc_contig_pages(unsignedlongnr_pages,gfp_tgfp_mask,intnid,nodemask_t*nodemask)¶

tries to find and allocate contiguous range of pages

Parameters

unsignedlongnr_pages: Number of contiguous pages to allocate
gfp_tgfp_mask: GFP mask. Node/zone/placement hints limit the search; only someaction and reclaim modifiers are supported. Reclaim modifierscontrol allocation behavior during compaction/migration/reclaim.
intnid: Target node
nodemask_t*nodemask: Mask for other possible nodes

Description

This routine is a wrapper aroundalloc_contig_range(). It scans over zoneson an applicable zonelist to find a contiguous pfn range which can then betried for allocation withalloc_contig_range(). This routine is intendedfor allocation requests which can not be fulfilled with the buddy allocator.

The allocated memory is always aligned to a page boundary. If nr_pages is apower of two, then allocated range is also guaranteed to be aligned to samenr_pages (e.g. 1GB request would be aligned to 1GB).

Allocated pages can be freed withfree_contig_range() or by manually calling__free_page() on each allocated page.

Return

pointer to contiguous pages on success, or NULL if not successful.

structpage*alloc_pages_nolock(gfp_tgfp_flags,intnid,unsignedintorder)¶: opportunistic reentrant allocation from any context

Parameters

gfp_tgfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
intnid: node to allocate from
unsignedintorder: allocation order size

Description

Allocates pages of a given order from the given node. This is safe tocall from any context (from atomic, NMI, and also reentrantallocator -> tracepoint -> alloc_pages_nolock_noprof).Allocation is best effort and to be expected to fail easily so nobody shouldrely on the success. Failures are not reported viawarn_alloc().See always fail conditions below.

Return

allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.It means ENOMEM. There is no reason to call it again and expect !NULL.

intnuma_nearest_node(intnode,unsignedintstate)¶: Find nearest node by state

Parameters

intnode: Node id to start the search
unsignedintstate: State to filter the search

Description

Lookup the closest node by distance ifnid is not in state.

Return

thisnode if it is in state, otherwise the closest node by distance

intnearest_node_nodemask(intnode,nodemask_t*mask)¶: Find the node inmask at the nearest distance fromnode.

Parameters

intnode: a valid node ID to start the search from.
nodemask_t*mask: a pointer to a nodemask representing the allowed nodes.

Description

This function iterates over all nodes inmask and calculates thedistance from the startingnode, then it returns the node ID that isthe closest tonode, or MAX_NUMNODES if no node is found.

Note thatnode must be a valid node ID usable withnode_distance(),providing an invalid node ID (e.g., NUMA_NO_NODE) may result in crashesor unexpected behavior.

boolfolio_can_map_prot_numa(structfolio*folio,structvm_area_struct*vma,boolis_private_single_threaded)¶: check whether the folio can map prot numa

Parameters

structfolio*folio: The folio whose mapping considered for being made NUMA hintable
structvm_area_struct*vma: The VMA that the folio belongs to.
boolis_private_single_threaded: Is this a single-threaded private VMA or not

Description

This function checks to see if the folio actually indicates thatwe need to make the mapping one which causes a NUMA hinting fault,as there are cases where it’s simply unnecessary, and the folio’saccess time is adjusted for memory tiering if prot numa needed.

Return

True if the mapping of the folio needs to be changed, false otherwise.

structpage*alloc_pages_mpol(gfp_tgfp,unsignedintorder,structmempolicy*pol,pgoff_tilx,intnid)¶: Allocate pages according to NUMA mempolicy.

Parameters

gfp_tgfp: GFP flags.
unsignedintorder: Order of the page allocation.
structmempolicy*pol: Pointer to the NUMA mempolicy.
pgoff_tilx: Index for interleave mempolicy (also distinguishesalloc_pages()).
intnid: Preferred node (usuallynuma_node_id() butmpol may override it).

Return

The page on success or NULL if allocation fails.

structfolio*vma_alloc_folio(gfp_tgfp,intorder,structvm_area_struct*vma,unsignedlongaddr)¶: Allocate a folio for a VMA.

Parameters

gfp_tgfp: GFP flags.
intorder: Order of the folio.
structvm_area_struct*vma: Pointer to VMA.
unsignedlongaddr: Virtual address of the allocation. Must be insidevma.

Description

Allocate a folio for a specific address invma, using the appropriateNUMA policy. The caller must hold the mmap_lock of the mm_struct of theVMA to prevent it from going away. Should be used for all allocationsfor folios that will be mapped into user space, excepting hugetlbfs, andexcepting where direct use offolio_alloc_mpol() is more appropriate.

Return

The folio on success or NULL if allocation fails.

structpage*alloc_pages(gfp_tgfp,unsignedintorder)¶: Allocate pages.

Parameters

gfp_tgfp: GFP flags.
unsignedintorder: Power of two of number of pages to allocate.

Description

Allocate 1 <<order contiguous pages. The physical address of thefirst page is naturally aligned (eg an order-3 allocation will be alignedto a multiple of 8 * PAGE_SIZE bytes). The NUMA policy of the currentprocess is honoured when in process context.

Context

Can be called from any context, providing the appropriate GFPflags are used.

Return

The page on success or NULL if allocation fails.

intmpol_misplaced(structfolio*folio,structvm_fault*vmf,unsignedlongaddr)¶: check whether current folio node is valid in policy

Parameters

structfolio*folio: folio to be checked
structvm_fault*vmf: structure describing the fault
unsignedlongaddr: virtual address invma for shared policy lookup and interleave policy

Description

Lookup current policy node id for vma,addr and “compare to” folio’snode id. Policy determination “mimics”alloc_page_vma().Called from fault path where we know the vma and faulting address.

Return

NUMA_NO_NODE if the page is in a node that is valid for thispolicy, or a suitable node ID to allocate a replacement folio from.

voidmpol_shared_policy_init(structshared_policy*sp,structmempolicy*mpol)¶: initialize shared policy for inode

Parameters

structshared_policy*sp: pointer to inode shared policy
structmempolicy*mpol: structmempolicy to install

Description

Install non-NULLmpol in inode’s shared policy rb-tree.On entry, the current task has a reference on a non-NULLmpol.This must be released on exit.This is called atget_inode() calls and we can use GFP_KERNEL.

intmpol_parse_str(char*str,structmempolicy**mpol)¶: parse string to mempolicy, for tmpfs mpol mount option.

Parameters

char*str: string containing mempolicy to parse
structmempolicy**mpol: pointer tostructmempolicy pointer, returned on success.

Description

Format of input:: <mode>[=<flags>][:<nodelist>]

Return

0 on success, else1

voidmpol_to_str(char*buffer,intmaxlen,structmempolicy*pol)¶: format a mempolicy structure for printing

Parameters

char*buffer: to contain formatted mempolicy string
intmaxlen: length ofbuffer
structmempolicy*pol: pointer to mempolicy to be formatted

Description

Convertpol into a string. Ifbuffer is too short, truncate the string.Recommend amaxlen of at least 51 for the longest mode, “weightedinterleave”, plus the longest flag flags, “relative|balancing”, and todisplay at least a few node ids.

typesoftleaf_t¶: Describes a page table software leaf entry, abstracted from its architecture-specific encoding.

Description

Page table leaf entries are those which do not reference any descendent pagetables but rather either reference a data page, are an empty (or ‘none’entry), or contain a non-present entry.

If referencing another page table or a data page then the page table entry ispertinent to hardware - that is it tells the hardware how to decode the pagetable entry.

Otherwise it is a software-defined leaf page table entry, which this typedescribes. See leafops.h and specificallysoftleaf_type for a list of allpossible kinds of software leaf entry.

A softleaf_t entry is abstracted from the hardware page table entry, so isnot architecture-specific.

NOTE

While we transition from the confusing swp_entry_t type used for this: purpose, we simply alias this type. This will be removed once thetransition is complete.

structfolio¶: Represents a contiguous set of bytes.

Definition:

struct folio {    memdesc_flags_t flags;    union {        struct list_head lru;        unsigned int mlock_count;        struct dev_pagemap *pgmap;    };    struct address_space *mapping;    union {        pgoff_t index;        unsigned long share;    };    union {        void *private;        swp_entry_t swap;    };    atomic_t _mapcount;    atomic_t _refcount;#ifdef CONFIG_MEMCG;    unsigned long memcg_data;#elif defined(CONFIG_SLAB_OBJ_EXT);    unsigned long _unused_slab_obj_exts;#endif;#if defined(WANT_PAGE_VIRTUAL);    void *virtual;#endif;#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS;    int _last_cpupid;#endif;    atomic_t _large_mapcount;    atomic_t _nr_pages_mapped;#ifdef CONFIG_64BIT;    atomic_t _entire_mapcount;    atomic_t _pincount;#endif;    mm_id_mapcount_t _mm_id_mapcount[2];    union {        mm_id_t _mm_id[2];        unsigned long _mm_ids;    };#ifdef NR_PAGES_IN_LARGE_FOLIO;    unsigned int _nr_pages;#endif;    struct list_head _deferred_list;#ifndef CONFIG_64BIT;    atomic_t _entire_mapcount;    atomic_t _pincount;#endif;    void *_hugetlb_subpool;    void *_hugetlb_cgroup;    void *_hugetlb_cgroup_rsvd;    void *_hugetlb_hwpoison;};

Members

flags: Identical to the page flags.
{unnamed_union}: anonymous
lru: Least Recently Used list; tracks how recently this folio was used.
mlock_count: Number of times this folio has been pinned bymlock().
pgmap: Metadata for ZONE_DEVICE mappings
mapping: The file this page belongs to, or refers to the anon_vma foranonymous memory.
{unnamed_union}: anonymous
index: Offset within the file, in units of pages. For anonymous memory,this is the index from the beginning of the mmap.
share: number of DAX mappings that reference this folio. Seedax_associate_entry.
{unnamed_union}: anonymous
private: Filesystem per-folio data (seefolio_attach_private()).
swap: Used for swp_entry_t iffolio_test_swapcache().
_mapcount: Do not access this member directly. Usefolio_mapcount() tofind out how many times this folio is mapped by userspace.
_refcount: Do not access this member directly. Usefolio_ref_count()to find how many references there are to this folio.
memcg_data: Memory Control Group data.
_unused_slab_obj_exts: Placeholder to match obj_exts instructslab.
virtual: Virtual address in the kernel direct map.
_last_cpupid: IDs of last CPU and last process that accessed the folio.
_large_mapcount: Do not use directly, callfolio_mapcount().
_nr_pages_mapped: Do not use outside of rmap and debug code.
_entire_mapcount: Do not use directly, callfolio_entire_mapcount().
_pincount: Do not use directly, callfolio_maybe_dma_pinned().
_mm_id_mapcount: Do not use outside of rmap code.
{unnamed_union}: anonymous
_mm_id: Do not use outside of rmap code.
_mm_ids: Do not use outside of rmap code.
_nr_pages: Do not use directly, callfolio_nr_pages().
_deferred_list: Folios to be split under memory pressure.
_entire_mapcount: Do not use directly, callfolio_entire_mapcount().
_pincount: Do not use directly, callfolio_maybe_dma_pinned().
_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
_hugetlb_hwpoison: Do not use directly, callraw_hwp_list_head().

Description

A folio is a physically, virtually and logically contiguous setof bytes. It is a power-of-two in size, and it is aligned to thatsame power-of-two. It is at least as large asPAGE_SIZE. If it isin the page cache, it is at a file offset which is a multiple of thatpower-of-two. It may be mapped into userspace at an address which isat an arbitrary page offset, but its kernel virtual address is alignedto its size.

structptdesc¶: Memory descriptor for page tables.

Definition:

struct ptdesc {    memdesc_flags_t pt_flags;    union {        struct rcu_head pt_rcu_head;        struct list_head pt_list;        struct {            unsigned long _pt_pad_1;            pgtable_t pmd_huge_pte;        };    };    unsigned long __page_mapping;    union {        pgoff_t pt_index;        struct mm_struct *pt_mm;        atomic_t pt_frag_refcount;#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING;        atomic_t pt_share_count;#endif;    };    union {        unsigned long _pt_pad_2;#if ALLOC_SPLIT_PTLOCKS;        spinlock_t *ptl;#else;        spinlock_t ptl;#endif;    };    unsigned int __page_type;    atomic_t __page_refcount;#ifdef CONFIG_MEMCG;    unsigned long pt_memcg_data;#endif;};

Members

pt_flags: enumpt_flags plus zone/node/section.
{unnamed_union}: anonymous
pt_rcu_head: For freeing page table pages.
pt_list: List of used page tables. Used for s390 gmap shadow pages(which are not linked into the user page tables) and x86pgds.
{unnamed_struct}: anonymous
_pt_pad_1: Padding that aliases with page’s compound head.
pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
__page_mapping: Aliases with page->mapping. Unused for page tables.
{unnamed_union}: anonymous
pt_index: Used for s390 gmap.
pt_mm: Used for x86 pgds.
pt_frag_refcount: For fragmented page table tracking. Powerpc only.
pt_share_count: Used for HugeTLB PMD page table share count.
{unnamed_union}: anonymous
_pt_pad_2: Padding to ensure proper alignment.
ptl: Lock for the page table.
ptl: Lock for the page table.
__page_type: Same as page->page_type. Unused for page tables.
__page_refcount: Same as page refcount.
pt_memcg_data: Memcg data. Tracked for page tables here.

Description

Thisstructoverlaysstructpage for now. Do not modify without a goodunderstanding of the issues.

typevm_fault_t¶: Return type for page fault handlers.

Description

Page fault handlers return a bitmask ofVM_FAULT values.

enumvm_fault_reason¶: Page fault handlers return a bitmask of these values to tell the core VM what happened when handling the fault. Used to decide whether a process gets delivered SIGBUS or just gets major/minor fault counters bumped up.

Constants

VM_FAULT_OOM: Out Of Memory
VM_FAULT_SIGBUS: Bad access
VM_FAULT_MAJOR: Page read from storage
VM_FAULT_HWPOISON: Hit poisoned small page
VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encodedin upper bits
VM_FAULT_SIGSEGV: segmentation fault
VM_FAULT_NOPAGE: ->fault installed the pte, not return page
VM_FAULT_LOCKED: ->fault locked the returned page
VM_FAULT_RETRY: ->fault blocked, must retry
VM_FAULT_FALLBACK: huge page fault failed, fall back to small
VM_FAULT_DONE_COW: ->fault has fully handled COW
VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needsfsync() to complete (for synchronous page faultsin DAX)
VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
VM_FAULT_HINDEX_MASK: mask HINDEX value

enumfault_flag¶: Fault flag definitions.

Constants

FAULT_FLAG_WRITE: Fault was a write fault.
FAULT_FLAG_MKWRITE: Fault was mkwrite of existing PTE.
FAULT_FLAG_ALLOW_RETRY: Allow to retry the fault if blocked.
FAULT_FLAG_RETRY_NOWAIT: Don’t drop mmap_lock and wait when retrying.
FAULT_FLAG_KILLABLE: The fault task is in SIGKILL killable region.
FAULT_FLAG_TRIED: The fault has been tried once.
FAULT_FLAG_USER: The fault originated in userspace.
FAULT_FLAG_REMOTE: The fault is not for current task/mm.
FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in aCOW mapping, making sure that an exclusive anon page ismapped after the fault.
FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.We should only access orig_pte if this flag set.
FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.

Description

AboutFAULT_FLAG_ALLOW_RETRY andFAULT_FLAG_TRIED: we can specifywhether we would allow page faults to retry by specifying these twofault flags correctly. Currently there can be three legal combinations:

ALLOW_RETRY and !TRIED: this means the page fault allows retry, and
this is the first try
ALLOW_RETRY and TRIED: this means the page fault allows retry, and
we’ve already tried at least once
!ALLOW_RETRY and !TRIED: this means the page fault does not allow retry

The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should neverbe used. Note that page faults can be allowed to retry for multiple times,in which case we’ll have an initial fault with flags (a) then later oncontinuous faults with flags (b). We should always try to detect pendingsignals before a retry to make sure the continuous page faults can still beinterrupted if necessary.

The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal.FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault whenapplied to mappings that are not COW mappings.

intfolio_is_file_lru(conststructfolio*folio)¶: Should the folio be on a file LRU or anon LRU?

Parameters

conststructfolio*folio: The folio to test.

Description

We would like to get this info without a page flag, but the stateneeds to survive until the folio is last deleted from the LRU, whichcould be as far down as __page_cache_release.

Return

An integer (not a boolean!) used to sort a folio onto theright LRU list and to account folios correctly.1 iffolio is a regular filesystem backed page cache folioor a lazily freed anonymous folio (e.g. via MADV_FREE).0 iffolio is a normal anonymous folio, a tmpfs folio or otherwiseram or swap backed folio.

void__folio_clear_lru_flags(structfolio*folio)¶: Clear page lru flags before releasing a page.

Parameters

structfolio*folio: The folio that was on lru and now has a zero reference.

enumlru_listfolio_lru_list(conststructfolio*folio)¶: Which LRU list should a folio be on?

Parameters

conststructfolio*folio: The folio to test.

Return

The LRU list a folio should be on, as an indexinto the array of LRU lists.

size_tnum_pages_contiguous(structpage**pages,size_tnr_pages)¶: determine the number of contiguous pages that represent contiguous PFNs

Parameters

structpage**pages: an array of page pointers
size_tnr_pages: length of the array, at least 1

Description

Determine the number of contiguous pages that represent contiguous PFNsinpages, starting from the first page.

In some kernel configs contiguous PFNs will not have contiguousstructpages. In these configurationsnum_pages_contiguous() will return a numsmaller than ideal number. The caller should continue to check for pfncontiguity after each call tonum_pages_contiguous().

Returns the number of contiguous pages.

page_folio¶

page_folio(p)

Converts from page to folio.

Parameters

p: The page.

Description

Every page is part of a folio. This function cannot be called on aNULL pointer.

Context

No reference, nor lock is required onpage. If the callerdoes not hold a reference, this call may race with a folio split, soit should re-check the folio still contains this page after gaininga reference on the folio.

Return

The folio which contains this page.

folio_page¶

folio_page(folio,n)

Return a page from a folio.

Parameters

folio: The folio.
n: The page number to return.

Description

n is relative to the start of the folio. This function does notcheck that the page number lies withinfolio; the caller is presumedto have a reference to the page.

boolfolio_xor_flags_has_waiters(structfolio*folio,unsignedlongmask)¶: Change some folio flags.

Parameters

structfolio*folio: The folio.
unsignedlongmask: Bits set in this word will be changed.

Description

This must only be used for flags which are changed with the foliolock held. For example, it is unsafe to use for PG_dirty as thatcan be set without the folio lock held. It can also only be usedon flags which are in the range 0-6 as some of the implementationsonly affect those bits.

Return

Whether there are tasks waiting on the folio.

boolfolio_test_uptodate(conststructfolio*folio)¶: Is this folio up to date?

Parameters

conststructfolio*folio: The folio.

Description

The uptodate flag is set on a folio when every byte in the folio isat least as new as the corresponding bytes on storage. Anonymousand CoW folios are always uptodate. If the folio is not uptodate,some of the bytes in it may be; see theis_partially_uptodate()address_space operation.

boolfolio_test_large(conststructfolio*folio)¶: Does this folio contain more than one page?

Parameters

conststructfolio*folio: The folio to test.

Return

True if the folio is larger than one page.

boolPageHuge(conststructpage*page)¶: Determine if the page belongs to hugetlbfs

Parameters

conststructpage*page: The page to test.

Context

Any context.

Return

True for hugetlbfs pages, false for anon pages or pagesbelonging to other filesystems.

boolpage_has_movable_ops(conststructpage*page)¶: test for a movable_ops page

Parameters

conststructpage*page: The page to test.

Description

Test whether this is a movable_ops page. Such pages will stay thatway until freed.

Returns true if this is a movable_ops page, otherwise false.

intfolio_has_private(conststructfolio*folio)¶: Determine if folio has private stuff

Parameters

conststructfolio*folio: The folio to be checked

Description

Determine if a folio has private stuff, indicating that release routinesshould be invoked upon it.

unsignedlongfolio_page_idx(conststructfolio*folio,conststructpage*page)¶: Return the number of a page in a folio.

Parameters

conststructfolio*folio: The folio.
conststructpage*page: The folio page.

Description

This function expects that the page is actually part of the folio.The returned number is relative to the start of the folio.

typevma_flag_t¶: specifies an individual VMA flag by bit number.

Description

This value is made type safe by sparse to avoid passing invalid flag valuesaround.

boolfault_flag_allow_retry_first(enumfault_flagflags)¶: check ALLOW_RETRY the first time

Parameters

enumfault_flagflags: Fault flags.

Description

This is mostly used for places where we want to try to avoid takingthe mmap_lock for too long a time when waiting for another conditionto change, in which case we can try to be polite to release themmap_lock in the first round to avoid potential starvation of otherprocesses that would also want the mmap_lock.

Return

true if the page fault allows retry and this is the firstattempt of the fault handling; false otherwise.

unsignedintfolio_order(conststructfolio*folio)¶: The allocation order of a folio.

Parameters

conststructfolio*folio: The folio.

Description

A folio is composed of 2^order pages. Seeget_order() for the definitionof order.

Return

The order of the folio.

voidfolio_reset_order(structfolio*folio)¶: Reset the folio order and derived _nr_pages

Parameters

structfolio*folio: The folio.

Description

Reset the order and derived _nr_pages to 0. Must only be used in theprocess of splitting large folios.

intfolio_mapcount(conststructfolio*folio)¶: Number of mappings of this folio.

Parameters

conststructfolio*folio: The folio.

Description

The folio mapcount corresponds to the number of present user page tableentries that reference any part of a folio. Each such present user pagetable entry must be paired with exactly on folio reference.

For ordindary folios, each user page table entry (PTE/PMD/PUD/...) countsexactly once.

For hugetlb folios, each abstracted “hugetlb” user page table entry thatreferences the entire folio counts exactly once, even when such specialpage table entries are comprised of multiple ordinary page table entries.

Will report 0 for pages which cannot be mapped into userspace, such asslab, page tables and similar.

Return

The number of times this folio is mapped.

boolfolio_mapped(conststructfolio*folio)¶: Is this folio mapped into userspace?

Parameters

conststructfolio*folio: The folio.

Return

True if any page in this folio is referenced by user page tables.

unsignedintthp_order(structpage*page)¶: Order of a transparent huge page.

Parameters

structpage*page: Head page of a transparent huge page.

unsignedlongthp_size(structpage*page)¶: Size of a transparent huge page.

Parameters

structpage*page: Head page of a transparent huge page.

Return

Number of bytes in this page.

voidfolio_get(structfolio*folio)¶: Increment the reference count on a folio.

Parameters

structfolio*folio: The folio.

Context

May be called in any context, as long as you know thatyou have a refcount on the folio. If you do not already have one,folio_try_get() may be the right interface for you to use.

voidfolio_put(structfolio*folio)¶: Decrement the reference count on a folio.

Parameters

structfolio*folio: The folio.

Description

If the folio’s reference count reaches zero, the memory will bereleased back to the page allocator and may be used by anotherallocation immediately. Do not access the memory or thestructfolioafter callingfolio_put() unless you can be sure that it wasn’t thelast reference.

Context

May be called in process or interrupt context, but not in NMIcontext. May be called while holding a spinlock.

voidfolio_put_refs(structfolio*folio,intrefs)¶: Reduce the reference count on a folio.

Parameters

structfolio*folio: The folio.
intrefs: The amount to subtract from the folio’s reference count.

Description

If the folio’s reference count reaches zero, the memory will bereleased back to the page allocator and may be used by anotherallocation immediately. Do not access the memory or thestructfolioafter callingfolio_put_refs() unless you can be sure that these weren’tthe last references.

Context

May be called in process or interrupt context, but not in NMIcontext. May be called while holding a spinlock.

voidfolios_put(structfolio_batch*folios)¶: Decrement the reference count on an array of folios.

Parameters

structfolio_batch*folios: The folios.

Description

Likefolio_put(), but for a batch of folios. This is more efficientthan writing the loop yourself as it will optimise the locks which needto be taken if the folios are freed. The folios batch is returnedempty and ready to be reused for another batch; there is no need toreinitialise it.

Context

May be called in process or interrupt context, but not in NMIcontext. May be called while holding a spinlock.

unsignedlongfolio_pfn(conststructfolio*folio)¶: Return the Page Frame Number of a folio.

Parameters

conststructfolio*folio: The folio.

Description

A folio may contain multiple pages. The pages have consecutivePage Frame Numbers.

Return

The Page Frame Number of the first page in the folio.

pte_tfolio_mk_pte(conststructfolio*folio,pgprot_tpgprot)¶: Create a PTE for this folio

Parameters

conststructfolio*folio: The folio to create a PTE for
pgprot_tpgprot: The page protection bits to use

Description

Create a page table entry for the first page of this folio.This is suitable for passing toset_ptes().

Return

A page table entry suitable for mapping this folio.

pmd_tfolio_mk_pmd(conststructfolio*folio,pgprot_tpgprot)¶: Create a PMD for this folio

Parameters

conststructfolio*folio: The folio to create a PMD for
pgprot_tpgprot: The page protection bits to use

Description

Create a page table entry for the first page of this folio.This is suitable for passing toset_pmd_at().

Return

A page table entry suitable for mapping this folio.

pud_tfolio_mk_pud(conststructfolio*folio,pgprot_tpgprot)¶: Create a PUD for this folio

Parameters

conststructfolio*folio: The folio to create a PUD for
pgprot_tpgprot: The page protection bits to use

Description

Create a page table entry for the first page of this folio.This is suitable for passing toset_pud_at().

Return

A page table entry suitable for mapping this folio.

boolfolio_maybe_dma_pinned(structfolio*folio)¶: Report if a folio may be pinned for DMA.

Parameters

structfolio*folio: The folio.

Description

This function checks if a folio has been pinned via a call toa function in thepin_user_pages() family.

For small folios, the return value is partially fuzzy: false is not fuzzy,because it means “definitely not pinned for DMA”, but true means “probablypinned for DMA, but possibly a false positive due to having at leastGUP_PIN_COUNTING_BIAS worth of normal folio references”.

False positives are OK, because: a) it’s unlikely for a folio toget that many refcounts, and b) all the callers of this routine areexpected to be able to deal gracefully with a false positive.

For most large folios, the result will be exactly correct. That’s becausewe have more tracking data available: the _pincount field is usedinstead of the GUP_PIN_COUNTING_BIAS scheme.

For more information, please seepin_user_pages() and related calls.

Return

True, if it is likely that the folio has been “dma-pinned”.False, if the folio is definitely not dma-pinned.

boolis_zero_page(conststructpage*page)¶: Query if a page is a zero page

Parameters

conststructpage*page: The page to query

Description

This returns true ifpage is one of the permanent zero pages.

boolis_zero_folio(conststructfolio*folio)¶: Query if a folio is a zero page

Parameters

conststructfolio*folio: The folio to query

Description

This returns true iffolio is one of the permanent zero pages.

unsignedlongfolio_nr_pages(conststructfolio*folio)¶: The number of pages in the folio.

Parameters

conststructfolio*folio: The folio.

Return

A positive power of two.

structfolio*folio_next(structfolio*folio)¶: Move to the next physical folio.

Parameters

structfolio*folio: The folio we’re currently operating on.

Description

If you have physically contiguous memory which may span more thanone folio (eg astructbio_vec), use this function to move from onefolio to the next. Do not use it if the memory is only virtuallycontiguous as the folios are almost certainly not adjacent to eachother. This is the folio equivalent to writingpage++.

Context

We assume that the folios are refcounted and/or locked at ahigher level and do not adjust the reference counts.

Return

The nextstructfolio.

unsignedintfolio_shift(conststructfolio*folio)¶: The size of the memory described by this folio.

Parameters

conststructfolio*folio: The folio.

Description

A folio represents a number of bytes which is a power-of-two in size.This function tells you which power-of-two the folio is. See alsofolio_size() andfolio_order().

Context

The caller should have a reference on the folio to preventit from being split. It is not necessary for the folio to be locked.

Return

The base-2 logarithm of the size of this folio.

size_tfolio_size(conststructfolio*folio)¶: The number of bytes in a folio.

Parameters

conststructfolio*folio: The folio.

Context

The caller should have a reference on the folio to preventit from being split. It is not necessary for the folio to be locked.

Return

The number of bytes in this folio.

boolfolio_maybe_mapped_shared(structfolio*folio)¶: Whether the folio is mapped into the page tables of more than one MM

Parameters

structfolio*folio: The folio.

Description

This function checks if the folio maybe currently mapped into more than oneMM (“maybe mapped shared”), or if the folio is certainly mapped into a singleMM (“mapped exclusively”).

For KSM folios, this function also returns “mapped shared” when a folio ismapped multiple times into the same MM, because the individual page mappingsare independent.

For small anonymous folios and anonymous hugetlb folios, the returnvalue will be exactly correct: non-KSM folios can only be mapped at most onceinto an MM, and they cannot be partially mapped. KSM folios areconsidered shared even if mapped multiple times into the same MM.

For other folios, the result can be fuzzy:

For partially-mappable large folios (THP), the return value can wronglyindicate “mapped shared” (false positive) if a folio was mapped bymore than two MMs at one point in time.
For pagecache folios (including hugetlb), the return value can wronglyindicate “mapped shared” (false positive) when two VMAs in the same MMcover the same file range.

Further, this function only considers current page table mappings thatare tracked using the folio mapcount(s).

This function does not consider:

If the folio might get mapped in the (near) future (e.g., swapcache,pagecache, temporary unmapping for migration).
If the folio is mapped differently (VM_PFNMAP).
If hugetlb page table sharing applies. Callers might want to checkhugetlb_pmd_shared().

Return

Whether the folio is estimated to be mapped into more than one MM.

intfolio_expected_ref_count(conststructfolio*folio)¶: calculate the expected folio refcount

Parameters

conststructfolio*folio: the folio

Description

Calculate the expected folio refcount, taking references from the pagecache,swapcache, PG_private and page table mappings into account. Useful incombination withfolio_ref_count() to detect unexpected references (e.g.,GUP or other temporary references).

Does currently not consider references from the LRU cache. If the foliowas isolated from the LRU (which is the case during migration or split),the LRU cache does not apply.

Calling this function on an unmapped folio -- !folio_mapped() -- that islocked will return a stable result.

Calling this function on a mapped folio will not result in a stable result,because nothing stops additional page table mappings from coming (e.g.,fork()) or going (e.g.,munmap()).

Calling this function without the folio lock will also not result in astable result: for example, the folio might get dropped from the swapcacheconcurrently.

However, even when called without the folio lock or on a mapped folio,this function can be used to detect unexpected references early (for example,if it makes sense to even lock the folio and unmap it).

The caller must add any reference (e.g., fromfolio_try_get()) it might beholding itself to the result.

Returns the expected folio refcount.

void*ptdesc_address(conststructptdesc*pt)¶: Virtual address of page table.

Parameters

conststructptdesc*pt: Page table descriptor.

Return

The first byte of the page table described bypt.

voidptdesc_set_kernel(structptdesc*ptdesc)¶: Mark a ptdesc used to map the kernel

Parameters

structptdesc*ptdesc: The ptdesc to be marked

Description

Kernel page tables often need special handling. Set a flag so thatthe handling code knows this ptdesc will not be used for userspace.

voidptdesc_clear_kernel(structptdesc*ptdesc)¶: Mark a ptdesc as no longer used to map the kernel

Parameters

structptdesc*ptdesc: The ptdesc to be unmarked

Description

Use when the ptdesc is no longer used to map the kernel and no longerneeds special handling.

boolptdesc_test_kernel(conststructptdesc*ptdesc)¶: Check if a ptdesc is used to map the kernel

Parameters

conststructptdesc*ptdesc: The ptdesc being tested

Description

Call to tell if the ptdesc used to map the kernel.

structptdesc*pagetable_alloc(gfp_tgfp,unsignedintorder)¶: Allocate pagetables

Parameters

gfp_tgfp: GFP flags
unsignedintorder: desired pagetable order

Description

pagetable_alloc allocates memory for page tables as well as a page tabledescriptor to describe that memory.

Return

The ptdesc describing the allocated page tables.

voidpagetable_free(structptdesc*pt)¶: Free pagetables

Parameters

structptdesc*pt: The page table descriptor

Description

pagetable_free frees the memory of all page tables described by a pagetable descriptor and the memory for the descriptor itself.

structvm_area_struct*vma_lookup(structmm_struct*mm,unsignedlongaddr)¶: Find a VMA at a specific address

Parameters

structmm_struct*mm: The process address space.
unsignedlongaddr: The user address.

Return

The vm_area_struct at the given address,NULL otherwise.

voidmmap_action_remap(structvm_area_desc*desc,unsignedlongstart,unsignedlongstart_pfn,unsignedlongsize)¶: helper for mmap_prepare hook to specify that a pure PFN remap is required.

Parameters

structvm_area_desc*desc: The VMA descriptor for the VMA requiring remap.
unsignedlongstart: The virtual address to start the remap from, must be within the VMA.
unsignedlongstart_pfn: The first PFN in the range to remap.
unsignedlongsize: The size of the range to remap, in bytes, at most spanning to the endof the VMA.

voidmmap_action_remap_full(structvm_area_desc*desc,unsignedlongstart_pfn)¶: helper for mmap_prepare hook to specify that the entirety of a VMA should be PFN remapped.

Parameters

structvm_area_desc*desc: The VMA descriptor for the VMA requiring remap.
unsignedlongstart_pfn: The first PFN in the range to remap.

voidmmap_action_ioremap(structvm_area_desc*desc,unsignedlongstart,unsignedlongstart_pfn,unsignedlongsize)¶: helper for mmap_prepare hook to specify that a pure PFN I/O remap is required.

Parameters

structvm_area_desc*desc: The VMA descriptor for the VMA requiring remap.
unsignedlongstart: The virtual address to start the remap from, must be within the VMA.
unsignedlongstart_pfn: The first PFN in the range to remap.
unsignedlongsize: The size of the range to remap, in bytes, at most spanning to the endof the VMA.

voidmmap_action_ioremap_full(structvm_area_desc*desc,unsignedlongstart_pfn)¶: helper for mmap_prepare hook to specify that the entirety of a VMA should be PFN I/O remapped.

Parameters

structvm_area_desc*desc: The VMA descriptor for the VMA requiring remap.
unsignedlongstart_pfn: The first PFN in the range to remap.

boolvma_is_special_huge(conststructvm_area_struct*vma)¶: Are transhuge page-table entries considered special?

Parameters

conststructvm_area_struct*vma: Pointer to thestructvm_area_struct to consider

Description

Whether transhuge page-table entries are considered “special” followingthe definition invm_normal_page().

Return

true if transhuge page-table entries should be considered special,false otherwise.

intfolio_ref_count(conststructfolio*folio)¶: The reference count on this folio.

Parameters

conststructfolio*folio: The folio.

Description

The refcount is usually incremented by calls tofolio_get() anddecremented by calls tofolio_put(). Some typical users of thefolio refcount:

Each reference from a page table
The page cache
Filesystem private data
The LRU list
Pipes
Direct IO which references this page in the process address space

Return

The number of references to this folio.

boolfolio_try_get(structfolio*folio)¶: Attempt to increase the refcount on a folio.

Parameters

structfolio*folio: The folio.

Description

If you do not already have a reference to a folio, you can attempt toget one using this function. It may fail if, for example, the foliohas been freed since you found a pointer to it, or it is frozen forthe purposes of splitting or migration.

Return

True if the reference count was successfully incremented.

intis_highmem(conststructzone*zone)¶: helper function to quickly check if astructzone is a highmem zone or not. This is an attempt to keep references to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum.

Parameters

conststructzone*zone: pointer tostructzone variable

Return

1 for a highmem zone, 0 otherwise

for_each_online_pgdat¶

for_each_online_pgdat(pgdat)

helper macro to iterate over all online nodes

Parameters

pgdat: pointer to a pg_data_t variable

for_each_zone¶

for_each_zone(zone)

helper macro to iterate over all memory zones

Parameters

zone: pointer tostructzone variable

Description

The user only needs to declare the zone variable, for_each_zonefills it in.

structzoneref*next_zones_zonelist(structzoneref*z,enumzone_typehighest_zoneidx,nodemask_t*nodes)¶: Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point

Parameters

structzoneref*z: The cursor used as a starting point for the search
enumzone_typehighest_zoneidx: The zone index of the highest zone to return
nodemask_t*nodes: An optional nodemask to filter the zonelist with

Description

This function returns the next zone at or below a given zone index that iswithin the allowed nodemask using a cursor as the starting point for thesearch. The zoneref returned is a cursor that represents the current zonebeing examined. It should be advanced by one before callingnext_zones_zonelist again.

Return

the next zone at or below highest_zoneidx within the allowednodemask using a cursor within a zonelist as a starting point

structzoneref*first_zones_zonelist(structzonelist*zonelist,enumzone_typehighest_zoneidx,nodemask_t*nodes)¶: Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist

Parameters

structzonelist*zonelist: The zonelist to search for a suitable zone
enumzone_typehighest_zoneidx: The zone index of the highest zone to return
nodemask_t*nodes: An optional nodemask to filter the zonelist with

Description

This function returns the first zone at or below a given zone index that iswithin the allowed nodemask. The zoneref returned is a cursor that can beused to iterate the zonelist with next_zones_zonelist by advancing it byone before calling.

When no eligible zone is found, zoneref->zone is NULL (zoneref itself isnever NULL). This may happen either genuinely, or due to concurrent nodemaskupdate due to cpuset modification.

Return

Zoneref pointer for the first suitable zone found

for_each_zone_zonelist_nodemask¶

for_each_zone_zonelist_nodemask(zone,z,zlist,highidx,nodemask)

helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask

Parameters

zone: The current zone in the iterator
z: The current pointer within zonelist->_zonerefs being iterated
zlist: The zonelist being iterated
highidx: The zone index of the highest zone to return
nodemask: Nodemask allowed by the allocator

Description

This iterator iterates though all zones at or below a given zone index andwithin a given nodemask

for_each_zone_zonelist¶

for_each_zone_zonelist(zone,z,zlist,highidx)

helper macro to iterate over valid zones in a zonelist at or below a given zone index

Parameters

zone: The current zone in the iterator
z: The current pointer within zonelist->zones being iterated
zlist: The zonelist being iterated
highidx: The zone index of the highest zone to return

Description

This iterator iterates though all zones at or below a given zone index.

intpfn_valid(unsignedlongpfn)¶: check if there is a valid memory map entry for a PFN

Parameters

unsignedlongpfn: the page frame number to check

Description

Check if there is a valid memory map entry akastructpage for thepfn.Note, that availability of the memory map entry does not imply thatthere is actual usable memory at thatpfn. Thestructpage mayrepresent a hole or an unusable page frame.

Return

1 for PFNs that have memory map entries and 0 otherwise

structaddress_space*folio_mapping(conststructfolio*folio)¶: Find the mapping where this folio is stored.

Parameters

conststructfolio*folio: The folio.

Description

For folios which are in the page cache, return the mapping that thispage belongs to. Folios in the swap cache return the swap mappingthis page is stored in (which is different from the mapping for theswap file or swap device where the data is stored).

You can call this for folios which aren’t in the swap cache or pagecache and it will return NULL.

int__anon_vma_prepare(structvm_area_struct*vma)¶: attach an anon_vma to a memory region

Parameters

structvm_area_struct*vma: the memory region in question

Description

This makes sure the memory mapping described by ‘vma’ hasan ‘anon_vma’ attached to it, so that we can associate theanonymous pages mapped into it with that anon_vma.

The common case will be that we already have one, whichis handled inline byanon_vma_prepare(). But ifnot we either need to find an adjacent mapping that wecan re-use the anon_vma from (very common when the onlyreason for splitting a vma has beenmprotect()), or weallocate a new one.

Anon-vma allocations are very subtle, because we may haveoptimistically looked up an anon_vma infolio_lock_anon_vma_read()and that may actually touch the rwsem even in the newlyallocated vma (it depends on RCU to make sure that theanon_vma isn’t actually destroyed).

As a result, we need to do proper anon_vma locking evenfor the new allocation. At the same time, we do not wantto do any locking for the common case of already havingan anon_vma.

unsignedlongpage_address_in_vma(conststructfolio*folio,conststructpage*page,conststructvm_area_struct*vma)¶: The virtual address of a page in this VMA.

Parameters

conststructfolio*folio: The folio containing the page.
conststructpage*page: The page within the folio.
conststructvm_area_struct*vma: The VMA we need to know the address in.

Description

Calculates the user virtual address of this page in the specified VMA.It is the caller’s responsibility to check the page is actuallywithin the VMA. There may not currently be a PTE pointing at thispage, but if a page fault occurs at this address, this is the pagewhich will be accessed.

Context

Caller should hold a reference to the folio. Caller shouldhold a lock (eg the i_mmap_lock or the mmap_lock) which keeps theVMA from being altered.

Return

The virtual address corresponding to this page in the VMA.

intfolio_referenced(structfolio*folio,intis_locked,structmem_cgroup*memcg,vm_flags_t*vm_flags)¶: Test if the folio was referenced.

Parameters

structfolio*folio: The folio to test.
intis_locked: Caller holds lock on the folio.
structmem_cgroup*memcg: target memory cgroup
vm_flags_t*vm_flags: A combination of all the vma->vm_flags which referenced the folio.

Description

Quick test_and_clear_referenced for all mappings of a folio,

Return

The number of mappings which referenced the folio. Return -1 ifthe function bailed out due to rmap lock contention.

intmapping_wrprotect_range(structaddress_space*mapping,pgoff_tpgoff,unsignedlongpfn,unsignedlongnr_pages)¶: Write-protect all mappings in a specified range.

Parameters

structaddress_space*mapping: The mapping whose reverse mapping should be traversed.
pgoff_tpgoff: The page offset at whichpfn is mapped withinmapping.
unsignedlongpfn: The PFN of the page mapped inmapping atpgoff.
unsignedlongnr_pages: The number of physically contiguous base pages spanned.

Description

Traverses the reverse mapping, finding all VMAs which contain a sharedmapping of the pages in the specified range inmapping, and write-protectsthem (that is, updates the page tables to mark the mappings read-only suchthat a write protection fault arises when the mappings are written to).

Thepfn value need not refer to a folio, but rather can reference a kernelallocation which is mapped into userland. We therefore do not require thatthe page maps to a folio with a valid mapping or index field, rather thecaller specifies these inmapping andpgoff.

Return

the number of write-protected PTEs, or an error.

intpfn_mkclean_range(unsignedlongpfn,unsignedlongnr_pages,pgoff_tpgoff,structvm_area_struct*vma)¶: Cleans the PTEs (including PMDs) mapped with range of [pfn,pfn +nr_pages) at the specific offset (pgoff) within thevma of shared mappings. And since clean PTEs should also be readonly, write protects them too.

Parameters

unsignedlongpfn: start pfn.
unsignedlongnr_pages: number of physically contiguous pages srarting withpfn.
pgoff_tpgoff: page offset that thepfn mapped with.
structvm_area_struct*vma: vma thatpfn mapped within.

Description

Returns the number of cleaned PTEs (including PMDs).

voidfolio_move_anon_rmap(structfolio*folio,structvm_area_struct*vma)¶: move a folio to our anon_vma

Parameters

structfolio*folio: The folio to move to our anon_vma
structvm_area_struct*vma: The vma the folio belongs to

Description

When a folio belongs exclusively to one process after a COW event,that folio can be moved into the anon_vma that belongs to just thatprocess, so the rmap code will not search the parent or sibling processes.

void__folio_set_anon(structfolio*folio,structvm_area_struct*vma,unsignedlongaddress,boolexclusive)¶: set up a new anonymous rmap for a folio

Parameters

structfolio*folio: The folio to set up the new anonymous rmap for.
structvm_area_struct*vma: VM area to add the folio to.
unsignedlongaddress: User virtual address of the mapping
boolexclusive: Whether the folio is exclusive to the process.

void__page_check_anon_rmap(conststructfolio*folio,conststructpage*page,structvm_area_struct*vma,unsignedlongaddress)¶: sanity check anonymous rmap addition

Parameters

conststructfolio*folio: The folio containingpage.
conststructpage*page: the page to check the mapping of
structvm_area_struct*vma: the vm area in which the mapping is added
unsignedlongaddress: the user virtual address mapped

voidfolio_add_anon_rmap_ptes(structfolio*folio,structpage*page,intnr_pages,structvm_area_struct*vma,unsignedlongaddress,rmap_tflags)¶: add PTE mappings to a page range of an anon folio

Parameters

structfolio*folio: The folio to add the mappings to
structpage*page: The first page to add
intnr_pages: The number of pages which will be mapped
structvm_area_struct*vma: The vm area in which the mappings are added
unsignedlongaddress: The user virtual address of the first page to map
rmap_tflags: The rmap flags

Description

The page range of folio is defined by [first_page, first_page + nr_pages)

The caller needs to hold the page table lock, and the page must be locked inthe anon_vma case: to serialize mapping,index checking after setting,and to ensure that an anon folio is not being upgraded racily to a KSM folio(but KSM folios are never downgraded).

voidfolio_add_anon_rmap_pmd(structfolio*folio,structpage*page,structvm_area_struct*vma,unsignedlongaddress,rmap_tflags)¶: add a PMD mapping to a page range of an anon folio

Parameters

structfolio*folio: The folio to add the mapping to
structpage*page: The first page to add
structvm_area_struct*vma: The vm area in which the mapping is added
unsignedlongaddress: The user virtual address of the first page to map
rmap_tflags: The rmap flags

Description

The page range of folio is defined by [first_page, first_page + HPAGE_PMD_NR)

The caller needs to hold the page table lock, and the page must be locked inthe anon_vma case: to serialize mapping,index checking after setting.

voidfolio_add_new_anon_rmap(structfolio*folio,structvm_area_struct*vma,unsignedlongaddress,rmap_tflags)¶: Add mapping to a new anonymous folio.

Parameters

structfolio*folio: The folio to add the mapping to.
structvm_area_struct*vma: the vm area in which the mapping is added
unsignedlongaddress: the user virtual address mapped
rmap_tflags: The rmap flags

Description

Like folio_add_anon_rmap_*() but must only be called onnew folios.This means the inc-and-test can be bypassed.The folio doesn’t necessarily need to be locked while it’s exclusiveunless two threads map it concurrently. However, the folio must belocked if it’s shared.

If the folio is pmd-mappable, it is accounted as a THP.

voidfolio_add_file_rmap_ptes(structfolio*folio,structpage*page,intnr_pages,structvm_area_struct*vma)¶: add PTE mappings to a page range of a folio

Parameters

structfolio*folio: The folio to add the mappings to
structpage*page: The first page to add
intnr_pages: The number of pages that will be mapped using PTEs
structvm_area_struct*vma: The vm area in which the mappings are added

Description

The page range of the folio is defined by [page, page + nr_pages)

The caller needs to hold the page table lock.

voidfolio_add_file_rmap_pmd(structfolio*folio,structpage*page,structvm_area_struct*vma)¶: add a PMD mapping to a page range of a folio

Parameters

structfolio*folio: The folio to add the mapping to
structpage*page: The first page to add
structvm_area_struct*vma: The vm area in which the mapping is added

Description

The page range of the folio is defined by [page, page + HPAGE_PMD_NR)

The caller needs to hold the page table lock.

voidfolio_add_file_rmap_pud(structfolio*folio,structpage*page,structvm_area_struct*vma)¶: add a PUD mapping to a page range of a folio

Parameters

structfolio*folio: The folio to add the mapping to
structpage*page: The first page to add
structvm_area_struct*vma: The vm area in which the mapping is added

Description

The page range of the folio is defined by [page, page + HPAGE_PUD_NR)

The caller needs to hold the page table lock.

voidfolio_remove_rmap_ptes(structfolio*folio,structpage*page,intnr_pages,structvm_area_struct*vma)¶: remove PTE mappings from a page range of a folio

Parameters

structfolio*folio: The folio to remove the mappings from
structpage*page: The first page to remove
intnr_pages: The number of pages that will be removed from the mapping
structvm_area_struct*vma: The vm area from which the mappings are removed

Description

The page range of the folio is defined by [page, page + nr_pages)

The caller needs to hold the page table lock.

voidfolio_remove_rmap_pmd(structfolio*folio,structpage*page,structvm_area_struct*vma)¶: remove a PMD mapping from a page range of a folio

Parameters

structfolio*folio: The folio to remove the mapping from
structpage*page: The first page to remove
structvm_area_struct*vma: The vm area from which the mapping is removed

Description

The page range of the folio is defined by [page, page + HPAGE_PMD_NR)

The caller needs to hold the page table lock.

voidfolio_remove_rmap_pud(structfolio*folio,structpage*page,structvm_area_struct*vma)¶: remove a PUD mapping from a page range of a folio

Parameters

structfolio*folio: The folio to remove the mapping from
structpage*page: The first page to remove
structvm_area_struct*vma: The vm area from which the mapping is removed

Description

The page range of the folio is defined by [page, page + HPAGE_PUD_NR)

The caller needs to hold the page table lock.

voidtry_to_unmap(structfolio*folio,enumttu_flagsflags)¶: Try to remove all page table mappings to a folio.

Parameters

structfolio*folio: The folio to unmap.
enumttu_flagsflags: action and flags

Description

Tries to remove all the page table entries which are mapping thisfolio. It is the caller’s responsibility to check if the folio isstill mapped if needed (use TTU_SYNC to prevent accounting races).

Context

Caller must hold the folio lock.

voidtry_to_migrate(structfolio*folio,enumttu_flagsflags)¶: try to replace all page table mappings with swap entries

Parameters

structfolio*folio: the folio to replace page table entries for
enumttu_flagsflags: action and flags

Description

Tries to remove all the page table entries which are mapping this folio andreplace them with special swap entries. Caller must hold the folio lock.

structpage*make_device_exclusive(structmm_struct*mm,unsignedlongaddr,void*owner,structfolio**foliop)¶: Mark a page for exclusive use by a device

Parameters

structmm_struct*mm: mm_struct of associated target process
unsignedlongaddr: the virtual address to mark for exclusive device access
void*owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
structfolio**foliop: folio pointer will be stored here on success.

Description

This function looks up the page mapped at the given address, grabs afolio reference, locks the folio and replaces the PTE with specialdevice-exclusive PFN swap entry, preventing access through the processpage tables. The function will return with the folio locked and referenced.

On fault, the device-exclusive entries are replaced with the original PTEunder folio lock, after calling MMU notifiers.

Only anonymous non-hugetlb folios are supported and the VMA must havewrite permissions such that we can fault in the anonymous page writablein order to mark it exclusive. The caller must hold the mmap_lock in readmode.

A driver using this to program access from a device must use a mmu notifiercritical section to hold a device specific lock during programming. Onceprogramming is complete it should drop the folio lock and reference afterwhich point CPU access to the page will revoke the exclusive access.

Notes

This function always operates on individual PTEs mapping individualpages. PMD-sized THPs are first remapped to be mapped by PTEs beforethe conversion happens on a single PTE corresponding toaddr.
While concurrent access through the process page tables is prevented,concurrent access through other page references (e.g., earlier GUPinvocation) is not handled and not supported.
device-exclusive entries are considered “clean” and “old” by core-mm.Device drivers must update the folio state when informed by MMUnotifiers.

Return

pointer to mapped page on success, otherwise a negative error.

void__rmap_walk_file(structfolio*folio,structaddress_space*mapping,pgoff_tpgoff_start,unsignedlongnr_pages,structrmap_walk_control*rwc,boollocked)¶: Traverse the reverse mapping for a file-backed mapping of a page mapped within a specified page cache object at a specified offset.

Parameters

structfolio*folio: Either the folio whose mappings to traverse, or if NULL,the callbacks specified inrwc will be configured suchas to be able to look up mappings correctly.
structaddress_space*mapping: The page cache object whose mapping VMAs we intend totraverse. Iffolio is non-NULL, this should be equal tofolio_mapping(folio).
pgoff_tpgoff_start: The offset withinmapping of the page which we arelooking up. Iffolio is non-NULL, this should be equalto folio_pgoff(folio).
unsignedlongnr_pages: The number of pages mapped by the mapping. Iffolio isnon-NULL, this should be equal to folio_nr_pages(folio).
structrmap_walk_control*rwc: The reverse mapping walk control object describing howthe traversal should proceed.
boollocked: Is themapping already locked? If not, we acquire thelock.

boolisolate_movable_ops_page(structpage*page,isolate_mode_tmode)¶: isolate a movable_ops page for migration

Parameters

structpage*page: The page.
isolate_mode_tmode: The isolation mode.

Description

Try to isolate a movable_ops page for migration. Will fail if the page isnot a movable_ops page, if the page is already isolated for migrationor if the page was just was released by its owner.

Once isolated, the page cannot get freed until it is either putbackor migrated.

Returns true if isolation succeeded, otherwise false.

voidputback_movable_ops_page(structpage*page)¶: putback an isolated movable_ops page

Parameters

structpage*page: The isolated page.

Description

Putback an isolated movable_ops page.

After the page was putback, it might get freed instantly.

intmigrate_movable_ops_page(structpage*dst,structpage*src,enummigrate_modemode)¶: migrate an isolated movable_ops page

Parameters

structpage*dst: The destination page.
structpage*src: The source page.
enummigrate_modemode: The migration mode.

Description

Migrate an isolated movable_ops page.

If the src page was already released by its owner, the src page isun-isolated (putback) and migration succeeds; the migration core will be theowner of both pages.

If the src page was not released by its owner and the migration wassuccessful, the owner of the src page and the dst page are swapped andthe src page is un-isolated.

If migration fails, the ownership stays unmodified and the src pageremains isolated: migration may be retried later or the page can be putback.

TODO: migration core will treat both pages as folios and lock them beforethis call to unlock them after this call. Further, the folio refcounts onsrc and dst are also released by migration core. These pages will not befolios in the future, so that must be reworked.

Returns 0 on success, otherwise a negative error code.

intmigrate_folio(structaddress_space*mapping,structfolio*dst,structfolio*src,enummigrate_modemode)¶: Simple folio migration.

Parameters

structaddress_space*mapping: The address_space containing the folio.
structfolio*dst: The folio to migrate the data to.
structfolio*src: The folio containing the current data.
enummigrate_modemode: How to migrate the page.

Description

Common logic to directly migrate a single LRU folio suitable forfolios that do not have private data.

Folios are locked upon entry and exit.

intbuffer_migrate_folio(structaddress_space*mapping,structfolio*dst,structfolio*src,enummigrate_modemode)¶: Migration function for folios with buffers.

Parameters

structaddress_space*mapping: The address space containingsrc.
structfolio*dst: The folio to migrate to.
structfolio*src: The folio to migrate from.
enummigrate_modemode: How to migrate the folio.

Description

This function can only be used if the underlying filesystem guaranteesthat no other references tosrc exist. For example attached bufferheads are accessed only under the folio lock. If your filesystem cannotprovide this guarantee,buffer_migrate_folio_norefs() may be moreappropriate.

Return

0 on success or a negative errno on failure.

intbuffer_migrate_folio_norefs(structaddress_space*mapping,structfolio*dst,structfolio*src,enummigrate_modemode)¶: Migration function for folios with buffers.

Parameters

structaddress_space*mapping: The address space containingsrc.
structfolio*dst: The folio to migrate to.
structfolio*src: The folio to migrate from.
enummigrate_modemode: How to migrate the folio.

Description

Likebuffer_migrate_folio() except that this variant is more carefuland checks that there are also no buffer head references. This functionis the right one for mappings where buffer heads are directly lookedup and referenced (such as block device mappings).

Return

0 on success or a negative errno on failure.

unsignedlongdo_mmap(structfile*file,unsignedlongaddr,unsignedlonglen,unsignedlongprot,unsignedlongflags,vm_flags_tvm_flags,unsignedlongpgoff,unsignedlong*populate,structlist_head*uf)¶: Perform a userland memory mapping into the current process address space of lengthlen with protection bitsprot, mmap flagsflags (from which VMA flags will be inferred), and any additional VMA flags to applyvm_flags. If this is a file-backed mapping then the file is specified infile and page offset into the file viapgoff.

Parameters

structfile*file: An optionalstructfile pointer describing the file which is to bemapped, if a file-backed mapping.
unsignedlongaddr: If non-zero, hints at (or ifflags has MAP_FIXED set, specifies) theaddress at which to perform this mapping. See mmap (2) for details. Must bepage-aligned.
unsignedlonglen: The length of the mapping. Will be page-aligned and must be at least 1page in size.
unsignedlongprot: Protection bits describing access required to the mapping. See mmap(2) for details.
unsignedlongflags: Flags specifying how the mapping should be performed, see mmap (2)for details.
vm_flags_tvm_flags: VMA flags which should be set by default, or 0 otherwise.
unsignedlongpgoff: Page offset into thefile if file-backed, should be 0 otherwise.
unsignedlong*populate: A pointer to a value which will be set to 0 if no population ofthe range is required, or the number of bytes to populate if it is. Must benon-NULL. See mmap (2) for details as to under what circumstances populationof the range occurs.
structlist_head*uf: An optional pointer to a list head to track userfaultfd unmap eventsshould unmapping events arise. If provided, it is up to the caller to managethis.

Description

This function does not perform security checks on the file and assumes, ifuf is non-NULL, the caller has provided a list head to track unmap eventsfor userfaultfduf.

It also simply indicates whether memory population is required by settingpopulate, which must be non-NULL, expecting the caller to actually performthis task itself if appropriate.

This function will invoke architecture-specific (and if provided andrelevant, file system-specific) logic to determine the most appropriateunmapped area in which to place the mapping if not MAP_FIXED.

Callers which require userland mmap() behaviour should invokevm_mmap(),which is also exported for module use.

Those which require this behaviour less security checks, userfaultfd andpopulate behaviour, and who handle the mmap write lock themselves, shouldcall this function.

Note that the returned address may reside within a merged VMA if anappropriate merge were to take place, so it doesn’t necessarily specify thestart of a VMA, rather only the start of a valid mapped range of lengthlen bytes, rounded down to the nearest page size.

The caller must write-lock current->mm->mmap_lock.

Return

Either an error, or the address at which the requested mapping hasbeen performed.

structvm_area_struct*find_vma_intersection(structmm_struct*mm,unsignedlongstart_addr,unsignedlongend_addr)¶: Look up the first VMA which intersects the interval

Parameters

structmm_struct*mm: The process address space.
unsignedlongstart_addr: The inclusive start user address.
unsignedlongend_addr: The exclusive end user address.

Return

The first VMA within the provided range,NULL otherwise. Assumesstart_addr < end_addr.

structvm_area_struct*find_vma(structmm_struct*mm,unsignedlongaddr)¶: Find the VMA for a given address, or the next VMA.

Parameters

structmm_struct*mm: The mm_struct to check
unsignedlongaddr: The address

Return

The VMA associated with addr, or the next VMA.May returnNULL in the case of no VMA at addr or above.

structvm_area_struct*find_vma_prev(structmm_struct*mm,unsignedlongaddr,structvm_area_struct**pprev)¶: Find the VMA for a given address, or the next vma and setpprev to the previous VMA, if any.

Parameters

structmm_struct*mm: The mm_struct to check
unsignedlongaddr: The address
structvm_area_struct**pprev: The pointer to set to the previous VMA

Description

Note that RCU lock is missing here since the externalmmap_lock() is usedinstead.

Return

The VMA associated withaddr, or the next vma.May returnNULL in the case of no vma at addr or above.

void__refkmemleak_alloc(constvoid*ptr,size_tsize,intmin_count,gfp_tgfp)¶: register a newly allocated object

Parameters

constvoid*ptr: pointer to beginning of the object
size_tsize: size of the object
intmin_count: minimum number of references to this object. If during memoryscanning a number of references less thanmin_count is found,the object is reported as a memory leak. Ifmin_count is 0,the object is never reported as a leak. Ifmin_count is -1,the object is ignored (not scanned and not reported as a leak)
gfp_tgfp: kmalloc() flags used for kmemleak internal memory allocations

Description

This function is called from the kernel allocators when a new object(memory block) is allocated (kmem_cache_alloc, kmalloc etc.).

void__refkmemleak_alloc_percpu(constvoid__percpu*ptr,size_tsize,gfp_tgfp)¶: register a newly allocated __percpu object

Parameters

constvoid__percpu*ptr: __percpu pointer to beginning of the object
size_tsize: size of the object
gfp_tgfp: flags used for kmemleak internal memory allocations

Description

This function is called from the kernel percpu allocator when a new object(memory block) is allocated (alloc_percpu).

void__refkmemleak_vmalloc(conststructvm_struct*area,size_tsize,gfp_tgfp)¶: register a newly vmalloc’ed object

Parameters

conststructvm_struct*area: pointer to vm_struct
size_tsize: size of the object
gfp_tgfp: __vmalloc() flags used for kmemleak internal memory allocations

Description

This function is called from thevmalloc() kernel allocator when a newobject (memory block) is allocated.

void__refkmemleak_free(constvoid*ptr)¶: unregister a previously registered object

Parameters

constvoid*ptr: pointer to beginning of the object

Description

This function is called from the kernel allocators when an object (memoryblock) is freed (kmem_cache_free, kfree, vfree etc.).

void__refkmemleak_free_part(constvoid*ptr,size_tsize)¶: partially unregister a previously registered object

Parameters

constvoid*ptr: pointer to the beginning or inside the object. This alsorepresents the start of the range to be freed
size_tsize: size to be unregistered

Description

This function is called when only a part of a memory block is freed(usually from the bootmem allocator).

void__refkmemleak_free_percpu(constvoid__percpu*ptr)¶: unregister a previously registered __percpu object

Parameters

constvoid__percpu*ptr: __percpu pointer to beginning of the object

Description

This function is called from the kernel percpu allocator when an object(memory block) is freed (free_percpu).

void__refkmemleak_update_trace(constvoid*ptr)¶: update object allocation stack trace

Parameters

constvoid*ptr: pointer to beginning of the object

Description

Override the object allocation stack trace for cases where the actualallocation place is not always useful.

void__refkmemleak_not_leak(constvoid*ptr)¶: mark an allocated object as false positive

Parameters

constvoid*ptr: pointer to beginning of the object

Description

Calling this function on an object will cause the memory block to no longerbe reported as leak and always be scanned.

void__refkmemleak_transient_leak(constvoid*ptr)¶: mark an allocated object as transient false positive

Parameters

constvoid*ptr: pointer to beginning of the object

Description

Calling this function on an object will cause the memory block to not bereported as a leak temporarily. This may happen, for example, if the objectis part of a singly linked list and the ->next reference to it is changed.

void__refkmemleak_ignore_percpu(constvoid__percpu*ptr)¶: similar to kmemleak_ignore but taking a percpu address argument

Parameters

constvoid__percpu*ptr: percpu address of the object

void__refkmemleak_ignore(constvoid*ptr)¶: ignore an allocated object

Parameters

constvoid*ptr: pointer to beginning of the object

Description

Calling this function on an object will cause the memory block to beignored (not scanned and not reported as a leak). This is usually done whenit is known that the corresponding block is not a leak and does not containany references to other allocated memory blocks.

void__refkmemleak_scan_area(constvoid*ptr,size_tsize,gfp_tgfp)¶: limit the range to be scanned in an allocated object

Parameters

constvoid*ptr: pointer to beginning or inside the object. This alsorepresents the start of the scan area
size_tsize: size of the scan area
gfp_tgfp: kmalloc() flags used for kmemleak internal memory allocations

Description

This function is used when it is known that only certain parts of an objectcontain references to other objects. Kmemleak will only scan these areasreducing the number false negatives.

void__refkmemleak_no_scan(constvoid*ptr)¶: do not scan an allocated object

Parameters

constvoid*ptr: pointer to beginning of the object

Description

This function notifies kmemleak not to scan the given memory block. Usefulin situations where it is known that the given object does not contain anyreferences to other objects. Kmemleak will not scan such objects reducingthe number of false negatives.

void__refkmemleak_alloc_phys(phys_addr_tphys,size_tsize,gfp_tgfp)¶: similar to kmemleak_alloc but taking a physical address argument

Parameters

phys_addr_tphys: physical address of the object
size_tsize: size of the object
gfp_tgfp: kmalloc() flags used for kmemleak internal memory allocations

void__refkmemleak_free_part_phys(phys_addr_tphys,size_tsize)¶: similar to kmemleak_free_part but taking a physical address argument

Parameters

phys_addr_tphys: physical address if the beginning or inside an object. Thisalso represents the start of the range to be freed
size_tsize: size to be unregistered

void__refkmemleak_ignore_phys(phys_addr_tphys)¶: similar to kmemleak_ignore but taking a physical address argument

Parameters

phys_addr_tphys: physical address of the object

void*devm_memremap_pages(structdevice*dev,structdev_pagemap*pgmap)¶: remap and provide memmap backing for the given resource

Parameters

structdevice*dev: hosting device forres
structdev_pagemap*pgmap: pointer to astructdev_pagemap

Notes

1/ At a minimum the range and type members ofpgmap must be initialized: by the caller before passing it to this function
2/ The altmap field may optionally be initialized, in which case: PGMAP_ALTMAP_VALID must be set in pgmap->flags.
3/ The ref field may optionally be provided, in which pgmap->ref must be: ‘live’ on entry and will be killed and reaped atdevm_memremap_pages_release() time, or if this routine fails.
4/ range is expected to be a host memory range that could feasibly be: treated as a “System RAM” range, i.e. not a device mmio range, butthis is not enforced.

structdev_pagemap*get_dev_pagemap(unsignedlongpfn)¶: take a new live reference on the dev_pagemap forpfn

Parameters

unsignedlongpfn: page frame number to lookup page_map

unsignedlongvma_kernel_pagesize(structvm_area_struct*vma)¶: Page size granularity for this VMA.

Parameters

structvm_area_struct*vma: The user mapping.

Description

Folios in this VMA will be aligned to, and at least the size of thenumber of bytes returned by this function.

Return

The default size of the folios allocated when backing a VMA.

inthuge_pmd_unshare(structmmu_gather*tlb,structvm_area_struct*vma,unsignedlongaddr,pte_t*ptep)¶: Unmap a pmd table if it is shared by multiple users

Parameters

structmmu_gather*tlb: the current mmu_gather.
structvm_area_struct*vma: the vma covering the pmd table.
unsignedlongaddr: the address we are trying to unshare.
pte_t*ptep: pointer into the (pmd) page table.

Description

Called with the page table lock held, the i_mmap_rwsem held in write modeand the hugetlb vma lock held in write mode.

Note

The caller must callhuge_pmd_unshare_flush() before dropping thei_mmap_rwsem.

Return

1 if it was a shared PMD table and it got unmapped, or 0 if itwas not a shared PMD table.

boolfolio_isolate_hugetlb(structfolio*folio,structlist_head*list)¶: try to isolate an allocated hugetlb folio

Parameters

structfolio*folio: the folio to isolate
structlist_head*list: the list to add the folio to on success

Description

Isolate an allocated (refcount > 0) hugetlb folio, marking it asisolated/non-migratable, and moving it from the active list to thegiven list.

Isolation will fail iffolio is not an allocated hugetlb folio, or ifit is already isolated/non-migratable.

On success, an additional folio reference is taken that must be droppedusingfolio_putback_hugetlb() to undo the isolation.

Return

True if isolation worked, otherwise False.

voidfolio_putback_hugetlb(structfolio*folio)¶: unisolate a hugetlb folio

Parameters

structfolio*folio: the isolated hugetlb folio

Description

Putback/un-isolate the hugetlb folio that was previous isolated usingfolio_isolate_hugetlb(): marking it non-isolated/migratable and putting itback onto the active list.

Will drop the additional folio reference obtained throughfolio_isolate_hugetlb().

voidfolio_mark_accessed(structfolio*folio)¶: Mark a folio as having seen activity.

Parameters

structfolio*folio: The folio to mark.

Description

This function will perform one of the following transitions:

inactive,unreferenced -> inactive,referenced
inactive,referenced -> active,unreferenced
active,unreferenced -> active,referenced

When a newly allocated folio is not yet visible, so safe for non-atomic ops,__folio_set_referenced() may be substituted forfolio_mark_accessed().

voidfolio_add_lru(structfolio*folio)¶: Add a folio to an LRU list.

Parameters

structfolio*folio: The folio to be added to the LRU.

Description

Queue the folio for addition to the LRU. The decision on whetherto add the page to the [in]active [file|anon] list is deferred until thefolio_batch is drained. This gives a chance for the caller offolio_add_lru()have the folio added to the active list usingfolio_mark_accessed().

voidfolio_add_lru_vma(structfolio*folio,structvm_area_struct*vma)¶: Add a folio to the appropate LRU list for this VMA.

Parameters

structfolio*folio: The folio to be added to the LRU.
structvm_area_struct*vma: VMA in which the folio is mapped.

Description

If the VMA is mlocked,folio is added to the unevictable list.Otherwise, it is treated the same way asfolio_add_lru().

voiddeactivate_file_folio(structfolio*folio)¶: Deactivate a file folio.

Parameters

structfolio*folio: Folio to deactivate.

Description

This function hints to the VM thatfolio is a good reclaim candidate,for example if its invalidation fails due to the folio being dirtyor under writeback.

Context

Caller holds a reference on the folio.

voidfolio_mark_lazyfree(structfolio*folio)¶: make an anon folio lazyfree

Parameters

structfolio*folio: folio to deactivate

Description

folio_mark_lazyfree() movesfolio to the inactive file list.This is done to accelerate the reclaim offolio.

voidfolios_put_refs(structfolio_batch*folios,unsignedint*refs)¶: Reduce the reference count on a batch of folios.

Parameters

structfolio_batch*folios: The folios.
unsignedint*refs: The number of refs to subtract from each folio.

Description

Context

May be called in process or interrupt context, but not in NMIcontext. May be called while holding a spinlock.

voidrelease_pages(release_pages_argarg,intnr)¶: batchedput_page()

Parameters

release_pages_argarg: array of pages to release
intnr: number of pages

Description

Decrement the reference count on all the pages inarg. If itfell to zero, remove the page from the LRU and free it.

Note that the argument can be an array of pages, encoded pages,or folio pointers. We ignore any encoded bits, and turn any ofthem into just a folio that gets free’d.

voidfolio_batch_remove_exceptionals(structfolio_batch*fbatch)¶: Prune non-folios from a batch.

Parameters

structfolio_batch*fbatch: The batch to prune

Description

find_get_entries() fills a batch with both folios and shadow/swap/DAXentries. This function prunes all the non-folio entries fromfbatchwithout leaving holes, so that it can be passed on to folio-only batchoperations.

structcgroup_subsys_state*mem_cgroup_css_from_folio(structfolio*folio)¶: css of the memcg associated with a folio

Parameters

structfolio*folio: folio of interest

Description

If memcg is bound to the default hierarchy, css of the memcg associatedwithfolio is returned. The returned css remains associated withfoliountil it is released.

If memcg is bound to a traditional hierarchy, the css of root_mem_cgroupis returned.

ino_tpage_cgroup_ino(structpage*page)¶: return inode number of the memcg a page is charged to

Parameters

structpage*page: the page

Description

Look up the closest online ancestor of the memory cgrouppage is charged toand return its inode number or 0 ifpage is not charged to any cgroup. Itis safe to call this function without holding a reference topage.

Note, this function is inherently racy, because there is nothing to preventthe cgroup inode from getting torn down and potentially reallocated a momentafterpage_cgroup_ino() returns, so it only should be used by callers thatdo not care (such as procfs interfaces).

voidmod_memcg_state(structmem_cgroup*memcg,enummemcg_stat_itemidx,intval)¶: update cgroup memory statistics

Parameters

structmem_cgroup*memcg: the memory cgroup
enummemcg_stat_itemidx: the stat item - can beenummemcg_stat_item orenumnode_stat_item
intval: delta to add to the counter, can be negative

voidmod_lruvec_state(structlruvec*lruvec,enumnode_stat_itemidx,intval)¶: update lruvec memory statistics

Parameters

structlruvec*lruvec: the lruvec
enumnode_stat_itemidx: the stat item
intval: delta to add to the counter, can be negative

Description

The lruvec is the intersection of the NUMA node and a cgroup. Thisfunction updates the all three counters that are affected by achange of state at this level: per-node, per-cgroup, per-lruvec.

voidcount_memcg_events(structmem_cgroup*memcg,enumvm_event_itemidx,unsignedlongcount)¶: account VM events in a cgroup

Parameters

structmem_cgroup*memcg: the memory cgroup
enumvm_event_itemidx: the event item
unsignedlongcount: the number of events that occurred

structmem_cgroup*get_mem_cgroup_from_mm(structmm_struct*mm)¶: Obtain a reference on given mm_struct’s memcg.

Parameters

structmm_struct*mm: mm from which memcg should be extracted. It can be NULL.

Description

Obtain a reference on mm->memcg and returns it if successful. If mmis NULL, then the memcg is chosen as follows:1) The active memcg, if set.2) current->mm->memcg, if available3) root memcgIf mem_cgroup is disabled, NULL is returned.

structmem_cgroup*get_mem_cgroup_from_current(void)¶: Obtain a reference on current task’s memcg.

Parameters

void: no arguments

structmem_cgroup*get_mem_cgroup_from_folio(structfolio*folio)¶: Obtain a reference on a given folio’s memcg.

Parameters

structfolio*folio: folio from which memcg should be extracted.

structmem_cgroup*mem_cgroup_iter(structmem_cgroup*root,structmem_cgroup*prev,structmem_cgroup_reclaim_cookie*reclaim)¶: iterate over memory cgroup hierarchy

Parameters

structmem_cgroup*root: hierarchy root
structmem_cgroup*prev: previously returned memcg, NULL on first invocation
structmem_cgroup_reclaim_cookie*reclaim: cookie for shared reclaim walks, NULL for full walks

Description

Returns references to children of the hierarchy belowroot, orroot itself, orNULL after a full round-trip.

Caller must pass the return value inprev on subsequentinvocations for reference counting, or usemem_cgroup_iter_break()to cancel a hierarchy walk before the round-trip is complete.

Reclaimers can specify a node inreclaim to divide up the memcgsin the hierarchy among all concurrent reclaimers operating on thesame node.

voidmem_cgroup_iter_break(structmem_cgroup*root,structmem_cgroup*prev)¶: abort a hierarchy walk prematurely

Parameters

structmem_cgroup*root: hierarchy root
structmem_cgroup*prev: last visited hierarchy member as returned bymem_cgroup_iter()

voidmem_cgroup_scan_tasks(structmem_cgroup*memcg,int(*fn)(structtask_struct*,void*),void*arg)¶: iterate over tasks of a memory cgroup hierarchy

Parameters

structmem_cgroup*memcg: hierarchy root
int(*fn)(structtask_struct*,void*): function to call for each task
void*arg: argument passed tofn

Description

This function iterates over tasks attached tomemcg or to any of itsdescendants and callsfn for each task. Iffn returns a non-zerovalue, the function breaks the iteration loop. Otherwise, it will iterateover all tasks and return 0.

This function must not be called for the root memory cgroup.

structlruvec*folio_lruvec_lock(structfolio*folio)¶: Lock the lruvec for a folio.

Parameters

structfolio*folio: Pointer to the folio.

Description

These functions are safe to use under any of the following conditions:- folio locked- folio_test_lru false- folio frozen (refcount of 0)

Return

The lruvec this folio is on with its lock held.

structlruvec*folio_lruvec_lock_irq(structfolio*folio)¶: Lock the lruvec for a folio.

Parameters

structfolio*folio: Pointer to the folio.

Description

These functions are safe to use under any of the following conditions:- folio locked- folio_test_lru false- folio frozen (refcount of 0)

Return

The lruvec this folio is on with its lock held and interruptsdisabled.

structlruvec*folio_lruvec_lock_irqsave(structfolio*folio,unsignedlong*flags)¶: Lock the lruvec for a folio.

Parameters

structfolio*folio: Pointer to the folio.
unsignedlong*flags: Pointer to irqsave flags.

Description

These functions are safe to use under any of the following conditions:- folio locked- folio_test_lru false- folio frozen (refcount of 0)

Return

The lruvec this folio is on with its lock held and interruptsdisabled.

voidmem_cgroup_update_lru_size(structlruvec*lruvec,enumlru_listlru,intzid,intnr_pages)¶: account for adding or removing an lru page

Parameters

structlruvec*lruvec: mem_cgroup per zone lru vector
enumlru_listlru: index of lru list the page is sitting on
intzid: zone id of the accounted pages
intnr_pages: positive when adding or negative when removing

Description

This function must be called under lru_lock, just before a page is addedto or just after a page is removed from an lru list.

unsignedlongmem_cgroup_margin(structmem_cgroup*memcg)¶: calculate chargeable space of a memory cgroup

Parameters

structmem_cgroup*memcg: the memory cgroup

Description

Returns the maximum amount of memorymem can be charged with, inpages.

voidmem_cgroup_print_oom_context(structmem_cgroup*memcg,structtask_struct*p)¶: Print OOM information relevant to memory controller.

Parameters

structmem_cgroup*memcg: The memory cgroup that went over limit
structtask_struct*p: Task that is going to be killed

NOTE

memcg andp’s mem_cgroup can be different when hierarchy isenabled

voidmem_cgroup_print_oom_meminfo(structmem_cgroup*memcg)¶: Print OOM memory information relevant to memory controller.

Parameters

structmem_cgroup*memcg: The memory cgroup that went over limit

structmem_cgroup*mem_cgroup_get_oom_group(structtask_struct*victim,structmem_cgroup*oom_domain)¶: get a memory cgroup to clean up after OOM

Parameters

structtask_struct*victim: task to be killed by the OOM killer
structmem_cgroup*oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM

Description

Returns a pointer to a memory cgroup, which has to be cleaned upby killing all belonging OOM-killable tasks.

Caller has to callmem_cgroup_put() on the returned non-NULL memcg.

boolconsume_stock(structmem_cgroup*memcg,unsignedintnr_pages)¶: Try to consume stocked charge on this cpu.

Parameters

structmem_cgroup*memcg: memcg to consume from.
unsignedintnr_pages: how many pages to charge.

Description

Consume the cached charge if enough nr_pages are present otherwise returnfailure. Also return failure for charge request larger thanMEMCG_CHARGE_BATCH or if the local lock is already taken.

returns true if successful, false otherwise.

int__memcg_kmem_charge_page(structpage*page,gfp_tgfp,intorder)¶: charge a kmem page to the current memory cgroup

Parameters

structpage*page: page to charge
gfp_tgfp: reclaim mode
intorder: allocation order

Description

Returns 0 on success, an error code on failure.

void__memcg_kmem_uncharge_page(structpage*page,intorder)¶: uncharge a kmem page

Parameters

structpage*page: page to uncharge
intorder: allocation order

voidmem_cgroup_wb_stats(structbdi_writeback*wb,unsignedlong*pfilepages,unsignedlong*pheadroom,unsignedlong*pdirty,unsignedlong*pwriteback)¶: retrieve writeback related stats from its memcg

Parameters

structbdi_writeback*wb: bdi_writeback in question
unsignedlong*pfilepages: out parameter for number of file pages
unsignedlong*pheadroom: out parameter for number of allocatable pages according to memcg
unsignedlong*pdirty: out parameter for number of dirty pages
unsignedlong*pwriteback: out parameter for number of pages under writeback

Description

Determine the numbers of file, headroom, dirty, and writeback pages inwb’s memcg. File, dirty and writeback are self-explanatory. Headroomis a bit more involved.

A memcg’s headroom is “min(max, high) - used”. In the hierarchy, theheadroom is calculated as the lowest headroom of itself and theancestors. Note that this doesn’t consider the actual amount ofavailable memory in the system. The caller should further cap*pheadroom accordingly.

structmem_cgroup*mem_cgroup_from_id(unsignedshortid)¶: look up a memcg from a memcg id

Parameters

unsignedshortid: the memcg id to look up

Description

Caller must holdrcu_read_lock().

voidmem_cgroup_css_reset(structcgroup_subsys_state*css)¶: reset the states of a mem_cgroup

Parameters

structcgroup_subsys_state*css: the target css

Description

Reset the states of the mem_cgroup associated withcss. This isinvoked when the userland requests disabling on the default hierarchybut the memcg is pinned through dependency. The memcg should stopapplying policies and should revert to the vanilla state as it may bemade visible again.

The current implementation only resets the essential configurations.This needs to be expanded to cover all the visible parts.

voidmem_cgroup_calculate_protection(structmem_cgroup*root,structmem_cgroup*memcg)¶: check if memory consumption is in the normal range

Parameters

structmem_cgroup*root: the top ancestor of the sub-tree being checked
structmem_cgroup*memcg: the memory cgroup to check

Description

WARNING: This function is not stateless! It can only be used as part: of a top-down tree iteration, not for isolated queries.

intmem_cgroup_charge_hugetlb(structfolio*folio,gfp_tgfp)¶: charge the memcg for a hugetlb folio

Parameters

structfolio*folio: folio being charged
gfp_tgfp: reclaim mode

Description

This function is called when allocating a huge page folio, after the page hasalready been obtained and charged to the appropriate hugetlb cgroupcontroller (if it is enabled).

Returns ENOMEM if the memcg is already full.Returns 0 if either the charge was successful, or if we skip the charging.

intmem_cgroup_swapin_charge_folio(structfolio*folio,structmm_struct*mm,gfp_tgfp,swp_entry_tentry)¶: Charge a newly allocated folio for swapin.

Parameters

structfolio*folio: folio to charge.
structmm_struct*mm: mm context of the victim
gfp_tgfp: reclaim mode
swp_entry_tentry: swap entry for which the folio is allocated

Description

This function charges a folio allocated for swapin. Please call this beforeadding the folio to the swapcache.

Returns 0 on success. Otherwise, an error code is returned.

voidmem_cgroup_replace_folio(structfolio*old,structfolio*new)¶: Charge a folio’s replacement.

Parameters

structfolio*old: Currently circulating folio.
structfolio*new: Replacement folio.

Description

Chargenew as a replacement folio forold.old willbe uncharged upon free.

Both folios must be locked,new->mapping must be set up.

voidmem_cgroup_migrate(structfolio*old,structfolio*new)¶: Transfer the memcg data from the old to the new folio.

Parameters

structfolio*old: Currently circulating folio.
structfolio*new: Replacement folio.

Description

Transfer the memcg data from the old folio to the new folio for migration.The old folio’s data info will be cleared. Note that the memory counterswill remain unchanged throughout the process.

Both folios must be locked,new->mapping must be set up.

boolmem_cgroup_sk_charge(conststructsock*sk,unsignedintnr_pages,gfp_tgfp_mask)¶: charge socket memory

Parameters

conststructsock*sk: socket in memcg to charge
unsignedintnr_pages: number of pages to charge
gfp_tgfp_mask: reclaim mode

Description

Chargesnr_pages tomemcg. Returnstrue if the charge fit withinmemcg’s configured limit,false if it doesn’t.

voidmem_cgroup_sk_uncharge(conststructsock*sk,unsignedintnr_pages)¶: uncharge socket memory

Parameters

conststructsock*sk: socket in memcg to uncharge
unsignedintnr_pages: number of pages to uncharge

int__mem_cgroup_try_charge_swap(structfolio*folio,swp_entry_tentry)¶: try charging swap space for a folio

Parameters

structfolio*folio: folio being added to swap
swp_entry_tentry: swap entry to charge

Description

Try to chargefolio’s memcg for the swap space atentry.

Returns 0 on success, -ENOMEM on failure.

void__mem_cgroup_uncharge_swap(swp_entry_tentry,unsignedintnr_pages)¶: uncharge swap space

Parameters

swp_entry_tentry: swap entry to uncharge
unsignedintnr_pages: the amount of swap space to uncharge

boolobj_cgroup_may_zswap(structobj_cgroup*objcg)¶: check if this cgroup can zswap

Parameters

structobj_cgroup*objcg: the object cgroup

Description

Check if the hierarchical zswap limit has been reached.

This doesn’t check for specific headroom, and it is not atomiceither. But with zswap, the size of the allocation is only knownonce compression has occurred, and this optimistic pre-check avoidsspending cycles on compression when there is already no room leftor zswap is disabled altogether somewhere in the hierarchy.

voidobj_cgroup_charge_zswap(structobj_cgroup*objcg,size_tsize)¶: charge compression backend memory

Parameters

structobj_cgroup*objcg: the object cgroup
size_tsize: size of compressed object

Description

This forces the charge afterobj_cgroup_may_zswap() allowedcompression and storage in zswap for this cgroup to go ahead.

voidobj_cgroup_uncharge_zswap(structobj_cgroup*objcg,size_tsize)¶: uncharge compression backend memory

Parameters

structobj_cgroup*objcg: the object cgroup
size_tsize: size of compressed object

Description

Uncharges zswap memory on page in.

boolshmem_recalc_inode(structinode*inode,longalloced,longswapped)¶: recalculate the block usage of an inode

Parameters

structinode*inode: inode to recalc
longalloced: the change in number of pages allocated to inode
longswapped: the change in number of pages swapped from inode

Description

We have to calculate the free blocks since the mm can dropundirtied hole pages behind our back.

But normally info->alloced == inode->i_mapping->nrpages + info->swappedSo mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)

Return

true if swapped was incremented from 0, forshmem_writeout().

intshmem_writeout(structfolio*folio,structswap_iocb**plug,structlist_head*folio_list)¶: Write the folio to swap

Parameters

structfolio*folio: The folio to write
structswap_iocb**plug: swap plug
structlist_head*folio_list: list to put back folios on split

Description

Move the folio from the page cache to the swap cache.

intshmem_get_folio(structinode*inode,pgoff_tindex,loff_twrite_end,structfolio**foliop,enumsgp_typesgp)¶: find, and lock a shmem folio.

Parameters

structinode*inode: inode to search
pgoff_tindex: the page index.
loff_twrite_end: end of a write, could extend inode size
structfolio**foliop: pointer to the folio if found
enumsgp_typesgp: SGP_* flags to control behavior

Description

Looks up the page cache entry atinode &index. If a folio ispresent, it is returned locked with an increased refcount.

If the caller modifies data in the folio, it must callfolio_mark_dirty()before unlocking the folio to ensure that the folio is not reclaimed.There is no need to reserve space before callingfolio_mark_dirty().

When no folio is found, the behavior depends onsgp:

for SGP_READ,*foliop isNULL and 0 is returned
for SGP_NOALLOC,*foliop isNULL and -ENOENT is returned
for all other flags a new folio is allocated, inserted into thepage cache and returned locked infoliop.

Context

May sleep.

Return

0 if successful, else a negative error code.

structfile*shmem_kernel_file_setup(constchar*name,loff_tsize,unsignedlongflags)¶: get an unlinked file living in tmpfs which must be kernel internal. There will be NO LSM permission checks against the underlying inode. So users of this interface must do LSM checks at a higher layer. The users are the big_key and shm implementations. LSM checks are provided at the key or shm level rather than the inode.

Parameters

constchar*name: name for dentry (to be seen in /proc/<pid>/maps)
loff_tsize: size to be set for the file
unsignedlongflags: VM_NORESERVE suppresses pre-accounting of the entire object size

structfile*shmem_file_setup(constchar*name,loff_tsize,unsignedlongflags)¶: get an unlinked file living in tmpfs

Parameters

constchar*name: name for dentry (to be seen in /proc/<pid>/maps)
loff_tsize: size to be set for the file
unsignedlongflags: VM_NORESERVE suppresses pre-accounting of the entire object size

structfile*shmem_file_setup_with_mnt(structvfsmount*mnt,constchar*name,loff_tsize,unsignedlongflags)¶: get an unlinked file living in tmpfs

Parameters

structvfsmount*mnt: the tmpfs mount where the file will be created
constchar*name: name for dentry (to be seen in /proc/<pid>/maps)
loff_tsize: size to be set for the file
unsignedlongflags: VM_NORESERVE suppresses pre-accounting of the entire object size

intshmem_zero_setup(structvm_area_struct*vma)¶: setup a shared anonymous mapping

Parameters

structvm_area_struct*vma: the vma to be mmapped is prepared by do_mmap

Return

0 on success, or error

intshmem_zero_setup_desc(structvm_area_desc*desc)¶: same as shmem_zero_setup, but determined by VMA descriptor for convenience.

Parameters

structvm_area_desc*desc: Describes VMA

Return

0 on success, or error

structfolio*shmem_read_folio_gfp(structaddress_space*mapping,pgoff_tindex,gfp_tgfp)¶: read into page cache, using specified page allocation flags.

Parameters

structaddress_space*mapping: the folio’s address_space
pgoff_tindex: the folio index
gfp_tgfp: the page allocator flags to use if allocating

Description

This behaves as a tmpfs “read_cache_page_gfp(mapping, index, gfp)”,with any new page allocations done using the specified allocation flags.Butread_cache_page_gfp() uses the ->read_folio() method: which does notsuit tmpfs, since it may have pages in swapcache, and needs to find thosefor itself; although drivers/gpu/drm i915 and ttm rely upon this support.

i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN inwith themapping_gfp_mask(), to avoid OOMing the machine unnecessarily.

intmigrate_vma_split_folio(structfolio*folio,structpage*fault_page)¶: Helper function to split a THP folio

Parameters

structfolio*folio: the folio to split
structpage*fault_page: structpage associated with the fault if any

Description

Returns 0 on success

intmigrate_vma_setup(structmigrate_vma*args)¶: prepare to migrate a range of memory

Parameters

structmigrate_vma*args: contains the vma, start, and pfns arrays for the migration

Return

negative errno on failures, 0 when 0 or more pages were migratedwithout an error.

Description

Prepare to migrate a range of memory virtual address range by collecting allthe pages backing each virtual address in the range, saving them inside thesrc array. Then lock those pages and unmap them. Once the pages are lockedand unmapped, check whether each page is pinned or not. Pages that aren’tpinned have the MIGRATE_PFN_MIGRATE flag set (by this function) in thecorresponding src array entry. Then restores any pages that are pinned, byremapping and unlocking those pages.

The caller should then allocate destination memory and copy source memory toit for all those entries (ie with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATEflag set). Once these are allocated and copied, the caller must update eachcorresponding entry in the dst array with the pfn value of the destinationpage and with MIGRATE_PFN_VALID. Destination pages must be locked vialock_page().

Note that the caller does not have to migrate all the pages that are markedwith MIGRATE_PFN_MIGRATE flag in src array unless this is a migration fromdevice memory to system memory. If the caller cannot migrate a device pageback to system memory, then it must return VM_FAULT_SIGBUS, which has severeconsequences for the userspace process, so it must be avoided if at allpossible.

For empty entries inside CPU page table (pte_none() orpmd_none() is true) wedo set MIGRATE_PFN_MIGRATE flag inside the corresponding source array thusallowing the caller to allocate device memory for those unbacked virtualaddresses. For this the caller simply has to allocate device memory andproperly set the destination entry like for regular migration. Note thatthis can still fail, and thus inside the device driver you must check if themigration was successful for those entries after callingmigrate_vma_pages(),just like for regular migration.

After that, the callers must callmigrate_vma_pages() to go over each entryin the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flagset. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set,thenmigrate_vma_pages() to migratestructpage information from the sourcestructpage to the destinationstructpage. If it fails to migrate thestructpage information, then it clears the MIGRATE_PFN_MIGRATE flag in thesrc array.

At this point all successfully migrated pages have an entry in the srcarray with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dstarray entry with MIGRATE_PFN_VALID flag set.

Oncemigrate_vma_pages() returns the caller may inspect which pages weresuccessfully migrated, and which were not. Successfully migrated pages willhave the MIGRATE_PFN_MIGRATE flag set for their src array entry.

It is safe to update device page table aftermigrate_vma_pages() becauseboth destination and source page are still locked, and the mmap_lock is heldin read mode (hence no one can unmap the range being migrated).

Once the caller is done cleaning up things and updating its page table (if itchose to do so, this is not an obligation) it finally callsmigrate_vma_finalize() to update the CPU page table to point to new pagesfor successfully migrated pages or otherwise restore the CPU page table topoint to the original source pages.

intmigrate_vma_insert_huge_pmd_page(structmigrate_vma*migrate,unsignedlongaddr,structpage*page,unsignedlong*src,pmd_t*pmdp)¶: Insert a huge folio intomigrate->vma->vm_mm ataddr. folio is already allocated as a part of the migration process with large page.

Parameters

structmigrate_vma*migrate: migrate_vma arguments
unsignedlongaddr: address where the folio will be inserted
structpage*page: page to be inserted ataddr
unsignedlong*src: src pfn which is being migrated
pmd_t*pmdp: pointer to the pmd

Description

page needs to be initialized and setup after it’s allocated. The code bitshere follow closely the code in__do_huge_pmd_anonymous_page(). This API doesnot support THP zero pages.

voidmigrate_device_pages(unsignedlong*src_pfns,unsignedlong*dst_pfns,unsignedlongnpages)¶: migrate meta-data from src page to dst page

Parameters

unsignedlong*src_pfns: src_pfns returned frommigrate_device_range()
unsignedlong*dst_pfns: array of pfns allocated by the driver to migrate memory to
unsignedlongnpages: number of pages in the range

Description

Equivalent tomigrate_vma_pages(). This is called to migratestructpagemeta-data from sourcestructpage to destination.

voidmigrate_vma_pages(structmigrate_vma*migrate)¶: migrate meta-data from src page to dst page

Parameters

structmigrate_vma*migrate: migratestructcontaining all migration information

Description

This migratesstructpage meta-data from sourcestructpage to destinationstructpage. This effectively finishes the migration from source page to thedestination page.

voidmigrate_vma_finalize(structmigrate_vma*migrate)¶: restore CPU page table entry

Parameters

structmigrate_vma*migrate: migratestructcontaining all migration information

Description

This replaces the special migration pte entry with either a mapping to thenew page if migration was successful for that page, or to the original pageotherwise.

This also unlocks the pages and puts them back on the lru, or drops the extrarefcount, for device pages.

intmigrate_device_range(unsignedlong*src_pfns,unsignedlongstart,unsignedlongnpages)¶: migrate device private pfns to normal memory.

Parameters

unsignedlong*src_pfns: array large enough to hold migrating source device private pfns.
unsignedlongstart: starting pfn in the range to migrate.
unsignedlongnpages: number of pages to migrate.

Description

migrate_vma_setup() is similar in concept tomigrate_vma_setup() except thatinstead of looking up pages based on virtual address mappings a range ofdevice pfns that should be migrated to system memory is used instead.

This is useful when a driver needs to free device memory but doesn’t know thevirtual mappings of every page that may be in device memory. For example thisis often the case when a driver is being unloaded or unbound from a device.

Likemigrate_vma_setup() this function will take a reference and lock anymigrating pages that aren’t free before unmapping them. Drivers may thenallocate destination pages and start copying data from the device to CPUmemory before callingmigrate_device_pages().

intmigrate_device_pfns(unsignedlong*src_pfns,unsignedlongnpages)¶: migrate device private pfns to normal memory.

Parameters

unsignedlong*src_pfns: pre-popluated array of source device private pfns to migrate.
unsignedlongnpages: number of pages to migrate.

Description

Similar tomigrate_device_range() but supports non-contiguous pre-popluatedarray of device pages to migrate.

structwp_walk¶: Private struct for pagetable walk callbacks

Definition:

struct wp_walk {    struct mmu_notifier_range range;    unsigned long tlbflush_start;    unsigned long tlbflush_end;    unsigned long total;};

Members

range: Range for mmu notifiers
tlbflush_start: Address of first modified pte
tlbflush_end: Address of last modified pte + 1
total: Total number of modified ptes

intwp_pte(pte_t*pte,unsignedlongaddr,unsignedlongend,structmm_walk*walk)¶: Write-protect a pte

Parameters

pte_t*pte: Pointer to the pte
unsignedlongaddr: The start of protecting virtual address
unsignedlongend: The end of protecting virtual address
structmm_walk*walk: pagetable walk callback argument

Description

The function write-protects a pte and records the range invirtual address space of touched ptes for efficient range TLB flushes.

structclean_walk¶: Private struct for the clean_record_pte function.

Definition:

struct clean_walk {    struct wp_walk base;    pgoff_t bitmap_pgoff;    unsigned long *bitmap;    pgoff_t start;    pgoff_t end;};

Members

base: structwp_walk we derive from
bitmap_pgoff: Address_space Page offset of the first bit inbitmap
bitmap: Bitmap with one bit for each page offset in the address_space rangecovered.
start: Address_space page offset of first modified pte relativetobitmap_pgoff
end: Address_space page offset of last modified pte relativetobitmap_pgoff

intclean_record_pte(pte_t*pte,unsignedlongaddr,unsignedlongend,structmm_walk*walk)¶: Clean a pte and record its address space offset in a bitmap

Parameters

pte_t*pte: Pointer to the pte
unsignedlongaddr: The start of virtual address to be clean
unsignedlongend: The end of virtual address to be clean
structmm_walk*walk: pagetable walk callback argument

Description

The function cleans a pte and records the range invirtual address space of touched ptes for efficient TLB flushes.It also records dirty ptes in a bitmap representing page offsetsin the address_space, as well as the first and last of the bitstouched.

unsignedlongwp_shared_mapping_range(structaddress_space*mapping,pgoff_tfirst_index,pgoff_tnr)¶: Write-protect all ptes in an address space range

Parameters

structaddress_space*mapping: The address_space we want to write protect
pgoff_tfirst_index: The first page offset in the range
pgoff_tnr: Number of incremental page offsets to cover

Note

This function currently skips transhuge page-table entries, sinceit’s intended for dirty-tracking on the PTE level. It will warn onencountering transhuge write-enabled entries, though, and can easily beextended to handle them as well.

Return

The number of ptes actually write-protected. Note thatalready write-protected ptes are not counted.

unsignedlongclean_record_shared_mapping_range(structaddress_space*mapping,pgoff_tfirst_index,pgoff_tnr,pgoff_tbitmap_pgoff,unsignedlong*bitmap,pgoff_t*start,pgoff_t*end)¶: Clean and record all ptes in an address space range

Parameters

structaddress_space*mapping: The address_space we want to clean
pgoff_tfirst_index: The first page offset in the range
pgoff_tnr: Number of incremental page offsets to cover
pgoff_tbitmap_pgoff: The page offset of the first bit inbitmap
unsignedlong*bitmap: Pointer to a bitmap of at leastnr bits. The bitmap needs tocover the whole rangefirst_index..**first_index** +nr.
pgoff_t*start: Pointer to number of the first set bit inbitmap.is modified as new bits are set by the function.
pgoff_t*end: Pointer to the number of the last set bit inbitmap.none set. The value is modified as new bits are set by the function.

Description

When this function returns there is no guarantee that a CPU hasnot already dirtied new ptes. However it will not clean any ptes notreported in the bitmap. The guarantees are as follows:

All ptes dirty when the function starts executing will end up recordedin the bitmap.
All ptes dirtied after that will either remain dirty, be recorded in thebitmap or both.

If a caller needs to make sure all dirty ptes are picked up and noneadditional are added, it first needs to write-protect the address-spacerange and make sure new writers are blocked inpage_mkwrite() orpfn_mkwrite(). And then after a TLB flush following the write-protectionpick up all dirty bits.

This function currently skips transhuge page-table entries, sinceit’s intended for dirty-tracking on the PTE level. It will warn onencountering transhuge dirty entries, though, and can easily be extendedto handle them as well.

Return

The number of dirty ptes actually cleaned.

boolpcpu_addr_in_chunk(structpcpu_chunk*chunk,void*addr)¶: check if the address is served from this chunk

Parameters

structpcpu_chunk*chunk: chunk of interest
void*addr: percpu address

Return

True if the address is served from this chunk.

boolpcpu_check_block_hint(structpcpu_block_md*block,intbits,size_talign)¶: check against the contig hint

Parameters

structpcpu_block_md*block: block of interest
intbits: size of allocation
size_talign: alignment of area (max PAGE_SIZE)

Description

Check to see if the allocation can fit in the block’s contig hint.Note, a chunk uses the same hints as a block so this can also check againstthe chunk’s contig hint.

voidpcpu_next_md_free_region(structpcpu_chunk*chunk,int*bit_off,int*bits)¶: finds the next hint free area

Parameters

structpcpu_chunk*chunk: chunk of interest
int*bit_off: chunk offset
int*bits: size of free area

Description

Helper function for pcpu_for_each_md_free_region. It checksblock->contig_hint and performs aggregation across blocks to find thenext hint. It modifies bit_off and bits in-place to be consumed in theloop.

voidpcpu_next_fit_region(structpcpu_chunk*chunk,intalloc_bits,intalign,int*bit_off,int*bits)¶: finds fit areas for a given allocation request

Parameters

structpcpu_chunk*chunk: chunk of interest
intalloc_bits: size of allocation
intalign: alignment of area (max PAGE_SIZE)
int*bit_off: chunk offset
int*bits: size of free area

Description

Finds the next free region that is viable for use with a given size andalignment. This only returns if there is a valid area to be used for thisallocation. block->first_free is returned if the allocation request fitswithin the block to see if the request can be fulfilled prior to the contighint.

void*pcpu_mem_zalloc(size_tsize,gfp_tgfp)¶: allocate memory

Parameters

size_tsize: bytes to allocate
gfp_tgfp: allocation flags

Description

Allocatesize bytes. Ifsize is smaller than PAGE_SIZE,kzalloc() is used; otherwise, the equivalent ofvzalloc() is used.This is to facilitate passing through whitelisted flags. Thereturned memory is always zeroed.

Return

Pointer to the allocated area on success, NULL on failure.

voidpcpu_mem_free(void*ptr)¶: free memory

Parameters

void*ptr: memory to free

Description

Freeptr.ptr should have been allocated usingpcpu_mem_zalloc().

voidpcpu_chunk_relocate(structpcpu_chunk*chunk,intoslot)¶: put chunk in the appropriate chunk slot

Parameters

structpcpu_chunk*chunk: chunk of interest
intoslot: the previous slot it was on

Description

This function is called after an allocation or free changedchunk.New slot according to the changed state is determined andchunk ismoved to the slot. Note that the reserved chunk is never put onchunk slots.

Context

pcpu_lock.

voidpcpu_block_update(structpcpu_block_md*block,intstart,intend)¶: updates a block given a free area

Parameters

structpcpu_block_md*block: block of interest
intstart: start offset in block
intend: end offset in block

Description

Updates a block given a known free area. The region [start, end) isexpected to be the entirety of the free area within a block. Choosesthe best starting offset if the contig hints are equal.

voidpcpu_chunk_refresh_hint(structpcpu_chunk*chunk,boolfull_scan)¶: updates metadata about a chunk

Parameters

structpcpu_chunk*chunk: chunk of interest
boolfull_scan: if we should scan from the beginning

Description

Iterates over the metadata blocks to find the largest contig area.A full scan can be avoided on the allocation path as this is triggeredif we broke the contig_hint. In doing so, the scan_hint will be beforethe contig_hint or after if the scan_hint == contig_hint. This cannotbe prevented on freeing as we want to find the largest area possiblyspanning blocks.

voidpcpu_block_refresh_hint(structpcpu_chunk*chunk,intindex)¶

Parameters

structpcpu_chunk*chunk: chunk of interest
intindex: index of the metadata block

Description

Scans over the block beginning at first_free and updates the blockmetadata accordingly.

voidpcpu_block_update_hint_alloc(structpcpu_chunk*chunk,intbit_off,intbits)¶: update hint on allocation path

Parameters

structpcpu_chunk*chunk: chunk of interest
intbit_off: chunk offset
intbits: size of request

Description

Updates metadata for the allocation path. The metadata only has to berefreshed by a full scan iff the chunk’s contig hint is broken. Block levelscans are required if the block’s contig hint is broken.

voidpcpu_block_update_hint_free(structpcpu_chunk*chunk,intbit_off,intbits)¶: updates the block hints on the free path

Parameters

structpcpu_chunk*chunk: chunk of interest
intbit_off: chunk offset
intbits: size of request

Description

Updates metadata for the allocation path. This avoids a blind blockrefresh by making use of the block contig hints. If this fails, it scansforward and backward to determine the extent of the free area. This iscapped at the boundary of blocks.

A chunk update is triggered if a page becomes free, a block becomes free,or the free spans across blocks. This tradeoff is to minimize iteratingover the block metadata to update chunk_md->contig_hint.chunk_md->contig_hint may be off by up to a page, but it will never be morethan the available space. If the contig hint is contained in one block, itwill be accurate.

boolpcpu_is_populated(structpcpu_chunk*chunk,intbit_off,intbits,int*next_off)¶: determines if the region is populated

Parameters

structpcpu_chunk*chunk: chunk of interest
intbit_off: chunk offset
intbits: size of area
int*next_off: return value for the next offset to start searching

Description

For atomic allocations, check if the backing pages are populated.

Return

Bool if the backing pages are populated.next_index is to skip over unpopulated blocks in pcpu_find_block_fit.

intpcpu_find_block_fit(structpcpu_chunk*chunk,intalloc_bits,size_talign,boolpop_only)¶: finds the block index to start searching

Parameters

structpcpu_chunk*chunk: chunk of interest
intalloc_bits: size of request in allocation units
size_talign: alignment of area (max PAGE_SIZE bytes)
boolpop_only: use populated regions only

Description

Given a chunk and an allocation spec, find the offset to begin searchingfor a free region. This iterates over the bitmap metadata blocks tofind an offset that will be guaranteed to fit the requirements. It isnot quite first fit as if the allocation does not fit in the contig hintof a block or chunk, it is skipped. This errs on the side of cautionto prevent excess iteration. Poor alignment can cause the allocator toskip over blocks and chunks that have valid free areas.

Return

The offset in the bitmap to begin searching.-1 if no offset is found.

intpcpu_alloc_area(structpcpu_chunk*chunk,intalloc_bits,size_talign,intstart)¶: allocates an area from a pcpu_chunk

Parameters

structpcpu_chunk*chunk: chunk of interest
intalloc_bits: size of request in allocation units
size_talign: alignment of area (max PAGE_SIZE)
intstart: bit_off to start searching

Description

This function takes in astart offset to begin searching to fit anallocation ofalloc_bits with alignmentalign. It needs to scanthe allocation map because if it fits within the block’s contig hint,start will be block->first_free. This is an attempt to fill theallocation prior to breaking the contig hint. The allocation andboundary maps are updated accordingly if it confirms a validfree area.

Return

Allocated addr offset inchunk on success.-1 if no matching area is found.

intpcpu_free_area(structpcpu_chunk*chunk,intoff)¶: frees the corresponding offset

Parameters

structpcpu_chunk*chunk: chunk of interest
intoff: addr offset into chunk

Description

This function determines the size of an allocation to free usingthe boundary bitmap and clears the allocation map.

Return

Number of freed bytes.

structpcpu_chunk*pcpu_alloc_first_chunk(unsignedlongtmp_addr,intmap_size)¶: creates chunks that serve the first chunk

Parameters

unsignedlongtmp_addr: the start of the region served
intmap_size: size of the region served

Description

This is responsible for creating the chunks that serve the first chunk. Thebase_addr is page aligned down oftmp_addr while the region end is pagealigned up. Offsets are kept track of to determine the region served. Allthis is done to appease the bitmap allocator in avoiding partial blocks.

Return

Chunk serving the region attmp_addr ofmap_size.

voidpcpu_chunk_populated(structpcpu_chunk*chunk,intpage_start,intpage_end)¶: post-population bookkeeping

Parameters

structpcpu_chunk*chunk: pcpu_chunk which got populated
intpage_start: the start page
intpage_end: the end page

Description

Pages in [page_start,**page_end**) have been populated tochunk. Updatethe bookkeeping information accordingly. Must be called after eachsuccessful population.

voidpcpu_chunk_depopulated(structpcpu_chunk*chunk,intpage_start,intpage_end)¶: post-depopulation bookkeeping

Parameters

structpcpu_chunk*chunk: pcpu_chunk which got depopulated
intpage_start: the start page
intpage_end: the end page

Description

Pages in [page_start,**page_end**) have been depopulated fromchunk.Update the bookkeeping information accordingly. Must be called aftereach successful depopulation.

structpcpu_chunk*pcpu_chunk_addr_search(void*addr)¶: determine chunk containing specified address

Parameters

void*addr: address for which the chunk needs to be determined.

Description

This is an internal function that handles all but static allocations.Static percpu address values should never be passed into the allocator.

Return

The address of the found chunk.

void__percpu*pcpu_alloc(size_tsize,size_talign,boolreserved,gfp_tgfp)¶: the percpu allocator

Parameters

size_tsize: size of area to allocate in bytes
size_talign: alignment of area (max PAGE_SIZE)
boolreserved: allocate from the reserved chunk if available
gfp_tgfp: allocation flags

Description

Allocate percpu area ofsize bytes aligned atalign. Ifgfp doesn’tcontainGFP_KERNEL, the allocation is atomic. Ifgfp has __GFP_NOWARNthen no warning will be triggered on invalid or failed allocationrequests.

Return

Percpu pointer to the allocated area on success, NULL on failure.

voidpcpu_balance_free(boolempty_only)¶: manage the amount of free chunks

Parameters

boolempty_only: free chunks only if there are no populated pages

Description

If empty_only isfalse, reclaim all fully free chunks regardless of thenumber of populated pages. Otherwise, only reclaim chunks that have nopopulated pages.

Context

pcpu_lock (can be dropped temporarily)

voidpcpu_balance_populated(void)¶: manage the amount of populated pages

Parameters

void: no arguments

Description

Maintain a certain amount of populated pages to satisfy atomic allocations.It is possible that this is called when physical memory is scarce causingOOM killer to be triggered. We should avoid doing so until an actualallocation causes the failure as it is possible that requests can beserviced from already backed regions.

Context

pcpu_lock (can be dropped temporarily)

voidpcpu_reclaim_populated(void)¶: scan over to_depopulate chunks and free empty pages

Parameters

void: no arguments

Description

Scan over chunks in the depopulate list and try to release unused populatedpages back to the system. Depopulated chunks are sidelined to preventrepopulating these pages unless required. Fully free chunks are reintegratedand freed accordingly (1 is kept around). If we drop below the emptypopulated pages threshold, reintegrate the chunk if it has empty free pages.Each chunk is scanned in the reverse order to keep populated pages close tothe beginning of the chunk.

Context

pcpu_lock (can be dropped temporarily)

voidpcpu_balance_workfn(structwork_struct*work)¶: manage the amount of free chunks and populated pages

Parameters

structwork_struct*work: unused

Description

For each chunk type, manage the number of fully free chunks and the number ofpopulated pages. An important thing to consider is when pages are freed andhow they contribute to the global counts.

voidfree_percpu(void__percpu*ptr)¶: free percpu area

Parameters

void__percpu*ptr: pointer to area to free

Description

Free percpu areaptr.

Context

Can be called from atomic context.

boolis_kernel_percpu_address(unsignedlongaddr)¶: test whether address is from static percpu area

Parameters

unsignedlongaddr: address to test

Description

Test whetheraddr belongs to in-kernel static percpu area. Modulestatic percpu areas are not considered. For those, useis_module_percpu_address().

Return

true ifaddr is from in-kernel static percpu area,false otherwise.

phys_addr_tper_cpu_ptr_to_phys(void*addr)¶: convert translated percpu address to physical address

Parameters

void*addr: the address to be converted to physical address

Description

Givenaddr which is dereferenceable address obtained via one ofpercpu access macros, this function translates it into its physicaladdress. The caller is responsible for ensuringaddr stays validuntil this function finishes.

percpu allocator has special setup for the first chunk, which currentlysupports either embedding in linear address space or vmalloc mapping,and, from the second one, the backing allocator (currently either vm orkm) provides translation.

The addr can be translated simply without checking if it falls into thefirst chunk. But the current code reflects better how percpu allocatoractually works, and the verification can discover both bugs in percpuallocator itself andper_cpu_ptr_to_phys() callers. So we keep currentcode.

Return

The physical address foraddr.

structpcpu_alloc_info*pcpu_alloc_alloc_info(intnr_groups,intnr_units)¶: allocate percpu allocation info

Parameters

intnr_groups: the number of groups
intnr_units: the number of units

Description

Allocate ai which is large enough fornr_groups groups containingnr_units units. The returned ai’s groups[0].cpu_map points to thecpu_map array which is long enough fornr_units and filled withNR_CPUS. It’s the caller’s responsibility to initialize cpu_mappointer of other groups.

Return

Pointer to the allocated pcpu_alloc_info on success, NULL onfailure.

voidpcpu_free_alloc_info(structpcpu_alloc_info*ai)¶: free percpu allocation info

Parameters

structpcpu_alloc_info*ai: pcpu_alloc_info to free

Description

Freeai which was allocated bypcpu_alloc_alloc_info().

voidpcpu_dump_alloc_info(constchar*lvl,conststructpcpu_alloc_info*ai)¶: print out information about pcpu_alloc_info

Parameters

constchar*lvl: loglevel
conststructpcpu_alloc_info*ai: allocation info to dump

Description

Print out information aboutai using loglevellvl.

voidpcpu_setup_first_chunk(conststructpcpu_alloc_info*ai,void*base_addr)¶: initialize the first percpu chunk

Parameters

conststructpcpu_alloc_info*ai: pcpu_alloc_info describing how to percpu area is shaped
void*base_addr: mapped address

Description

Initialize the first percpu chunk which contains the kernel staticpercpu area. This function is to be called from arch percpu areasetup path.

ai contains all information necessary to initialize the firstchunk and prime the dynamic percpu allocator.

ai->static_size is the size of static percpu area.

ai->reserved_size, if non-zero, specifies the amount of bytes toreserve after the static area in the first chunk. This reservesthe first chunk such that it’s available only through reservedpercpu allocation. This is primarily used to serve module percpustatic areas on architectures where the addressing model haslimited offset range for symbol relocations to guarantee modulepercpu symbols fall inside the relocatable range.

ai->dyn_size determines the number of bytes available for dynamicallocation in the first chunk. The area betweenai->static_size +ai->reserved_size +ai->dyn_size andai->unit_size is unused.

ai->unit_size specifies unit size and must be aligned to PAGE_SIZEand equal to or larger thanai->static_size +ai->reserved_size +ai->dyn_size.

ai->atom_size is the allocation atom size and used as alignmentfor vm areas.

ai->alloc_size is the allocation size and always multiple ofai->atom_size. This is larger thanai->atom_size ifai->unit_size is larger thanai->atom_size.

ai->nr_groups andai->groups describe virtual memory layout ofpercpu areas. Units which should be colocated are put into thesame group. Dynamic VM areas will be allocated according to thesegroupings. Ifai->nr_groups is zero, a single group containingall units is assumed.

The caller should have mapped the first chunk atbase_addr andcopied static data to each unit.

The first chunk will always contain a static and a dynamic region.However, the static region is not managed by any chunk. If the firstchunk also contains a reserved region, it is served by two chunks -one for the reserved region and one for the dynamic region. Theyshare the same vm, but use offset regions in the area allocation map.The chunk serving the dynamic region is circulated in the chunk slotsand available for dynamic allocation like any other chunk.

structpcpu_alloc_info*pcpu_build_alloc_info(size_treserved_size,size_tdyn_size,size_tatom_size,pcpu_fc_cpu_distance_fn_tcpu_distance_fn)¶: build alloc_info considering distances between CPUs

Parameters

size_treserved_size: the size of reserved percpu area in bytes
size_tdyn_size: minimum free size for dynamic allocation in bytes
size_tatom_size: allocation atom size
pcpu_fc_cpu_distance_fn_tcpu_distance_fn: callback to determine distance between cpus, optional

Description

This function determines grouping of units, their mappings to cpusand other parameters considering needed percpu size, allocationatom size and distances between CPUs.

Groups are always multiples of atom size and CPUs which are ofLOCAL_DISTANCE both ways are grouped together and share space forunits in the same group. The returned configuration is guaranteedto have CPUs on different nodes on different groups and >=75% usageof allocated virtual address space.

Return

On success, pointer to the new allocation_info is returned. Onfailure, ERR_PTR value is returned.

intpcpu_embed_first_chunk(size_treserved_size,size_tdyn_size,size_tatom_size,pcpu_fc_cpu_distance_fn_tcpu_distance_fn,pcpu_fc_cpu_to_node_fn_tcpu_to_nd_fn)¶: embed the first percpu chunk into bootmem

Parameters

size_treserved_size: the size of reserved percpu area in bytes
size_tdyn_size: minimum free size for dynamic allocation in bytes
size_tatom_size: allocation atom size
pcpu_fc_cpu_distance_fn_tcpu_distance_fn: callback to determine distance between cpus, optional
pcpu_fc_cpu_to_node_fn_tcpu_to_nd_fn: callback to convert cpu to it’s node, optional

Description

This is a helper to ease setting up embedded first percpu chunk andcan be called wherepcpu_setup_first_chunk() is expected.

If this function is used to setup the first chunk, it is allocatedby calling pcpu_fc_alloc and used as-is without being mapped intovmalloc area. Allocations are always whole multiples ofatom_sizealigned toatom_size.

This enables the first chunk to piggy back on the linear physicalmapping which often uses larger page size. Please note that thiscan result in very sparse cpu->unit mapping on NUMA machines thusrequiring large vmalloc address space. Don’t use this allocator ifvmalloc space is not orders of magnitude larger than distancesbetween node memory addresses (ie. 32bit NUMA machines).

dyn_size specifies the minimum dynamic area size.

If the needed size is smaller than the minimum or specified unitsize, the leftover is returned using pcpu_fc_free.

Return

0 on success, -errno on failure.

intpcpu_page_first_chunk(size_treserved_size,pcpu_fc_cpu_to_node_fn_tcpu_to_nd_fn)¶: map the first chunk using PAGE_SIZE pages

Parameters

size_treserved_size: the size of reserved percpu area in bytes
pcpu_fc_cpu_to_node_fn_tcpu_to_nd_fn: callback to convert cpu to it’s node, optional

Description

This is a helper to ease setting up page-remapped first percpuchunk and can be called wherepcpu_setup_first_chunk() is expected.

This is the basic allocator. Static percpu area is allocatedpage-by-page into vmalloc area.

Return

0 on success, -errno on failure.

longcopy_from_user_nofault(void*dst,constvoid__user*src,size_tsize)¶: safely attempt to read from a user-space location

Parameters

void*dst: pointer to the buffer that shall take the data
constvoid__user*src: address to read from. This must be a user address.
size_tsize: size of the data chunk

Description

Safely read from user addresssrc to the buffer atdst. If a kernel faulthappens, handle that and return -EFAULT.

longcopy_to_user_nofault(void__user*dst,constvoid*src,size_tsize)¶: safely attempt to write to a user-space location

Parameters

void__user*dst: address to write to
constvoid*src: pointer to the data that shall be written
size_tsize: size of the data chunk

Description

Safely write to addressdst from the buffer atsrc. If a kernel faulthappens, handle that and return -EFAULT.

longstrncpy_from_user_nofault(char*dst,constvoid__user*unsafe_addr,longcount)¶

Copy a NUL terminated string from unsafe user address.

Parameters

char*dst: Destination address, in kernel space. This buffer must be atleastcount bytes long.
constvoid__user*unsafe_addr: Unsafe user address.
longcount: Maximum number of bytes to copy, including the trailing NUL.

Description

Copies a NUL-terminated string from unsafe user address to kernel buffer.

On success, returns the length of the string INCLUDING the trailing NUL.

If access fails, returns -EFAULT (some data may have been copiedand the trailing NUL added).

Ifcount is smaller than the length of the string, copiescount-1 bytes,sets the last byte ofdst buffer to NUL and returnscount.

longstrnlen_user_nofault(constvoid__user*unsafe_addr,longcount)¶

Get the size of a user string INCLUDING final NUL.

Parameters

constvoid__user*unsafe_addr: The string to measure.
longcount: Maximum count (including NUL)

Description

Get the size of a NUL-terminated string in user space without pagefault.

Returns the size of the string INCLUDING the terminating NUL.

If the string is too long, returns a number larger thancount. Userhas to check the return value against “> count”.On exception (or invalid count), returns 0.

Unlike strnlen_user, this can be used from IRQ handler etc. becauseit disables pagefaults.

boolwriteback_throttling_sane(structscan_control*sc)¶: is the usual dirty throttling mechanism available?

Parameters

structscan_control*sc: scan_control in question

Description

The normal page dirty throttling mechanism inbalance_dirty_pages() iscompletely broken with the legacy memcg and direct stalling inshrink_folio_list() is used for throttling instead, which lacks all theniceties such as fairness, adaptive pausing, bandwidth proportionalallocation and configurability.

This function tests whether the vmscan currently in progress can assumethat the normal dirty throttling mechanism is operational.

unsignedlonglruvec_lru_size(structlruvec*lruvec,enumlru_listlru,intzone_idx)¶: Returns the number of pages on the given LRU list.

Parameters

structlruvec*lruvec: lru vector
enumlru_listlru: lru to use
intzone_idx: zones to consider (use MAX_NR_ZONES - 1 for the whole LRU list)

longremove_mapping(structaddress_space*mapping,structfolio*folio)¶: Attempt to remove a folio from its mapping.

Parameters

structaddress_space*mapping: The address space.
structfolio*folio: The folio to remove.

Description

If the folio is dirty, under writeback or if someone else has a refon it, removal will fail.

Return

The number of pages removed from the mapping. 0 if the foliocould not be removed.

Context

The caller should have a single refcount on the folio andhold its lock.

voidfolio_putback_lru(structfolio*folio)¶: Put previously isolated folio onto appropriate LRU list.

Parameters

structfolio*folio: Folio to be returned to an LRU list.

Description

Add previously isolatedfolio to appropriate LRU list.The folio may still be unevictable for other reasons.

Context

lru_lock must not be held, interrupts must be enabled.

boolfolio_isolate_lru(structfolio*folio)¶: Try to isolate a folio from its LRU list.

Parameters

structfolio*folio: Folio to isolate from its LRU list.

Description

Isolate afolio from an LRU list and adjust the vmstat statisticcorresponding to whatever LRU list the folio was on.

The folio will have its LRU flag cleared. If it was found on theactive list, it will have the Active flag set. If it was found on theunevictable list, it will have the Unevictable flag set. These flagsmay need to be cleared by the caller before letting the page go.

Must be called with an elevated refcount on the folio. This is afundamental difference fromisolate_lru_folios() (which is calledwithout a stable reference).
The lru_lock must not be held.
Interrupts must be enabled.

Context

Return

true if the folio was removed from an LRU list.false if the folio was not on an LRU list.

voidcheck_move_unevictable_folios(structfolio_batch*fbatch)¶: Move evictable folios to appropriate zone lru list

Parameters

structfolio_batch*fbatch: Batch of lru folios to check.

Description

Checks folios for evictability, if an evictable folio is in the unevictablelru list, moves it to the appropriate evictable lru list. This functionshould be only used for lru folios.

void__remove_pages(unsignedlongpfn,unsignedlongnr_pages,structvmem_altmap*altmap)¶: remove sections of pages

Parameters

unsignedlongpfn: starting pageframe (must be aligned to start of a section)
unsignedlongnr_pages: number of pages to remove (must be multiple of section size)
structvmem_altmap*altmap: alternative device page map orNULL if default memmap is used

Description

Generic helper function to remove section mappings and sysfs entriesfor the section of the memory we are removing. Caller needs to makesure that pages are marked reserved and zones are adjust properly bycallingoffline_pages().

voidtry_offline_node(intnid)¶

Parameters

intnid: the node ID

Description

Offline a node if all memory sections and cpus of the node are removed.

NOTE

The caller must calllock_device_hotplug() to serialize hotplugand online/offline operations before this call.

void__remove_memory(u64start,u64size)¶: Remove memory if every memory block is offline

Parameters

u64start: physical address of the region to remove
u64size: size of the region to remove

NOTE

The caller must calllock_device_hotplug() to serialize hotplugand online/offline operations before this call, as required bytry_offline_node().

unsignedlongmmu_interval_read_begin(structmmu_interval_notifier*interval_sub)¶: Begin a read side critical section against a VA range

Parameters

structmmu_interval_notifier*interval_sub: The interval subscription

Description

mmu_iterval_read_begin()/mmu_iterval_read_retry() implement acollision-retry scheme similar to seqcount for the VA range undersubscription. If the mm invokes invalidation during the critical sectionthenmmu_interval_read_retry() will return true.

This is useful to obtain shadow PTEs where teardown or setup of the SPTEsrequire a blocking context. The critical region formed by this can sleep,and the required ‘user_lock’ can also be a sleeping lock.

The caller is required to provide a ‘user_lock’ to serialize both teardownand setup.

The return value should be passed tommu_interval_read_retry().

intmmu_notifier_register(structmmu_notifier*subscription,structmm_struct*mm)¶: Register a notifier on a mm

Parameters

structmmu_notifier*subscription: The notifier to attach
structmm_struct*mm: The mm to attach the notifier to

Description

Must not hold mmap_lock nor any other VM related lock when callingthis registration function. Must also ensure mm_users can’t go downto zero while this runs to avoid races with mmu_notifier_release,so mm has to be current->mm or the mm should be pinned safely suchas withget_task_mm(). If the mm is not current->mm, the mm_userspin should be released by calling mmput after mmu_notifier_registerreturns.

mmu_notifier_unregister() ormmu_notifier_put() must be always called tounregister the notifier.

While the caller has a mmu_notifier get the subscription->mm pointer will remainvalid, and can be converted to an active mm pointer viammget_not_zero().

structmmu_notifier*mmu_notifier_get_locked(conststructmmu_notifier_ops*ops,structmm_struct*mm)¶: Return the singlestructmmu_notifier for the mm & ops

Parameters

conststructmmu_notifier_ops*ops: The operationsstructbeing subscribe with
structmm_struct*mm: The mm to attach notifiers too

Description

This function either allocates a new mmu_notifier viaops->alloc_notifier(), or returns an already existing notifier on thelist. The value of the ops pointer is used to determine when two notifiersare the same.

Each call tommu_notifier_get() must be paired with a call tommu_notifier_put(). The caller must hold the write side of mm->mmap_lock.

While the caller has a mmu_notifier get the mm pointer will remain valid,and can be converted to an active mm pointer viammget_not_zero().

voidmmu_notifier_put(structmmu_notifier*subscription)¶: Release the reference on the notifier

Parameters

structmmu_notifier*subscription: The notifier to act on

Description

This function must be paired with eachmmu_notifier_get(), it releases thereference obtained by the get. If this is the last reference then processto free the notifier will be run asynchronously.

Unlikemmu_notifier_unregister() the get/put flow only calls ops->releasewhen the mm_struct is destroyed. Instead free_notifier is always called torelease any resources held by the user.

As ops->release is not guaranteed to be called, the user must ensure thatall sptes are dropped, and no new sptes can be established beforemmu_notifier_put() is called.

This function can be called from the ops->release callback, however thecaller must still ensure it is called pairwise withmmu_notifier_get().

Modules calling this function must callmmu_notifier_synchronize() intheir __exit functions to ensure the async work is completed.

intmmu_interval_notifier_insert(structmmu_interval_notifier*interval_sub,structmm_struct*mm,unsignedlongstart,unsignedlonglength,conststructmmu_interval_notifier_ops*ops)¶: Insert an interval notifier

Parameters

structmmu_interval_notifier*interval_sub: Interval subscription to register
structmm_struct*mm: mm_struct to attach to
unsignedlongstart: Starting virtual address to monitor
unsignedlonglength: Length of the range to monitor
conststructmmu_interval_notifier_ops*ops: Interval notifier operations to be called on matching events

Description

This function subscribes the interval notifier for notifications from themm. Upon return the ops related to mmu_interval_notifier will be calledwhenever an event that intersects with the given range occurs.

Upon return the range_notifier may not be present in the interval tree yet.The caller must use the normal interval notifier read flow viammu_interval_read_begin() to establish SPTEs for this range.

voidmmu_interval_notifier_remove(structmmu_interval_notifier*interval_sub)¶: Remove a interval notifier

Parameters

structmmu_interval_notifier*interval_sub: Interval subscription to unregister

Description

This function must be paired withmmu_interval_notifier_insert(). It cannotbe called from any ops callback.

Once this returns ops callbacks are no longer running on other CPUs andwill not be called in future.

voidmmu_notifier_synchronize(void)¶: Ensure all mmu_notifiers are freed

Parameters

void: no arguments

Description

This function ensures that all outstanding async SRU work frommmu_notifier_put() is completed. After it returns any mmu_notifier_opsassociated with an unused mmu_notifier will no longer be called.

Before using the caller must ensure that all of its mmu_notifiers have beenfully released viammu_notifier_put().

Modules using themmu_notifier_put() API should call this in their __exitfunction to avoid module unloading races.

size_tballoon_page_list_enqueue(structballoon_dev_info*b_dev_info,structlist_head*pages)¶: inserts a list of pages into the balloon page list.

Parameters

structballoon_dev_info*b_dev_info: balloon device descriptor where we will insert a new page to
structlist_head*pages: pages to enqueue - allocated using balloon_page_alloc.

Description

Driver must call this function to properly enqueue balloon pages beforedefinitively removing them from the guest system.

Return

number of pages that were enqueued.

size_tballoon_page_list_dequeue(structballoon_dev_info*b_dev_info,structlist_head*pages,size_tn_req_pages)¶: removes pages from balloon’s page list and returns a list of the pages.

Parameters

structballoon_dev_info*b_dev_info: balloon device descriptor where we will grab a page from.
structlist_head*pages: pointer to the list of pages that would be returned to the caller.
size_tn_req_pages: number of requested pages.

Description

Driver must call this function to properly de-allocate a previous enlistedballoon pages before definitively releasing it back to the guest system.This function tries to removen_req_pages from the ballooned pages andreturn them to the caller in thepages list.

Note that this function may fail to dequeue some pages even if the balloonisn’t empty - since the page list can be temporarily empty due to compactionof isolated pages.

Return

number of pages that were added to thepages list.

vm_fault_tvmf_insert_pfn_pmd(structvm_fault*vmf,unsignedlongpfn,boolwrite)¶: insert a pmd size pfn

Parameters

structvm_fault*vmf: Structure describing the fault
unsignedlongpfn: pfn to insert
boolwrite: whether it’s a write fault

Description

Insert a pmd size pfn. Seevmf_insert_pfn() for additional info.

Return

vm_fault_t value.

vm_fault_tvmf_insert_pfn_pud(structvm_fault*vmf,unsignedlongpfn,boolwrite)¶: insert a pud size pfn

Parameters

structvm_fault*vmf: Structure describing the fault
unsignedlongpfn: pfn to insert
boolwrite: whether it’s a write fault

Description

Insert a pud size pfn. Seevmf_insert_pfn() for additional info.

Return

vm_fault_t value.

vm_fault_tvmf_insert_folio_pud(structvm_fault*vmf,structfolio*folio,boolwrite)¶: insert a pud size folio mapped by a pud entry

Parameters

structvm_fault*vmf: Structure describing the fault
structfolio*folio: folio to insert
boolwrite: whether it’s a write fault

Return

vm_fault_t value.

booltouch_pmd(structvm_area_struct*vma,unsignedlongaddr,pmd_t*pmd,boolwrite)¶: Mark page table pmd entry as accessed and dirty (for write)

Parameters

structvm_area_struct*vma: The VMA coveringaddr
unsignedlongaddr: The virtual address
pmd_t*pmd: pmd pointer into the page table mappingaddr
boolwrite: Whether it’s a write access

Return

whether the pmd entry is changed

int__split_unmapped_folio(structfolio*folio,intnew_order,structpage*split_at,structxa_state*xas,structaddress_space*mapping,enumsplit_typesplit_type)¶: splits an unmappedfolio to lower order folios in two ways: uniform split or non-uniform split.

Parameters

structfolio*folio: the to-be-split folio
intnew_order: the smallest order of the after split folios (since buddyallocator like split generates folios with orders fromfolio’sorder - 1 to new_order).
structpage*split_at: in buddy allocator like split, the folio containingsplit_atwill be split until its order becomesnew_order.
structxa_state*xas: xa_state pointing to folio->mapping->i_pages and locked by caller
structaddress_space*mapping: folio->mapping
enumsplit_typesplit_type: if the split is uniform or not (buddy allocator like split)

Description

uniform split: the givenfolio into multiplenew_order small folios,where all small folios have the same order. This is done whensplit_type is SPLIT_TYPE_UNIFORM.
buddy allocator like (non-uniform) split: the givenfolio is split intohalf and one of the half (containing the given page) is split into halfuntil the givenfolio’s order becomesnew_order. This is done whensplit_type is SPLIT_TYPE_NON_UNIFORM.

The high level flow for these two methods are:

uniform split:xas is split with no expectation of failure and a single__split_folio_to_order() is called to split thefolio intonew_orderalong with stats update.
non-uniform split: folio_order -new_order calls to__split_folio_to_order() are expected to be made in a for loop to splitthefolio to one lower order at a time. The folio containingsplit_atis split in each iteration.xas is split into half in each iteration andcan fail. A failedxas split leaves split folios as is without mergingthem back.

After splitting, the caller’s folio reference will be transferred to thefolio containingsplit_at. The caller needs to unlock and/or freeafter-split folios if necessary.

Return

0 - successful, <0 - failed (if -ENOMEM is returned,folio might besplit but not tonew_order, the caller needs to check)

intfolio_check_splittable(structfolio*folio,unsignedintnew_order,enumsplit_typesplit_type)¶: check if a folio can be split to a given order

Parameters

structfolio*folio: folio to be split
unsignedintnew_order: the smallest order of the after split folios (since buddyallocator like split generates folios with orders fromfolio’sorder - 1 to new_order).
enumsplit_typesplit_type: uniform or non-uniform split

Description

folio_check_splittable() checks iffolio can be split tonew_order usingsplit_type method. The truncated folio check must come first.

Context

folio must be locked.

Return

0 -folio can be split tonew_order, otherwise an error number isreturned.

int__folio_split(structfolio*folio,unsignedintnew_order,structpage*split_at,structpage*lock_at,structlist_head*list,enumsplit_typesplit_type)¶: split a folio atsplit_at to anew_order folio

Parameters

structfolio*folio: folio to split
unsignedintnew_order: the order of the new folio
structpage*split_at: a page within the new folio
structpage*lock_at: a page withinfolio to be left locked to caller
structlist_head*list: after-split folios will be put on it if non NULL
enumsplit_typesplit_type: perform uniform split or not (non-uniform split)

Description

It calls__split_unmapped_folio() to perform uniform and non-uniform split.It is in charge of checking whether the split is supported or not andpreparingfolio for__split_unmapped_folio().

After splitting, the after-split folio containinglock_at remains lockedand others are unlocked:1. for uniform split,lock_at points to one offolio’s subpages;2. for buddy allocator like (non-uniform) split,lock_at points tofolio.

Return

0 - successful, <0 - failed (if -ENOMEM is returned,folio might besplit but not tonew_order, the caller needs to check)

intfolio_split_unmapped(structfolio*folio,unsignedintnew_order)¶: split a large anon folio that is already unmapped

Parameters

structfolio*folio: folio to split
unsignedintnew_order: the order of folios after split

Description

This function is a helper for splitting folios that have already beenunmapped. The use case is that the device or the CPU can refuse to migrateTHP pages in the middle of migration, due to allocation issues on eitherside.

anon_vma_lock is not required to be held,mmap_read_lock() ormmap_write_lock() should be held.folio is expected to be locked by thecaller. device-private and non device-private folios are supported alongwith folios that are in the swapcache.folio should also be unmapped andisolated from LRU (if applicable)

Upon return, the folio is not remapped, split folios are not added to LRU,free_folio_and_swap_cache() is not called, and new folios remain locked.

Return

0 on success, -EAGAIN if the folio cannot be split (e.g., due toinsufficient reference count or extra pins).

intfolio_split(structfolio*folio,unsignedintnew_order,structpage*split_at,structlist_head*list)¶: split a folio atsplit_at to anew_order folio

Parameters

structfolio*folio: folio to split
unsignedintnew_order: the order of the new folio
structpage*split_at: a page within the new folio
structlist_head*list: after-split folios are added tolist if not null, otherwise to LRUlist

Description

It has the same prerequisites and returns assplit_huge_page_to_list_to_order().

Split a folio atsplit_at to a new_order folio, leave theremaining subpages of the original folio as large as possible. For example,in the case of splitting an order-9 folio at its third order-3 subpages toan order-3 folio, there are 2^(9-3)=64 order-3 subpages in the order-9 folio.After the split, there will be a group of folios with different orders andthe new folio containingsplit_at is marked in bracket:[order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].

After split, folio is left locked for caller.

Return

0 - successful, <0 - failed (if -ENOMEM is returned,folio might besplit but not tonew_order, the caller needs to check)

unsignedintmin_order_for_split(structfolio*folio)¶: get the minimum orderfolio can be split to

Parameters

structfolio*folio: folio to split

Description

min_order_for_split() tells the minimum orderfolio can be split to.If a file-backed folio is truncated, 0 will be returned. Any subsequentsplit attempt should get -EBUSY from split checking code.

Return

folio’s minimum order for split

Movatterモバイル変換

Memory Management APIs¶

User Space Memory Access¶

Memory Allocation Controls¶

Page mobility and placement hints¶

Watermark modifiers -- controls access to emergency reserves¶

Reclaim modifiers¶

Useful GFP flag combinations¶

The Slab Cache¶

Virtually Contiguous Mappings¶

File Mapping and Page Cache¶

Filemap¶

Readahead¶

Writeback¶

Truncate¶

Memory pools¶

More Memory Management Functions¶