Memory Management APIs

User Space Memory Access

access_ok(addr,size)

Checks if a user space pointer is valid

Parameters

addr
User space pointer to start of block to check
size
Size of block to check

Context

User context only. This function may sleep if pagefaults areenabled.

Description

Checks if a pointer to a block of memory in user space is valid.

Note that, depending on architecture, this function probably justchecks that the pointer is in the user space range - after callingthis function, memory access functions may still return -EFAULT.

Return

true (nonzero) if the memory block may be valid, false (zero)if it is definitely invalid.

get_user(x,ptr)

Get a simple variable from user space.

Parameters

x
Variable to store result.
ptr
Source address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple variable from user space to kernelspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result ofdereferencingptr must be assignable tox without a cast.

Return

zero on success, or -EFAULT on error.On error, the variablex is set to zero.

put_user(x,ptr)

Write a simple value into user space.

Parameters

x
Value to copy to user space.
ptr
Destination address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple value from kernel space to userspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, andx must be assignableto the result of dereferencingptr.

Return

zero on success, or -EFAULT on error.

__get_user(x,ptr)

Get a simple variable from user space, with less checking.

Parameters

x
Variable to store result.
ptr
Source address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple variable from user space to kernelspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result ofdereferencingptr must be assignable tox without a cast.

Caller must check the pointer withaccess_ok() before calling thisfunction.

Return

zero on success, or -EFAULT on error.On error, the variablex is set to zero.

__put_user(x,ptr)

Write a simple value into user space, with less checking.

Parameters

x
Value to copy to user space.
ptr
Destination address, in user space.

Context

User context only. This function may sleep if pagefaults areenabled.

Description

This macro copies a single simple value from kernel space to userspace. It supports simple types like char and int, but not largerdata types like structures or arrays.

ptr must have pointer-to-simple-variable type, andx must be assignableto the result of dereferencingptr.

Caller must check the pointer withaccess_ok() before calling thisfunction.

Return

zero on success, or -EFAULT on error.

unsigned longclear_user(void __user * to, unsigned long n)

Zero a block of memory in user space.

Parameters

void__user*to
Destination address, in user space.
unsignedlongn
Number of bytes to zero.

Description

Zero a block of memory in user space.

Return

number of bytes that could not be cleared.On success, this will be zero.

unsigned long__clear_user(void __user * to, unsigned long n)

Zero a block of memory in user space, with less checking.

Parameters

void__user*to
Destination address, in user space.
unsignedlongn
Number of bytes to zero.

Description

Zero a block of memory in user space. Caller must checkthe specified block withaccess_ok() before calling this function.

Return

number of bytes that could not be cleared.On success, this will be zero.

intget_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page ** pages)

pin user pages in memory

Parameters

unsignedlongstart
starting user address
intnr_pages
number of pages from start to pin
unsignedintgup_flags
flags modifying pin behaviour
structpage**pages
array that receives pointers to the pages pinned.Should be at least nr_pages long.

Description

Attempt to pin user pages in memory without taking mm->mmap_lock.If not successful, it will fall back to taking the lock andcalling get_user_pages().

Returns number of pages pinned. This may be fewer than the number requested.If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns-errno.

Memory Allocation Controls

Functions which need to allocate memory often use GFP flags to expresshow that memory should be allocated. The GFP acronym stands for “getfree pages”, the underlying memory allocation function. Not every GFPflag is allowed to every function which may allocate memory. Mostusers will want to use a plainGFP_KERNEL.

Page mobility and placement hints

These flags provide hints about how mobile the page is. Pages with similarmobility are placed within the same pageblocks to minimise problems dueto external fragmentation.

__GFP_MOVABLE (also a zone modifier) indicates that the page can bemoved by page migration during memory compaction or can be reclaimed.

__GFP_RECLAIMABLE is used for slab allocations that specifySLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.

__GFP_WRITE indicates the caller intends to dirty the page. Where possible,these pages will be spread between local zones to avoid all the dirtypages being in one zone (fair zone allocation policy).

__GFP_HARDWALL enforces the cpuset memory allocation policy.

__GFP_THISNODE forces the allocation to be satisfied from the requestednode with no fallbacks or placement policy enforcements.

__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.

Watermark modifiers – controls access to emergency reserves

__GFP_HIGH indicates that the caller is high-priority and that grantingthe request is necessary before the system can make forward progress.For example, creating an IO context to clean pages.

__GFP_ATOMIC indicates that the caller cannot reclaim or sleep and ishigh priority. Users are typically interrupt handlers. This may beused in conjunction with__GFP_HIGH

__GFP_MEMALLOC allows access to all memory. This should only be used whenthe caller guarantees the allocation will allow more memory to be freedvery shortly e.g. process exiting or swapping. Users either shouldbe the MM or co-ordinating closely with the VM (e.g. swap over NFS).Users of this flag have to be extremely careful to not deplete the reservecompletely and implement a throttling mechanism which controls theconsumption of the reserve based on the amount of freed memory.Usage of a pre-allocated pool (e.g. mempool) should be always consideredbefore using this flag.

__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.This takes precedence over the__GFP_MEMALLOC flag if both are set.

Reclaim modifiers

Please note that all the following flags are only applicable to sleepableallocations (e.g.GFP_NOWAIT andGFP_ATOMIC will ignore them).

__GFP_IO can start physical IO.

__GFP_FS can call down to the low-level FS. Clearing the flag avoids theallocator recursing into the filesystem which might already be holdinglocks.

__GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.This flag can be cleared to avoid unnecessary delays when a fallbackoption is available.

__GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd whenthe low watermark is reached and have it reclaim pages until the highwatermark is reached. A caller may wish to clear this flag when fallbackoptions are available and the reclaim is likely to disrupt the system. Thecanonical example is THP allocation where a fallback is cheap butreclaim/compaction may cause indirect stalls.

__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.

The default allocator behavior depends on the request size. We have a conceptof so called costly allocations (with order >PAGE_ALLOC_COSTLY_ORDER).!costly allocations are too essential to fail so they are implicitlynon-failing by default (with some exceptions like OOM victims might fail sothe caller still has to check for failures) while costly requests try to benot disruptive and back off even without invoking the OOM killer.The following three modifiers might be used to override some of theseimplicit rules

__GFP_NORETRY: The VM implementation will try only very lightweightmemory direct reclaim to get some memory under memory pressure (thusit can sleep). It will avoid disruptive actions like OOM killer. Thecaller must handle the failure which is quite likely to happen underheavy memory pressure. The flag is suitable when failure can easily behandled at small cost, such as reduced throughput

__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaimprocedures that have previously failed if there is some indicationthat progress has been made else where. It can wait for othertasks to attempt high level approaches to freeing memory such ascompaction (which removes fragmentation) and page-out.There is still a definite limit to the number of retries, but it isa larger limit than with__GFP_NORETRY.Allocations with this flag may fail, but only when there isgenuinely little unused memory. While these allocations do notdirectly trigger the OOM killer, their failure indicates thatthe system is likely to need to use the OOM killer soon. Thecaller must handle failure, but can reasonably do so by failinga higher-level request, or completing it only in a much lessefficient manner.If the allocation does fail, and the caller is in a position tofree some non-essential memory, doing so could benefit the systemas a whole.

__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the callercannot handle allocation failures. The allocation could blockindefinitely but will never return with failure. Testing forfailure is pointless.New users should be evaluated carefully (and the flag should beused only when there is no reasonable failure policy) but it isdefinitely preferable to use the flag rather than opencode endlessloop around allocator.Using this flag for costly allocations is _highly_ discouraged.

Useful GFP flag combinations

Useful GFP flag combinations that are commonly used. It is recommendedthat subsystems start with one of these combinations and then set/clear__GFP_FOO flags as necessary.

GFP_ATOMIC users can not sleep and need the allocation to succeed. A lowerwatermark is applied to allow access to “atomic reserves”

GFP_KERNEL is typical for kernel-internal allocations. The caller requiresZONE_NORMAL or a lower zone for direct access but can direct reclaim.

GFP_KERNEL_ACCOUNT is the same as GFP_KERNEL, except the allocation isaccounted to kmemcg.

GFP_NOWAIT is for kernel allocations that should not stall for directreclaim, start physical IO or use any filesystem callback.

GFP_NOIO will use direct reclaim to discard clean pages or slab pagesthat do not require the starting of any physical IO.Please try to avoid using this flag directly and instead usememalloc_noio_{save,restore} to mark the whole scope which cannotperform any IO with a short explanation why. All allocation requestswill inherit GFP_NOIO implicitly.

GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.Please try to avoid using this flag directly and instead usememalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn’trecurse into the FS layer with a short explanation why. All allocationrequests will inherit GFP_NOFS implicitly.

GFP_USER is for userspace allocations that also need to be directlyaccessibly by the kernel or hardware. It is typically used by hardwarefor buffers that are mapped to userspace (e.g. graphics) that hardwarestill must DMA to. cpuset limits are enforced for these allocations.

GFP_DMA exists for historical reasons and should be avoided where possible.The flags indicates that the caller requires that the lowest zone beused (ZONE_DMA or 16M on x86-64). Ideally, this would be removed butit would require careful auditing as some users really require it andothers use the flag to avoid lowmem reserves inZONE_DMA and treat thelowest zone as a type of emergency reserve.

GFP_DMA32 is similar toGFP_DMA except that the caller requires a 32-bitaddress.

GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,do not need to be directly accessible by the kernel but that cannotmove once in use. An example may be a hardware allocation that mapsdata directly into userspace but has no addressing limitations.

GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does notneed direct access to but can use kmap() when access is required. Theyare expected to be movable via page reclaim or page migration. Typically,pages on the LRU would also be allocated withGFP_HIGHUSER_MOVABLE.

GFP_TRANSHUGE andGFP_TRANSHUGE_LIGHT are used for THP allocations. Theyare compound allocations that will generally fail quickly if memory is notavailable and will not wake kswapd/kcompactd on failure. The _LIGHTversion does not attempt reclaim/compaction at all and is by default usedin page fault path, while the non-light is used by khugepaged.

The Slab Cache

void *kmalloc(size_t size, gfp_t flags)

allocate memory

Parameters

size_tsize
how many bytes of memory are required.
gfp_tflags
the type of memory to allocate.

Description

kmalloc is the normal method of allocating memoryfor objects smaller than page size in the kernel.

The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGNbytes. Forsize of power of two bytes, the alignment is also guaranteedto be at least to the size.

Theflags argument may be one of the GFP flags defined atinclude/linux/gfp.h and described atDocumentation/core-api/mm-api.rst

The recommended usage of theflags is described atDocumentation/core-api/memory-allocation.rst

Below is a brief outline of the most useful GFP flags

GFP_KERNEL
Allocate normal kernel ram. May sleep.
GFP_NOWAIT
Allocation will not sleep.
GFP_ATOMIC
Allocation will not sleep. May use emergency pools.
GFP_HIGHUSER
Allocate memory from high memory on behalf of user.

Also it is possible to set different flags by OR’ingin one or more of the following additionalflags:

__GFP_HIGH
This allocation has high priority and may use emergency pools.
__GFP_NOFAIL
Indicate that this allocation is in no way allowed to fail(think twice before using).
__GFP_NORETRY
If memory is not immediately available,then give up at once.
__GFP_NOWARN
If allocation fails, don’t issue any warnings.
__GFP_RETRY_MAYFAIL
Try really hard to succeed the allocation but faileventually.
void *kmalloc_array(size_t n, size_t size, gfp_t flags)

allocate memory for an array.

Parameters

size_tn
number of elements.
size_tsize
element size.
gfp_tflags
the type of memory to allocate (see kmalloc).
void *kcalloc(size_t n, size_t size, gfp_t flags)

allocate memory for an array. The memory is set to zero.

Parameters

size_tn
number of elements.
size_tsize
element size.
gfp_tflags
the type of memory to allocate (see kmalloc).
void *kzalloc(size_t size, gfp_t flags)

allocate memory. The memory is set to zero.

Parameters

size_tsize
how many bytes of memory are required.
gfp_tflags
the type of memory to allocate (see kmalloc).
void *kzalloc_node(size_t size, gfp_t flags, int node)

allocate zeroed memory from a particular memory node.

Parameters

size_tsize
how many bytes of memory are required.
gfp_tflags
the type of memory to allocate (see kmalloc).
intnode
memory node from which to allocate
void *kmem_cache_alloc(struct kmem_cache * cachep, gfp_t flags)

Allocate an object

Parameters

structkmem_cache*cachep
The cache to allocate from.
gfp_tflags
Seekmalloc().

Description

Allocate an object from this cache. The flags are only relevantif the cache has no available objects.

Return

pointer to the new object orNULL in case of error

void *kmem_cache_alloc_node(struct kmem_cache * cachep, gfp_t flags, int nodeid)

Allocate an object on the specified node

Parameters

structkmem_cache*cachep
The cache to allocate from.
gfp_tflags
Seekmalloc().
intnodeid
node number of the target node.

Description

Identical to kmem_cache_alloc but it will allocate memory on the givennode, which can improve the performance for cpu bound structures.

Fallback to other node is possible if __GFP_THISNODE is not set.

Return

pointer to the new object orNULL in case of error

voidkmem_cache_free(struct kmem_cache * cachep, void * objp)

Deallocate an object

Parameters

structkmem_cache*cachep
The cache the allocation was from.
void*objp
The previously allocated object.

Description

Free an object which was previously allocated from thiscache.

voidkfree(const void * objp)

free previously allocated memory

Parameters

constvoid*objp
pointer returned by kmalloc.

Description

Ifobjp is NULL, no operation is performed.

Don’t free memory not originally allocated bykmalloc()or you will run into trouble.

size_t__ksize(const void * objp)
  • Uninstrumented ksize.

Parameters

constvoid*objp
pointer to the object

Description

Unlikeksize(),__ksize() is uninstrumented, and does not provide the samesafety checks asksize() with KASAN instrumentation enabled.

Return

size of the actual memory used byobjp in bytes

struct kmem_cache *kmem_cache_create_usercopy(const char * name, unsigned int size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void *))

Create a cache with a region suitable for copying to userspace

Parameters

constchar*name
A string which is used in /proc/slabinfo to identify this cache.
unsignedintsize
The size of objects to be created in this cache.
unsignedintalign
The required alignment for the objects.
slab_flags_tflags
SLAB flags
unsignedintuseroffset
Usercopy region offset
unsignedintusersize
Usercopy region size
void(*)(void*)ctor
A constructor for the objects.

Description

Cannot be called within a interrupt, but can be interrupted.Thector is run when new pages are allocated by the cache.

The flags are

SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)to catch references to uninitialised memory.

SLAB_RED_ZONE - InsertRed zones around the allocated memory to checkfor buffer overruns.

SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardwarecacheline. This can be beneficial if you’re counting cycles as closelyas davem.

Return

a pointer to the cache on success, NULL on failure.

struct kmem_cache *kmem_cache_create(const char * name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *))

Create a cache.

Parameters

constchar*name
A string which is used in /proc/slabinfo to identify this cache.
unsignedintsize
The size of objects to be created in this cache.
unsignedintalign
The required alignment for the objects.
slab_flags_tflags
SLAB flags
void(*)(void*)ctor
A constructor for the objects.

Description

Cannot be called within a interrupt, but can be interrupted.Thector is run when new pages are allocated by the cache.

The flags are

SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)to catch references to uninitialised memory.

SLAB_RED_ZONE - InsertRed zones around the allocated memory to checkfor buffer overruns.

SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardwarecacheline. This can be beneficial if you’re counting cycles as closelyas davem.

Return

a pointer to the cache on success, NULL on failure.

intkmem_cache_shrink(struct kmem_cache * cachep)

Shrink a cache.

Parameters

structkmem_cache*cachep
The cache to shrink.

Description

Releases as many slabs as possible for a cache.To help debugging, a zero exit status indicates all slabs were released.

Return

0 if all slabs were released, non-zero otherwise

void *krealloc(const void * p, size_t new_size, gfp_t flags)

reallocate memory. The contents will remain unchanged.

Parameters

constvoid*p
object to reallocate memory for.
size_tnew_size
how many bytes of memory are required.
gfp_tflags
the type of memory to allocate.

Description

The contents of the object pointed to are preserved up to thelesser of the new and old sizes. Ifp isNULL,krealloc()behaves exactly likekmalloc(). Ifnew_size is 0 andp is not aNULL pointer, the object pointed to is freed.

Return

pointer to the allocated memory orNULL in case of error

voidkzfree(const void * p)

like kfree but zero memory

Parameters

constvoid*p
object to free memory of

Description

The memory of the objectp points to is zeroed before freed.Ifp isNULL,kzfree() does nothing.

Note

this function zeroes the whole allocated buffer which can be a gooddeal bigger than the requested buffer size passed tokmalloc(). So becareful when using this function in performance sensitive code.

size_tksize(const void * objp)

get the actual amount of memory allocated for a given object

Parameters

constvoid*objp
Pointer to the object

Description

kmalloc may internally round up allocations and return more memorythan requested.ksize() can be used to determine the actual amount ofmemory allocated. The caller may use this additional memory, even thougha smaller amount of memory was initially specified with the kmalloc call.The caller must guarantee that objp points to a valid object previouslyallocated with eitherkmalloc() orkmem_cache_alloc(). The objectmust not be freed during the duration of the call.

Return

size of the actual memory used byobjp in bytes

voidkfree_const(const void * x)

conditionally free memory

Parameters

constvoid*x
pointer to the memory

Description

Function calls kfree only ifx is not in .rodata section.

void *kvmalloc_node(size_t size, gfp_t flags, int node)

attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation.

Parameters

size_tsize
size of the request.
gfp_tflags
gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
intnode
numa node to allocate from

Description

Uses kmalloc to get the memory but if the allocation fails then falls backto the vmalloc allocator. Use kvfree for freeing the memory.

Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.__GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc ispreferable to the vmalloc fallback, due to visible performance drawbacks.

Please note that any use of gfp flags outside of GFP_KERNEL is careful to notfall back to vmalloc.

Return

pointer to the allocated memory ofNULL in case of failure

voidkvfree(const void * addr)

Free memory.

Parameters

constvoid*addr
Pointer to allocated memory.

Description

kvfree frees memory allocated by any ofvmalloc(),kmalloc() or kvmalloc().It is slightly more efficient to usekfree() orvfree() if you are certainthat you know which one to use.

Context

Either preemptible task context or not-NMI interrupt.

Virtually Contiguous Mappings

voidvm_unmap_aliases(void)

unmap outstanding lazy aliases in the vmap layer

Parameters

void
no arguments

Description

The vmap/vmalloc layer lazily flushes kernel virtual mappings primarilyto amortize TLB flushing overheads. What this means is that any page youhave now, may, in a former life, have been mapped into kernel virtualaddress by the vmap layer and so there might be some CPUs with TLB entriesstill referencing that page (additional to the regular 1:1 kernel mapping).

vm_unmap_aliases flushes all such lazy mappings. After it returns, we canbe sure that none of the pages we have control over will have any aliasesfrom the vmap layer.

voidvm_unmap_ram(const void * mem, unsigned int count)

unmap linear kernel address space set up by vm_map_ram

Parameters

constvoid*mem
the pointer returned by vm_map_ram
unsignedintcount
the count passed to that vm_map_ram call (cannot unmap partial)
void *vm_map_ram(struct page ** pages, unsigned int count, int node)

map pages linearly into kernel virtual address (vmalloc space)

Parameters

structpage**pages
an array of pointers to the pages to be mapped
unsignedintcount
number of pages
intnode
prefer to allocate data structures on this node

Description

If you use this function for less than VMAP_MAX_ALLOC pages, it could befaster than vmap so it’s good. But if you mix long-life and short-lifeobjects withvm_map_ram(), it could consume lots of address space throughfragmentation (especially on a 32bit machine). You could see failures inthe end. Please use this function for short-lived objects.

Return

a pointer to the address that has been mapped, orNULL on failure

voidvfree(const void * addr)

release memory allocated byvmalloc()

Parameters

constvoid*addr
memory base address

Description

Free the virtually continuous memory area starting ataddr, asobtained fromvmalloc(),vmalloc_32() or __vmalloc(). Ifaddr isNULL, no operation is performed.

Must not be called in NMI context (strictly speaking, only if we don’thave CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the callingconventions forvfree() arch-depenedent would be a really bad idea)

May sleep if callednot from interrupt context.

NOTE

assumes that the object ataddr has a size >= sizeof(llist_node)

voidvunmap(const void * addr)

release virtual mapping obtained byvmap()

Parameters

constvoid*addr
memory base address

Description

Free the virtually contiguous memory area starting ataddr,which was created from the page array passed tovmap().

Must not be called in interrupt context.

void *vmap(struct page ** pages, unsigned int count, unsigned long flags, pgprot_t prot)

map an array of pages into virtually contiguous space

Parameters

structpage**pages
array of page pointers
unsignedintcount
number of pages to map
unsignedlongflags
vm_area->flags
pgprot_tprot
page protection for the mapping

Description

Mapscount pages frompages into contiguous kernel virtualspace.

Return

the address of the area orNULL on failure

void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void * caller)

allocate virtually contiguous memory

Parameters

unsignedlongsize
allocation size
unsignedlongalign
desired alignment
gfp_tgfp_mask
flags for the page level allocator
intnode
node to use for allocation or NUMA_NO_NODE
constvoid*caller
caller’s return address

Description

Allocate enough pages to coversize from the page level allocator withgfp_mask flags. Map them into contiguous kernel virtual space.

Reclaim modifiers ingfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAILand __GFP_NOFAIL are not supported

Any use of gfp flags outside of GFP_KERNEL should be consultedwith mm people.

Return

pointer to the allocated memory orNULL on error

void *vmalloc(unsigned long size)

allocate virtually contiguous memory

Parameters

unsignedlongsize
allocation size

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.

For tight control over page level allocator and protection flagsuse __vmalloc() instead.

Return

pointer to the allocated memory orNULL on error

void *vzalloc(unsigned long size)

allocate virtually contiguous memory with zero fill

Parameters

unsignedlongsize
allocation size

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.The memory allocated is set to zero.

For tight control over page level allocator and protection flagsuse __vmalloc() instead.

Return

pointer to the allocated memory orNULL on error

void *vmalloc_user(unsigned long size)

allocate zeroed virtually contiguous memory for userspace

Parameters

unsignedlongsize
allocation size

Description

The resulting memory area is zeroed so it can be mapped to userspacewithout leaking data.

Return

pointer to the allocated memory orNULL on error

void *vmalloc_node(unsigned long size, int node)

allocate memory on a specific node

Parameters

unsignedlongsize
allocation size
intnode
numa node

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.

For tight control over page level allocator and protection flagsuse __vmalloc() instead.

Return

pointer to the allocated memory orNULL on error

void *vzalloc_node(unsigned long size, int node)

allocate memory on a specific node with zero fill

Parameters

unsignedlongsize
allocation size
intnode
numa node

Description

Allocate enough pages to coversize from the page levelallocator and map them into contiguous kernel virtual space.The memory allocated is set to zero.

Return

pointer to the allocated memory orNULL on error

void *vmalloc_32(unsigned long size)

allocate virtually contiguous memory (32bit addressable)

Parameters

unsignedlongsize
allocation size

Description

Allocate enough 32bit PA addressable pages to coversize from thepage level allocator and map them into contiguous kernel virtual space.

Return

pointer to the allocated memory orNULL on error

void *vmalloc_32_user(unsigned long size)

allocate zeroed virtually contiguous 32bit memory

Parameters

unsignedlongsize
allocation size

Description

The resulting memory area is 32bit addressable and zeroed so it can bemapped to userspace without leaking data.

Return

pointer to the allocated memory orNULL on error

intremap_vmalloc_range_partial(struct vm_area_struct * vma, unsigned long uaddr, void * kaddr, unsigned long pgoff, unsigned long size)

map vmalloc pages to userspace

Parameters

structvm_area_struct*vma
vma to cover
unsignedlonguaddr
target user address to start at
void*kaddr
virtual address of vmalloc kernel memory
unsignedlongpgoff
offset fromkaddr to start at
unsignedlongsize
size of map area

Return

0 for success, -Exxx on failure

Description

This function checks thatkaddr is a valid vmalloc’ed area,and that it is big enough to cover the range starting atuaddr invma. Will return failure if that criteria isn’tmet.

Similar toremap_pfn_range() (see mm/memory.c)

intremap_vmalloc_range(struct vm_area_struct * vma, void * addr, unsigned long pgoff)

map vmalloc pages to userspace

Parameters

structvm_area_struct*vma
vma to cover (map full range of vma)
void*addr
vmalloc memory
unsignedlongpgoff
number of pages into addr before first page to map

Return

0 for success, -Exxx on failure

Description

This function checks that addr is a valid vmalloc’ed area, andthat it is big enough to cover the vma. Will return failure ifthat criteria isn’t met.

Similar toremap_pfn_range() (see mm/memory.c)

struct vm_struct *alloc_vm_area(size_t size, pte_t ** ptes)

allocate a range of kernel address space

Parameters

size_tsize
size of the area
pte_t**ptes
returns the PTEs for the address space

Return

NULL on failure, vm_struct on success

Description

This function reserves a range of kernel address space, andallocates pagetables to map that range. No actual mappingsare created.

Ifptes is non-NULL, pointers to the PTEs (in init_mm)allocated for the VM area are returned.

File Mapping and Page Cache

intread_cache_pages(structaddress_space * mapping, struct list_head * pages, int (*filler)(void *, struct page *), void * data)

populate an address space with some pages & start reads against them

Parameters

structaddress_space*mapping
the address_space
structlist_head*pages
The address of a list_head which contains the target pages. Thesepages have their ->index populated and are otherwise uninitialised.
int(*)(void*,structpage*)filler
callback routine for filling a single page.
void*data
private data for the callback routine.

Description

Hides the details of the LRU cache etc from the filesystems.

Return

0 on success, error return byfiller otherwise

voidpage_cache_readahead_unbounded(structaddress_space * mapping, struct file * file, pgoff_t index, unsigned long nr_to_read, unsigned long lookahead_size)

Start unchecked readahead.

Parameters

structaddress_space*mapping
File address space.
structfile*file
This instance of the open file; used for authentication.
pgoff_tindex
First page index to read.
unsignedlongnr_to_read
The number of pages to read.
unsignedlonglookahead_size
Where to start the next readahead.

Description

This function is for filesystems to call when they want to startreadahead beyond a file’s stated i_size. This is almost certainlynot the function you want to call. Usepage_cache_async_readahead()orpage_cache_sync_readahead() instead.

Context

File is referenced by caller. Mutexes may be held by caller.May sleep, but will not reenter filesystem to reclaim memory.

voidpage_cache_sync_readahead(structaddress_space * mapping, struct file_ra_state * ra, struct file * filp, pgoff_t index, unsigned long req_count)

generic file readahead

Parameters

structaddress_space*mapping
address_space which holds the pagecache and I/O vectors
structfile_ra_state*ra
file_ra_state which holds the readahead state
structfile*filp
passed on to ->readpage() and ->readpages()
pgoff_tindex
Index of first page to be read.
unsignedlongreq_count
Total number of pages being read by the caller.

Description

page_cache_sync_readahead() should be called when a cache miss happened:it will submit the read. The readahead logic may decide to piggyback morepages onto the read request if access patterns suggest it will improveperformance.

voidpage_cache_async_readahead(structaddress_space * mapping, struct file_ra_state * ra, struct file * filp, struct page * page, pgoff_t index, unsigned long req_count)

file readahead for marked pages

Parameters

structaddress_space*mapping
address_space which holds the pagecache and I/O vectors
structfile_ra_state*ra
file_ra_state which holds the readahead state
structfile*filp
passed on to ->readpage() and ->readpages()
structpage*page
The page atindex which triggered the readahead call.
pgoff_tindex
Index of first page to be read.
unsignedlongreq_count
Total number of pages being read by the caller.

Description

page_cache_async_readahead() should be called when a page is used whichis marked as PageReadahead; this is a marker to suggest that the applicationhas used up enough of the readahead window that we should start pulling inmore pages.

voiddelete_from_page_cache(struct page * page)

delete page from page cache

Parameters

structpage*page
the page which the kernel is trying to remove from page cache

Description

This must be called only on pages that have been verified to be in the pagecache and locked. It will never put the page into the free list, the callerhas a reference on the page.

intfilemap_flush(structaddress_space * mapping)

mostly a non-blocking flush

Parameters

structaddress_space*mapping
target address_space

Description

This is a mostly non-blocking flush. Not suitable for data-integritypurposes - I/O may not be started against all dirty pages.

Return

0 on success, negative error code otherwise.

boolfilemap_range_has_page(structaddress_space * mapping, loff_t start_byte, loff_t end_byte)

check if a page exists in range.

Parameters

structaddress_space*mapping
address space within which to check
loff_tstart_byte
offset in bytes where the range starts
loff_tend_byte
offset in bytes where the range ends (inclusive)

Description

Find at least one page in the range supplied, usually used to check ifdirect writing in this range will trigger a writeback.

Return

true if at least one page exists in the specified range,false otherwise.

intfilemap_fdatawait_range(structaddress_space * mapping, loff_t start_byte, loff_t end_byte)

wait for writeback to complete

Parameters

structaddress_space*mapping
address space structure to wait for
loff_tstart_byte
offset in bytes where the range starts
loff_tend_byte
offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the given address spacein the given range and wait for all of them. Check error status ofthe address space and return it.

Since the error status of the address space is cleared by this function,callers are responsible for checking the return value and handling and/orreporting the error.

Return

error status of the address space.

intfilemap_fdatawait_range_keep_errors(structaddress_space * mapping, loff_t start_byte, loff_t end_byte)

wait for writeback to complete

Parameters

structaddress_space*mapping
address space structure to wait for
loff_tstart_byte
offset in bytes where the range starts
loff_tend_byte
offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the given address space in thegiven range and wait for all of them. Unlikefilemap_fdatawait_range(),this function does not clear error status of the address space.

Use this function if callers don’t handle errors themselves. Expectedcall sites are system-wide / filesystem-wide data flushers: e.g. sync(2),fsfreeze(8)

intfile_fdatawait_range(struct file * file, loff_t start_byte, loff_t end_byte)

wait for writeback to complete

Parameters

structfile*file
file pointing to address space structure to wait for
loff_tstart_byte
offset in bytes where the range starts
loff_tend_byte
offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the address space that filerefers to, in the given range and wait for all of them. Check errorstatus of the address space vs. the file->f_wb_err cursor and return it.

Since the error status of the file is advanced by this function,callers are responsible for checking the return value and handling and/orreporting the error.

Return

error status of the address space vs. the file->f_wb_err cursor.

intfilemap_fdatawait_keep_errors(structaddress_space * mapping)

wait for writeback without clearing errors

Parameters

structaddress_space*mapping
address space structure to wait for

Description

Walk the list of under-writeback pages of the given address spaceand wait for all of them. Unlike filemap_fdatawait(), this functiondoes not clear error status of the address space.

Use this function if callers don’t handle errors themselves. Expectedcall sites are system-wide / filesystem-wide data flushers: e.g. sync(2),fsfreeze(8)

Return

error status of the address space.

intfilemap_write_and_wait_range(structaddress_space * mapping, loff_t lstart, loff_t lend)

write out & wait on a file range

Parameters

structaddress_space*mapping
the address_space for the pages
loff_tlstart
offset in bytes where the range starts
loff_tlend
offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note thatlend is inclusive (describes the last byte to be written) sothat this function can be used to write to the very end-of-file (end = -1).

Return

error status of the address space.

intfile_check_and_advance_wb_err(struct file * file)

report wb error (if any) that was previously and advance wb_err to current one

Parameters

structfile*file
struct file on which the error is being reported

Description

When userland calls fsync (or something like nfsd does the equivalent), wewant to report any writeback errors that occurred since the last fsync (orsince the file was opened if there haven’t been any).

Grab the wb_err from the mapping. If it matches what we have in the file,then just quickly return 0. The file is all caught up.

If it doesn’t match, then take the mapping value, set the “seen” flag init and try to swap it into place. If it works, or another task beat usto it with the new value, then update the f_wb_err and return the errorportion. The error at this point must be reported via proper channels(a’la fsync, or NFS COMMIT operation, etc.).

While we handle mapping->wb_err with atomic operations, the f_wb_errvalue is protected by the f_lock since we must ensure that it reflectsthe latest value swapped in for this file descriptor.

Return

0 on success, negative error code otherwise.

intfile_write_and_wait_range(struct file * file, loff_t lstart, loff_t lend)

write out & wait on a file range

Parameters

structfile*file
file pointing to address_space with pages
loff_tlstart
offset in bytes where the range starts
loff_tlend
offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note thatlend is inclusive (describes the last byte to be written) sothat this function can be used to write to the very end-of-file (end = -1).

After writing out and waiting on the data, we check and advance thef_wb_err cursor to the latest value, and return any errors detected there.

Return

0 on success, negative error code otherwise.

intreplace_page_cache_page(struct page * old, struct page * new, gfp_t gfp_mask)

replace a pagecache page with a new one

Parameters

structpage*old
page to be replaced
structpage*new
page to replace with
gfp_tgfp_mask
allocation mode

Description

This function replaces a page in the pagecache with a new one. Onsuccess it acquires the pagecache reference for the new page anddrops it for the old page. Both the old and new pages must belocked. This function does not add the new page to the LRU, thecaller must do that.

The remove + add is atomic. This function cannot fail.

Return

0

intadd_to_page_cache_locked(struct page * page, structaddress_space * mapping, pgoff_t offset, gfp_t gfp_mask)

add a locked page to the pagecache

Parameters

structpage*page
page to add
structaddress_space*mapping
the page’s address_space
pgoff_toffset
page index
gfp_tgfp_mask
page allocation mode

Description

This function is used to add a page to the pagecache. It must be locked.This function does not add the page to the LRU. The caller must do that.

Return

0 on success, negative error code otherwise.

voidadd_page_wait_queue(struct page * page, wait_queue_entry_t * waiter)

Add an arbitrary waiter to a page’s wait queue

Parameters

structpage*page
Page defining the wait queue of interest
wait_queue_entry_t*waiter
Waiter to add to the queue

Description

Add an arbitrarywaiter to the wait queue for the nominatedpage.

voidunlock_page(struct page * page)

unlock a locked page

Parameters

structpage*page
the page

Description

Unlocks the page and wakes up sleepers in ___wait_on_page_locked().Also wakes sleepers in wait_on_page_writeback() because the wakeupmechanism between PageLocked pages and PageWriteback pages is shared.But that’s OK - sleepers in wait_on_page_writeback() just go back to sleep.

Note that this depends on PG_waiters being the sign bit in the bytethat contains PG_locked - thus the BUILD_BUG_ON(). That allows us toclear the PG_locked bit and test PG_waiters at the same time fairlyportably (architectures that do LL/SC can test any bit, while x86 cantest the sign bit).

voidend_page_writeback(struct page * page)

end writeback against a page

Parameters

structpage*page
the page
void__lock_page(struct page * __page)

get a lock on the page, assuming we need to sleep to get it

Parameters

structpage*__page
the page to lock
pgoff_tpage_cache_next_miss(structaddress_space * mapping, pgoff_t index, unsigned long max_scan)

Find the next gap in the page cache.

Parameters

structaddress_space*mapping
Mapping.
pgoff_tindex
Index.
unsignedlongmax_scan
Maximum range to search.

Description

Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for thegap with the lowest index.

This function may be called under the rcu_read_lock. However, this willnot atomically search a snapshot of the cache at a single point in time.For example, if a gap is created at index 5, then subsequently a gap iscreated at index 10, page_cache_next_miss covering both indices mayreturn 10 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside therange specified (in which case ‘return - index >= max_scan’ will be true).In the rare case of index wrap-around, 0 will be returned.

pgoff_tpage_cache_prev_miss(structaddress_space * mapping, pgoff_t index, unsigned long max_scan)

Find the previous gap in the page cache.

Parameters

structaddress_space*mapping
Mapping.
pgoff_tindex
Index.
unsignedlongmax_scan
Maximum range to search.

Description

Search the range [max(index - max_scan + 1, 0), index] for thegap with the highest index.

This function may be called under the rcu_read_lock. However, this willnot atomically search a snapshot of the cache at a single point in time.For example, if a gap is created at index 10, then subsequently a gap iscreated at index 5,page_cache_prev_miss() covering both indices mayreturn 5 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside therange specified (in which case ‘index - return >= max_scan’ will be true).In the rare case of wrap-around, ULONG_MAX will be returned.

struct page *find_lock_entry(structaddress_space * mapping, pgoff_t offset)

locate, pin and lock a page cache entry

Parameters

structaddress_space*mapping
the address_space to search
pgoff_toffset
the page cache index

Description

Looks up the page cache slot atmapping &offset. If there is apage cache page, it is returned locked and with an increasedrefcount.

If the slot holds a shadow entry of a previously evicted page, or aswap entry from shmem/tmpfs, it is returned.

find_lock_entry() may sleep.

Return

the found page or shadow entry,NULL if nothing is found.

struct page *pagecache_get_page(structaddress_space * mapping, pgoff_t index, int fgp_flags, gfp_t gfp_mask)

Find and get a reference to a page.

Parameters

structaddress_space*mapping
The address_space to search.
pgoff_tindex
The page index.
intfgp_flags
FGP flags modify how the page is returned.
gfp_tgfp_mask
Memory allocation flags to use ifFGP_CREAT is specified.

Description

Looks up the page cache entry atmapping &index.

fgp_flags can be zero or more of these flags:

  • FGP_ACCESSED - The page will be marked accessed.
  • FGP_LOCK - The page is returned locked.
  • FGP_CREAT - If no page is present then a new page is allocated usinggfp_mask and added to the page cache and the VM’s LRU list.The page is returned locked and with an increased refcount.
  • FGP_FOR_MMAP - The caller wants to do its own locking dance if thepage is already in cache. If the page was allocated, unlock it beforereturning so the caller can do the same dance.

IfFGP_LOCK orFGP_CREAT are specified then the function may sleep evenif theGFP flags specified forFGP_CREAT are atomic.

If there is a page cache page, it is returned with an increased refcount.

Return

The found page orNULL otherwise.

unsignedfind_get_pages_contig(structaddress_space * mapping, pgoff_t index, unsigned int nr_pages, struct page ** pages)

gang contiguous pagecache lookup

Parameters

structaddress_space*mapping
The address_space to search
pgoff_tindex
The starting page index
unsignedintnr_pages
The maximum number of pages
structpage**pages
Where the resulting pages are placed

Description

find_get_pages_contig() works exactly like find_get_pages(), exceptthat the returned number of pages are guaranteed to be contiguous.

Return

the number of pages which were found.

unsignedfind_get_pages_range_tag(structaddress_space * mapping, pgoff_t * index, pgoff_t end, xa_mark_t tag, unsigned int nr_pages, struct page ** pages)

find and return pages in given range matchingtag

Parameters

structaddress_space*mapping
the address_space to search
pgoff_t*index
the starting page index
pgoff_tend
The final page index (inclusive)
xa_mark_ttag
the tag index
unsignedintnr_pages
the maximum number of pages
structpage**pages
where the resulting pages are placed

Description

Like find_get_pages, except we only return pages which are tagged withtag. We updateindex to index the next page for the traversal.

Return

the number of pages which were found.

ssize_tgeneric_file_buffered_read(struct kiocb * iocb, struct iov_iter * iter, ssize_t written)

generic file read routine

Parameters

structkiocb*iocb
the iocb to read
structiov_iter*iter
data destination
ssize_twritten
already copied

Description

This is a generic file read routine, and uses themapping->a_ops->readpage() function for the actual low-level stuff.

This is really ugly. But the goto’s actually try to clarify someof the logic when it comes to error handling etc.

Return

  • total number of bytes copied, including those the were alreadywritten
  • negative error code if nothing was copied
ssize_tgeneric_file_read_iter(struct kiocb * iocb, struct iov_iter * iter)

generic filesystem read routine

Parameters

structkiocb*iocb
kernel I/O control block
structiov_iter*iter
destination for the data read

Description

This is the “read_iter()” routine for all filesystemsthat can use the page cache directly.

The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shallbe returned when no data can be read without waiting for I/O requeststo complete; it doesn’t prevent readahead.

The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/Orequests shall be made for the read or for readahead. When no datacan be read, -EAGAIN shall be returned. When readahead would betriggered, a partial, possibly empty read shall be returned.

Return

  • number of bytes copied, even for partial reads
  • negative error code (or 0 if IOCB_NOIO) if nothing was read
vm_fault_tfilemap_fault(struct vm_fault * vmf)

read in file data for page fault handling

Parameters

structvm_fault*vmf
struct vm_fault containing details of the fault

Description

filemap_fault() is invoked via the vma operations vector for amapped memory region to read in file data during a page fault.

The goto’s are kind of ugly, but this streamlines the normal case of havingit in the page cache, and handles the special cases reasonably withouthaving a lot of duplicated code.

vma->vm_mm->mmap_lock must be held on entry.

If our return value has VM_FAULT_RETRY set, it’s because the mmap_lockmay be dropped before doing I/O or by lock_page_maybe_drop_mmap().

If our return value does not have VM_FAULT_RETRY set, the mmap_lockhas not been released.

We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.

Return

bitwise-OR ofVM_FAULT_ codes.

struct page *read_cache_page(structaddress_space * mapping, pgoff_t index, int (*filler)(void *, struct page *), void * data)

read into page cache, fill it if needed

Parameters

structaddress_space*mapping
the page’s address_space
pgoff_tindex
the page index
int(*)(void*,structpage*)filler
function to perform the read
void*data
first arg to filler(data, page) function, often left as NULL

Description

Read into the page cache. If a page already exists, and PageUptodate() isnot set, try to fill the page and wait for it to become unlocked.

If the page does not get brought uptodate, return -EIO.

Return

up to date page on success, ERR_PTR() on failure.

struct page *read_cache_page_gfp(structaddress_space * mapping, pgoff_t index, gfp_t gfp)

read into page cache, using specified page allocation flags.

Parameters

structaddress_space*mapping
the page’s address_space
pgoff_tindex
the page index
gfp_tgfp
the page allocator flags to use if allocating

Description

This is the same as “read_mapping_page(mapping, index, NULL)”, but withany new page allocations done using the specified allocation flags.

If the page does not get brought uptodate, return -EIO.

Return

up to date page on success, ERR_PTR() on failure.

ssize_t__generic_file_write_iter(struct kiocb * iocb, struct iov_iter * from)

write data to a file

Parameters

structkiocb*iocb
IO state structure (file, offset, etc.)
structiov_iter*from
iov_iter with data to write

Description

This function does all the work needed for actually writing data to afile. It does all basic checks, removes SUID from the file, updatesmodification times and calls proper subroutines depending on whether wedo direct IO or a standard buffered write.

It expects i_mutex to be grabbed unless we work on a block device or similarobject which does not need locking at all.

This function doesnot take care of syncing data in case of O_SYNC write.A caller has to handle it. This is mainly due to the fact that we want toavoid syncing under i_mutex.

Return

  • number of bytes written, even for truncated writes
  • negative error code if no data has been written at all
ssize_tgeneric_file_write_iter(struct kiocb * iocb, struct iov_iter * from)

write data to a file

Parameters

structkiocb*iocb
IO state structure
structiov_iter*from
iov_iter with data to write

Description

This is a wrapper around__generic_file_write_iter() to be used by mostfilesystems. It takes care of syncing the file in case of O_SYNC fileand acquires i_mutex as needed.

Return

  • negative error code if no data has been written at all ofvfs_fsync_range() failed for a synchronous write
  • number of bytes written, even for truncated writes
inttry_to_release_page(struct page * page, gfp_t gfp_mask)

release old fs-specific metadata on a page

Parameters

structpage*page
the page which the kernel is trying to free
gfp_tgfp_mask
memory allocation flags (and I/O mode)

Description

The address_space is to try to release any data against the page(presumably at page->private).

This may also be called if PG_fscache is set on a page, indicating that thepage is known to the local caching routines.

Thegfp_mask argument specifies whether I/O may be performed to releasethis page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).

Return

1 if the release was successful, otherwise return zero.

voidbalance_dirty_pages_ratelimited(structaddress_space * mapping)

balance dirty memory state

Parameters

structaddress_space*mapping
address_space which was dirtied

Description

Processes which are dirtying memory should call in here once for each pagewhich was newly dirtied. The function will periodically check the system’sdirty state and will initiate writeback if needed.

On really big machines, get_writeback_state is expensive, so try to avoidcalling it too often (ratelimiting). But once we’re over the dirty memorylimit we decrease the ratelimiting by a lot, to prevent individual processesfrom overshooting the limit by (ratelimit_pages) each.

voidtag_pages_for_writeback(structaddress_space * mapping, pgoff_t start, pgoff_t end)

tag pages to be written by write_cache_pages

Parameters

structaddress_space*mapping
address space structure to write
pgoff_tstart
starting page index
pgoff_tend
ending page index (inclusive)

Description

This function scans the page range fromstart toend (inclusive) and tagsall pages that have DIRTY tag set with a special TOWRITE tag. The idea isthat write_cache_pages (or whoever calls this function) will then useTOWRITE tag to identify pages eligible for writeback. This mechanism isused to avoid livelocking of writeback by a process steadily creating newdirty pages in the file (thus it is important for this function to be quickso that it can tag pages faster than a dirtying process can create them).

intwrite_cache_pages(structaddress_space * mapping, struct writeback_control * wbc, writepage_t writepage, void * data)

walk the list of dirty pages of the given address space and write all of them.

Parameters

structaddress_space*mapping
address space structure to write
structwriteback_control*wbc
subtract the number of written pages from*wbc->nr_to_write
writepage_twritepage
function called for each page
void*data
data passed to writepage function

Description

If a page is already under I/O,write_cache_pages() skips it, evenif it’s dirty. This is desirable behaviour for memory-cleaning writeback,but it is INCORRECT for data-integrity system calls such as fsync(). fsync()and msync() need to guarantee that all the data which was dirty at the timethe call was made get new I/O started against them. If wbc->sync_mode isWB_SYNC_ALL then we were called for data integrity and we must wait forexisting IO to complete.

To avoid livelocks (when other process dirties new pages), we first tagpages which should be written back with TOWRITE tag and only then startwriting them. For data-integrity sync we have to be careful so that we donot miss some pages (e.g., because some other process has cleared TOWRITEtag we set). The rule we follow is that TOWRITE tag can be cleared onlyby the process clearing the DIRTY tag (and submitting the page for IO).

To avoid deadlocks between range_cyclic writeback and callers that holdpages in PageWriteback to aggregate IO untilwrite_cache_pages() returns,we do not loop back to the start of the file. Doing so causes a pagelock/page writeback access order inversion - we should only ever lockmultiple pages in ascending page->index order, and looping back to the startof the file violates that rule and causes deadlocks.

Return

0 on success, negative error code otherwise

intgeneric_writepages(structaddress_space * mapping, struct writeback_control * wbc)

walk the list of dirty pages of the given address space and writepage() all of them.

Parameters

structaddress_space*mapping
address space structure to write
structwriteback_control*wbc
subtract the number of written pages from*wbc->nr_to_write

Description

This is a library function, which implements the writepages()address_space_operation.

Return

0 on success, negative error code otherwise

intwrite_one_page(struct page * page)

write out a single page and wait on I/O

Parameters

structpage*page
the page to write

Description

The page must be locked by the caller and will be unlocked upon return.

Note that the mapping’s AS_EIO/AS_ENOSPC flags will be cleared when thisfunction returns.

Return

0 on success, negative error code otherwise

voidwait_for_stable_page(struct page * page)

wait for writeback to finish, if necessary.

Parameters

structpage*page
The page to wait on.

Description

This function determines if the given page is related to a backing devicethat requires page contents to be held stable during writeback. If so, thenit will wait for any pending writeback to complete.

voidtruncate_inode_pages_range(structaddress_space * mapping, loff_t lstart, loff_t lend)

truncate range of pages specified by start & end byte offsets

Parameters

structaddress_space*mapping
mapping to truncate
loff_tlstart
offset from which to truncate
loff_tlend
offset to which to truncate (inclusive)

Description

Truncate the page cache, removing the pages that are betweenspecified offsets (and zeroing out partial pagesif lstart or lend + 1 is not page aligned).

Truncate takes two passes - the first pass is nonblocking. It will notblock on page locks and it will not block on writeback. The second passwill wait. This is to prevent as much IO as possible in the affected region.The first pass will remove most pages, so the search cost of the second passis low.

We pass down the cache-hot hint to the page freeing code. Even if themapping is large, it is probably the case that the final pages are the mostrecently touched, and freeing happens in ascending file offset order.

Note that since ->invalidatepage() accepts range to invalidatetruncate_inode_pages_range is able to handle cases where lend + 1 is notpage aligned properly.

voidtruncate_inode_pages(structaddress_space * mapping, loff_t lstart)

truncateall the pages from an offset

Parameters

structaddress_space*mapping
mapping to truncate
loff_tlstart
offset from which to truncate

Description

Called under (and serialised by) inode->i_mutex.

Note

When this function returns, there can be a page in the process ofdeletion (inside __delete_from_page_cache()) in the specified range. Thusmapping->nrpages can be non-zero when this function returns even aftertruncation of the whole mapping.

voidtruncate_inode_pages_final(structaddress_space * mapping)

truncateall pages before inode dies

Parameters

structaddress_space*mapping
mapping to truncate

Description

Called under (and serialized by) inode->i_mutex.

Filesystems have to use this in the .evict_inode path to inform theVM that this is the final truncate and the inode is going away.

unsigned longinvalidate_mapping_pages(structaddress_space * mapping, pgoff_t start, pgoff_t end)

Invalidate all the unlocked pages of one inode

Parameters

structaddress_space*mapping
the address_space which holds the pages to invalidate
pgoff_tstart
the offset ‘from’ which to invalidate
pgoff_tend
the offset ‘to’ which to invalidate (inclusive)

Description

This function only removes the unlocked pages, if you want toremove all the pages of one inode, you must call truncate_inode_pages.

invalidate_mapping_pages() will not block on IO activity. It will notinvalidate pages which are dirty, locked, under writeback or mapped intopagetables.

Return

the number of the pages that were invalidated

intinvalidate_inode_pages2_range(structaddress_space * mapping, pgoff_t start, pgoff_t end)

remove range of pages from an address_space

Parameters

structaddress_space*mapping
the address_space
pgoff_tstart
the page offset ‘from’ which to invalidate
pgoff_tend
the page offset ‘to’ which to invalidate (inclusive)

Description

Any pages which are found to be mapped into pagetables are unmapped prior toinvalidation.

Return

-EBUSY if any pages could not be invalidated.

intinvalidate_inode_pages2(structaddress_space * mapping)

remove all pages from an address_space

Parameters

structaddress_space*mapping
the address_space

Description

Any pages which are found to be mapped into pagetables are unmapped prior toinvalidation.

Return

-EBUSY if any pages could not be invalidated.

voidtruncate_pagecache(struct inode * inode, loff_t newsize)

unmap and remove pagecache that has been truncated

Parameters

structinode*inode
inode
loff_tnewsize
new file size

Description

inode’s new i_size must already be written before truncate_pagecacheis called.

This function should typically be called before the filesystemreleases resources associated with the freed range (eg. deallocatesblocks). This way, pagecache will always stay logically coherentwith on-disk format, and the filesystem would not have to deal withsituations such as writepage being called for a page that has alreadyhad its underlying blocks deallocated.

voidtruncate_setsize(struct inode * inode, loff_t newsize)

update inode and pagecache for a new file size

Parameters

structinode*inode
inode
loff_tnewsize
new file size

Description

truncate_setsize updates i_size and performs pagecache truncation (ifnecessary) tonewsize. It will be typically be called from the filesystem’ssetattr function when ATTR_SIZE is passed in.

Must be called with a lock serializing truncates and writes (generallyi_mutex but e.g. xfs uses a different lock) and before all filesystemspecific block truncation has been performed.

voidpagecache_isize_extended(struct inode * inode, loff_t from, loff_t to)

update pagecache after extension of i_size

Parameters

structinode*inode
inode for which i_size was extended
loff_tfrom
original inode size
loff_tto
new inode size

Description

Handle extension of inode size either caused by extending truncate or bywrite starting after current i_size. We mark the page straddling currenti_size RO so that page_mkwrite() is called on the nearest write access tothe page. This way filesystem can be sure that page_mkwrite() is called onthe page before user writes to the page via mmap after the i_size has beenchanged.

The function must be called after i_size is updated so that page faultcoming after we unlock the page will already see the new i_size.The function must be called while we still hold i_mutex - this not onlymakes sure i_size is stable but also that userspace cannot observe newi_size value before we are prepared to store mmap writes at new inode size.

voidtruncate_pagecache_range(struct inode * inode, loff_t lstart, loff_t lend)

unmap and remove pagecache that is hole-punched

Parameters

structinode*inode
inode
loff_tlstart
offset of beginning of hole
loff_tlend
offset of last byte of hole

Description

This function should typically be called before the filesystemreleases resources associated with the freed range (eg. deallocatesblocks). This way, pagecache will always stay logically coherentwith on-disk format, and the filesystem would not have to deal withsituations such as writepage being called for a page that has alreadyhad its underlying blocks deallocated.

voidmapping_set_error(structaddress_space * mapping, int error)

record a writeback error in the address_space

Parameters

structaddress_space*mapping
the mapping in which an error should be set
interror
the error to set in the mapping

Description

When writeback fails in some way, we must record that error so thatuserspace can be informed when fsync and the like are called. We endeavorto report errors on any file that was open at the time of the error. Someinternal callers also need to know when writeback errors have occurred.

When a writeback error occurs, most filesystems will want to callmapping_set_error to record the error in the mapping so that it can bereported when the application calls fsync(2).

voidattach_page_private(struct page * page, void * data)

Attach private data to a page.

Parameters

structpage*page
Page to attach data to.
void*data
Data to attach to page.

Description

Attaching private data to a page increments the page’s reference count.The data must be detached before the page will be freed.

void *detach_page_private(struct page * page)

Detach private data from a page.

Parameters

structpage*page
Page to detach data from.

Description

Removes the data that was previously attached to the page and decrementsthe refcount on the page.

Return

Data that was attached to the page.

struct page *find_get_page(structaddress_space * mapping, pgoff_t offset)

find and get a page reference

Parameters

structaddress_space*mapping
the address_space to search
pgoff_toffset
the page index

Description

Looks up the page cache slot atmapping &offset. If there is apage cache page, it is returned with an increased refcount.

Otherwise,NULL is returned.

struct page *find_lock_page(structaddress_space * mapping, pgoff_t offset)

locate, pin and lock a pagecache page

Parameters

structaddress_space*mapping
the address_space to search
pgoff_toffset
the page index

Description

Looks up the page cache slot atmapping &offset. If there is apage cache page, it is returned locked and with an increasedrefcount.

Otherwise,NULL is returned.

find_lock_page() may sleep.

struct page *find_or_create_page(structaddress_space * mapping, pgoff_t index, gfp_t gfp_mask)

locate or add a pagecache page

Parameters

structaddress_space*mapping
the page’s address_space
pgoff_tindex
the page’s index into the mapping
gfp_tgfp_mask
page allocation mode

Description

Looks up the page cache slot atmapping &offset. If there is apage cache page, it is returned locked and with an increasedrefcount.

If the page is not present, a new page is allocated usinggfp_maskand added to the page cache and the VM’s LRU list. The page isreturned locked and with an increased refcount.

On memory exhaustion,NULL is returned.

find_or_create_page() may sleep, even ifgfp_flags specifies anatomic allocation!

struct page *grab_cache_page_nowait(structaddress_space * mapping, pgoff_t index)

returns locked page at given index in given cache

Parameters

structaddress_space*mapping
target address_space
pgoff_tindex
the page index

Description

Same as grab_cache_page(), but do not wait if the page is unavailable.This is intended for speculative data generators, where the data canbe regenerated if the page couldn’t be grabbed. This routine shouldbe safe to call while holding the lock for another page.

Clear __GFP_FS when allocating the page to avoid recursion into the fsand deadlock against the caller’s locked page.

structreadahead_control

Describes a readahead request.

Definition

struct readahead_control {  struct file *file;  struct address_space *mapping;};

Members

file
The file, used primarily by network filesystems for authentication.May be NULL if invoked internally by the filesystem.
mapping
Readahead this filesystem object.

Description

A readahead request is for consecutive pages. Filesystems whichimplement the ->readahead method should callreadahead_page() orreadahead_page_batch() in a loop and attempt to start I/O againsteach page in the request.

Most of the fields in this struct are private and should be accessedby the functions below.

struct page *readahead_page(structreadahead_control * rac)

Get the next page to read.

Parameters

structreadahead_control*rac
The current readahead request.

Context

The page is locked and has an elevated refcount. The callershould decreases the refcount once the page has been submitted for I/Oand unlock the page once all I/O to that page has completed.

Return

A pointer to the next page, orNULL if we are done.

readahead_page_batch(rac,array)

Get a batch of pages to read.

Parameters

rac
The current readahead request.
array
An array of pointers to struct page.

Context

The pages are locked and have an elevated refcount. The callershould decreases the refcount once the page has been submitted for I/Oand unlock the page once all I/O to that page has completed.

Return

The number of pages placed in the array. 0 indicates the requestis complete.

loff_treadahead_pos(structreadahead_control * rac)

The byte offset into the file of this readahead request.

Parameters

structreadahead_control*rac
The readahead request.
loff_treadahead_length(structreadahead_control * rac)

The number of bytes in this readahead request.

Parameters

structreadahead_control*rac
The readahead request.
pgoff_treadahead_index(structreadahead_control * rac)

The index of the first page in this readahead request.

Parameters

structreadahead_control*rac
The readahead request.
unsigned intreadahead_count(structreadahead_control * rac)

The number of pages in this readahead request.

Parameters

structreadahead_control*rac
The readahead request.
intpage_mkwrite_check_truncate(struct page * page, struct inode * inode)

check if page was truncated

Parameters

structpage*page
the page to check
structinode*inode
the inode to check the page against

Description

Returns the number of bytes in the page up to EOF,or -EFAULT if the page was truncated.

Memory pools

voidmempool_exit(mempool_t * pool)

exit a mempool initialized withmempool_init()

Parameters

mempool_t*pool
pointer to the memory pool which was initialized withmempool_init().

Description

Free all reserved elements inpool andpool itself. This functiononly sleeps if the free_fn() function sleeps.

May be called on a zeroed but uninitialized mempool (i.e. allocated withkzalloc()).

voidmempool_destroy(mempool_t * pool)

deallocate a memory pool

Parameters

mempool_t*pool
pointer to the memory pool which was allocated viamempool_create().

Description

Free all reserved elements inpool andpool itself. This functiononly sleeps if the free_fn() function sleeps.

intmempool_init(mempool_t * pool, int min_nr, mempool_alloc_t * alloc_fn, mempool_free_t * free_fn, void * pool_data)

initialize a memory pool

Parameters

mempool_t*pool
pointer to the memory pool that should be initialized
intmin_nr
the minimum number of elements guaranteed to beallocated for this pool.
mempool_alloc_t*alloc_fn
user-defined element-allocation function.
mempool_free_t*free_fn
user-defined element-freeing function.
void*pool_data
optional private data available to the user-defined functions.

Description

Likemempool_create(), but initializes the pool in (i.e. embedded in anotherstructure).

Return

0 on success, negative error code otherwise.

mempool_t *mempool_create(int min_nr, mempool_alloc_t * alloc_fn, mempool_free_t * free_fn, void * pool_data)

create a memory pool

Parameters

intmin_nr
the minimum number of elements guaranteed to beallocated for this pool.
mempool_alloc_t*alloc_fn
user-defined element-allocation function.
mempool_free_t*free_fn
user-defined element-freeing function.
void*pool_data
optional private data available to the user-defined functions.

Description

this function creates and allocates a guaranteed size, preallocatedmemory pool. The pool can be used from themempool_alloc() andmempool_free()functions. This function might sleep. Both the alloc_fn() and the free_fn()functions might sleep - as long as themempool_alloc() function is not calledfrom IRQ contexts.

Return

pointer to the created memory pool object orNULL on error.

intmempool_resize(mempool_t * pool, int new_min_nr)

resize an existing memory pool

Parameters

mempool_t*pool
pointer to the memory pool which was allocated viamempool_create().
intnew_min_nr
the new minimum number of elements guaranteed to beallocated for this pool.

Description

This function shrinks/grows the pool. In the case of growing,it cannot be guaranteed that the pool will be grown to the newsize immediately, but newmempool_free() calls will refill it.This function may sleep.

Note, the caller must guarantee that no mempool_destroy is calledwhile this function is running.mempool_alloc() &mempool_free()might be called (eg. from IRQ contexts) while this function executes.

Return

0 on success, negative error code otherwise.

void *mempool_alloc(mempool_t * pool, gfp_t gfp_mask)

allocate an element from a specific memory pool

Parameters

mempool_t*pool
pointer to the memory pool which was allocated viamempool_create().
gfp_tgfp_mask
the usual allocation bitmask.

Description

this function only sleeps if the alloc_fn() function sleeps orreturns NULL. Note that due to preallocation, this functionnever fails when called from process contexts. (it mightfail if called from an IRQ context.)

Note

using __GFP_ZERO is not supported.

Return

pointer to the allocated element orNULL on error.

voidmempool_free(void * element, mempool_t * pool)

return an element to the pool.

Parameters

void*element
pool element pointer.
mempool_t*pool
pointer to the memory pool which was allocated viamempool_create().

Description

this function only sleeps if the free_fn() function sleeps.

DMA pools

struct dma_pool *dma_pool_create(const char * name, structdevice * dev, size_t size, size_t align, size_t boundary)

Creates a pool of consistent memory blocks, for dma.

Parameters

constchar*name
name of pool, for diagnostics
structdevice*dev
device that will be doing the DMA
size_tsize
size of the blocks in this pool.
size_talign
alignment requirement for blocks; must be a power of two
size_tboundary
returned blocks won’t cross this power of two boundary

Context

not in_interrupt()

Description

Given one of these pools,dma_pool_alloc()may be used to allocate memory. Such memory will all have “consistent”DMA mappings, accessible by the device and its driver without usingcache flushing primitives. The actual size of blocks allocated may belarger than requested because of alignment.

Ifboundary is nonzero, objects returned fromdma_pool_alloc() won’tcross that size boundary. This is useful for devices which haveaddressing restrictions on individual DMA transfers, such as not crossingboundaries of 4KBytes.

Return

a dma allocation pool with the requested characteristics, orNULL if one can’t be created.

voiddma_pool_destroy(struct dma_pool * pool)

destroys a pool of dma memory blocks.

Parameters

structdma_pool*pool
dma pool that will be destroyed

Context

!in_interrupt()

Description

Caller guarantees that no more memory from the pool is in use,and that nothing will try to use the pool after this call.

void *dma_pool_alloc(struct dma_pool * pool, gfp_t mem_flags, dma_addr_t * handle)

get a block of consistent memory

Parameters

structdma_pool*pool
dma pool that will produce the block
gfp_tmem_flags
GFP_* bitmask
dma_addr_t*handle
pointer to dma address of block

Return

the kernel virtual address of a currently unused block,and reports its dma address through the handle.If such a memory block can’t be allocated,NULL is returned.

voiddma_pool_free(struct dma_pool * pool, void * vaddr, dma_addr_t dma)

put block back into dma pool

Parameters

structdma_pool*pool
the dma pool holding the block
void*vaddr
virtual address of block
dma_addr_tdma
dma address of block

Description

Caller promises neither device nor driver will again touch this blockunless it is first re-allocated.

struct dma_pool *dmam_pool_create(const char * name, structdevice * dev, size_t size, size_t align, size_t allocation)

Manageddma_pool_create()

Parameters

constchar*name
name of pool, for diagnostics
structdevice*dev
device that will be doing the DMA
size_tsize
size of the blocks in this pool.
size_talign
alignment requirement for blocks; must be a power of two
size_tallocation
returned blocks won’t cross this boundary (or zero)

Description

Manageddma_pool_create(). DMA pool created with this function isautomatically destroyed on driver detach.

Return

a managed dma allocation pool with the requestedcharacteristics, orNULL if one can’t be created.

voiddmam_pool_destroy(struct dma_pool * pool)

Manageddma_pool_destroy()

Parameters

structdma_pool*pool
dma pool that will be destroyed

Description

Manageddma_pool_destroy().

More Memory Management Functions

voidzap_vma_ptes(struct vm_area_struct * vma, unsigned long address, unsigned long size)

remove ptes mapping the vma

Parameters

structvm_area_struct*vma
vm_area_struct holding ptes to be zapped
unsignedlongaddress
starting address of pages to zap
unsignedlongsize
number of bytes to zap

Description

This function only unmaps ptes assigned to VM_PFNMAP vmas.

The entire address range must be fully contained within the vma.

intvm_insert_pages(struct vm_area_struct * vma, unsigned long addr, struct page ** pages, unsigned long * num)

insert multiple pages into user vma, batching the pmd lock.

Parameters

structvm_area_struct*vma
user vma to map to
unsignedlongaddr
target start user address of these pages
structpage**pages
source kernel pages
unsignedlong*num
in: number of pages to map. out: number of pages that werenotmapped. (0 means all pages were successfully mapped).

Description

Preferred overvm_insert_page() when inserting multiple pages.

In case of error, we may have mapped a subset of the providedpages. It is the caller’s responsibility to account for this case.

The same restrictions apply as invm_insert_page().

intvm_insert_page(struct vm_area_struct * vma, unsigned long addr, struct page * page)

insert single page into user vma

Parameters

structvm_area_struct*vma
user vma to map to
unsignedlongaddr
target user address of this page
structpage*page
source kernel page

Description

This allows drivers to insert individual pages they’ve allocatedinto a user vma.

The page has to be a nice clean _individual_ kernel allocation.If you allocate a compound page, you need to have marked it assuch (__GFP_COMP), or manually just split the page up yourself(see split_page()).

NOTE! Traditionally this was done with “remap_pfn_range()” whichtook an arbitrary page protection parameter. This doesn’t allowthat. Your vma protection will have to be set up correctly, whichmeans that if you want a shared writable mapping, you’d betterask for a shared writable mapping!

The page does not need to be reserved.

Usually this function is called from f_op->mmap() handlerunder mm->mmap_lock write-lock, so it can change vma->vm_flags.Caller must set VM_MIXEDMAP on vma if it wants to call thisfunction from other places, for example from page-fault handler.

Return

0 on success, negative error code otherwise.

intvm_map_pages(struct vm_area_struct * vma, struct page ** pages, unsigned long num)

maps range of kernel pages starts with non zero offset

Parameters

structvm_area_struct*vma
user vma to map to
structpage**pages
pointer to array of source kernel pages
unsignedlongnum
number of pages in page array

Description

Maps an object consisting ofnum pages, catering for the user’srequested vm_pgoff

If we fail to insert any page into the vma, the function will returnimmediately leaving any previously inserted pages present. Callersfrom the mmap handler may immediately return the error as their callerwill destroy the vma, removing any successfully inserted pages. Othercallers should make their own arrangements for calling unmap_region().

Context

Process context. Called by mmap handlers.

Return

0 on success and error code otherwise.

intvm_map_pages_zero(struct vm_area_struct * vma, struct page ** pages, unsigned long num)

map range of kernel pages starts with zero offset

Parameters

structvm_area_struct*vma
user vma to map to
structpage**pages
pointer to array of source kernel pages
unsignedlongnum
number of pages in page array

Description

Similar tovm_map_pages(), except that it explicitly sets the offsetto 0. This function is intended for the drivers that did not considervm_pgoff.

Context

Process context. Called by mmap handlers.

Return

0 on success and error code otherwise.

vm_fault_tvmf_insert_pfn_prot(struct vm_area_struct * vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot)

insert single pfn into user vma with specified pgprot

Parameters

structvm_area_struct*vma
user vma to map to
unsignedlongaddr
target user address of this page
unsignedlongpfn
source kernel pfn
pgprot_tpgprot
pgprot flags for the inserted page

Description

This is exactly likevmf_insert_pfn(), except that it allows drivers toto override pgprot on a per-page basis.

This only makes sense for IO mappings, and it makes no sense forCOW mappings. In general, using multiple vmas is preferable;vmf_insert_pfn_prot should only be used if using multiple VMAs isimpractical.

Seevmf_insert_mixed_prot() for a discussion of the implication of usinga value ofpgprot different from that ofvma->vm_page_prot.

Context

Process context. May allocate usingGFP_KERNEL.

Return

vm_fault_t value.

vm_fault_tvmf_insert_pfn(struct vm_area_struct * vma, unsigned long addr, unsigned long pfn)

insert single pfn into user vma

Parameters

structvm_area_struct*vma
user vma to map to
unsignedlongaddr
target user address of this page
unsignedlongpfn
source kernel pfn

Description

Similar to vm_insert_page, this allows drivers to insert individual pagesthey’ve allocated into a user vma. Same comments apply.

This function should only be called from a vm_ops->fault handler, andin that case the handler should return the result of this function.

vma cannot be a COW mapping.

As this is called only for pages that do not currently exist, wedo not need to flush old virtual caches or the TLB.

Context

Process context. May allocate usingGFP_KERNEL.

Return

vm_fault_t value.

vm_fault_tvmf_insert_mixed_prot(struct vm_area_struct * vma, unsigned long addr, pfn_t pfn, pgprot_t pgprot)

insert single pfn into user vma with specified pgprot

Parameters

structvm_area_struct*vma
user vma to map to
unsignedlongaddr
target user address of this page
pfn_tpfn
source kernel pfn
pgprot_tpgprot
pgprot flags for the inserted page

Description

This is exactly like vmf_insert_mixed(), except that it allows drivers toto override pgprot on a per-page basis.

Typically this function should be used by drivers to set caching- andencryption bits different than those ofvma->vm_page_prot, becausethe caching- or encryption mode may not be known at mmap() time.This is ok as long asvma->vm_page_prot is not used by the core vmto set caching and encryption bits for those vmas (except for COW pages).This is ensured by core vm only modifying these page table entries usingfunctions that don’t touch caching- or encryption bits, using pte_modify()if needed. (See for example mprotect()).Also when new page-table entries are created, this is only done using thefault() callback, and never using the value of vma->vm_page_prot,except for page-table entries that point to anonymous pages as the resultof COW.

Context

Process context. May allocate usingGFP_KERNEL.

Return

vm_fault_t value.

intremap_pfn_range(struct vm_area_struct * vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot)

remap kernel memory to userspace

Parameters

structvm_area_struct*vma
user vma to map to
unsignedlongaddr
target user address to start at
unsignedlongpfn
page frame number of kernel physical memory address
unsignedlongsize
size of mapping area
pgprot_tprot
page protection flags for this mapping

Note

this is only safe if the mm semaphore is held when called.

Return

0 on success, negative error code otherwise.

intvm_iomap_memory(struct vm_area_struct * vma, phys_addr_t start, unsigned long len)

remap memory to userspace

Parameters

structvm_area_struct*vma
user vma to map to
phys_addr_tstart
start of the physical memory to be mapped
unsignedlonglen
size of area

Description

This is a simplified io_remap_pfn_range() for common driver use. Thedriver just needs to give us the physical memory range to be mapped,we’ll figure out the rest from the vma information.

NOTE! Some drivers might want to tweak vma->vm_page_prot first to getwhatever write-combining details or similar.

Return

0 on success, negative error code otherwise.

voidunmap_mapping_range(structaddress_space * mapping, loff_t const holebegin, loff_t const holelen, int even_cows)

unmap the portion of all mmaps in the specified address_space corresponding to the specified byte range in the underlying file.

Parameters

structaddress_space*mapping
the address space containing mmaps to be unmapped.
loff_tconstholebegin
byte in first page to unmap, relative to the start ofthe underlying file. This will be rounded down to a PAGE_SIZEboundary. Note that this is different fromtruncate_pagecache(), whichmust keep the partial page. In contrast, we must get rid ofpartial pages.
loff_tconstholelen
size of prospective hole in bytes. This will be roundedup to a PAGE_SIZE boundary. A holelen of zero truncates to theend of the file.
inteven_cows
1 when truncating a file, unmap even private COWed pages;but 0 when invalidating pagecache, don’t throw away private data.
intfollow_pfn(struct vm_area_struct * vma, unsigned long address, unsigned long * pfn)

look up PFN at a user virtual address

Parameters

structvm_area_struct*vma
memory mapping
unsignedlongaddress
user virtual address
unsignedlong*pfn
location to store found PFN

Description

Only IO mappings and raw PFN mappings are allowed.

Return

zero and the pfn atpfn on success, -ve otherwise.

unsigned long__get_pfnblock_flags_mask(struct page * page, unsigned long pfn, unsigned long end_bitidx, unsigned long mask)

Return the requested group of flags for the pageblock_nr_pages block of pages

Parameters

structpage*page
The page within the block of interest
unsignedlongpfn
The target page frame number
unsignedlongend_bitidx
The last bit of interest to retrieve
unsignedlongmask
mask of bits that the caller is interested in

Return

pageblock_bits flags

voidset_pfnblock_flags_mask(struct page * page, unsigned long flags, unsigned long pfn, unsigned long end_bitidx, unsigned long mask)

Set the requested group of flags for a pageblock_nr_pages block of pages

Parameters

structpage*page
The page within the block of interest
unsignedlongflags
The flags to set
unsignedlongpfn
The target page frame number
unsignedlongend_bitidx
The last bit of interest
unsignedlongmask
mask of bits that the caller is interested in
void__putback_isolated_page(struct page * page, unsigned int order, int mt)

Return a now-isolated page back where we got it

Parameters

structpage*page
Page that was isolated
unsignedintorder
Order of the isolated page
intmt
The page’s pageblock’s migratetype

Description

This function is meant to return a page pulled from the free lists via__isolate_free_page back to the free lists they were pulled from.

void *alloc_pages_exact(size_t size, gfp_t gfp_mask)

allocate an exact number physically-contiguous pages.

Parameters

size_tsize
the number of bytes to allocate
gfp_tgfp_mask
GFP flags for the allocation, must not contain __GFP_COMP

Description

This function is similar to alloc_pages(), except that it allocates theminimum number of pages to satisfy the request. alloc_pages() can onlyallocate memory in power-of-two pages.

This function is also limited by MAX_ORDER.

Memory allocated by this function must be released byfree_pages_exact().

Return

pointer to the allocated area orNULL in case of error.

void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)

allocate an exact number of physically-contiguous pages on a node.

Parameters

intnid
the preferred node ID where memory should be allocated
size_tsize
the number of bytes to allocate
gfp_tgfp_mask
GFP flags for the allocation, must not contain __GFP_COMP

Description

Likealloc_pages_exact(), but try to allocate on node nid first before fallingback.

Return

pointer to the allocated area orNULL in case of error.

voidfree_pages_exact(void * virt, size_t size)

release memory allocated viaalloc_pages_exact()

Parameters

void*virt
the value returned by alloc_pages_exact.
size_tsize
size of allocation, same value as passed toalloc_pages_exact().

Description

Release the memory allocated by a previous call to alloc_pages_exact.

unsigned longnr_free_zone_pages(int offset)

count number of pages beyond high watermark

Parameters

intoffset
The zone index of the highest zone

Description

nr_free_zone_pages() counts the number of pages which are beyond thehigh watermark within all zones at or below a given zone index. For eachzone, the number of pages is calculated as:

nr_free_zone_pages = managed_pages - high_pages

Return

number of pages beyond high watermark.

unsigned longnr_free_buffer_pages(void)

count number of pages beyond high watermark

Parameters

void
no arguments

Description

nr_free_buffer_pages() counts the number of pages which are beyond the highwatermark within ZONE_DMA and ZONE_NORMAL.

Return

number of pages beyond high watermark within ZONE_DMA andZONE_NORMAL.

unsigned longnr_free_pagecache_pages(void)

count number of pages beyond high watermark

Parameters

void
no arguments

Description

nr_free_pagecache_pages() counts the number of pages which are beyond thehigh watermark within all zones.

Return

number of pages beyond high watermark within all zones.

intfind_next_best_node(int node, nodemask_t * used_node_mask)

find the next node that should appear in a given node’s fallback list

Parameters

intnode
node whose fallback list we’re appending
nodemask_t*used_node_mask
nodemask_t of already used nodes

Description

We use a number of factors to determine which is the next node that shouldappear on a given node’s fallback list. The node should not have appearedalready innode’s fallback list, and it should be the next closest nodeaccording to the distance array (which contains arbitrary distance valuesfrom each node to each node in the system), and should also prefer nodeswith no CPUs, since presumably they’ll have very little allocation pressureon them otherwise.

Return

node id of the found node orNUMA_NO_NODE if no node is found.

voidsparse_memory_present_with_active_regions(int nid)

Call memory_present for each active range

Parameters

intnid
The node to call memory_present for. If MAX_NUMNODES, all nodes will be used.

Description

If an architecture guarantees that all ranges registered contain no holes and maybe freed, this function may be used instead of calling memory_present() manually.

voidget_pfn_range_for_nid(unsigned int nid, unsigned long * start_pfn, unsigned long * end_pfn)

Return the start and end page frames for a node

Parameters

unsignedintnid
The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned.
unsignedlong*start_pfn
Passed by reference. On return, it will have the node start_pfn.
unsignedlong*end_pfn
Passed by reference. On return, it will have the node end_pfn.

Description

It returns the start and end page frame of a node based on informationprovided bymemblock_set_node(). If called for a nodewith no available memory, a warning is printed and the start and endPFNs will be 0.

unsigned longabsent_pages_in_range(unsigned long start_pfn, unsigned long end_pfn)

Return number of page frames in holes within a range

Parameters

unsignedlongstart_pfn
The start PFN to start searching for holes
unsignedlongend_pfn
The end PFN to stop searching for holes

Return

the number of pages frames in memory holes within a range.

unsigned longnode_map_pfn_alignment(void)

determine the maximum internode alignment

Parameters

void
no arguments

Description

This function should be called after node map is populated and sorted.It calculates the maximum power of two alignment which can distinguishall the nodes.

For example, if all nodes are 1GiB and aligned to 1GiB, the return valuewould indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If thenodes are shifted by 256MiB, 256MiB. Note that if only the last node isshifted, 1GiB is enough and this function will indicate so.

This is used to test whether pfn -> nid mapping of the chosen memorymodel has fine enough granularity to avoid incorrect mapping for thepopulated node map.

Return

the determined alignment in pfn’s. 0 if there is no alignmentrequirement (single node).

unsigned longfind_min_pfn_with_active_regions(void)

Find the minimum PFN registered

Parameters

void
no arguments

Return

the minimum PFN based on information provided viamemblock_set_node().

voidfree_area_init(unsigned long * max_zone_pfn)

Initialise all pg_data_t and zone data

Parameters

unsignedlong*max_zone_pfn
an array of max PFNs for each zone

Description

This will call free_area_init_node() for each active node in the system.Using the page ranges provided bymemblock_set_node(), the size of eachzone in each node and their holes is calculated. If the maximum PFNbetween two adjacent zones match, it is assumed that the zone is empty.For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumedthat arch_max_dma32_pfn has no pages. It is also assumed that a zonestarts where the previous one ended. For example, ZONE_DMA32 startsat arch_max_dma_pfn.

voidset_dma_reserve(unsigned long new_dma_reserve)

set the specified number of pages reserved in the first zone

Parameters

unsignedlongnew_dma_reserve
The number of pages to mark reserved

Description

The per-cpu batchsize and zone watermarks are determined by managed_pages.In the DMA zone, a significant percentage may be consumed by kernel imageand other unfreeable allocations which can skew the watermarks badly. Thisfunction may optionally be used to account for unfreeable pages in thefirst zone (e.g., ZONE_DMA). The effect will be lower watermarks andsmaller per-cpu batchsize.

voidsetup_per_zone_wmarks(void)

called when min_free_kbytes changes or when memory is hot-{added|removed}

Parameters

void
no arguments

Description

Ensures that the watermark[min,low,high] values for each zone are setcorrectly with respect to min_free_kbytes.

intalloc_contig_range(unsigned long start, unsigned long end, unsigned migratetype, gfp_t gfp_mask)
  • tries to allocate given range of pages

Parameters

unsignedlongstart
start PFN to allocate
unsignedlongend
one-past-the-last PFN to allocate
unsignedmigratetype
migratetype of the underlaying pageblocks (either#MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocksin range must have the same migratetype and it mustbe either of the two.
gfp_tgfp_mask
GFP mask to use during compaction

Description

The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGESaligned. The PFN range must belong to a single zone.

The first thing this routine does is attempt to MIGRATE_ISOLATE allpageblocks in the range. Once isolated, the pageblocks should notbe modified by others.

Return

zero on success or negative error code. On success allpages which PFN is in [start, end) are allocated for the caller andneed to be freed with free_contig_range().

struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask, int nid, nodemask_t * nodemask)
  • tries to find and allocate contiguous range of pages

Parameters

unsignedlongnr_pages
Number of contiguous pages to allocate
gfp_tgfp_mask
GFP mask to limit search and used during compaction
intnid
Target node
nodemask_t*nodemask
Mask for other possible nodes

Description

This routine is a wrapper aroundalloc_contig_range(). It scans over zoneson an applicable zonelist to find a contiguous pfn range which can then betried for allocation withalloc_contig_range(). This routine is intendedfor allocation requests which can not be fulfilled with the buddy allocator.

The allocated memory is always aligned to a page boundary. If nr_pages is apower of two then the alignment is guaranteed to be to the given nr_pages(e.g. 1GB request would be aligned to 1GB).

Allocated pages can be freed with free_contig_range() or by manually calling__free_page() on each allocated page.

Return

pointer to contiguous pages on success, or NULL if not successful.