Physical Memory

Linux is available for a wide range of architectures so there is a need for anarchitecture-independent abstraction to represent the physical memory. Thischapter describes the structures used to manage physical memory in a runningsystem.

The first principal concept prevalent in the memory management isNon-Uniform Memory Access (NUMA).With multi-core and multi-socket machines, memory may be arranged into banksthat incur a different cost to access depending on the “distance” from theprocessor. For example, there might be a bank of memory assigned to each CPU ora bank of memory very suitable for DMA near peripheral devices.

Each bank is called a node and the concept is represented under Linux by astructpglist_data even if the architecture is UMA. This structure isalways referenced by its typedefpg_data_t. Apg_data_t structurefor a particular node can be referenced byNODE_DATA(nid) macro wherenid is the ID of that node.

For NUMA architectures, the node structures are allocated by the architecturespecific code early during boot. Usually, these structures are allocatedlocally on the memory bank they represent. For UMA architectures, only onestaticpg_data_t structure calledcontig_page_data is used. Nodes willbe discussed further in SectionNodes

The entire physical address space is partitioned into one or more blockscalled zones which represent ranges within memory. These ranges are usuallydetermined by architectural constraints for accessing the physical memory.The memory range within a node that corresponds to a particular zone isdescribed by astructzone. Each zone hasone of the types described below.

  • ZONE_DMA andZONE_DMA32 historically represented memory suitable forDMA by peripheral devices that cannot access all of the addressablememory. For many years there are better more and robust interfaces to getmemory with DMA specific requirements (Dynamic DMA mapping using the generic device),butZONE_DMA andZONE_DMA32 still represent memory ranges that haverestrictions on how they can be accessed.Depending on the architecture, either of these zone types or even they bothcan be disabled at build time usingCONFIG_ZONE_DMA andCONFIG_ZONE_DMA32 configuration options. Some 64-bit platforms may needboth zones as they support peripherals with different DMA addressinglimitations.

  • ZONE_NORMAL is for normal memory that can be accessed by the kernel allthe time. DMA operations can be performed on pages in this zone if the DMAdevices support transfers to all addressable memory.ZONE_NORMAL isalways enabled.

  • ZONE_HIGHMEM is the part of the physical memory that is not covered by apermanent mapping in the kernel page tables. The memory in this zone is onlyaccessible to the kernel using temporary mappings. This zone is availableonly on some 32-bit architectures and is enabled withCONFIG_HIGHMEM.

  • ZONE_MOVABLE is for normal accessible memory, just likeZONE_NORMAL.The difference is that the contents of most pages inZONE_MOVABLE ismovable. That means that while virtual addresses of these pages do notchange, their content may move between different physical pages. OftenZONE_MOVABLE is populated during memory hotplug, but it may bealso populated on boot using one ofkernelcore,movablecore andmovable_node kernel command line parameters. SeePage migration andMemory Hot(Un)Plug for additional details.

  • ZONE_DEVICE represents memory residing on devices such as PMEM and GPU.It has different characteristics than RAM zone types and it exists to providestruct page and memory map services for device driveridentified physical address ranges.ZONE_DEVICE is enabled withconfiguration optionCONFIG_ZONE_DEVICE.

It is important to note that many kernel operations can only take place usingZONE_NORMAL so it is the most performance critical zone. Zones arediscussed further in SectionZones.

The relation between node and zone extents is determined by the physical memorymap reported by the firmware, architectural constraints for memory addressingand certain parameters in the kernel command line.

For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM theentire memory will be on node 0 and there will be three zones:ZONE_DMA,ZONE_NORMAL andZONE_HIGHMEM:

0                                                            2G+-------------------------------------------------------------+|                            node 0                           |+-------------------------------------------------------------+0         16M                    896M                        2G+----------+-----------------------+--------------------------+| ZONE_DMA |      ZONE_NORMAL      |       ZONE_HIGHMEM       |+----------+-----------------------+--------------------------+

With a kernel built withZONE_DMA disabled andZONE_DMA32 enabled andbooted withmovablecore=80% parameter on an arm64 machine with 16 Gbytes ofRAM equally split between two nodes, there will beZONE_DMA32,ZONE_NORMAL andZONE_MOVABLE on node 0, andZONE_NORMAL andZONE_MOVABLE on node 1:

1G                                9G                         17G+--------------------------------+ +--------------------------+|              node 0            | |          node 1          |+--------------------------------+ +--------------------------+1G       4G        4200M          9G          9320M          17G+---------+----------+-----------+ +------------+-------------+|  DMA32  |  NORMAL  |  MOVABLE  | |   NORMAL   |   MOVABLE   |+---------+----------+-----------+ +------------+-------------+

Memory banks may belong to interleaving nodes. In the example below an x86machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0and odd banks belong to node 1:

0              4G              8G             12G            16G+-------------+ +-------------+ +-------------+ +-------------+|    node 0   | |    node 1   | |    node 0   | |    node 1   |+-------------+ +-------------+ +-------------+ +-------------+0   16M      4G+-----+-------+ +-------------+ +-------------+ +-------------+| DMA | DMA32 | |    NORMAL   | |    NORMAL   | |    NORMAL   |+-----+-------+ +-------------+ +-------------+ +-------------+

In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from4 to 16 Gbytes.

Nodes

As we have mentioned, each node in memory is described by apg_data_t whichis a typedef for astructpglist_data. When allocating a page, by defaultLinux uses a node-local allocation policy to allocate memory from the nodeclosest to the running CPU. As processes tend to run on the same CPU, it islikely the memory from the current node will be used. The allocation policy canbe controlled by users as described inNUMA Memory Policy.

Most NUMA architectures maintain an array of pointers to the nodestructures. The actual structures are allocated early during boot whenarchitecture specific code parses the physical memory map reported by thefirmware. The bulk of the node initialization happens slightly later in theboot process byfree_area_init() function, described later in SectionInitialization.

Along with the node structures, kernel maintains an array ofnodemask_tbitmasks callednode_states. Each bitmask in this array represents a set ofnodes with particular properties as defined byenumnode_states:

N_POSSIBLE

The node could become online at some point.

N_ONLINE

The node is online.

N_NORMAL_MEMORY

The node has regular memory.

N_HIGH_MEMORY

The node has regular or high memory. WhenCONFIG_HIGHMEM is disabledaliased toN_NORMAL_MEMORY.

N_MEMORY

The node has memory(regular, high, movable)

N_CPU

The node has one or more CPUs

N_GENERIC_INITIATOR

The node has one or more Generic Initiators

For each node that has a property described above, the bit corresponding to thenode ID in thenode_states[<property>] bitmask is set.

For example, for node 2 with normal memory and CPUs, bit 2 will be set in

node_states[N_POSSIBLE]node_states[N_ONLINE]node_states[N_NORMAL_MEMORY]node_states[N_HIGH_MEMORY]node_states[N_MEMORY]node_states[N_CPU]

For various operations possible with nodemasks please refer toinclude/linux/nodemask.h.

Among other things, nodemasks are used to provide macros for node traversal,namelyfor_each_node() andfor_each_online_node().

For instance, to call a functionfoo() for each online node:

for_each_online_node(nid) {        pg_data_t *pgdat = NODE_DATA(nid);        foo(pgdat);}

Node structure

The nodes structurestructpglist_data is declared ininclude/linux/mmzone.h. Here we briefly describe fields of thisstructure:

General

node_zones

The zones for this node. Not all of the zones may be populated, but it isthe full list. It is referenced by this node’s node_zonelists as well asother node’s node_zonelists.

node_zonelists

The list of all zones in all nodes. This list defines the order of zonesthat allocations are preferred from. Thenode_zonelists is set up bybuild_zonelists() inmm/page_alloc.c during the initialization ofcore memory management structures.

nr_zones

Number of populated zones in this node.

node_mem_map

For UMA systems that use FLATMEM memory model the 0’s nodenode_mem_map is array ofstructpages representing each physical frame.

node_page_ext

For UMA systems that use FLATMEM memory model the 0’s nodenode_page_ext is array of extensions ofstructpages. Available onlyin the kernels built withCONFIG_PAGE_EXTENSION enabled.

node_start_pfn

The page frame number of the starting page frame in this node.

node_present_pages

Total number of physical pages present in this node.

node_spanned_pages

Total size of physical page range, including holes.

node_size_lock

A lock that protects the fields defining the node extents. Only defined whenat least one ofCONFIG_MEMORY_HOTPLUG orCONFIG_DEFERRED_STRUCT_PAGE_INIT configuration options are enabled.pgdat_resize_lock() andpgdat_resize_unlock() are provided tomanipulatenode_size_lock without checking forCONFIG_MEMORY_HOTPLUGorCONFIG_DEFERRED_STRUCT_PAGE_INIT.

node_id

The Node ID (NID) of the node, starts at 0.

totalreserve_pages

This is a per-node reserve of pages that are not available to userspaceallocations.

first_deferred_pfn

If memory initialization on large machines is deferred then this is the firstPFN that needs to be initialized. Defined only whenCONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled

deferred_split_queue

Per-node queue of huge pages that their split was deferred. Defined only whenCONFIG_TRANSPARENT_HUGEPAGE is enabled.

__lruvec

Per-node lruvec holding LRU lists and related parameters. Used only whenmemory cgroups are disabled. It should not be accessed directly, usemem_cgroup_lruvec() to look up lruvecs instead.

Reclaim control

See alsoPage Reclaim.

kswapd

Per-node instance of kswapd kernel thread.

kswapd_wait,pfmemalloc_wait,reclaim_wait

Workqueues used to synchronize memory reclaim tasks

nr_writeback_throttled

Number of tasks that are throttled waiting on dirty pages to clean.

nr_reclaim_start

Number of pages written while reclaim is throttled waiting for writeback.

kswapd_order

Controls the order kswapd tries to reclaim

kswapd_highest_zoneidx

The highest zone index to be reclaimed by kswapd

kswapd_failures

Number of runs kswapd was unable to reclaim any pages

min_unmapped_pages

Minimal number of unmapped file backed pages that cannot be reclaimed.Determined byvm.min_unmapped_ratio sysctl. Only defined whenCONFIG_NUMA is enabled.

min_slab_pages

Minimal number of SLAB pages that cannot be reclaimed. Determined byvm.min_slab_ratiosysctl. Only defined whenCONFIG_NUMA is enabled

flags

Flags controlling reclaim behavior.

Compaction control

kcompactd_max_order

Page order that kcompactd should try to achieve.

kcompactd_highest_zoneidx

The highest zone index to be compacted by kcompactd.

kcompactd_wait

Workqueue used to synchronize memory compaction tasks.

kcompactd

Per-node instance of kcompactd kernel thread.

proactive_compact_trigger

Determines if proactive compaction is enabled. Controlled byvm.compaction_proactiveness sysctl.

Statistics

per_cpu_nodestats

Per-CPU VM statistics for the node

vm_stat

VM statistics for the node.

Zones

As we have mentioned, each zone in memory is described by astructzonewhich is an element of thenode_zones array of the node it belongs to.structzone is the core data structure of the page allocator. A zonerepresents a range of physical memory and may have holes.

The page allocator uses the GFP flags, seeMemory Allocation Controls, specified bya memory allocation to determine the highest zone in a node from which thememory allocation can allocate memory. The page allocator first allocates memoryfrom that zone, if the page allocator can’t allocate the requested amount ofmemory from the zone, it will allocate memory from the next lower zone in thenode, the process continues up to and including the lowest zone. For example, ifa node containsZONE_DMA32,ZONE_NORMAL andZONE_MOVABLE and thehighest zone of a memory allocation isZONE_MOVABLE, the order of the zonesfrom which the page allocator allocates memory isZONE_MOVABLE >ZONE_NORMAL >ZONE_DMA32.

At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areasof the zone. The Per-CPU Pagesets are a vital mechanism in the kernel’s memorymanagement system. By handling most frequent allocations and frees locally oneach CPU, the Per-CPU Pagesets improve performance and scalability, especiallyon systems with many cores. The page allocator in the kernel employs a two-stepstrategy for memory allocation, starting with the Per-CPU Pagesets beforefalling back to the buddy allocator. Pages are transferred between the Per-CPUPagesets and the global free areas (managed by the buddy allocator) in batches.This minimizes the overhead of frequent interactions with the global buddyallocator.

Architecture specific code callsfree_area_init() to initializes zones.

Zone structure

The zones structurestructzone is defined ininclude/linux/mmzone.h.Here we briefly describe fields of this structure:

General

_watermark

The watermarks for this zone. When the amount of free pages in a zone is belowthe min watermark, boosting is ignored, an allocation may trigger directreclaim and direct compaction, it is also used to throttle direct reclaim.When the amount of free pages in a zone is below the low watermark, kswapd iswoken up. When the amount of free pages in a zone is above the high watermark,kswapd stops reclaiming (a zone is balanced) when theNUMA_BALANCING_MEMORY_TIERING bit ofsysctl_numa_balancing_mode is notset. The promo watermark is used for memory tiering and NUMA balancing. Whenthe amount of free pages in a zone is above the promo watermark, kswapd stopsreclaiming when theNUMA_BALANCING_MEMORY_TIERING bit ofsysctl_numa_balancing_mode is set. The watermarks are set by__setup_per_zone_wmarks(). The min watermark is calculated according tovm.min_free_kbytes sysctl. The other three watermarks are set accordingto the distance between two watermarks. The distance itself is calculatedtakingvm.watermark_scale_factor sysctl into account.

watermark_boost

The number of pages which are used to boost watermarks to increase reclaimpressure to reduce the likelihood of future fallbacks and wake kswapd nowas the node may be balanced overall and kswapd will not wake naturally.

nr_reserved_highatomic

The number of pages which are reserved for high-order atomic allocations.

nr_free_highatomic

The number of free pages in reserved highatomic pageblocks

lowmem_reserve

The array of the amounts of the memory reserved in this zone for memoryallocations. For example, if the highest zone a memory allocation canallocate memory from isZONE_MOVABLE, the amount of memory reserved inthis zone for this allocation islowmem_reserve[ZONE_MOVABLE] whenattempting to allocate memory from this zone. This is a mechanism the pageallocator uses to prevent allocations which could usehighmem from usingtoo muchlowmem. For some specialised workloads onhighmem machines,it is dangerous for the kernel to allow process memory to be allocated fromthelowmem zone. This is because that memory could then be pinned via themlock() system call, or by unavailability of swapspace.vm.lowmem_reserve_ratio sysctl determines how aggressive the kernel is indefending these lower zones. This array is recalculated bysetup_per_zone_lowmem_reserve() at runtime ifvm.lowmem_reserve_ratiosysctl changes.

node

The index of the node this zone belongs to. Available only whenCONFIG_NUMA is enabled because there is only one zone in a UMA system.

zone_pgdat

Pointer to thestructpglist_data of the node this zone belongs to.

per_cpu_pageset

Pointer to the Per-CPU Pagesets (PCP) allocated and initialized bysetup_zone_pageset(). By handling most frequent allocations and freeslocally on each CPU, PCP improves performance and scalability on systems withmany cores.

pageset_high_min

Copied to thehigh_min of the Per-CPU Pagesets for faster access.

pageset_high_max

Copied to thehigh_max of the Per-CPU Pagesets for faster access.

pageset_batch

Copied to thebatch of the Per-CPU Pagesets for faster access. Thebatch,high_min andhigh_max of the Per-CPU Pagesets are used tocalculate the number of elements the Per-CPU Pagesets obtain from the buddyallocator under a single hold of the lock for efficiency. They are also usedto decide if the Per-CPU Pagesets return pages to the buddy allocator in pagefree process.

pageblock_flags

The pointer to the flags for the pageblocks in the zone (seeinclude/linux/pageblock-flags.h for flags list). The memory is allocatedinsetup_usemap(). Each pageblock occupiesNR_PAGEBLOCK_BITS bits.Defined only whenCONFIG_FLATMEM is enabled. The flags is stored inmem_section whenCONFIG_SPARSEMEM is enabled.

zone_start_pfn

The start pfn of the zone. It is initialized bycalculate_node_totalpages().

managed_pages

The present pages managed by the buddy system, which is calculated as:managed_pages =present_pages -reserved_pages,reserved_pagesincludes pages allocated by the memblock allocator. It should be used by pageallocator and vm scanner to calculate all kinds of watermarks and thresholds.It is accessed usingatomic_long_xxx() functions. It is initialized infree_area_init_core() and then is reinitialized when memblock allocatorfrees pages into buddy system.

spanned_pages

The total pages spanned by the zone, including holes, which is calculated as:spanned_pages =zone_end_pfn -zone_start_pfn. It is initializedbycalculate_node_totalpages().

present_pages

The physical pages existing within the zone, which is calculated as:present_pages =spanned_pages -absent_pages (pages in holes). Itmay be used by memory hotplug or memory power management logic to figure outunmanaged pages by checking (present_pages -managed_pages). Writeaccess topresent_pages at runtime should be protected bymem_hotplug_begin/done(). Any reader who can’t tolerant drift ofpresent_pages should useget_online_mems() to get a stable value. Itis initialized bycalculate_node_totalpages().

present_early_pages

The present pages existing within the zone located on memory available sinceearly boot, excluding hotplugged memory. Defined only whenCONFIG_MEMORY_HOTPLUG is enabled and initialized bycalculate_node_totalpages().

cma_pages

The pages reserved for CMA use. These pages behave likeZONE_MOVABLE whenthey are not used for CMA. Defined only whenCONFIG_CMA is enabled.

name

The name of the zone. It is a pointer to the corresponding element ofthezone_names array.

nr_isolate_pageblock

Number of isolated pageblocks. It is used to solve incorrect freepage countingproblem due to racy retrieving migratetype of pageblock. Protected byzone->lock. Defined only whenCONFIG_MEMORY_ISOLATION is enabled.

span_seqlock

The seqlock to protectzone_start_pfn andspanned_pages. It is aseqlock because it has to be read outside ofzone->lock, and it is done inthe main allocator path. However, the seqlock is written quite infrequently.Defined only whenCONFIG_MEMORY_HOTPLUG is enabled.

initialized

The flag indicating if the zone is initialized. Set byinit_currently_empty_zone() during boot.

free_area

The array of free areas, where each element corresponds to a specific orderwhich is a power of two. The buddy allocator uses this structure to managefree memory efficiently. When allocating, it tries to find the smallestsufficient block, if the smallest sufficient block is larger than therequested size, it will be recursively split into the next smaller blocksuntil the required size is reached. When a page is freed, it may be mergedwith its buddy to form a larger block. It is initialized byzone_init_free_lists().

unaccepted_pages

The list of pages to be accepted. All pages on the list areMAX_PAGE_ORDER.Defined only whenCONFIG_UNACCEPTED_MEMORY is enabled.

flags

The zone flags. The least three bits are used and defined byenumzone_flags.ZONE_BOOSTED_WATERMARK (bit 0): zone recently boostedwatermarks. Cleared when kswapd is woken.ZONE_RECLAIM_ACTIVE (bit 1):kswapd may be scanning the zone.ZONE_BELOW_HIGH (bit 2): zone is belowhigh watermark.

lock

The main lock that protects the internal data structures of the page allocatorspecific to the zone, especially protectsfree_area.

percpu_drift_mark

When free pages are below this point, additional steps are taken when readingthe number of free pages to avoid per-cpu counter drift allowing watermarksto be breached. It is updated inrefresh_zone_stat_thresholds().

Compaction control

compact_cached_free_pfn

The PFN where compaction free scanner should start in the next scan.

compact_cached_migrate_pfn

The PFNs where compaction migration scanner should start in the next scan.This array has two elements: the first one is used inMIGRATE_ASYNC mode,and the other one is used inMIGRATE_SYNC mode.

compact_init_migrate_pfn

The initial migration PFN which is initialized to 0 at boot time, and to thefirst pageblock with migratable pages in the zone after a full compactionfinishes. It is used to check if a scan is a whole zone scan or not.

compact_init_free_pfn

The initial free PFN which is initialized to 0 at boot time and to the lastpageblock with freeMIGRATE_MOVABLE pages in the zone. It is used to checkif it is the start of a scan.

compact_considered

The number of compactions attempted since last failure. It is reset indefer_compaction() when a compaction fails to result in a page allocationsuccess. It is increased by 1 incompaction_deferred() when a compactionshould be skipped.compaction_deferred() is called beforecompact_zone() is called,compaction_defer_reset() is called whencompact_zone() returnsCOMPACT_SUCCESS,defer_compaction() iscalled whencompact_zone() returnsCOMPACT_PARTIAL_SKIPPED orCOMPACT_COMPLETE.

compact_defer_shift

The number of compactions skipped before trying again is1<<compact_defer_shift. It is increased by 1 indefer_compaction().It is reset incompaction_defer_reset() when a direct compaction resultsin a page allocation success. Its maximum value isCOMPACT_MAX_DEFER_SHIFT.

compact_order_failed

The minimum compaction failed order. It is set incompaction_defer_reset()when a compaction succeeds and indefer_compaction() when a compactionfails to result in a page allocation success.

compact_blockskip_flush

Set to true when compaction migration scanner and free scanner meet, whichmeans thePB_compact_skip bits should be cleared.

contiguous

Set to true when the zone is contiguous (in other words, no hole).

Statistics

vm_stat

VM statistics for the zone. The items tracked are defined byenumzone_stat_item.

vm_numa_event

VM NUMA event statistics for the zone. The items tracked are defined byenumnuma_stat_item.

per_cpu_zonestats

Per-CPU VM statistics for the zone. It records VM statistics and VM NUMA eventstatistics on a per-CPU basis. It reduces updates to the globalvm_statandvm_numa_event fields of the zone to improve performance.

Pages

Stub

This section is incomplete. Please list and describe the appropriate fields.

Folios

Stub

This section is incomplete. Please list and describe the appropriate fields.

Initialization

Stub

This section is incomplete. Please list and describe the appropriate fields.