NUMA Memory Policy¶
What is NUMA Memory Policy?¶
In the Linux kernel, “memory policy” determines from which node the kernel willallocate memory in a NUMA system or in an emulated NUMA system. Linux hassupported platforms with Non-Uniform Memory Access architectures since 2.4.?.The current memory policy support was added to Linux 2.6 around May 2004. Thisdocument attempts to describe the concepts and APIs of the 2.6 memory policysupport.
Memory policies should not be confused with cpusets(Documentation/admin-guide/cgroup-v1/cpusets.rst)which is an administrative mechanism for restricting the nodes from whichmemory may be allocated by a set of processes. Memory policies are aprogramming interface that a NUMA-aware application can take advantage of. Whenboth cpusets and policies are applied to a task, the restrictions of the cpusettakes priority. SeeMemory Policies and cpusetsbelow for more details.
Memory Policy Concepts¶
Scope of Memory Policies¶
The Linux kernel supports _scopes_ of memory policy, described here frommost general to most specific:
- System Default Policy
this policy is “hard coded” into the kernel. It is the policythat governs all page allocations that aren’t controlled byone of the more specific policy scopes discussed below. Whenthe system is “up and running”, the system default policy willuse “local allocation” described below. However, during bootup, the system default policy will be set to interleaveallocations across all nodes with “sufficient” memory, so asnot to overload the initial boot node with boot-timeallocations.
- Task/Process Policy
this is an optional, per-task policy. When defined for aspecific task, this policy controls all page allocations madeby or on behalf of the task that aren’t controlled by a morespecific scope. If a task does not define a task policy, thenall page allocations that would have been controlled by thetask policy “fall back” to the System Default Policy.
The task policy applies to the entire address space of a task. Thus,it is inheritable, and indeed is inherited, across both fork()[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent taskto establish the task policy for a child task
exec()’d from anexecutable image that has no awareness of memory policy. See theMemory Policy APIs section,below, for an overview of the system callthat a task may use to set/change its task/process policy.In a multi-threaded task, task policies apply only to the thread[Linux kernel task] that installs the policy and any threadssubsequently created by that thread. Any sibling threads existingat the time a new task policy is installed retain their currentpolicy.
A task policy applies only to pages allocated after the policy isinstalled. Any pages already faulted in by the task when the taskchanges its task policy remain where they were allocated based onthe policy at the time they were allocated.
- VMA Policy
A “VMA” or “Virtual Memory Area” refers to a range of a task’svirtual address space. A task may define a specific policy for a rangeof its virtual address space. See theMemory Policy APIs section,below, for an overview of the
mbind()system call used to set a VMApolicy.A VMA policy will govern the allocation of pages that backthis region of the address space. Any regions of the task’saddress space that don’t have an explicit VMA policy will fallback to the task policy, which may itself fall back to theSystem Default Policy.
VMA policies have a few complicating details:
VMA policy applies ONLY to anonymous pages. These includepages allocated for anonymous segments, such as the taskstack and heap, and any regions of the address spacemmap()ed with the MAP_ANONYMOUS flag. If a VMA policy isapplied to a file mapping, it will be ignored if the mappingused the MAP_SHARED flag. If the file mapping used theMAP_PRIVATE flag, the VMA policy will only be applied whenan anonymous page is allocated on an attempt to write to themapping-- i.e., at Copy-On-Write.
VMA policies are shared between all tasks that share avirtual address space--a.k.a. threads--independent of whenthe policy is installed; and they are inherited acrossfork(). However, because VMA policies refer to a specificregion of a task’s address space, and because the addressspace is discarded and recreated on exec*(), VMA policiesare NOT inheritable across
exec(). Thus, only NUMA-awareapplications may use VMA policies.A task may install a new VMA policy on a sub-range of apreviously mmap()ed region. When this happens, Linux splitsthe existing virtual memory area into 2 or 3 VMAs, each withits own policy.
By default, VMA policy applies only to pages allocated afterthe policy is installed. Any pages already faulted into theVMA range remain where they were allocated based on thepolicy at the time they were allocated. However, since2.6.16, Linux supports page migration via the
mbind()systemcall, so that page contents can be moved to match a newlyinstalled policy.
- Shared Policy
Conceptually, shared policies apply to “memory objects” mappedshared into one or more tasks’ distinct address spaces. Anapplication installs shared policies the same way as VMApolicies--using the
mbind()system call specifying a range ofvirtual addresses that map the shared object. However, unlikeVMA policies, which can be considered to be an attribute of arange of a task’s address space, shared policies applydirectly to the shared object. Thus, all tasks that attach tothe object share the policy, and all pages allocated for theshared object, by any task, will obey the shared policy.As of 2.6.22, only shared memory segments, created by
shmget()ormmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When sharedpolicy support was added to Linux, the associated data structures wereadded to hugetlbfs shmem segments. At the time, hugetlbfs did notsupport allocation at fault time--a.k.a lazy allocation--so hugetlbfsshmem segments were never “hooked up” to the shared policy support.Although hugetlbfs segments now support lazy allocation, their supportfor shared policy has not been completed.As mentioned above inVMA policies section,allocations of page cache pages for regular files mmap()edwith MAP_SHARED ignore any VMA policy installed on the virtualaddress range backed by the shared file mapping. Rather,shared page cache pages, including pages backing privatemappings that have not yet been written by the task, followtask policy, if any, else System Default Policy.
The shared policy infrastructure supports different policies on subsetranges of the shared object. However, Linux still splits the VMA ofthe task that installs the policy for each range of distinct policy.Thus, different tasks that attach to a shared memory segment can havedifferent VMA configurations mapping that one shared object. Thiscan be seen by examining the /proc/<pid>/numa_maps of tasks sharinga shared memory region, when one task has installed shared policy onone or more ranges of the region.
Components of Memory Policies¶
A NUMA memory policy consists of a “mode”, optional mode flags, andan optional set of nodes. The mode determines the behavior of thepolicy, the optional mode flags determine the behavior of the mode,and the optional set of nodes can be viewed as the arguments to thepolicy behavior.
Internally, memory policies are implemented by a reference countedstructure,structmempolicy. Details of this structure will bediscussed in context, below, as required to explain the behavior.
NUMA memory policy supports the following 4 behavioral modes:
- Default Mode--MPOL_DEFAULT
This mode is only used in the memory policy APIs. Internally,MPOL_DEFAULT is converted to the NULL memory policy in allpolicy scopes. Any existing non-default policy will simply beremoved when MPOL_DEFAULT is specified. As a result,MPOL_DEFAULT means “fall back to the next most specific policyscope.”
For example, a NULL or default task policy will fall back to thesystem default policy. A NULL or default vma policy will fallback to the task policy.
When specified in one of the memory policy APIs, the Default modedoes not use the optional set of nodes.
It is an error for the set of nodes specified for this policy tobe non-empty.
- MPOL_BIND
This mode specifies that memory must come from the set ofnodes specified by the policy. Memory will be allocated fromthe node in the set with sufficient free memory that isclosest to the node where the allocation takes place.
- MPOL_PREFERRED
This mode specifies that the allocation should be attemptedfrom the single node specified in the policy. If thatallocation fails, the kernel will search other nodes, in orderof increasing distance from the preferred node based oninformation provided by the platform firmware.
Internally, the Preferred policy uses a single node--thepreferred_node member of
structmempolicy. When the internalmode flag MPOL_F_LOCAL is set, the preferred_node is ignoredand the policy is interpreted as local allocation. “Local”allocation policy can be viewed as a Preferred policy thatstarts at the node containing the cpu where the allocationtakes place.It is possible for the user to specify that local allocationis always preferred by passing an empty nodemask with thismode. If an empty nodemask is passed, the policy cannot usethe MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flagsdescribed below.
- MPOL_INTERLEAVED
This mode specifies that page allocations be interleaved, on apage granularity, across the nodes specified in the policy.This mode also behaves slightly differently, based on thecontext where it is used:
For allocation of anonymous pages and shared memory pages,Interleave mode indexes the set of nodes specified by thepolicy using the page offset of the faulting address into thesegment [VMA] containing the address modulo the number ofnodes specified by the policy. It then attempts to allocate apage, starting at the selected node, as if the node had beenspecified by a Preferred policy or had been selected by alocal allocation. That is, allocation will follow the pernode zonelist.
For allocation of page cache pages, Interleave mode indexesthe set of nodes specified by the policy using a node countermaintained per task. This counter wraps around to the lowestspecified node after it reaches the highest specified node.This will tend to spread the pages out over the nodesspecified by the policy based on the order in which they areallocated, rather than based on any page offset into anaddress range or file. During system boot up, the temporaryinterleaved system default policy works in this mode.
- MPOL_PREFERRED_MANY
This mode specifies that the allocation should be preferablysatisfied from the nodemask specified in the policy. If there isa memory pressure on all nodes in the nodemask, the allocationcan fall back to all existing numa nodes. This is effectivelyMPOL_PREFERRED allowed for a mask rather than a single node.
- MPOL_WEIGHTED_INTERLEAVE
This mode operates the same as MPOL_INTERLEAVE, except thatinterleaving behavior is executed based on weights set in/sys/kernel/mm/mempolicy/weighted_interleave/
Weighted interleave allocates pages on nodes according to aweight. For example if nodes [0,1] are weighted [5,2], 5 pageswill be allocated on node0 for every 2 pages allocated on node1.
NUMA memory policy supports the following optional mode flags:
- MPOL_F_STATIC_NODES
This flag specifies that the nodemask passed bythe user should not be remapped if the task or VMA’s set of allowednodes changes after the memory policy has been defined.
Without this flag, any time a mempolicy is rebound because of achange in the set of allowed nodes, the preferred nodemask (PreferredMany), preferred node (Preferred) or nodemask (Bind, Interleave) isremapped to the new set of allowed nodes. This may result in nodesbeing used that were previously undesired.
With this flag, if the user-specified nodes overlap with thenodes allowed by the task’s cpuset, then the memory policy isapplied to their intersection. If the two sets of nodes do notoverlap, the Default policy is used.
For example, consider a task that is attached to a cpuset withmems 1-3 that sets an Interleave policy over the same set. Ifthe cpuset’s mems change to 3-5, the Interleave will now occurover nodes 3, 4, and 5. With this flag, however, since only node3 is allowed from the user’s nodemask, the “interleave” onlyoccurs over that node. If no nodes from the user’s nodemask arenow allowed, the Default behavior is used.
MPOL_F_STATIC_NODES cannot be combined with theMPOL_F_RELATIVE_NODES flag. It also cannot be used forMPOL_PREFERRED policies that were created with an empty nodemask(local allocation).
- MPOL_F_RELATIVE_NODES
This flag specifies that the nodemask passedby the user will be mapped relative to the set of the task or VMA’sset of allowed nodes. The kernel stores the user-passed nodemask,and if the allowed nodes changes, then that original nodemask willbe remapped relative to the new set of allowed nodes.
Without this flag (and without MPOL_F_STATIC_NODES), anytime amempolicy is rebound because of a change in the set of allowednodes, the node (Preferred) or nodemask (Bind, Interleave) isremapped to the new set of allowed nodes. That remap may notpreserve the relative nature of the user’s passed nodemask to itsset of allowed nodes upon successive rebinds: a nodemask of1,3,5 may be remapped to 7-9 and then to 1-3 if the set ofallowed nodes is restored to its original state.
With this flag, the remap is done so that the node numbers fromthe user’s passed nodemask are relative to the set of allowednodes. In other words, if nodes 0, 2, and 4 are set in the user’snodemask, the policy will be effected over the first (and in theBind or Interleave case, the third and fifth) nodes in the set ofallowed nodes. The nodemask passed by the user represents nodesrelative to task or VMA’s set of allowed nodes.
If the user’s nodemask includes nodes that are outside the rangeof the new set of allowed nodes (for example, node 5 is set inthe user’s nodemask when the set of allowed nodes is only 0-3),then the remap wraps around to the beginning of the nodemask and,if not already set, sets the node in the mempolicy nodemask.
For example, consider a task that is attached to a cpuset withmems 2-5 that sets an Interleave policy over the same set withMPOL_F_RELATIVE_NODES. If the cpuset’s mems change to 3-7, theinterleave now occurs over nodes 3,5-7. If the cpuset’s memsthen change to 0,2-3,5, then the interleave occurs over nodes0,2-3,5.
Thanks to the consistent remapping, applications preparingnodemasks to specify memory policies using this flag shoulddisregard their current, actual cpuset imposed memory placementand prepare the nodemask as if they were always located onmemory nodes 0 to N-1, where N is the number of memory nodes thepolicy is intended to manage. Let the kernel then remap to theset of memory nodes allowed by the task’s cpuset, as that maychange over time.
MPOL_F_RELATIVE_NODES cannot be combined with theMPOL_F_STATIC_NODES flag. It also cannot be used forMPOL_PREFERRED policies that were created with an empty nodemask(local allocation).
Memory Policy Reference Counting¶
To resolve use/free races,structmempolicy contains an atomic referencecount field. Internal interfaces,mpol_get()/mpol_put() increment anddecrement this reference count, respectively.mpol_put() will only freethe structure back to the mempolicy kmem cache when the reference countgoes to zero.
When a new memory policy is allocated, its reference count is initializedto ‘1’, representing the reference held by the task that is installing thenew policy. When a pointer to a memory policy structure is stored in anotherstructure, another reference is added, as the task’s reference will be droppedon completion of the policy installation.
During run-time “usage” of the policy, we attempt to minimize atomic operationson the reference count, as this can lead to cache lines bouncing between cpusand NUMA nodes. “Usage” here means one of the following:
querying of the policy, either by the task itself [using the
get_mempolicy()API discussed below] or by another task using the /proc/<pid>/numa_mapsinterface.examination of the policy to determine the policy mode and associated nodeor node lists, if any, for page allocation. This is considered a “hotpath”. Note that for MPOL_BIND, the “usage” extends across the entireallocation process, which may sleep during page reclamation, because theBIND policy nodemask is used, by reference, to filter ineligible nodes.
We can avoid taking an extra reference during the usages listed above asfollows:
we never need to get/free the system default policy as this is neverchanged nor freed, once the system is up and running.
for querying the policy, we do not need to take an extra reference on thetarget task’s task policy nor vma policies because we always acquire thetask’s mm’s mmap_lock for read during the query. The
set_mempolicy()andmbind()APIs [see below] always acquire the mmap_lock for write wheninstalling or replacing task or vma policies. Thus, there is no possibilityof a task or thread freeing a policy while another task or thread isquerying it.Page allocation usage of task or vma policy occurs in the fault path wherewe hold them mmap_lock for read. Again, because replacing the task or vmapolicy requires that the mmap_lock be held for write, the policy can’t befreed out from under us while we’re using it for page allocation.
Shared policies require special consideration. One task can replace ashared memory policy while another task, with a distinct mmap_lock, isquerying or allocating a page based on the policy. To resolve thispotential race, the shared policy infrastructure adds an extra referenceto the shared policy during lookup while holding a spin lock on the sharedpolicy management structure. This requires that we drop this extrareference when we’re finished “using” the policy. We must drop theextra reference on shared policies in the same query/allocation pathsused for non-shared policies. For this reason, shared policies are markedas such, and the extra reference is dropped “conditionally”--i.e., onlyfor shared policies.
Because of this extra reference counting, and because we must lookupshared policies in a tree structure under spinlock, shared policies aremore expensive to use in the page allocation path. This is especiallytrue for shared policies on shared memory regions shared by tasks runningon different NUMA nodes. This extra overhead can be avoided by alwaysfalling back to task or system default policy for shared memory regions,or by prefaulting the entire shared memory region into memory and lockingit down. However, this might not be appropriate for all applications.
Memory Policy APIs¶
Linux supports 4 system calls for controlling memory policy. These APISalways affect only the calling task, the calling task’s address space, orsome shared object mapped into the calling task’s address space.
Note
the headers that define these APIs and the parameter data types foruser space applications reside in a package that is not part of theLinux kernel. The kernel system call interfaces, with the ‘sys_’prefix, are defined in <linux/syscalls.h>; the mode and flagdefinitions are defined in <linux/mempolicy.h>.
Set [Task] Memory Policy:
long set_mempolicy(int mode, const unsigned long *nmask, unsigned long maxnode);
Set’s the calling task’s “task/process memory policy” to modespecified by the ‘mode’ argument and the set of nodes defined by‘nmask’. ‘nmask’ points to a bit mask of node ids containing at least‘maxnode’ ids. Optional mode flags may be passed by combining the‘mode’ argument with the flag (for example: MPOL_INTERLEAVE |MPOL_F_STATIC_NODES).
See the set_mempolicy(2) man page for more details
Get [Task] Memory Policy or Related Information:
long get_mempolicy(int *mode, const unsigned long *nmask, unsigned long maxnode, void *addr, int flags);
Queries the “task/process memory policy” of the calling task, or thepolicy or location of a specified virtual address, depending on the‘flags’ argument.
See the get_mempolicy(2) man page for more details
Install VMA/Shared Policy for a Range of Task’s Address Space:
long mbind(void *start, unsigned long len, int mode, const unsigned long *nmask, unsigned long maxnode, unsigned flags);
mbind() installs the policy specified by (mode, nmask, maxnodes) as aVMA policy for the range of the calling task’s address space specifiedby the ‘start’ and ‘len’ arguments. Additional actions may berequested via the ‘flags’ argument.
See the mbind(2) man page for more details.
Set home node for a Range of Task’s Address Spacec:
long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags);
sys_set_mempolicy_home_node set the home node for a VMA policy present in thetask’s address range. The system call updates the home node only for the existingmempolicy range. Other address ranges are ignored. A home node is the NUMA nodeclosest to which page allocation will come from. Specifying the home node overridethe default allocation policy to allocate memory close to the local node for anexecuting CPU.
Memory Policy Command Line Interface¶
Although not strictly part of the Linux implementation of memory policy,a command line tool, numactl(8), exists that allows one to:
set the task policy for a specified program via set_mempolicy(2), fork(2) andexec(2)
set the shared policy for a shared memory segment via mbind(2)
The numactl(8) tool is packaged with the run-time version of the librarycontaining the memory policy system call wrappers. Some distributionspackage the headers and compile-time libraries in a separate developmentpackage.
Memory Policies and cpusets¶
Memory policies work within cpusets as described above. For memory policiesthat require a node or set of nodes, the nodes are restricted to the set ofnodes whose memories are allowed by the cpuset constraints. If the nodemaskspecified for the policy contains nodes that are not allowed by the cpuset andMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodesspecified for the policy and the set of nodes with memory is used. If theresult is the empty set, the policy is considered invalid and cannot beinstalled. If MPOL_F_RELATIVE_NODES is used, the policy’s nodes are mappedonto and folded into the task’s set of allowed nodes as previously described.
The interaction of memory policies and cpusets can be problematic when tasksin two cpusets share access to a memory region, such as shared memory segmentscreated byshmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, andany of the tasks install shared policy on the region, only nodes whosememories are allowed in both cpusets may be used in the policies. Obtainingthis information requires “stepping outside” the memory policy APIs to use thecpuset information and requires that one know in what cpusets other task mightbe attaching to the shared region. Furthermore, if the cpusets’ allowedmemory sets are disjoint, “local” allocation is the only valid policy.