Physical Memory Model

Physical memory in a system may be addressed in different ways. Thesimplest case is when the physical memory starts at address 0 andspans a contiguous range up to the maximal address. It could be,however, that this range contains small holes that are not accessiblefor the CPU. Then there could be several contiguous ranges atcompletely distinct addresses. And, don’t forget about NUMA, wheredifferent memory banks are attached to different CPUs.

Linux abstracts this diversity using one of the three memory models:FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines whatmemory models it supports, what the default memory model is andwhether it is possible to manually override that default.

Note

At time of this writing, DISCONTIGMEM is considered deprecated,although it is still in use by several architectures.

All the memory models track the status of physical page frames usingstructpage arranged in one or more arrays.

Regardless of the selected memory model, there exists one-to-onemapping between the physical page frame number (PFN) and thecorrespondingstruct page.

Each memory model definespfn_to_page() andpage_to_pfn()helpers that allow the conversion from PFN tostruct page and viceversa.

FLATMEM

The simplest memory model is FLATMEM. This model is suitable fornon-NUMA systems with contiguous, or mostly contiguous, physicalmemory.

In the FLATMEM memory model, there is a globalmem_map array thatmaps the entire physical memory. For most architectures, the holeshave entries in themem_map array. Thestruct page objectscorresponding to the holes are never fully initialized.

To allocate themem_map array, architecture specific setup code shouldcallfree_area_init() function. Yet, the mappings array is notusable until the call tomemblock_free_all() that hands all thememory to the page allocator.

If an architecture enablesCONFIG_ARCH_HAS_HOLES_MEMORYMODEL option,it may free parts of themem_map array that do not cover theactual physical pages. In such case, the architecture specificpfn_valid() implementation should take the holes in themem_map into account.

With FLATMEM, the conversion between a PFN and thestruct page isstraightforward:PFN - ARCH_PFN_OFFSET is an index to themem_map array.

TheARCH_PFN_OFFSET defines the first page frame number forsystems with physical memory starting at address different from 0.

DISCONTIGMEM

The DISCONTIGMEM model treats the physical memory as a collection ofnodes similarly to how Linux NUMA support does. For each node Linuxconstructs an independent memory management subsystem represented bystruct pglist_data (orpg_data_t for short). Among otherthings,pg_data_t holds thenode_mem_map array that mapsphysical pages belonging to that node. Thenode_start_pfn field ofpg_data_t is the number of the first page frame belonging to thatnode.

The architecture setup code should callfree_area_init_node() foreach node in the system to initialize thepg_data_t object and itsnode_mem_map.

Everynode_mem_map behaves exactly as FLATMEM’smem_map -every physical page frame in a node has astruct page entry in thenode_mem_map array. When DISCONTIGMEM is enabled, a portion of theflags field of thestruct page encodes the node number of thenode hosting that page.

The conversion between a PFN and thestruct page in theDISCONTIGMEM model became slightly more complex as it has to determinewhich node hosts the physical page and whichpg_data_t objectholds thestruct page.

Architectures that support DISCONTIGMEM providepfn_to_nid()to convert PFN to the node number. The opposite conversion helperpage_to_nid() is generic as it uses the node number encoded inpage->flags.

Once the node number is known, the PFN can be used to indexappropriatenode_mem_map array to access thestruct page andthe offset of thestruct page from thenode_mem_map plusnode_start_pfn is the PFN of that page.

SPARSEMEM

SPARSEMEM is the most versatile memory model available in Linux and itis the only memory model that supports several advanced features suchas hot-plug and hot-remove of the physical memory, alternative memorymaps for non-volatile memory devices and deferred initialization ofthe memory map for larger systems.

The SPARSEMEM model presents the physical memory as a collection ofsections. A section is represented withstructmem_sectionthat containssection_mem_map that is, logically, a pointer to anarray of struct pages. However, it is stored with some other magicthat aids the sections management. The section size and maximal numberof section is specified usingSECTION_SIZE_BITS andMAX_PHYSMEM_BITS constants defined by each architecture thatsupports SPARSEMEM. WhileMAX_PHYSMEM_BITS is an actual width of aphysical address that an architecture supports, theSECTION_SIZE_BITS is an arbitrary value.

The maximal number of sections is denotedNR_MEM_SECTIONS anddefined as

NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}

Themem_section objects are arranged in a two-dimensional arraycalledmem_sections. The size and placement of this array dependonCONFIG_SPARSEMEM_EXTREME and the maximal possible number ofsections:

  • WhenCONFIG_SPARSEMEM_EXTREME is disabled, themem_sectionsarray is static and hasNR_MEM_SECTIONS rows. Each row holds asinglemem_section object.
  • WhenCONFIG_SPARSEMEM_EXTREME is enabled, themem_sectionsarray is dynamically allocated. Each row contains PAGE_SIZE worth ofmem_section objects and the number of rows is calculated to fitall the memory sections.

The architecture setup code should call sparse_init() toinitialize the memory sections and the memory maps.

With SPARSEMEM there are two possible ways to convert a PFN to thecorrespondingstruct page - a “classic sparse” and “sparsevmemmap”. The selection is made at build time and it is determined bythe value ofCONFIG_SPARSEMEM_VMEMMAP.

The classic sparse encodes the section number of a page in page->flagsand uses high bits of a PFN to access the section that maps that pageframe. Inside a section, the PFN is the index to the array of pages.

The sparse vmemmap uses a virtually mapped memory map to optimizepfn_to_page and page_to_pfn operations. There is a globalstructpage *vmemmap pointer that points to a virtually contiguous array ofstruct page objects. A PFN is an index to that array and theoffset of thestruct page fromvmemmap is the PFN of thatpage.

To use vmemmap, an architecture has to reserve a range of virtualaddresses that will map the physical pages containing the memorymap and make sure thatvmemmap points to that range. In addition,the architecture should implementvmemmap_populate() methodthat will allocate the physical memory and create page tables for thevirtual memory map. If an architecture does not have any specialrequirements for the vmemmap mappings, it can use defaultvmemmap_populate_basepages() provided by the generic memorymanagement.

The virtually mapped memory map allows storingstruct page objectsfor persistent memory devices in pre-allocated storage on thosedevices. This storage is represented withstructvmem_altmapthat is eventually passed to vmemmap_populate() through a long chainof function calls. The vmemmap_populate() implementation may use thevmem_altmap along withvmemmap_alloc_block_buf() helper toallocate memory map on the persistent memory device.

ZONE_DEVICE

TheZONE_DEVICE facility builds uponSPARSEMEM_VMEMMAP to offerstruct pagemem_map services for device driver identified physicaladdress ranges. The “device” aspect ofZONE_DEVICE relates to the factthat the page objects for these address ranges are never marked online,and that a reference must be taken against the device, not just the pageto keep the memory pinned for active use.ZONE_DEVICE, viadevm_memremap_pages(), performs just enough memory hotplug toturn onpfn_to_page(),page_to_pfn(), andget_user_pages() service for the given range of pfns. Since thepage reference count never drops below 1 the page is never tracked asfree memory and the page’sstruct list_head lru space is repurposedfor back referencing to the host device / driver that mapped the memory.

WhileSPARSEMEM presents memory as a collection of sections,optionally collected into memory blocks,ZONE_DEVICE users have a needfor smaller granularity of populating themem_map. Given thatZONE_DEVICE memory is never marked online it is subsequently neversubject to its memory ranges being exposed through the sysfs memoryhotplug api on memory block boundaries. The implementation relies onthis lack of user-api constraint to allow sub-section sized memoryranges to be specified toarch_add_memory(), the top-half ofmemory hotplug. Sub-section support allows for 2MB as the cross-archcommon alignment granularity fordevm_memremap_pages().

The users ofZONE_DEVICE are:

  • pmem: Map platform persistent memory to be used as a direct-I/O targetvia DAX mappings.
  • hmm: ExtendZONE_DEVICE with->page_fault() and->page_free()event callbacks to allow a device-driver to coordinate memory managementevents related to device-memory, typically GPU memory. SeeDocumentation/vm/hmm.rst.
  • p2pdma: Createstruct page objects to allow peer devices in aPCI/-E topology to coordinate direct-DMA operations between themselves,i.e. bypass host memory.