Multi-Gen LRU¶

The multi-gen LRU is an alternative LRU implementation that optimizespage reclaim and improves performance under memory pressure. Pagereclaim decides the kernel’s caching policy and ability to overcommitmemory. It directly impacts the kswapd CPU usage and RAM efficiency.

Design overview¶

Objectives¶

The design objectives are:

Good representation of access recency
Try to profit from spatial locality
Fast paths to make obvious choices
Simple self-correcting heuristics

The representation of access recency is at the core of all LRUimplementations. In the multi-gen LRU, each generation represents agroup of pages with similar access recency. Generations establish a(time-based) common frame of reference and therefore help make betterchoices, e.g., between different memcgs on a computer or differentcomputers in a data center (for job scheduling).

Exploiting spatial locality improves efficiency when gathering theaccessed bit. A rmap walk targets a single page and does not try toprofit from discovering a young PTE. A page table walk can sweep allthe young PTEs in an address space, but the address space can be toosparse to make a profit. The key is to optimize both methods and usethem in combination.

Fast paths reduce code complexity and runtime overhead. Unmapped pagesdo not require TLB flushes; clean pages do not require writeback.These facts are only helpful when other conditions, e.g., accessrecency, are similar. With generations as a common frame of reference,additional factors stand out. But obvious choices might not be goodchoices; thus self-correction is necessary.

The benefits of simple self-correcting heuristics are self-evident.Again, with generations as a common frame of reference, this becomesattainable. Specifically, pages in the same generation can becategorized based on additional factors, and a feedback loop canstatistically compare the refault percentages across those categoriesand infer which of them are better choices.

Assumptions¶

The protection of hot pages and the selection of cold pages are basedon page access channels and patterns. There are two access channels:

Accesses through page tables
Accesses through file descriptors

The protection of the former channel is by design stronger because:

The uncertainty in determining the access patterns of the formerchannel is higher due to the approximation of the accessed bit.
The cost of evicting the former channel is higher due to the TLBflushes required and the likelihood of encountering the dirty bit.
The penalty of underprotecting the former channel is higher becauseapplications usually do not prepare themselves for major pagefaults like they do for blocked I/O. E.g., GUI applicationscommonly use dedicated I/O threads to avoid blocking renderingthreads.

There are also two access patterns:

Accesses exhibiting temporal locality
Accesses not exhibiting temporal locality

For the reasons listed above, the former channel is assumed to followthe former pattern unlessVM_SEQ_READ orVM_RAND_READ ispresent, and the latter channel is assumed to follow the latterpattern unless outlying refaults have been observed.

Workflow overview¶

Evictable pages are divided into multiple generations for eachlruvec. The youngest generation number is stored inlrugen->max_seq for both anon and file types as they are aged onan equal footing. The oldest generation numbers are stored inlrugen->min_seq[] separately for anon and file types as clean filepages can be evicted regardless of swap constraints. These threevariables are monotonically increasing.

Generation numbers are truncated intoorder_base_2(MAX_NR_GENS+1)bits in order to fit into the gen counter infolio->flags. Eachtruncated generation number is an index tolrugen->folios[]. Thesliding window technique is used to track at leastMIN_NR_GENS andat mostMAX_NR_GENS generations. The gen counter stores a valuewithin[1,MAX_NR_GENS] while a page is on one oflrugen->folios[]; otherwise it stores zero.

Each generation is divided into multiple tiers. A page accessedNtimes through file descriptors is in tierorder_base_2(N). Unlikegenerations, tiers do not have dedicatedlrugen->folios[]. Incontrast to moving across generations, which requires the LRU lock,moving across tiers only involves atomic operations onfolio->flags and therefore has a negligible cost. A feedback loopmodeled after the PID controller monitors refaults over all the tiersfrom anon and file types and decides which tiers from which types toevict or protect. The desired effect is to balance refault percentagesbetween anon and file types proportional to the swappiness level.

There are two conceptually independent procedures: the aging and theeviction. They form a closed-loop system, i.e., the page reclaim.

Aging¶

The aging produces young generations. Given anlruvec, itincrementsmax_seq whenmax_seq-min_seq+1 approachesMIN_NR_GENS. The aging promotes hot pages to the youngestgeneration when it finds them accessed through page tables; thedemotion of cold pages happens consequently when it incrementsmax_seq. The aging uses page table walks and rmap walks to findyoung PTEs. For the former, it iterateslruvec_memcg()->mm_listand callswalk_page_range() with eachmm_struct on this listto scan PTEs, and after each iteration, it incrementsmax_seq. Forthe latter, when the eviction walks the rmap and finds a young PTE,the aging scans the adjacent PTEs. For both, on finding a young PTE,the aging clears the accessed bit and updates the gen counter of thepage mapped by this PTE to(max_seq%MAX_NR_GENS)+1.

Eviction¶

The eviction consumes old generations. Given anlruvec, itincrementsmin_seq whenlrugen->folios[] indexed bymin_seq%MAX_NR_GENS becomes empty. To select a type and a tier toevict from, it first comparesmin_seq[] to select the older type.If both types are equally old, it selects the one whose first tier hasa lower refault percentage. The first tier contains single-useunmapped clean pages, which are the best bet. The eviction sorts apage according to its gen counter if the aging has found this pageaccessed through page tables and updated its gen counter. It alsomoves a page to the next generation, i.e.,min_seq+1, if this pagewas accessed multiple times through file descriptors and the feedbackloop has detected outlying refaults from the tier this page is in. Tothis end, the feedback loop uses the first tier as the baseline, forthe reason stated earlier.

Working set protection¶

Each generation is timestamped at birth. Iflru_gen_min_ttl isset, anlruvec is protected from the eviction when its oldestgeneration was born withinlru_gen_min_ttl milliseconds. In otherwords, it prevents the working set oflru_gen_min_ttl millisecondsfrom getting evicted. The OOM killer is triggered if this working setcannot be kept in memory.

This time-based approach has the following advantages:

It is easier to configure because it is agnostic to applicationsand memory sizes.
It is more reliable because it is directly wired to the OOM killer.

`mm_struct` list¶

Anmm_struct list is maintained for each memcg, and anmm_struct follows its owner task to the new memcg when this taskis migrated.

A page table walker iterateslruvec_memcg()->mm_list and callswalk_page_range() with eachmm_struct on this list to scanPTEs. When multiple page table walkers iterate the same list, each ofthem gets a uniquemm_struct, and therefore they can run inparallel.

Page table walkers ignore any misplaced pages, e.g., if anmm_struct was migrated, pages left in the previous memcg will beignored when the current memcg is under reclaim. Similarly, page tablewalkers will ignore pages from nodes other than the one under reclaim.

This infrastructure also tracks the usage ofmm_struct betweencontext switches so that page table walkers can skip processes thathave been sleeping since the last iteration.

Rmap/PT walk feedback¶

Searching the rmap for PTEs mapping each page on an LRU list (to testand clear the accessed bit) can be expensive because pages fromdifferent VMAs (PA space) are not cache friendly to the rmap (VAspace). For workloads mostly using mapped pages, searching the rmapcan incur the highest CPU cost in the reclaim path.

lru_gen_look_around() exploits spatial locality to reduce thetrips into the rmap. It scans the adjacent PTEs of a young PTE andpromotes hot pages. If the scan was done cacheline efficiently, itadds the PMD entry pointing to the PTE table to the Bloom filter. Thisforms a feedback loop between the eviction and the aging.

Bloom filters¶

Bloom filters are a space and memory efficient data structure for setmembership test, i.e., test if an element is not in the set or may bein the set.

In the eviction path, specifically, inlru_gen_look_around(), if aPMD has a sufficient number of hot pages, its address is placed in thefilter. In the aging path, set membership means that the PTE rangewill be scanned for young pages.

Note that Bloom filters are probabilistic on set membership. If a testis false positive, the cost is an additional scan of a range of PTEs,which may yield hot pages anyway. Parameters of the filter itself cancontrol the false positive rate in the limit.

PID controller¶

A feedback loop modeled after the Proportional-Integral-Derivative(PID) controller monitors refaults over anon and file types anddecides which type to evict when both types are available from thesame generation.

The PID controller uses generations rather than the wall clock as thetime domain because a CPU can scan pages at different rates undervarying memory pressure. It calculates a moving average for each newgeneration to avoid being permanently locked in a suboptimal state.

Memcg LRU¶

An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,since each node and memcg combination has an LRU of folios (seemem_cgroup_lruvec()). Its goal is to improve the scalability ofglobal reclaim, which is critical to system-wide memory overcommit indata centers. Note that memcg LRU only applies to global reclaim.

The basic structure of an memcg LRU can be understood by an analogy tothe active/inactive LRU (of folios):

It has the young and the old (generations), i.e., the counterpartsto the active and the inactive;
The increment ofmax_seq triggers promotion, i.e., thecounterpart to activation;
Other events trigger similar operations, e.g., offlining an memcgtriggers demotion, i.e., the counterpart to deactivation.

In terms of global reclaim, it has two distinct features:

Sharding, which allows each thread to start at a random memcg (inthe old generation) and improves parallelism;
Eventual fairness, which allows direct reclaim to bail out at willand reduces latency without affecting fairness over some time.

In terms of traversing memcgs during global reclaim, it improves thebest-case complexity from O(n) to O(1) and does not affect theworst-case complexity O(n). Therefore, on average, it has a sublinearcomplexity.

Summary¶

The multi-gen LRU (of folios) can be disassembled into the followingparts:

Generations
Rmap walks
Page table walks viamm_struct list
Bloom filters for rmap/PT walk feedback
PID controller for refault feedback

The aging and the eviction form a producer-consumer model;specifically, the latter drives the former by the sliding window overgenerations. Within the aging, rmap walks drive page table walks byinserting hot densely populated page tables to the Bloom filters.Within the eviction, the PID controller uses refaults as the feedbackto select types to evict and tiers to protect.

Movatterモバイル変換