Multi-Gen LRU¶
The multi-gen LRU is an alternative LRU implementation that optimizespage reclaim and improves performance under memory pressure. Pagereclaim decides the kernel’s caching policy and ability to overcommitmemory. It directly impacts the kswapd CPU usage and RAM efficiency.
Quick start¶
Build the kernel with the following configurations.
CONFIG_LRU_GEN=yCONFIG_LRU_GEN_ENABLED=y
All set!
Runtime options¶
/sys/kernel/mm/lru_gen/ contains stable ABIs described in thefollowing subsections.
Kill switch¶
enabled accepts different values to enable or disable thefollowing components. Its default value depends onCONFIG_LRU_GEN_ENABLED. All the components should be enabledunless some of them have unforeseen side effects. Writing toenabled has no effect when a component is not supported by thehardware, and valid values will be accepted even when the main switchis off.
Values | Components |
|---|---|
0x0001 | The main switch for the multi-gen LRU. |
0x0002 | Clearing the accessed bit in leaf page table entries in largebatches, when MMU sets it (e.g., on x86). This behavior cantheoretically worsen lock contention (mmap_lock). If it isdisabled, the multi-gen LRU will suffer a minor performancedegradation for workloads that contiguously map hot pages,whose accessed bits can be otherwise cleared by fewer largerbatches. |
0x0004 | Clearing the accessed bit in non-leaf page table entries aswell, when MMU sets it (e.g., on x86). This behavior was notverified on x86 varieties other than Intel and AMD. If it isdisabled, the multi-gen LRU will suffer a negligibleperformance degradation. |
[yYnN] | Apply to all the components above. |
E.g.,
echo y >/sys/kernel/mm/lru_gen/enabledcat /sys/kernel/mm/lru_gen/enabled0x0007echo 5 >/sys/kernel/mm/lru_gen/enabledcat /sys/kernel/mm/lru_gen/enabled0x0005
Thrashing prevention¶
Personal computers are more sensitive to thrashing because it cancause janks (lags when rendering UI) and negatively impact userexperience. The multi-gen LRU offers thrashing prevention to themajority of laptop and desktop users who do not haveoomd.
Users can writeN tomin_ttl_ms to prevent the working set ofN milliseconds from getting evicted. The OOM killer is triggeredif this working set cannot be kept in memory. In other words, thisoption works as an adjustable pressure relief valve, and when open, itterminates applications that are hopefully not being used.
Based on the average human detectable lag (~100ms),N=1000 usuallyeliminates intolerable janks due to thrashing. Larger values likeN=3000 make janks less noticeable at the risk of premature OOMkills.
The default value0 means disabled.
Experimental features¶
/sys/kernel/debug/lru_gen accepts commands described in thefollowing subsections. Multiple command lines are supported, so doesconcatenation with delimiters, and;.
/sys/kernel/debug/lru_gen_full provides additional stats fordebugging.CONFIG_LRU_GEN_STATS=y keeps historical stats fromevicted generations in this file.
Working set estimation¶
Working set estimation measures how much memory an application needsin a given time interval, and it is usually done with little impact onthe performance of the application. E.g., data centers want tooptimize job scheduling (bin packing) to improve memory utilizations.When a new job comes in, the job scheduler needs to find out whethereach server it manages can allocate a certain amount of memory forthis new job before it can pick a candidate. To do so, the jobscheduler needs to estimate the working sets of the existing jobs.
When it is read,lru_gen returns a histogram of numbers of pagesaccessed over different time intervals for each memcg and node.MAX_NR_GENS decides the number of bins for each histogram. Thehistograms are noncumulative.
memcg memcg_id memcg_path node node_id min_gen_nr age_in_ms nr_anon_pages nr_file_pages ... max_gen_nr age_in_ms nr_anon_pages nr_file_pages
Each bin contains an estimated number of pages that have been accessedwithinage_in_ms. E.g.,min_gen_nr contains the coldest pagesandmax_gen_nr contains the hottest pages, sinceage_in_ms ofthe former is the largest and that of the latter is the smallest.
Users can write the following command tolru_gen to create a newgenerationmax_gen_nr+1:
+memcg_idnode_idmax_gen_nr[can_swap[force_scan]]
can_swap defaults to the swap setting and, if it is set to1,it forces the scan of anon pages when swap is off, and vice versa.force_scan defaults to1 and, if it is set to0, itemploys heuristics to reduce the overhead, which is likely to reducethe coverage as well.
A typical use case is that a job scheduler runs this command at acertain time interval to create new generations, and it ranks theservers it manages based on the sizes of their cold pages defined bythis time interval.
Proactive reclaim¶
Proactive reclaim induces page reclaim when there is no memorypressure. It usually targets cold pages only. E.g., when a new jobcomes in, the job scheduler wants to proactively reclaim cold pages onthe server it selected, to improve the chance of successfully landingthis new job.
Users can write the following command tolru_gen to evictgenerations less than or equal tomin_gen_nr.
-memcg_idnode_idmin_gen_nr[swappiness[nr_to_reclaim]]
min_gen_nr should be less thanmax_gen_nr-1, sincemax_gen_nr andmax_gen_nr-1 are not fully aged (equivalent tothe active list) and therefore cannot be evicted.swappinessoverrides the default value in/proc/sys/vm/swappiness and the validrange is [0-200, max], with max being exclusively used for the reclamationof anonymous memory.nr_to_reclaim limits the number of pages to evict.
A typical use case is that a job scheduler runs this command before ittries to land a new job on a server. If it fails to materialize enoughcold pages because of the overestimation, it retries on the nextserver according to the ranking result obtained from the working setestimation step. This less forceful approach limits the impacts on theexisting jobs.