Kernel Samepage Merging¶
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,added to the Linux kernel in 2.6.32. Seemm/ksm.c for its implementation,andhttp://lwn.net/Articles/306704/ andhttps://lwn.net/Articles/330589/
The userspace interface of KSM is described inDocumentation/admin-guide/mm/ksm.rst
Design¶
Overview¶
A few notes about the KSM scanning process,to make it easier to understand the data structures below:
In order to reduce excessive scanning, KSM sorts the memory pages by theircontents into a data structure that holds pointers to the pages’ locations.
Since the contents of the pages may change at any moment, KSM cannot justinsert the pages into a normal sorted tree and expect it to find anything.Therefore KSM uses two data structures - the stable and the unstable tree.
The stable tree holds pointers to all the merged pages (ksm pages), sortedby their contents. Because each such page is write-protected, searching onthis tree is fully assured to be working (except when pages are unmapped),and therefore this tree is called the stable tree.
The stable tree node includes information required for reversemapping from a KSM page to virtual addresses that map this page.
In order to avoid large latencies of the rmap walks on KSM pages,KSM maintains two types of nodes in the stable tree:
- the regular nodes that keep the reverse mapping structures in alinked list
- the “chains” that link nodes (“dups”) that represent the samewrite protected memory content, but each “dup” corresponds to adifferent KSM page copy of that content
Internally, the regular nodes, “dups” and “chains” are representedusing the samestructstable_node structure.
In addition to the stable tree, KSM uses a second data structure called theunstable tree: this tree holds pointers to pages which have been found tobe “unchanged for a period of time”. The unstable tree sorts these pagesby their contents, but since they are not write-protected, KSM cannot relyupon the unstable tree to work correctly - the unstable tree is liable tobe corrupted as its contents are modified, and so it is called unstable.
KSM solves this problem by several techniques:
- The unstable tree is flushed every time KSM completes scanning allmemory areas, and then the tree is rebuilt again from the beginning.
- KSM will only insert into the unstable tree, pages whose hash valuehas not changed since the previous scan of all memory areas.
- The unstable tree is a RedBlack Tree - so its balancing is based on thecolors of the nodes and not on their contents, assuring that even whenthe tree gets “corrupted” it won’t get out of balance, so scanning timeremains the same (also, searching and inserting nodes in an rbtree usesthe same algorithm, so we have no overhead when we flush and rebuild).
- KSM never flushes the stable tree, which means that even if it were totake 10 attempts to find a page in the unstable tree, once it is found,it is secured in the stable tree. (When we scan a new page, we firstcompare it against the stable tree, and then against the unstable tree.)
If the merge_across_nodes tunable is unset, then KSM maintains multiplestable trees and multiple unstable trees: one of each for each NUMA node.
Reverse mapping¶
KSM maintains reverse mapping information for KSM pages in the stabletree.
If a KSM page is shared between less thanmax_page_sharing VMAs,the node of the stable tree that represents such KSM page points to alist ofstructrmap_item and thepage->mapping of theKSM page points to the stable tree node.
When the sharing passes this threshold, KSM adds a second dimension tothe stable tree. The tree node becomes a “chain” that links one ormore “dups”. Each “dup” keeps reverse mapping information for a KSMpage withpage->mapping pointing to that “dup”.
Every “chain” and all “dups” linked into a “chain” enforce theinvariant that they represent the same write protected memory content,even if each “dup” will be pointed by a different KSM page copy ofthat content.
This way the stable tree lookup computational complexity is unaffectedif compared to an unlimited list of reverse mappings. It is stillenforced that there cannot be KSM page content duplicates in thestable tree itself.
The deduplication limit enforced bymax_page_sharing is requiredto avoid the virtual memory rmap lists to grow too large. The rmapwalk has O(N) complexity where N is the number of rmap_items(i.e. virtual mappings) that are sharing the page, which is in turncapped bymax_page_sharing. So this effectively spreads the linearO(N) computational complexity from rmap walk context over differentKSM pages. The ksmd walk over the stable_node “chains” is also O(N),but N is the number of stable_node “dups”, not the number ofrmap_items, so it has not a significant impact on ksmd performance. Inpractice the best stable_node “dup” candidate will be kept and foundat the head of the “dups” list.
High values ofmax_page_sharing result in faster memory merging(because there will be fewer stable_node dups queued into thestable_node chain->hlist to check for pruning) and higherdeduplication factor at the expense of slower worst case for rmapwalks for any KSM page which can happen during swapping, compaction,NUMA balancing and page migration.
Thestable_node_dups/stable_node_chains ratio is also affected by themax_page_sharing tunable, and an high ratio may indicate fragmentationin the stable_node dups, which could be solved by introducingfragmentation algorithms in ksmd which would refile rmap_items fromone stable_node dup to another stable_node dup, in order to free upstable_node “dups” with few rmap_items in them, but that may increasethe ksmd CPU usage and possibly slowdown the readonly computations onthe KSM pages of the applications.
The whole list of stable_node “dups” linked in the stable_node“chains” is scanned periodically in order to prune stale stable_nodes.The frequency of such scans is defined bystable_node_chains_prune_millisecs sysfs tunable.
Reference¶
- struct
mm_slot¶ ksm information per mm that is being scanned
Definition
struct mm_slot { struct hlist_node link; struct list_head mm_list; struct rmap_item *rmap_list; struct mm_struct *mm;};Members
link- link to the mm_slots hash list
mm_list- link into the mm_slots list, rooted in ksm_mm_head
rmap_list- head for this mm_slot’s singly-linked list of rmap_items
mm- the mm that this information is valid for
- struct
ksm_scan¶ cursor for scanning
Definition
struct ksm_scan { struct mm_slot *mm_slot; unsigned long address; struct rmap_item **rmap_list; unsigned long seqnr;};Members
mm_slot- the current mm_slot we are scanning
address- the next address inside that to be scanned
rmap_list- link to the next rmap to be scanned in the rmap_list
seqnr- count of completed full scans (needed when removing unstable node)
Description
There is only the one ksm_scan instance of this cursor structure.
- struct
stable_node¶ node of the stable rbtree
Definition
struct stable_node { union { struct rb_node node; struct { struct list_head *head; struct { struct hlist_node hlist_dup; struct list_head list; }; }; }; struct hlist_head hlist; union { unsigned long kpfn; unsigned long chain_prune_time; };#define STABLE_NODE_CHAIN -1024; int rmap_hlist_len;#ifdef CONFIG_NUMA; int nid;#endif;};Members
{unnamed_union}- anonymous
node- rb node of this ksm page in the stable tree
{unnamed_struct}- anonymous
head- (overlaying parent)
migrate_nodesindicates temporarily on that list {unnamed_struct}- anonymous
hlist_dup- linked into the stable_node->hlist with a stable_node chain
list- linked into migrate_nodes, pending placement in the proper node tree
hlist- hlist head of rmap_items using this ksm page
{unnamed_union}- anonymous
kpfn- page frame number of this ksm page (perhaps temporarily on wrong nid)
chain_prune_time- time of the last full garbage collection
rmap_hlist_len- number of rmap_item entries in hlist or STABLE_NODE_CHAIN
nid- NUMA node id of stable tree in which linked (may not match kpfn)
- struct
rmap_item¶ reverse mapping item for virtual addresses
Definition
struct rmap_item { struct rmap_item *rmap_list; union { struct anon_vma *anon_vma;#ifdef CONFIG_NUMA; int nid;#endif; }; struct mm_struct *mm; unsigned long address; unsigned int oldchecksum; union { struct rb_node node; struct { struct stable_node *head; struct hlist_node hlist; }; };};Members
rmap_list- next rmap_item in mm_slot’s singly-linked rmap_list
{unnamed_union}- anonymous
anon_vma- pointer to anon_vma for this mm,address, when in stable tree
nid- NUMA node id of unstable tree in which linked (may not match page)
mm- the memory structure this rmap_item is pointing into
address- the virtual address this rmap_item tracks (+ flags in low bits)
oldchecksum- previous checksum of the page at that virtual address
{unnamed_union}- anonymous
node- rb node of this rmap_item in the unstable tree
{unnamed_struct}- anonymous
head- pointer to stable_node heading this list in the stable tree
hlist- link into hlist of rmap_items hanging off that stable_node
–Izik Eidus,Hugh Dickins, 17 Nov 2009