11.The TLB¶
When the kernel unmaps or modified the attributes of a range ofmemory, it has two choices:
Flush the entire TLB with a two-instruction sequence. This isa quick operation, but it causes collateral damage: TLB entriesfrom areas other than the one we are trying to flush will bedestroyed and must be refilled later, at some cost.
Use the invlpg instruction to invalidate a single page at atime. This could potentially cost many more instructions, butit is a much more precise operation, causing no collateraldamage to other TLB entries.
Which method to do depends on a few things:
The size of the flush being performed. A flush of the entireaddress space is obviously better performed by flushing theentire TLB than doing 2^48/PAGE_SIZE individual flushes.
The contents of the TLB. If the TLB is empty, then there willbe no collateral damage caused by doing the global flush, andall of the individual flush will have ended up being wastedwork.
The size of the TLB. The larger the TLB, the more collateraldamage we do with a full flush. So, the larger the TLB, themore attractive an individual flush looks. Data andinstructions have separate TLBs, as do different page sizes.
The microarchitecture. The TLB has become a multi-levelcache on modern CPUs, and the global flushes have become moreexpensive relative to single-page flushes.
There is obviously no way the kernel can know all these things,especially the contents of the TLB during a given flush. Thesizes of the flush will vary greatly depending on the workload aswell. There is essentially no “right” point to choose.
You may be doing too many individual invalidations if you see theinvlpg instruction (or instructions _near_ it) show up high inprofiles. If you believe that individual invalidations beingcalled too often, you can lower the tunable:
/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
This will cause us to do the global flush for more cases.Lowering it to 0 will disable the use of the individual flushes.Setting it to 1 is a very conservative setting and it shouldnever need to be 0 under normal circumstances.
Despite the fact that a single individual flush on x86 isguaranteed to flush a full 2MB[1], hugetlbfs always uses the fullflushes. THP is treated exactly the same as normal memory.
You might see invlpg inside offlush_tlb_mm_range() show up inprofiles, or you can use thetrace_tlb_flush() tracepoints. todetermine how long the flush operations are taking.
Essentially, you are balancing the cycles you spend doing invlpgwith the cycles that you spend refilling the TLB later.
You can measure how expensive TLB refills are by usingperformance counters and ‘perf stat’, like this:
perf stat -e cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
That works on an IvyBridge-era CPU (i5-3320M). Different CPUsmay have differently-named counters, but they should at leastbe there in some form. You can use pmu-tools ‘ocperf list’(https://github.com/andikleen/pmu-tools) to find the rightcounters for a given CPU.
[1]A footnote in Intel’s SDM “4.10.4.2 Recommended Invalidation”says: “One execution of INVLPG is sufficient even for a pagewith size greater than 4 KBytes.”