15. Page Table Isolation (PTI)¶
15.1. Overview¶
Page Table Isolation (pti, previously known as KAISER[1]) is acountermeasure against attacks on the shared user/kernel addressspace such as the “Meltdown” approach[2].
To mitigate this class of attacks, we create an independent set ofpage tables for use only when running userspace applications. Whenthe kernel is entered via syscalls, interrupts or exceptions, thepage tables are switched to the full “kernel” copy. When the systemswitches back to user mode, the user copy is used again.
The userspace page tables contain only a minimal amount of kerneldata: only what is needed to enter/exit the kernel such as theentry/exit functions themselves and the interrupt descriptor table(IDT). There are a few strictly unnecessary things that get mappedsuch as the first C function when entering an interrupt (seecomments in pti.c).
This approach helps to ensure that side-channel attacks leveragingthe paging structures do not function when PTI is enabled. It can beenabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.Once enabled at compile-time, it can be disabled at boot with the‘nopti’ or ‘pti=’ kernel parameters (see kernel-parameters.txt).
15.2. Page Table Management¶
When PTI is enabled, the kernel manages two sets of page tables.The first set is very similar to the single set which is present inkernels without PTI. This includes a complete mapping of userspacethat the kernel can use for things like copy_to_user().
Although _complete_, the user portion of the kernel page tables iscrippled by setting the NX bit in the top level. This ensuresthat any missed kernel->user CR3 switch will immediately crashuserspace upon executing its first instruction.
The userspace page tables map only the kernel data needed to enterand exit the kernel. This data is entirely contained in the ‘structcpu_entry_area’ structure which is placed in the fixmap which giveseach CPU’s copy of the area a compile-time-fixed virtual address.
For new userspace mappings, the kernel makes the entries in itspage tables like normal. The only difference is when the kernelmakes entries in the top (PGD) level. In addition to setting theentry in the main kernel PGD, a copy of the entry is made in theuserspace page tables’ PGD.
This sharing at the PGD level also inherently shares all the lowerlayers of the page tables. This leaves a single, shared set ofuserspace page tables to manage. One PTE to lock, one set ofaccessed bits, dirty bits, etc…
15.3. Overhead¶
Protection against side-channel attacks is important. But,this protection comes at a cost:
- Increased Memory Use
- Each process now needs an order-1 PGD instead of order-0.(Consumes an additional 4k per process).
- The ‘cpu_entry_area’ structure must be 2MB in size and 2MBaligned so that it can be mapped by setting a single PMDentry. This consumes nearly 2MB of RAM once the kernelis decompressed, but no space in the kernel image itself.
- Runtime Cost
- CR3 manipulation to switch between the page table copiesmust be done at interrupt, syscall, and exception entryand exit (it can be skipped when the kernel is interrupted,though.) Moves to CR3 are on the order of a hundredcycles, and are required at every entry and exit.
- A “trampoline” must be used for SYSCALL entry. Thistrampoline depends on a smaller set of resources than thenon-PTI SYSCALL entry code, so requires mapping fewerthings into the userspace page tables. The downside isthat stacks must be switched at entry time.
- Global pages are disabled for all kernel structures notmapped into both kernel and userspace page tables. Thisfeature of the MMU allows different processes to share TLBentries mapping the kernel. Losing the feature means moreTLB misses after a context switch. The actual loss ofperformance is very small, however, never exceeding 1%.
- Process Context IDentifiers (PCID) is a CPU feature thatallows us to skip flushing the entire TLB when switching pagetables by setting a special bit in CR3 when the page tablesare changed. This makes switching the page tables (at contextswitch, or kernel entry/exit) cheaper. But, on systems withPCID support, the context switch code must flush both the userand kernel entries out of the TLB. The user PCID TLB flush isdeferred until the exit to userspace, minimizing the cost.See intel.com/sdm for the gory PCID/INVPCID details.
- The userspace page tables must be populated for each newprocess. Even without PTI, the shared kernel mappingsare created by copying top-level (PGD) entries into eachnew process. But, with PTI, there are nowtwo kernelmappings: one in the kernel page tables that maps everythingand one for the entry/exit structures. At fork(), we need tocopy both.
- In addition to the fork()-time copying, there must alsobe an update to the userspace PGD any time a set_pgd() is doneon a PGD used to map userspace. This ensures that the kerneland userspace copies always map the same userspacememory.
- On systems without PCID support, each CR3 write flushesthe entire TLB. That means that each syscall, interruptor exception flushes the TLB.
- INVPCID is a TLB-flushing instruction which allows flushingof TLB entries for non-current PCIDs. Some systems supportPCIDs, but do not support INVPCID. On these systems, addressescan only be flushed from the TLB for the current PCID. Whenflushing a kernel address, we need to flush all PCIDs, so asingle kernel address flush will require a TLB-flushing CR3write upon the next use of every PCID.
15.4. Possible Future Work¶
- We can be more careful about not actually writing to CR3unless its value is actually changed.
- Allow PTI to be enabled/disabled at runtime in addition to theboot-time switching.
15.5. Testing¶
To test stability of PTI, the following test procedure is recommended,ideally doing all of these in parallel:
Set CONFIG_DEBUG_ENTRY=y
Run several copies of all of the tools/testing/selftests/x86/ tests(excluding MPX and protection_keys) in a loop on multiple CPUs forseveral minutes. These tests frequently uncover corner cases in thekernel entry code. In general, old kernels might cause these teststhemselves to crash, but they should never crash the kernel.
Run the ‘perf’ tool in a mode (top or record) that generates manyfrequent performance monitoring non-maskable interrupts (see “NMI”in /proc/interrupts). This exercises the NMI entry/exit code whichis known to trigger bugs in code paths that did not expect to beinterrupted, including nested NMIs. Using “-c” boosts the rate ofNMIs, and using two -c with separate counters encourages nested NMIsand less deterministic behavior.
while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
Launch a KVM virtual machine.
Run 32-bit binaries on systems supporting the SYSCALL instruction.This has been a lightly-tested code path and needs extra scrutiny.
15.6. Debugging¶
Bugs in PTI cause a few different signatures of crashesthat are worth noting here.
- Failures of the selftests/x86 code. Usually a bug in one of themore obscure corners of entry_64.S
- Crashes in early boot, especially around CPU bringup. Bugsin the trampoline code or mappings cause these.
- Crashes at the first interrupt. Caused by bugs in entry_64.S,like screwing up a page table switch. Also caused byincorrectly mapping the IRQ handler entry code.
- Crashes at the first NMI. The NMI code is separate from maininterrupt handlers and can have bugs that do not affectnormal interrupts. Also caused by incorrectly mapping NMIcode. NMIs that interrupt the entry code must be verycareful and can be the cause of crashes that show up whenrunning perf.
- Kernel crashes at the first exit to userspace. entry_64.Sbugs, or failing to map some of the exit code.
- Crashes at first interrupt that interrupts userspace. The pathsin entry_64.S that return to userspace are sometimes separatefrom the ones that return to the kernel.
- Double faults: overflowing the kernel stack because of pagefaults upon page faults. Caused by touching non-pti-mappeddata in the entry code, or forgetting to switch to kernelCR3 before calling into C functions which are not pti-mapped.
- Userspace segfaults early in boot, sometimes manifestingas mount(8) failing to mount the rootfs. These havetended to be TLB invalidation issues. Usually invalidatingthe wrong PCID, or otherwise missing an invalidation.
| [1] | https://gruss.cc/files/kaiser.pdf |
| [2] | https://meltdownattack.com/meltdown.pdf |