Kernel Self-Protection

Kernel self-protection is the design and implementation of systems andstructures within the Linux kernel to protect against security flaws inthe kernel itself. This covers a wide range of issues, including removingentire classes of bugs, blocking security flaw exploitation methods,and actively detecting attack attempts. Not all topics are explored inthis document, but it should serve as a reasonable starting point andanswer any frequently asked questions. (Patches welcome, of course!)

In the worst-case scenario, we assume an unprivileged local attackerhas arbitrary read and write access to the kernel’s memory. In manycases, bugs being exploited will not provide this level of access,but with systems in place that defend against the worst case we’llcover the more limited cases as well. A higher bar, and one that shouldstill be kept in mind, is protecting the kernel against a _privileged_local attacker, since the root user has access to a vastly increasedattack surface. (Especially when they have the ability to load arbitrarykernel modules.)

The goals for successful self-protection systems would be that theyare effective, on by default, require no opt-in by developers, have noperformance impact, do not impede kernel debugging, and have tests. Itis uncommon that all these goals can be met, but it is worth explicitlymentioning them, since these aspects need to be explored, dealt with,and/or accepted.

Attack Surface Reduction

The most fundamental defense against security exploits is to reduce theareas of the kernel that can be used to redirect execution. This rangesfrom limiting the exposed APIs available to userspace, making in-kernelAPIs hard to use incorrectly, minimizing the areas of writable kernelmemory, etc.

Strict kernel memory permissions

When all of kernel memory is writable, it becomes trivial for attacksto redirect execution flow. To reduce the availability of these targetsthe kernel needs to protect its memory with a tight set of permissions.

Executable code and read-only data must not be writable

Any areas of the kernel with executable memory must not be writable.While this obviously includes the kernel text itself, we must considerall additional places too: kernel modules, JIT memory, etc. (There aretemporary exceptions to this rule to support things like instructionalternatives, breakpoints, kprobes, etc. If these must exist in akernel, they are implemented in a way where the memory is temporarilymade writable during the update, and then returned to the originalpermissions.)

In support of this areCONFIG_STRICT_KERNEL_RWX andCONFIG_STRICT_MODULE_RWX, which seek to make sure that code is notwritable, data is not executable, and read-only data is neither writablenor executable.

Most architectures have these options on by default and not user selectable.For some architectures like arm that wish to have these be selectable,the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enablea Kconfig prompt.CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determinesthe default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.

Function pointers and sensitive variables must not be writable

Vast areas of kernel memory contain function pointers that are lookedup by the kernel and used to continue execution (e.g. descriptor/vectortables, file/network/etc operation structures, etc). The number of thesevariables must be reduced to an absolute minimum.

Many such variables can be made read-only by setting them “const”so that they live in the .rodata section instead of the .data sectionof the kernel, gaining the protection of the kernel’s strict memorypermissions as described above.

For variables that are initialized once at__init time, these canbe marked with the__ro_after_init attribute.

What remains are variables that are updated rarely (e.g. GDT). Thesewill need another infrastructure (similar to the temporary exceptionsmade to kernel code mentioned above) that allow them to spend the restof their lifetime read-only. (For example, when being updated, only theCPU thread performing the update would be given uninterruptible writeaccess to the memory.)

Segregation of kernel memory from userspace memory

The kernel must never execute userspace memory. The kernel must also neveraccess userspace memory without explicit expectation to do so. Theserules can be enforced either by support of hardware-based restrictions(x86’s SMEP/SMAP, ARM’s PXN/PAN) or via emulation (ARM’s Memory Domains).By blocking userspace memory in this way, execution and data parsingcannot be passed to trivially-controlled userspace memory, forcingattacks to operate entirely in kernel memory.

Reduced access to syscalls

One trivial way to eliminate many syscalls for 64-bit systems is buildingwithoutCONFIG_COMPAT. However, this is rarely a feasible scenario.

The “seccomp” system provides an opt-in feature made available touserspace, which provides a way to reduce the number of kernel entrypoints available to a running process. This limits the breadth of kernelcode that can be reached, possibly reducing the availability of a givenbug to an attack.

An area of improvement would be creating viable ways to keep access tothings like compat, user namespaces, BPF creation, and perf limited onlyto trusted processes. This would keep the scope of kernel entry pointsrestricted to the more regular set of normally available to unprivilegeduserspace.

Restricting access to kernel modules

The kernel should never allow an unprivileged user the ability toload specific kernel modules, since that would provide a facility tounexpectedly extend the available attack surface. (The on-demand loadingof modules via their predefined subsystems, e.g. MODULE_ALIAS_*, isconsidered “expected” here, though additional consideration should begiven even to these.) For example, loading a filesystem module via anunprivileged socket API is nonsense: only the root or physically localuser should trigger filesystem module loading. (And even this can be upfor debate in some scenarios.)

To protect against even privileged users, systems may need to eitherdisable module loading entirely (e.g. monolithic kernel builds ormodules_disabled sysctl), or provide signed modules (e.g.CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from havingroot load arbitrary kernel code via the module loader interface.

Memory integrity

There are many memory structures in the kernel that are regularly abusedto gain execution control during an attack, By far the most commonlyunderstood is that of the stack buffer overflow in which the returnaddress stored on the stack is overwritten. Many other examples of thiskind of attack exist, and protections exist to defend against them.

Stack buffer overflow

The classic stack buffer overflow involves writing past the expected endof a variable stored on the stack, ultimately writing a controlled valueto the stack frame’s stored return address. The most widely used defenseis the presence of a stack canary between the stack variables and thereturn address (CONFIG_STACKPROTECTOR), which is verified just beforethe function returns. Other defenses include things like shadow stacks.

Stack depth overflow

A less well understood attack is using a bug that triggers thekernel to consume stack memory with deep function calls or large stackallocations. With this attack it is possible to write beyond the end ofthe kernel’s preallocated stack space and into sensitive structures. Twoimportant changes need to be made for better protections: moving thesensitive thread_info structure elsewhere, and adding a faulting memoryhole at the bottom of the stack to catch these overflows.

Heap memory integrity

The structures used to track heap free lists can be sanity-checked duringallocation and freeing to make sure they aren’t being used to manipulateother memory areas.

Counter integrity

Many places in the kernel use atomic counters to track object referencesor perform similar lifetime management. When these counters can be madeto wrap (over or under) this traditionally exposes a use-after-freeflaw. By trapping atomic wrapping, this class of bug vanishes.

Size calculation overflow detection

Similar to counter overflow, integer overflows (usually size calculations)need to be detected at runtime to kill this class of bug, whichtraditionally leads to being able to write past the end of kernel buffers.

Probabilistic defenses

While many protections can be considered deterministic (e.g. read-onlymemory cannot be written to), some protections provide only statisticaldefense, in that an attack must gather enough information about arunning system to overcome the defense. While not perfect, these doprovide meaningful defenses.

Canaries, blinding, and other secrets

It should be noted that things like the stack canary discussed earlierare technically statistical defenses, since they rely on a secret value,and such values may become discoverable through an information exposureflaw.

Blinding literal values for things like JITs, where the executablecontents may be partially under the control of userspace, need a similarsecret value.

It is critical that the secret values used must be separate (e.g.different canary per stack) and high entropy (e.g. is the RNG actuallyworking?) in order to maximize their success.

Kernel Address Space Layout Randomization (KASLR)

Since the location of kernel memory is almost always instrumental inmounting a successful attack, making the location non-deterministicraises the difficulty of an exploit. (Note that this in turn makesthe value of information exposures higher, since they may be used todiscover desired memory locations.)

Text and module base

By relocating the physical and virtual base address of the kernel atboot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will befrustrated. Additionally, offsetting the module loading base addressmeans that even systems that load the same set of modules in the sameorder every boot will not share a common base address with the rest ofthe kernel text.

Stack base

If the base address of the kernel stack is not the same between processes,or even not the same between syscalls, targets on or beyond the stackbecome more difficult to locate.

Dynamic memory base

Much of the kernel’s dynamic memory (e.g. kmalloc, vmalloc, etc) ends upbeing relatively deterministic in layout due to the order of early-bootinitializations. If the base address of these areas is not the samebetween boots, targeting them is frustrated, requiring an informationexposure specific to the region.

Structure layout

By performing a per-build randomization of the layout of sensitivestructures, attacks must either be tuned to known kernel builds or exposeenough kernel memory to determine structure layouts before manipulatingthem.

Preventing Information Exposures

Since the locations of sensitive structures are the primary target forattacks, it is important to defend against exposure of both kernel memoryaddresses and kernel memory contents (since they may contain kerneladdresses or other sensitive things like canary values).

Kernel addresses

Printing kernel addresses to userspace leaks sensitive information aboutthe kernel memory layout. Care should be exercised when using any printkspecifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]in certain circumstances [*]). Any file written to using one of thesespecifiers should be readable only by privileged processes.

Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1addresses printed with the specifier %p are hashed before printing.

[*] If KALLSYMS is enabled and symbol lookup fails, the raw address isprinted. If KALLSYMS is not enabled the raw address is printed.

Unique identifiers

Kernel memory addresses must never be used as identifiers exposed touserspace. Instead, use an atomic counter, an idr, or similar uniqueidentifier.

Memory initialization

Memory copied to userspace must always be fully initialized. If notexplicitlymemset(), this will require changes to the compiler to makesure structure holes are cleared.

Memory poisoning

When releasing memory, it is best to poison the contents, to avoid reuseattacks that rely on the old contents of memory. E.g., clear stack on asyscall return (CONFIG_KSTACK_ERASE), wipe heap memory on afree. This frustrates many uninitialized variable attacks, stack contentexposures, heap content exposures, and use-after-free attacks.

Destination tracking

To help kill classes of bugs that result in kernel addresses beingwritten to userspace, the destination of writes needs to be tracked. Ifthe buffer is destined for userspace (e.g. seq_file backed/proc files),it should automatically censor sensitive values.