24.Microarchitectural Data Sampling (MDS) mitigation

24.1.Overview

Microarchitectural Data Sampling (MDS) is a family of side channel attackson internal buffers in Intel CPUs. The variants are:

  • Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)

  • Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)

  • Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)

  • Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)

MSBDS leaks Store Buffer Entries which can be speculatively forwarded to adependent load (store-to-load forwarding) as an optimization. The forwardcan also happen to a faulting or assisting load operation for a differentmemory address, which can be exploited under certain conditions. Storebuffers are partitioned between Hyper-Threads so cross thread forwarding isnot possible. But if a thread enters or exits a sleep state the storebuffer is repartitioned which can expose data from one thread to the other.

MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manageL1 miss situations and to hold data which is returned or sent in responseto a memory or I/O operation. Fill buffers can forward data to a loadoperation and also write data to the cache. When the fill buffer isdeallocated it can retain the stale data of the preceding operations whichcan then be forwarded to a faulting or assisting load operation, which canbe exploited under certain conditions. Fill buffers are shared betweenHyper-Threads so cross thread leakage is possible.

MLPDS leaks Load Port Data. Load ports are used to perform load operationsfrom memory or I/O. The received data is then forwarded to the registerfile or a subsequent operation. In some implementations the Load Port cancontain stale data from a previous operation which can be forwarded tofaulting or assisting loads under certain conditions, which again can beexploited eventually. Load ports are shared between Hyper-Threads so crossthread leakage is possible.

MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load frommemory that takes a fault or assist can leave data in a microarchitecturalstructure that may later be observed using one of the same methods used byMSBDS, MFBDS or MLPDS.

24.2.Exposure assumptions

It is assumed that attack code resides in user space or in a guest with oneexception. The rationale behind this assumption is that the code constructneeded for exploiting MDS requires:

  • to control the load to trigger a fault or assist

  • to have a disclosure gadget which exposes the speculatively accesseddata for consumption through a side channel.

  • to control the pointer through which the disclosure gadget exposes thedata

The existence of such a construct in the kernel cannot be excluded with100% certainty, but the complexity involved makes it extremely unlikely.

There is one exception, which is untrusted BPF. The functionality ofuntrusted BPF is limited, but it needs to be thoroughly investigatedwhether it can be used to create such a construct.

24.3.Mitigation strategy

All variants have the same mitigation strategy at least for the single CPUthread case (SMT off): Force the CPU to clear the affected buffers.

This is achieved by using the otherwise unused and obsolete VERWinstruction in combination with a microcode update. The microcode clearsthe affected CPU buffers when the VERW instruction is executed.

For virtualization there are two ways to achieve CPU bufferclearing. Either the modified VERW instruction or via the L1D Flushcommand. The latter is issued when L1TF mitigation is enabled so the extraVERW can be avoided. If the CPU is not affected by L1TF then VERW needs tobe issued.

If the VERW instruction with the supplied segment selector argument isexecuted on a CPU without the microcode update there is no side effectother than a small number of pointlessly wasted CPU cycles.

This does not protect against cross Hyper-Thread attacks except for MSBDSwhich is only exploitable cross Hyper-thread when one of the Hyper-Threadsenters a C-state.

The kernel provides a function to invoke the buffer clearing:

x86_clear_cpu_buffers()

Also macro CLEAR_CPU_BUFFERS can be used in ASM late in exit-to-user path.Other than CFLAGS.ZF, this macro doesn’t clobber any registers.

The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state(idle) transitions.

As a special quirk to address virtualization scenarios where the host hasthe microcode updated, but the hypervisor does not (yet) expose theMD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in thehope that it might actually clear the buffers. The state is reflectedaccordingly.

According to current knowledge additional mitigations inside the kernelitself are not required because the necessary gadgets to expose the leakeddata cannot be controlled in a way which allows exploitation from malicioususer space or VM guests.

24.4.Kernel internal mitigation modes

off

Mitigation is disabled. Either the CPU is not affected ormds=off is supplied on the kernel command line

full

Mitigation is enabled. CPU is affected and MD_CLEAR isadvertised in CPUID.

vmwerv

Mitigation is enabled. CPU is affected and MD_CLEAR is notadvertised in CPUID. That is mainly for virtualizationscenarios where the host has the updated microcode but thehypervisor does not expose MD_CLEAR in CPUID. It’s a besteffort approach without guarantee.

If the CPU is affected and mds=off is not supplied on the kernel commandline then the kernel selects the appropriate mitigation mode depending onthe availability of the MD_CLEAR CPUID bit.

24.5.Mitigation points

24.5.1.1. Return to user space

When transitioning from kernel to user space the CPU buffers are flushedon affected CPUs when the mitigation is not disabled on the kernelcommand line. The mitigation is enabled through the feature flagX86_FEATURE_CLEAR_CPU_BUF.

The mitigation is invoked just before transitioning to userspace afteruser registers are restored. This is done to minimize the window inwhich kernel data could be accessed after VERW e.g. via an NMI afterVERW.

Corner case not handledInterrupts returning to kernel don’t clear CPUs buffers since theexit-to-user path is expected to do that anyways. But, there could bea case when an NMI is generated in kernel after the exit-to-user pathhas cleared the buffers. This case is not handled and NMI returning tokernel don’t clear CPU buffers because:

  1. It is rare to get an NMI after VERW, but before returning to userspace.

  2. For an unprivileged user, there is no known way to make that NMIless rare or target it.

  3. It would take a large number of these precisely-timed NMIs to mountan actual attack. There’s presumably not enough bandwidth.

  4. The NMI in question occurs after a VERW, i.e. when user state isrestored and most interesting data is already scrubbed. What’s leftis only the data that NMI touches, and that may or may not be ofany interest.

24.5.2.2. C-State transition

When a CPU goes idle and enters a C-State the CPU buffers need to becleared on affected CPUs when SMT is active. This addresses therepartitioning of the store buffer when one of the Hyper-Threads entersa C-State.

When SMT is inactive, i.e. either the CPU does not support it or allsibling threads are offline CPU buffer clearing is not required.

The idle clearing is enabled on CPUs which are only affected by MSBDSand not by any other MDS variant. The other MDS variants cannot beprotected against cross Hyper-Thread attacks because the Fill Buffer andthe Load Ports are shared. So on CPUs affected by other variants, theidle clearing would be a window dressing exercise and is therefore notactivated.

The invocation is controlled by the static key cpu_buf_idle_clear which isswitched depending on the chosen mitigation mode and the SMT state of thesystem.

The buffer clear is only invoked before entering the C-State to preventthat stale data from the idling CPU from spilling to the Hyper-Threadsibling after the store buffer got repartitioned and all entries areavailable to the non idle sibling.

When coming out of idle the store buffer is partitioned again so eachsibling has half of it available. The back from idle CPU could be thenspeculatively exposed to contents of the sibling. The buffers areflushed either on exit to user space or on VMENTER so malicious codein user space or the guest cannot speculatively access them.

The mitigation is hooked into all variants ofhalt()/mwait(), but doesnot cover the legacy ACPI IO-Port mechanism because the ACPI idle driverhas been superseded by the intel_idle driver around 2010 and ispreferred on all affected CPUs which are expected to gain the MD_CLEARfunctionality in microcode. Aside of that the IO-Port mechanism is alegacy interface which is only used on older systems which are eithernot affected or do not receive microcode updates anymore.