Lightweight PI-futexes

We are calling them lightweight for 3 reasons:

  • in the user-space fastpath a PI-enabled futex involves no kernel work(or any other PI complexity) at all. No registration, no extra kernelcalls - just pure fast atomic ops in userspace.
  • even in the slowpath, the system call and scheduling pattern is verysimilar to normal futexes.
  • the in-kernel PI implementation is streamlined around the mutexabstraction, with strict rules that keep the implementationrelatively simple: only a single owner may own a lock (i.e. noread-write lock support), only the owner may unlock a lock, norecursive locking, etc.

Priority Inheritance - why?

The short reply: user-space PI helps achieving/improving determinism foruser-space applications. In the best-case, it can help achievedeterminism and well-bound latencies. Even in the worst-case, PI willimprove the statistical distribution of locking related applicationdelays.

The longer reply

Firstly, sharing locks between multiple tasks is a common programmingtechnique that often cannot be replaced with lockless algorithms. As wecan see it in the kernel [which is a quite complex program in itself],lockless structures are rather the exception than the norm - the currentratio of lockless vs. locky code for shared data structures is somewherebetween 1:10 and 1:100. Lockless is hard, and the complexity of locklessalgorithms often endangers to ability to do robust reviews of said code.I.e. critical RT apps often choose lock structures to protect criticaldata structures, instead of lockless algorithms. Furthermore, there arecases (like shared hardware, or other resource limits) where locklessaccess is mathematically impossible.

Media players (such as Jack) are an example of reasonable applicationdesign with multiple tasks (with multiple priority levels) sharingshort-held locks: for example, a highprio audio playback thread iscombined with medium-prio construct-audio-data threads and low-priodisplay-colory-stuff threads. Add video and decoding to the mix andwe’ve got even more priority levels.

So once we accept that synchronization objects (locks) are anunavoidable fact of life, and once we accept that multi-task userspaceapps have a very fair expectation of being able to use locks, we’ve gotto think about how to offer the option of a deterministic lockingimplementation to user-space.

Most of the technical counter-arguments against doing priorityinheritance only apply to kernel-space locks. But user-space locks aredifferent, there we cannot disable interrupts or make the tasknon-preemptible in a critical section, so the ‘use spinlocks’ argumentdoes not apply (user-space spinlocks have the same priority inversionproblems as other user-space locking constructs). Fact is, pretty muchthe only technique that currently enables good determinism for userspacelocks (such as futex-based pthread mutexes) is priority inheritance:

Currently (without PI), if a high-prio and a low-prio task shares a lock[this is a quite common scenario for most non-trivial RT applications],even if all critical sections are coded carefully to be deterministic(i.e. all critical sections are short in duration and only execute alimited number of instructions), the kernel cannot guarantee anydeterministic execution of the high-prio task: any medium-priority taskcould preempt the low-prio task while it holds the shared lock andexecutes the critical section, and could delay it indefinitely.

Implementation

As mentioned before, the userspace fastpath of PI-enabled pthreadmutexes involves no kernel work at all - they behave quite similarly tonormal futex-based locks: a 0 value means unlocked, and a value==TIDmeans locked. (This is the same method as used by list-based robustfutexes.) Userspace uses atomic ops to lock/unlock these mutexes withoutentering the kernel.

To handle the slowpath, we have added two new futex ops:

  • FUTEX_LOCK_PI
  • FUTEX_UNLOCK_PI

If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 toTID fails], then FUTEX_LOCK_PI is called. The kernel does all theremaining work: if there is no futex-queue attached to the futex addressyet then the code looks up the task that owns the futex [it has put itsown TID into the futex value], and attaches a ‘PI state’ structure tothe futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,kernel-based synchronization object. The ‘other’ task is made the ownerof the rt-mutex, and the FUTEX_WAITERS bit is atomically set in thefutex value. Then this task tries to lock the rt-mutex, on which itblocks. Once it returns, it has the mutex acquired, and it sets thefutex value to its own TID and returns. Userspace has no other work toperform - it now owns the lock, and futex value containsFUTEX_WAITERS|TID.

If the unlock side fastpath succeeds, [i.e. userspace manages to do aTID -> 0 atomic transition of the futex value], then no kernel work istriggered.

If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on thebehalf of userspace - and it also unlocks the attachedpi_state->rt_mutex and thus wakes up any potential waiters.

Note that under this approach, contrary to previous PI-futex approaches,there is no prior ‘registration’ of a PI-futex. [which is not quitepossible anyway, due to existing ABI properties of pthread mutexes.]

Also, under this scheme, ‘robustness’ and ‘PI’ are two orthogonalproperties of futexes, and all four combinations are possible: futex,robust-futex, PI-futex, robust+PI-futex.

More details about priority inheritance can be found inDocumentation/locking/rt-mutex.rst.