Futex Requeue PI

Requeueing of tasks from a non-PI futex to a PI futex requiresspecial handling in order to ensure the underlying rt_mutex is neverleft without an owner if it has waiters; doing so would break the PIboosting logic [see rt-mutex-desgin.txt] For the purposes ofbrevity, this action will be referred to as “requeue_pi” throughoutthis document. Priority inheritance is abbreviated throughout as“PI”.

Motivation

Without requeue_pi, the glibc implementation ofpthread_cond_broadcast() must resort to waking all the tasks waitingon a pthread_condvar and letting them try to sort out which taskgets to run first in classic thundering-herd formation. An idealimplementation would wake the highest-priority waiter, and leave therest to the natural wakeup inherent in unlocking the mutexassociated with the condvar.

Consider the simplified glibc calls:

/* caller must lock mutex */pthread_cond_wait(cond, mutex){        lock(cond->__data.__lock);        unlock(mutex);        do {        unlock(cond->__data.__lock);        futex_wait(cond->__data.__futex);        lock(cond->__data.__lock);        } while(...)        unlock(cond->__data.__lock);        lock(mutex);}pthread_cond_broadcast(cond){        lock(cond->__data.__lock);        unlock(cond->__data.__lock);        futex_requeue(cond->data.__futex, cond->mutex);}

Once pthread_cond_broadcast() requeues the tasks, the cond->mutexhas waiters. Note that pthread_cond_wait() attempts to lock themutex only after it has returned to user space. This will leave theunderlying rt_mutex with waiters, and no owner, breaking thepreviously mentioned PI-boosting algorithms.

In order to support PI-aware pthread_condvar’s, the kernel needs tobe able to requeue tasks to PI futexes. This support implies thatupon a successful futex_wait system call, the caller would return touser space already holding the PI futex. The glibc implementationwould be modified as follows:

/* caller must lock mutex */pthread_cond_wait_pi(cond, mutex){        lock(cond->__data.__lock);        unlock(mutex);        do {        unlock(cond->__data.__lock);        futex_wait_requeue_pi(cond->__data.__futex);        lock(cond->__data.__lock);        } while(...)        unlock(cond->__data.__lock);        /* the kernel acquired the mutex for us */}pthread_cond_broadcast_pi(cond){        lock(cond->__data.__lock);        unlock(cond->__data.__lock);        futex_requeue_pi(cond->data.__futex, cond->mutex);}

The actual glibc implementation will likely test for PI and make thenecessary changes inside the existing calls rather than creating newcalls for the PI cases. Similar changes are needed forpthread_cond_timedwait() and pthread_cond_signal().

Implementation

In order to ensure the rt_mutex has an owner if it has waiters, itis necessary for both the requeue code, as well as the waiting code,to be able to acquire the rt_mutex before returning to user space.The requeue code cannot simply wake the waiter and leave it toacquire the rt_mutex as it would open a race window between therequeue call returning to user space and the waiter waking andstarting to run. This is especially true in the uncontended case.

The solution involves two new rt_mutex helper routines,rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), whichallow the requeue code to acquire an uncontended rt_mutex on behalfof the waiter and to enqueue the waiter on a contended rt_mutex.Two new system calls provide the kernel<->user interface torequeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.

FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()and pthread_cond_timedwait()) to block on the initial futex and waitto be requeued to a PI-aware futex. The implementation is theresult of a high-speed collision between futex_wait() andfutex_lock_pi(), with some extra logic to check for the additionalwake-up scenarios.

FUTEX_CMP_REQUEUE_PI is called by the waker(pthread_cond_broadcast() and pthread_cond_signal()) to requeue andpossibly wake the waiting tasks. Internally, this system call isstill handled by futex_requeue (by passing requeue_pi=1). Beforerequeueing,futex_requeue() attempts to acquire the requeue targetPI futex on behalf of the top waiter. If it can, this waiter iswoken.futex_requeue() then proceeds to requeue the remainingnr_wake+nr_requeue tasks to the PI futex, callingrt_mutex_start_proxy_lock() prior to each requeue to prepare thetask as a waiter on the underlying rt_mutex. It is possible thatthe lock can be acquired at this stage as well, if so, the nextwaiter is woken to finish the acquisition of the lock.

FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, buttheir sum is all that really matters.futex_requeue() will wake orrequeue up to nr_wake + nr_requeue tasks. It will wake only as manytasks as it can acquire the lock for, which in the majority of casesshould be 0 as good programming practice dictates that the caller ofeither pthread_cond_broadcast() or pthread_cond_signal() acquire themutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires thatnr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 forsignal.