The robust futex ABI¶
- Author:
Started by Paul Jackson <pj@sgi.com>
Robust_futexes provide a mechanism that is used in addition to normalfutexes, for kernel assist of cleanup of held locks on task exit.
The interesting data as to what futexes a thread is holding is kept on alinked list in user space, where it can be updated efficiently as locksare taken and dropped, without kernel intervention. The only additionalkernel intervention required for robust_futexes above and beyond what isrequired for futexes is:
a one time call, per thread, to tell the kernel where its list ofheld robust_futexes begins, and
internal kernel code at exit, to handle any listed locks heldby the exiting thread.
The existing normal futexes already provide a “Fast Userspace Locking”mechanism, which handles uncontested locking without needing a systemcall, and handles contested locking by maintaining a list of waitingthreads in the kernel. Options on the sys_futex(2) system call supportwaiting on a particular futex, and waking up the next waiter on aparticular futex.
For robust_futexes to work, the user code (typically in a library suchas glibc linked with the application) has to manage and place thenecessary list elements exactly as the kernel expects them. If it failsto do so, then improperly listed locks will not be cleaned up on exit,probably causing deadlock or other such failure of the other threadswaiting on the same locks.
A thread that anticipates possibly using robust_futexes should firstissue the system call:
asmlinkage longsys_set_robust_list(struct robust_list_head __user *head, size_t len);
The pointer ‘head’ points to a structure in the threads address spaceconsisting of three words. Each word is 32 bits on 32 bit arch’s, or 64bits on 64 bit arch’s, and local byte order. Each thread should haveits own thread private ‘head’.
If a thread is running in 32 bit compatibility mode on a 64 native archkernel, then it can actually have two such structures - one using 32 bitwords for 32 bit compatibility mode, and one using 64 bit words for 64bit native mode. The kernel, if it is a 64 bit kernel supporting 32 bitcompatibility mode, will attempt to process both lists on each taskexit, if the correspondingsys_set_robust_list() call has been made tosetup that list.
The first word in the memory structure at ‘head’ contains apointer to a single linked list of ‘lock entries’, one per lock,as described below. If the list is empty, the pointer will pointto itself, ‘head’. The last ‘lock entry’ points back to the ‘head’.
The second word, called ‘offset’, specifies the offset from theaddress of the associated ‘lock entry’, plus or minus, of what willbe called the ‘lock word’, from that ‘lock entry’. The ‘lock word’is always a 32 bit word, unlike the other words above. The ‘lockword’ holds 2 flag bits in the upper 2 bits, and the thread id (TID)of the thread holding the lock in the bottom 30 bits. See furtherbelow for a description of the flag bits.
The third word, called ‘list_op_pending’, contains transient copy ofthe address of the ‘lock entry’, during list insertion and removal,and is needed to correctly resolve races should a thread exit whilein the middle of a locking or unlocking operation.
Each ‘lock entry’ on the single linked list starting at ‘head’ consistsof just a single word, pointing to the next ‘lock entry’, or back to‘head’ if there are no more entries. In addition, nearby to each ‘lockentry’, at an offset from the ‘lock entry’ specified by the ‘offset’word, is one ‘lock word’.
The ‘lock word’ is always 32 bits, and is intended to be the same 32 bitlock variable used by the futex mechanism, in conjunction withrobust_futexes. The kernel will only be able to wakeup the next threadwaiting for a lock on a threads exit if that next thread used the futexmechanism to register the address of that ‘lock word’ with the kernel.
For each futex lock currently held by a thread, if it wants thisrobust_futex support for exit cleanup of that lock, it should have one‘lock entry’ on this list, with its associated ‘lock word’ at thespecified ‘offset’. Should a thread die while holding any such locks,the kernel will walk this list, mark any such locks with a bitindicating their holder died, and wakeup the next thread waiting forthat lock using the futex mechanism.
When a thread has invoked the above system call to indicate itanticipates using robust_futexes, the kernel stores the passed in ‘head’pointer for that task. The task may retrieve that value later on byusing the system call:
asmlinkage longsys_get_robust_list(int pid, struct robust_list_head __user **head_ptr, size_t __user *len_ptr);
It is anticipated that threads will use robust_futexes embedded inlarger, user level locking structures, one per lock. The kernelrobust_futex mechanism doesn’t care what else is in that structure, solong as the ‘offset’ to the ‘lock word’ is the same for allrobust_futexes used by that thread. The thread should link those locksit currently holds using the ‘lock entry’ pointers. It may also haveother links between the locks, such as the reverse side of a doublelinked list, but that doesn’t matter to the kernel.
By keeping its locks linked this way, on a list starting with a ‘head’pointer known to the kernel, the kernel can provide to a thread theessential service available for robust_futexes, which is to help cleanup locks held at the time of (a perhaps unexpectedly) exit.
Actual locking and unlocking, during normal operations, is handledentirely by user level code in the contending threads, and by theexisting futex mechanism to wait for, and wakeup, locks. The kernelsonly essential involvement in robust_futexes is to remember where thelist ‘head’ is, and to walk the list on thread exit, handling locksstill held by the departing thread, as described below.
There may exist thousands of futex lock structures in a threads sharedmemory, on various data structures, at a given point in time. Only thoselock structures for locks currently held by that thread should be onthat thread’s robust_futex linked lock list a given time.
A given futex lock structure in a user shared memory region may be heldat different times by any of the threads with access to that region. Thethread currently holding such a lock, if any, is marked with the threadsTID in the lower 30 bits of the ‘lock word’.
When adding or removing a lock from its list of held locks, in order forthe kernel to correctly handle lock cleanup regardless of when the taskexits (perhaps it gets an unexpected signal 9 in the middle ofmanipulating this list), the user code must observe the followingprotocol on ‘lock entry’ insertion and removal:
On insertion:
set the ‘list_op_pending’ word to the address of the ‘lock entry’to be inserted,
acquire the futex lock,
add the lock entry, with its thread id (TID) in the bottom 30 bitsof the ‘lock word’, to the linked list starting at ‘head’, and
clear the ‘list_op_pending’ word.
On removal:
set the ‘list_op_pending’ word to the address of the ‘lock entry’to be removed,
remove the lock entry for this lock from the ‘head’ list,
release the futex lock, and
clear the ‘lock_op_pending’ word.
On exit, the kernel will consider the address stored in‘list_op_pending’ and the address of each ‘lock word’ found by walkingthe list starting at ‘head’. For each such address, if the bottom 30bits of the ‘lock word’ at offset ‘offset’ from that address equals theexiting threads TID, then the kernel will do two things:
if bit 31 (0x80000000) is set in that word, then attempt a futexwakeup on that address, which will waken the next thread that hasused to the futex mechanism to wait on that address, and
atomically set bit 30 (0x40000000) in the ‘lock word’.
In the above, bit 31 was set by futex waiters on that lock to indicatethey were waiting, and bit 30 is set by the kernel to indicate that thelock owner died holding the lock.
The kernel exit code will silently stop scanning the list further if atany point:
the ‘head’ pointer or an subsequent linked list pointeris not a valid address of a user space word
the calculated location of the ‘lock word’ (address plus‘offset’) is not the valid address of a 32 bit user spaceword
if the list contains more than 1 million (subject tofuture kernel configuration changes) elements.
When the kernel sees a list entry whose ‘lock word’ doesn’t have thecurrent threads TID in the lower 30 bits, it does nothing with thatentry, and goes on to the next entry.