Unreliable Guide To Locking

Author:Rusty Russell

Introduction

Welcome, to Rusty’s Remarkably Unreliable Guide to Kernel Lockingissues. This document describes the locking systems in the Linux Kernelin 2.6.

With the wide availability of HyperThreading, and preemption in theLinux Kernel, everyone hacking on the kernel needs to know thefundamentals of concurrency and locking for SMP.

The Problem With Concurrency

(Skip this if you know what a Race Condition is).

In a normal program, you can increment a counter like so:

very_important_count++;

This is what they would expect to happen:

Expected Results
Instance 1Instance 2
read very_important_count (5) 
add 1 (6) 
write very_important_count (6) 
 read very_important_count (6)
 add 1 (7)
 write very_important_count (7)

This is what might happen:

Possible Results
Instance 1Instance 2
read very_important_count (5) 
 read very_important_count (5)
add 1 (6) 
 add 1 (6)
write very_important_count (6) 
 write very_important_count (6)

Race Conditions and Critical Regions

This overlap, where the result depends on the relative timing ofmultiple tasks, is called a race condition. The piece of code containingthe concurrency issue is called a critical region. And especially sinceLinux starting running on SMP machines, they became one of the majorissues in kernel design and implementation.

Preemption can have the same effect, even if there is only one CPU: bypreempting one task during the critical region, we have exactly the samerace condition. In this case the thread which preempts might run thecritical region itself.

The solution is to recognize when these simultaneous accesses occur, anduse locks to make sure that only one instance can enter the criticalregion at any time. There are many friendly primitives in the Linuxkernel to help you do this. And then there are the unfriendlyprimitives, but I’ll pretend they don’t exist.

Locking in the Linux Kernel

If I could give you one piece of advice: never sleep with anyone crazierthan yourself. But if I had to give you advice on locking:keep itsimple.

Be reluctant to introduce new locks.

Strangely enough, this last one is the exact reverse of my advice whenyouhave slept with someone crazier than yourself. And you shouldthink about getting a big dog.

Two Main Types of Kernel Locks: Spinlocks and Mutexes

There are two main types of kernel locks. The fundamental type is thespinlock (include/asm/spinlock.h), which is a very simplesingle-holder lock: if you can’t get the spinlock, you keep trying(spinning) until you can. Spinlocks are very small and fast, and can beused anywhere.

The second type is a mutex (include/linux/mutex.h): it is like aspinlock, but you may block holding a mutex. If you can’t lock a mutex,your task will suspend itself, and be woken up when the mutex isreleased. This means the CPU can do something else while you arewaiting. There are many cases when you simply can’t sleep (seeWhat Functions Are Safe To Call From Interrupts?),and so have to use a spinlock instead.

Neither type of lock is recursive: seeDeadlock: Simple and Advanced.

Locks and Uniprocessor Kernels

For kernels compiled withoutCONFIG_SMP, and withoutCONFIG_PREEMPT spinlocks do not exist at all. This is an excellentdesign decision: when no-one else can run at the same time, there is noreason to have a lock.

If the kernel is compiled withoutCONFIG_SMP, butCONFIG_PREEMPTis set, then spinlocks simply disable preemption, which is sufficient toprevent any races. For most purposes, we can think of preemption asequivalent to SMP, and not worry about it separately.

You should always test your locking code withCONFIG_SMP andCONFIG_PREEMPT enabled, even if you don’t have an SMP test box,because it will still catch some kinds of locking bugs.

Mutexes still exist, because they are required for synchronizationbetween user contexts, as we will see below.

Locking Only In User Context

If you have a data structure which is only ever accessed from usercontext, then you can use a simple mutex (include/linux/mutex.h) toprotect it. This is the most trivial case: you initialize the mutex.Then you can callmutex_lock_interruptible() to grab themutex, andmutex_unlock() to release it. There is also amutex_lock(), which should be avoided, because it willnot return if a signal is received.

Example:net/netfilter/nf_sockopt.c allows registration of newsetsockopt() and getsockopt() calls, withnf_register_sockopt(). Registration and de-registrationare only done on module load and unload (and boot time, where there isno concurrency), and the list of registrations is only consulted for anunknown setsockopt() or getsockopt() systemcall. Thenf_sockopt_mutex is perfect to protect this, especiallysince the setsockopt and getsockopt calls may well sleep.

Locking Between User Context and Softirqs

If a softirq shares data with user context, you have two problems.Firstly, the current user context can be interrupted by a softirq, andsecondly, the critical region could be entered from another CPU. This iswhere spin_lock_bh() (include/linux/spinlock.h) isused. It disables softirqs on that CPU, then grabs the lock.spin_unlock_bh() does the reverse. (The ‘_bh’ suffix isa historical reference to “Bottom Halves”, the old name for softwareinterrupts. It should really be called spin_lock_softirq()’ in aperfect world).

Note that you can also use spin_lock_irq() orspin_lock_irqsave() here, which stop hardware interruptsas well: seeHard IRQ Context.

This works perfectly for UP as well: the spin lock vanishes, and thismacro simply becomes local_bh_disable()(include/linux/interrupt.h), which protects you from the softirqbeing run.

Locking Between User Context and Tasklets

This is exactly the same as above, because tasklets are actually runfrom a softirq.

Locking Between User Context and Timers

This, too, is exactly the same as above, because timers are actually runfrom a softirq. From a locking point of view, tasklets and timers areidentical.

Locking Between Tasklets/Timers

Sometimes a tasklet or timer might want to share data with anothertasklet or timer.

The Same Tasklet/Timer

Since a tasklet is never run on two CPUs at once, you don’t need toworry about your tasklet being reentrant (running twice at once), evenon SMP.

Different Tasklets/Timers

If another tasklet/timer wants to share data with your tasklet or timer, you will both need to use spin_lock() andspin_unlock() calls. spin_lock_bh() isunnecessary here, as you are already in a tasklet, and none will be runon the same CPU.

Locking Between Softirqs

Often a softirq might want to share data with itself or a tasklet/timer.

The Same Softirq

The same softirq can run on the other CPUs: you can use a per-CPU array(seePer-CPU Data) for better performance. If you’regoing so far as to use a softirq, you probably care about scalableperformance enough to justify the extra complexity.

You’ll need to use spin_lock() andspin_unlock() for shared data.

Different Softirqs

You’ll need to use spin_lock() andspin_unlock() for shared data, whether it be a timer,tasklet, different softirq or the same or another softirq: any of themcould be running on a different CPU.

Hard IRQ Context

Hardware interrupts usually communicate with a tasklet or softirq.Frequently this involves putting work in a queue, which the softirq willtake out.

Locking Between Hard IRQ and Softirqs/Tasklets

If a hardware irq handler shares data with a softirq, you have twoconcerns. Firstly, the softirq processing can be interrupted by ahardware interrupt, and secondly, the critical region could be enteredby a hardware interrupt on another CPU. This is wherespin_lock_irq() is used. It is defined to disableinterrupts on that cpu, then grab the lock.spin_unlock_irq() does the reverse.

The irq handler does not need to use spin_lock_irq(), becausethe softirq cannot run while the irq handler is running: it can usespin_lock(), which is slightly faster. The only exceptionwould be if a different hardware irq handler uses the same lock:spin_lock_irq() will stop that from interrupting us.

This works perfectly for UP as well: the spin lock vanishes, and thismacro simply becomes local_irq_disable()(include/asm/smp.h), which protects you from the softirq/tasklet/BHbeing run.

spin_lock_irqsave() (include/linux/spinlock.h) is avariant which saves whether interrupts were on or off in a flags word,which is passed to spin_unlock_irqrestore(). This meansthat the same code can be used inside an hard irq handler (whereinterrupts are already off) and in softirqs (where the irq disabling isrequired).

Note that softirqs (and hence tasklets and timers) are run on returnfrom hardware interrupts, so spin_lock_irq() also stopsthese. In that sense, spin_lock_irqsave() is the mostgeneral and powerful locking function.

Locking Between Two Hard IRQ Handlers

It is rare to have to share data between two IRQ handlers, but if youdo, spin_lock_irqsave() should be used: it isarchitecture-specific whether all interrupts are disabled inside irqhandlers themselves.

Cheat Sheet For Locking

Pete Zaitcev gives the following summary:

  • If you are in a process context (any syscall) and want to lock otherprocess out, use a mutex. You can take a mutex and sleep(copy_from_user*( orkmalloc(x,GFP_KERNEL)).
  • Otherwise (== data can be touched in an interrupt), usespin_lock_irqsave() andspin_unlock_irqrestore().
  • Avoid holding spinlock for more than 5 lines of code and across anyfunction call (except accessors like readb()).

Table of Minimum Requirements

The following table lists theminimum locking requirements betweenvarious contexts. In some cases, the same context can only be running onone CPU at a time, so no locking is required for that context (eg. aparticular thread can only run on one CPU at a time, but if it needsshares data with another thread, locking is required).

Remember the advice above: you can always usespin_lock_irqsave(), which is a superset of all otherspinlock primitives.

.IRQ Handler AIRQ Handler BSoftirq ASoftirq BTasklet ATasklet BTimer ATimer BUser Context AUser Context B
IRQ Handler ANone         
IRQ Handler BSLISNone        
Softirq ASLISLISL       
Softirq BSLISLISLSL      
Tasklet ASLISLISLSLNone     
Tasklet BSLISLISLSLSLNone    
Timer ASLISLISLSLSLSLNone   
Timer BSLISLISLSLSLSLSLNone  
User Context ASLISLISLBHSLBHSLBHSLBHSLBHSLBHNone 
User Context BSLISLISLBHSLBHSLBHSLBHSLBHSLBHMLINone

Table: Table of Locking Requirements

SLISspin_lock_irqsave
SLIspin_lock_irq
SLspin_lock
SLBHspin_lock_bh
MLImutex_lock_interruptible

Table: Legend for Locking Requirements Table

The trylock Functions

There are functions that try to acquire a lock only once and immediatelyreturn a value telling about success or failure to acquire the lock.They can be used if you need no access to the data protected with thelock when some other thread is holding the lock. You should acquire thelock later if you then need access to the data protected with the lock.

spin_trylock() does not spin but returns non-zero if itacquires the spinlock on the first try or 0 if not. This function can beused in all contexts like spin_lock(): you must havedisabled the contexts that might interrupt you and acquire the spinlock.

mutex_trylock() does not suspend your task but returnsnon-zero if it could lock the mutex on the first try or 0 if not. Thisfunction cannot be safely used in hardware or software interruptcontexts despite not sleeping.

Common Examples

Let’s step through a simple example: a cache of number to name mappings.The cache keeps a count of how often each of the objects is used, andwhen it gets full, throws out the least used one.

All In User Context

For our first example, we assume that all operations are in user context(ie. from system calls), so we can sleep. This means we can use a mutexto protect the cache and all the objects within it. Here’s the code:

#include <linux/list.h>#include <linux/slab.h>#include <linux/string.h>#include <linux/mutex.h>#include <asm/errno.h>struct object{        struct list_head list;        int id;        char name[32];        int popularity;};/* Protects the cache, cache_num, and the objects within it */static DEFINE_MUTEX(cache_lock);static LIST_HEAD(cache);static unsigned int cache_num = 0;#define MAX_CACHE_SIZE 10/* Must be holding cache_lock */static struct object *__cache_find(int id){        struct object *i;        list_for_each_entry(i, &cache, list)                if (i->id == id) {                        i->popularity++;                        return i;                }        return NULL;}/* Must be holding cache_lock */static void __cache_delete(struct object *obj){        BUG_ON(!obj);        list_del(&obj->list);        kfree(obj);        cache_num--;}/* Must be holding cache_lock */static void __cache_add(struct object *obj){        list_add(&obj->list, &cache);        if (++cache_num > MAX_CACHE_SIZE) {                struct object *i, *outcast = NULL;                list_for_each_entry(i, &cache, list) {                        if (!outcast || i->popularity < outcast->popularity)                                outcast = i;                }                __cache_delete(outcast);        }}int cache_add(int id, const char *name){        struct object *obj;        if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)                return -ENOMEM;        strscpy(obj->name, name, sizeof(obj->name));        obj->id = id;        obj->popularity = 0;        mutex_lock(&cache_lock);        __cache_add(obj);        mutex_unlock(&cache_lock);        return 0;}void cache_delete(int id){        mutex_lock(&cache_lock);        __cache_delete(__cache_find(id));        mutex_unlock(&cache_lock);}int cache_find(int id, char *name){        struct object *obj;        int ret = -ENOENT;        mutex_lock(&cache_lock);        obj = __cache_find(id);        if (obj) {                ret = 0;                strcpy(name, obj->name);        }        mutex_unlock(&cache_lock);        return ret;}

Note that we always make sure we have the cache_lock when we add,delete, or look up the cache: both the cache infrastructure itself andthe contents of the objects are protected by the lock. In this case it’seasy, since we copy the data for the user, and never let them access theobjects directly.

There is a slight (and common) optimization here: incache_add() we set up the fields of the object beforegrabbing the lock. This is safe, as no-one else can access it until weput it in cache.

Accessing From Interrupt Context

Now consider the case where cache_find() can be calledfrom interrupt context: either a hardware interrupt or a softirq. Anexample would be a timer which deletes object from the cache.

The change is shown below, in standard patch format: the- are lineswhich are taken away, and the+ are lines which are added.

--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100+++ cache.c.interrupt   2003-12-09 14:07:49.000000000 +1100@@ -12,7 +12,7 @@         int popularity; };-static DEFINE_MUTEX(cache_lock);+static DEFINE_SPINLOCK(cache_lock); static LIST_HEAD(cache); static unsigned int cache_num = 0; #define MAX_CACHE_SIZE 10@@ -55,6 +55,7 @@ int cache_add(int id, const char *name) {         struct object *obj;+        unsigned long flags;         if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)                 return -ENOMEM;@@ -63,30 +64,33 @@         obj->id = id;         obj->popularity = 0;-        mutex_lock(&cache_lock);+        spin_lock_irqsave(&cache_lock, flags);         __cache_add(obj);-        mutex_unlock(&cache_lock);+        spin_unlock_irqrestore(&cache_lock, flags);         return 0; } void cache_delete(int id) {-        mutex_lock(&cache_lock);+        unsigned long flags;++        spin_lock_irqsave(&cache_lock, flags);         __cache_delete(__cache_find(id));-        mutex_unlock(&cache_lock);+        spin_unlock_irqrestore(&cache_lock, flags); } int cache_find(int id, char *name) {         struct object *obj;         int ret = -ENOENT;+        unsigned long flags;-        mutex_lock(&cache_lock);+        spin_lock_irqsave(&cache_lock, flags);         obj = __cache_find(id);         if (obj) {                 ret = 0;                 strcpy(name, obj->name);         }-        mutex_unlock(&cache_lock);+        spin_unlock_irqrestore(&cache_lock, flags);         return ret; }

Note that the spin_lock_irqsave() will turn offinterrupts if they are on, otherwise does nothing (if we are already inan interrupt handler), hence these functions are safe to call from anycontext.

Unfortunately, cache_add() callskmalloc()with theGFP_KERNEL flag, which is only legal in user context. Ihave assumed that cache_add() is still only called inuser context, otherwise this should become a parameter tocache_add().

Exposing Objects Outside This File

If our objects contained more information, it might not be sufficient tocopy the information in and out: other parts of the code might want tokeep pointers to these objects, for example, rather than looking up theid every time. This produces two problems.

The first problem is that we use thecache_lock to protect objects:we’d need to make this non-static so the rest of the code can use it.This makes locking trickier, as it is no longer all in one place.

The second problem is the lifetime problem: if another structure keeps apointer to an object, it presumably expects that pointer to remainvalid. Unfortunately, this is only guaranteed while you hold the lock,otherwise someone might call cache_delete() and evenworse, add another object, re-using the same address.

As there is only one lock, you can’t hold it forever: no-one else wouldget any work done.

The solution to this problem is to use a reference count: everyone whohas a pointer to the object increases it when they first get the object,and drops the reference count when they’re finished with it. Whoeverdrops it to zero knows it is unused, and can actually delete it.

Here is the code:

--- cache.c.interrupt   2003-12-09 14:25:43.000000000 +1100+++ cache.c.refcnt  2003-12-09 14:33:05.000000000 +1100@@ -7,6 +7,7 @@ struct object {         struct list_head list;+        unsigned int refcnt;         int id;         char name[32];         int popularity;@@ -17,6 +18,35 @@ static unsigned int cache_num = 0; #define MAX_CACHE_SIZE 10+static void __object_put(struct object *obj)+{+        if (--obj->refcnt == 0)+                kfree(obj);+}++static void __object_get(struct object *obj)+{+        obj->refcnt++;+}++void object_put(struct object *obj)+{+        unsigned long flags;++        spin_lock_irqsave(&cache_lock, flags);+        __object_put(obj);+        spin_unlock_irqrestore(&cache_lock, flags);+}++void object_get(struct object *obj)+{+        unsigned long flags;++        spin_lock_irqsave(&cache_lock, flags);+        __object_get(obj);+        spin_unlock_irqrestore(&cache_lock, flags);+}+ /* Must be holding cache_lock */ static struct object *__cache_find(int id) {@@ -35,6 +65,7 @@ {         BUG_ON(!obj);         list_del(&obj->list);+        __object_put(obj);         cache_num--; }@@ -63,6 +94,7 @@         strscpy(obj->name, name, sizeof(obj->name));         obj->id = id;         obj->popularity = 0;+        obj->refcnt = 1; /* The cache holds a reference */         spin_lock_irqsave(&cache_lock, flags);         __cache_add(obj);@@ -79,18 +111,15 @@         spin_unlock_irqrestore(&cache_lock, flags); }-int cache_find(int id, char *name)+struct object *cache_find(int id) {         struct object *obj;-        int ret = -ENOENT;         unsigned long flags;         spin_lock_irqsave(&cache_lock, flags);         obj = __cache_find(id);-        if (obj) {-                ret = 0;-                strcpy(name, obj->name);-        }+        if (obj)+                __object_get(obj);         spin_unlock_irqrestore(&cache_lock, flags);-        return ret;+        return obj; }

We encapsulate the reference counting in the standard ‘get’ and ‘put’functions. Now we can return the object itself fromcache_find() which has the advantage that the user cannow sleep holding the object (eg. to copy_to_user() toname to userspace).

The other point to note is that I said a reference should be held forevery pointer to the object: thus the reference count is 1 when firstinserted into the cache. In some versions the framework does not hold areference count, but they are more complicated.

Using Atomic Operations For The Reference Count

In practice,atomic_t would usually be used for refcnt. There are anumber of atomic operations defined ininclude/asm/atomic.h: theseare guaranteed to be seen atomically from all CPUs in the system, so nolock is required. In this case, it is simpler than using spinlocks,although for anything non-trivial using spinlocks is clearer. Theatomic_inc() and atomic_dec_and_test()are used instead of the standard increment and decrement operators, andthe lock is no longer used to protect the reference count itself.

--- cache.c.refcnt  2003-12-09 15:00:35.000000000 +1100+++ cache.c.refcnt-atomic   2003-12-11 15:49:42.000000000 +1100@@ -7,7 +7,7 @@ struct object {         struct list_head list;-        unsigned int refcnt;+        atomic_t refcnt;         int id;         char name[32];         int popularity;@@ -18,33 +18,15 @@ static unsigned int cache_num = 0; #define MAX_CACHE_SIZE 10-static void __object_put(struct object *obj)-{-        if (--obj->refcnt == 0)-                kfree(obj);-}--static void __object_get(struct object *obj)-{-        obj->refcnt++;-}- void object_put(struct object *obj) {-        unsigned long flags;--        spin_lock_irqsave(&cache_lock, flags);-        __object_put(obj);-        spin_unlock_irqrestore(&cache_lock, flags);+        if (atomic_dec_and_test(&obj->refcnt))+                kfree(obj); } void object_get(struct object *obj) {-        unsigned long flags;--        spin_lock_irqsave(&cache_lock, flags);-        __object_get(obj);-        spin_unlock_irqrestore(&cache_lock, flags);+        atomic_inc(&obj->refcnt); } /* Must be holding cache_lock */@@ -65,7 +47,7 @@ {         BUG_ON(!obj);         list_del(&obj->list);-        __object_put(obj);+        object_put(obj);         cache_num--; }@@ -94,7 +76,7 @@         strscpy(obj->name, name, sizeof(obj->name));         obj->id = id;         obj->popularity = 0;-        obj->refcnt = 1; /* The cache holds a reference */+        atomic_set(&obj->refcnt, 1); /* The cache holds a reference */         spin_lock_irqsave(&cache_lock, flags);         __cache_add(obj);@@ -119,7 +101,7 @@         spin_lock_irqsave(&cache_lock, flags);         obj = __cache_find(id);         if (obj)-                __object_get(obj);+                object_get(obj);         spin_unlock_irqrestore(&cache_lock, flags);         return obj; }

Protecting The Objects Themselves

In these examples, we assumed that the objects (except the referencecounts) never changed once they are created. If we wanted to allow thename to change, there are three possibilities:

  • You can makecache_lock non-static, and tell people to grab thatlock before changing the name in any object.
  • You can provide a cache_obj_rename() which grabs thislock and changes the name for the caller, and tell everyone to usethat function.
  • You can make thecache_lock protect only the cache itself, anduse another lock to protect the name.

Theoretically, you can make the locks as fine-grained as one lock forevery field, for every object. In practice, the most common variantsare:

  • One lock which protects the infrastructure (thecache list inthis example) and all the objects. This is what we have done so far.
  • One lock which protects the infrastructure (including the listpointers inside the objects), and one lock inside the object whichprotects the rest of that object.
  • Multiple locks to protect the infrastructure (eg. one lock per hashchain), possibly with a separate per-object lock.

Here is the “lock-per-object” implementation:

--- cache.c.refcnt-atomic   2003-12-11 15:50:54.000000000 +1100+++ cache.c.perobjectlock   2003-12-11 17:15:03.000000000 +1100@@ -6,11 +6,17 @@ struct object {+        /* These two protected by cache_lock. */         struct list_head list;+        int popularity;+         atomic_t refcnt;++        /* Doesn't change once created. */         int id;++        spinlock_t lock; /* Protects the name */         char name[32];-        int popularity; }; static DEFINE_SPINLOCK(cache_lock);@@ -77,6 +84,7 @@         obj->id = id;         obj->popularity = 0;         atomic_set(&obj->refcnt, 1); /* The cache holds a reference */+        spin_lock_init(&obj->lock);         spin_lock_irqsave(&cache_lock, flags);         __cache_add(obj);

Note that I decide that the popularity count should be protected by thecache_lock rather than the per-object lock: this is because it (likethestructlist_head inside the object)is logically part of the infrastructure. This way, I don’t need to grabthe lock of every object in __cache_add() when seekingthe least popular.

I also decided that the id member is unchangeable, so I don’t need tograb each object lock in __cache_find() to examine theid: the object lock is only used by a caller who wants to read or writethe name field.

Note also that I added a comment describing what data was protected bywhich locks. This is extremely important, as it describes the runtimebehavior of the code, and can be hard to gain from just reading. And asAlan Cox says, “Lock data, not code”.

Common Problems

Deadlock: Simple and Advanced

There is a coding bug where a piece of code tries to grab a spinlocktwice: it will spin forever, waiting for the lock to be released(spinlocks, rwlocks and mutexes are not recursive in Linux). This istrivial to diagnose: not astay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem.

For a slightly more complex case, imagine you have a region shared by asoftirq and user context. If you use a spin_lock() callto protect it, it is possible that the user context will be interruptedby the softirq while it holds the lock, and the softirq will then spinforever trying to get the same lock.

Both of these are called deadlock, and as shown above, it can occur evenwith a single CPU (although not on UP compiles, since spinlocks vanishon kernel compiles withCONFIG_SMP=n. You’ll still get datacorruption in the second example).

This complete lockup is easy to diagnose: on SMP boxes the watchdogtimer or compiling withDEBUG_SPINLOCK set(include/linux/spinlock.h) will show this up immediately when ithappens.

A more complex problem is the so-called ‘deadly embrace’, involving twoor more locks. Say you have a hash table: each entry in the table is aspinlock, and a chain of hashed objects. Inside a softirq handler, yousometimes want to alter an object from one place in the hash to another:you grab the spinlock of the old hash chain and the spinlock of the newhash chain, and delete the object from the old one, and insert it in thenew one.

There are two problems here. First, if your code ever tries to move theobject to the same chain, it will deadlock with itself as it tries tolock it twice. Secondly, if the same softirq on another CPU is trying tomove another object in the reverse direction, the following couldhappen:

CPU 1CPU 2
Grab lock A -> OKGrab lock B -> OK
Grab lock B -> spinGrab lock A -> spin

Table: Consequences

The two CPUs will spin forever, waiting for the other to give up theirlock. It will look, smell, and feel like a crash.

Preventing Deadlock

Textbooks will tell you that if you always lock in the same order, youwill never get this kind of deadlock. Practice will tell you that thisapproach doesn’t scale: when I create a new lock, I don’t understandenough of the kernel to figure out where in the 5000 lock hierarchy itwill fit.

The best locks are encapsulated: they never get exposed in headers, andare never held around calls to non-trivial functions outside the samefile. You can read through this code and see that it will neverdeadlock, because it never tries to grab another lock while it has thatone. People using your code don’t even need to know you are using alock.

A classic problem here is when you provide callbacks or hooks: if youcall these with the lock held, you risk simple deadlock, or a deadlyembrace (who knows what the callback will do?). Remember, the otherprogrammers are out to get you, so don’t do this.

Overzealous Prevention Of Deadlocks

Deadlocks are problematic, but not as bad as data corruption. Code whichgrabs a read lock, searches a list, fails to find what it wants, dropsthe read lock, grabs a write lock and inserts the object has a racecondition.

If you don’t see why, please stay the fuck away from my code.

Racing Timers: A Kernel Pastime

Timers can produce their own special problems with races. Consider acollection of objects (list, hash, etc) where each object has a timerwhich is due to destroy it.

If you want to destroy the entire collection (say on module removal),you might do the following:

/* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE   HUNGARIAN NOTATION */spin_lock_bh(&list_lock);while (list) {        struct foo *next = list->next;        del_timer(&list->timer);        kfree(list);        list = next;}spin_unlock_bh(&list_lock);

Sooner or later, this will crash on SMP, because a timer can have justgone off before the spin_lock_bh(), and it will only getthe lock after we spin_unlock_bh(), and then try to freethe element (which has already been freed!).

This can be avoided by checking the result ofdel_timer(): if it returns 1, the timer has been deleted.If 0, it means (in this case) that it is currently running, so we cando:

retry:        spin_lock_bh(&list_lock);        while (list) {                struct foo *next = list->next;                if (!del_timer(&list->timer)) {                        /* Give timer a chance to delete this */                        spin_unlock_bh(&list_lock);                        goto retry;                }                kfree(list);                list = next;        }        spin_unlock_bh(&list_lock);

Another common problem is deleting timers which restart themselves (bycallingadd_timer() at the end of their timer function).Because this is a fairly common case which is prone to races, you shouldusedel_timer_sync() (include/linux/timer.h) tohandle this case. It returns the number of times the timer had to bedeleted before we finally stopped it from adding itself back in.

Locking Speed

There are three main things to worry about when considering speed ofsome code which does locking. First is concurrency: how many things aregoing to be waiting while someone else is holding a lock. Second is thetime taken to actually acquire and release an uncontended lock. Third isusing fewer, or smarter locks. I’m assuming that the lock is used fairlyoften: otherwise, you wouldn’t be concerned about efficiency.

Concurrency depends on how long the lock is usually held: you shouldhold the lock for as long as needed, but no longer. In the cacheexample, we always create the object without the lock held, and thengrab the lock only when we are ready to insert it in the list.

Acquisition times depend on how much damage the lock operations do tothe pipeline (pipeline stalls) and how likely it is that this CPU wasthe last one to grab the lock (ie. is the lock cache-hot for this CPU):on a machine with more CPUs, this likelihood drops fast. Consider a700MHz Intel Pentium III: an instruction takes about 0.7ns, an atomicincrement takes about 58ns, a lock which is cache-hot on this CPU takes160ns, and a cacheline transfer from another CPU takes an additional 170to 360ns. (These figures from Paul McKenney’sLinux Journal RCUarticle).

These two aims conflict: holding a lock for a short time might be doneby splitting locks into parts (such as in our final per-object-lockexample), but this increases the number of lock acquisitions, and theresults are often slower than having a single lock. This is anotherreason to advocate locking simplicity.

The third concern is addressed below: there are some methods to reducethe amount of locking which needs to be done.

Read/Write Lock Variants

Both spinlocks and mutexes have read/write variants:rwlock_t andstructrw_semaphore. These divideusers into two classes: the readers and the writers. If you are onlyreading the data, you can get a read lock, but to write to the data youneed the write lock. Many people can hold a read lock, but a writer mustbe sole holder.

If your code divides neatly along reader/writer lines (as our cache codedoes), and the lock is held by readers for significant lengths of time,using these locks can help. They are slightly slower than the normallocks though, so in practicerwlock_t is not usually worthwhile.

Avoiding Locks: Read Copy Update

There is a special method of read/write locking called Read Copy Update.Using RCU, the readers can avoid taking a lock altogether: as we expectour cache to be read more often than updated (otherwise the cache is awaste of time), it is a candidate for this optimization.

How do we get rid of read locks? Getting rid of read locks means thatwriters may be changing the list underneath the readers. That isactually quite simple: we can read a linked list while an element isbeing added if the writer adds the element very carefully. For example,addingnew to a single linked list calledlist:

new->next = list->next;wmb();list->next = new;

The wmb() is a write memory barrier. It ensures that thefirst operation (setting the new element’snext pointer) is completeand will be seen by all CPUs, before the second operation is (puttingthe new element into the list). This is important, since moderncompilers and modern CPUs can both reorder instructions unless toldotherwise: we want a reader to either not see the new element at all, orsee the new element with thenext pointer correctly pointing at therest of the list.

Fortunately, there is a function to do this for standardstructlist_head lists:list_add_rcu() (include/linux/list.h).

Removing an element from the list is even simpler: we replace thepointer to the old element with a pointer to its successor, and readerswill either see it, or skip over it.

list->next = old->next;

There islist_del_rcu() (include/linux/list.h) whichdoes this (the normal version poisons the old object, which we don’twant).

The reader must also be careful: some CPUs can look through thenextpointer to start reading the contents of the next element early, butdon’t realize that the pre-fetched contents is wrong when thenextpointer changes underneath them. Once again, there is alist_for_each_entry_rcu() (include/linux/list.h)to help you. Of course, writers can just uselist_for_each_entry(), since there cannot be twosimultaneous writers.

Our final dilemma is this: when can we actually destroy the removedelement? Remember, a reader might be stepping through this element inthe list right now: if we free this element and thenext pointerchanges, the reader will jump off into garbage and crash. We need towait until we know that all the readers who were traversing the listwhen we deleted the element are finished. We usecall_rcu() to register a callback which will actuallydestroy the object once all pre-existing readers are finished.Alternatively,synchronize_rcu() may be used to blockuntil all pre-existing are finished.

But how does Read Copy Update know when the readers are finished? Themethod is this: firstly, the readers always traverse the list insidercu_read_lock()/rcu_read_unlock() pairs:these simply disable preemption so the reader won’t go to sleep whilereading the list.

RCU then waits until every other CPU has slept at least once: sincereaders cannot sleep, we know that any readers which were traversing thelist during the deletion are finished, and the callback is triggered.The real Read Copy Update code is a little more optimized than this, butthis is the fundamental idea.

--- cache.c.perobjectlock   2003-12-11 17:15:03.000000000 +1100+++ cache.c.rcupdate    2003-12-11 17:55:14.000000000 +1100@@ -1,15 +1,18 @@ #include <linux/list.h> #include <linux/slab.h> #include <linux/string.h>+#include <linux/rcupdate.h> #include <linux/mutex.h> #include <asm/errno.h> struct object {-        /* These two protected by cache_lock. */+        /* This is protected by RCU */         struct list_head list;         int popularity;+        struct rcu_head rcu;+         atomic_t refcnt;         /* Doesn't change once created. */@@ -40,7 +43,7 @@ {         struct object *i;-        list_for_each_entry(i, &cache, list) {+        list_for_each_entry_rcu(i, &cache, list) {                 if (i->id == id) {                         i->popularity++;                         return i;@@ -49,19 +52,25 @@         return NULL; }+/* Final discard done once we know no readers are looking. */+static void cache_delete_rcu(void *arg)+{+        object_put(arg);+}+ /* Must be holding cache_lock */ static void __cache_delete(struct object *obj) {         BUG_ON(!obj);-        list_del(&obj->list);-        object_put(obj);+        list_del_rcu(&obj->list);         cache_num--;+        call_rcu(&obj->rcu, cache_delete_rcu); } /* Must be holding cache_lock */ static void __cache_add(struct object *obj) {-        list_add(&obj->list, &cache);+        list_add_rcu(&obj->list, &cache);         if (++cache_num > MAX_CACHE_SIZE) {                 struct object *i, *outcast = NULL;                 list_for_each_entry(i, &cache, list) {@@ -104,12 +114,11 @@ struct object *cache_find(int id) {         struct object *obj;-        unsigned long flags;-        spin_lock_irqsave(&cache_lock, flags);+        rcu_read_lock();         obj = __cache_find(id);         if (obj)                 object_get(obj);-        spin_unlock_irqrestore(&cache_lock, flags);+        rcu_read_unlock();         return obj; }

Note that the reader will alter the popularity member in__cache_find(), and now it doesn’t hold a lock. Onesolution would be to make it anatomic_t, but for this usage, wedon’t really care about races: an approximate result is good enough, soI didn’t change it.

The result is that cache_find() requires nosynchronization with any other functions, so is almost as fast on SMP asit would be on UP.

There is a further optimization possible here: remember our originalcache code, where there were no reference counts and the caller simplyheld the lock whenever using the object? This is still possible: if youhold the lock, no one can delete the object, so you don’t need to getand put the reference count.

Now, because the ‘read lock’ in RCU is simply disabling preemption, acaller which always has preemption disabled between callingcache_find() and object_put() does notneed to actually get and put the reference count: we could expose__cache_find() by making it non-static, and suchcallers could simply call that.

The benefit here is that the reference count is not written to: theobject is not altered in any way, which is much faster on SMP machinesdue to caching.

Per-CPU Data

Another technique for avoiding locking which is used fairly widely is toduplicate information for each CPU. For example, if you wanted to keep acount of a common condition, you could use a spin lock and a singlecounter. Nice and simple.

If that was too slow (it’s usually not, but if you’ve got a really bigmachine to test on and can show that it is), you could instead use acounter for each CPU, then none of them need an exclusive lock. SeeDEFINE_PER_CPU(), get_cpu_var() andput_cpu_var() (include/linux/percpu.h).

Of particular use for simple per-cpu counters is thelocal_t type,and the cpu_local_inc() and related functions, which aremore efficient than simple code on some architectures(include/asm/local.h).

Note that there is no simple, reliable way of getting an exact value ofsuch a counter, without introducing more locks. This is not a problemfor some uses.

Data Which Mostly Used By An IRQ Handler

If data is always accessed from within the same IRQ handler, you don’tneed a lock at all: the kernel already guarantees that the irq handlerwill not run simultaneously on multiple CPUs.

Manfred Spraul points out that you can still do this, even if the datais very occasionally accessed in user context or softirqs/tasklets. Theirq handler doesn’t use a lock, and all other accesses are done as so:

spin_lock(&lock);disable_irq(irq);...enable_irq(irq);spin_unlock(&lock);

Thedisable_irq() prevents the irq handler from running(and waits for it to finish if it’s currently running on other CPUs).The spinlock prevents any other accesses happening at the same time.Naturally, this is slower than just a spin_lock_irq()call, so it only makes sense if this type of access happens extremelyrarely.

What Functions Are Safe To Call From Interrupts?

Many functions in the kernel sleep (ie. call schedule()) directly orindirectly: you can never call them while holding a spinlock, or withpreemption disabled. This also means you need to be in user context:calling them from an interrupt is illegal.

Some Functions Which Sleep

The most common ones are listed below, but you usually have to read thecode to find out if other calls are safe. If everyone else who calls itcan sleep, you probably need to be able to sleep, too. In particular,registration and deregistration functions usually expect to be calledfrom user context, and can sleep.

Some Functions Which Don’t Sleep

Some functions are safe to call from any context, or holding almost anylock.

Mutex API reference

mutex_init(mutex)

initialize the mutex

Parameters

mutex
the mutex to be initialized

Description

Initialize the mutex to unlocked state.

It is not allowed to initialize an already locked mutex.

boolmutex_is_locked(struct mutex * lock)

is the mutex locked

Parameters

structmutex*lock
the mutex to be queried

Description

Returns true if the mutex is locked, false if unlocked.

enum mutex_trylock_recursive_enummutex_trylock_recursive(struct mutex * lock)

trylock variant that allows recursive locking

Parameters

structmutex*lock
mutex to be locked

Description

This function should not be used, _ever_. It is purely for hysterical GEMraisins, and once those are gone this will be removed.

Return

  • MUTEX_TRYLOCK_FAILED - trylock failed,
  • MUTEX_TRYLOCK_SUCCESS - lock acquired,
  • MUTEX_TRYLOCK_RECURSIVE - we already owned the lock.
voidmutex_lock(struct mutex * lock)

acquire the mutex

Parameters

structmutex*lock
the mutex to be acquired

Description

Lock the mutex exclusively for this task. If the mutex is notavailable right now, it will sleep until it can get it.

The mutex must later on be released by the same task thatacquired it. Recursive locking is not allowed. The taskmay not exit without first unlocking the mutex. Also, kernelmemory where the mutex resides must not be freed withthe mutex still locked. The mutex must first be initialized(or statically defined) before it can be locked.memset()-ingthe mutex to 0 is not allowed.

(The CONFIG_DEBUG_MUTEXES .config option turns on debuggingchecks that will enforce the restrictions and will also dodeadlock debugging)

This function is similar to (but not equivalent to) down().

voidmutex_unlock(struct mutex * lock)

release the mutex

Parameters

structmutex*lock
the mutex to be released

Description

Unlock a mutex that has been locked by this task previously.

This function must not be used in interrupt context. Unlockingof a not locked mutex is not allowed.

This function is similar to (but not equivalent to) up().

voidww_mutex_unlock(struct ww_mutex * lock)

release the w/w mutex

Parameters

structww_mutex*lock
the mutex to be released

Description

Unlock a mutex that has been locked by this task previously with any of theww_mutex_lock* functions (with or without an acquire context). It isforbidden to release the locks after releasing the acquire context.

This function must not be used in interrupt context. Unlockingof a unlocked mutex is not allowed.

intmutex_lock_interruptible(struct mutex * lock)

Acquire the mutex, interruptible by signals.

Parameters

structmutex*lock
The mutex to be acquired.

Description

Lock the mutex likemutex_lock(). If a signal is delivered while theprocess is sleeping, this function will return without acquiring themutex.

Context

Process context.

Return

0 if the lock was successfully acquired or-EINTR if asignal arrived.

intmutex_lock_killable(struct mutex * lock)

Acquire the mutex, interruptible by fatal signals.

Parameters

structmutex*lock
The mutex to be acquired.

Description

Lock the mutex likemutex_lock(). If a signal which will be fatal tothe current process is delivered while the process is sleeping, thisfunction will return without acquiring the mutex.

Context

Process context.

Return

0 if the lock was successfully acquired or-EINTR if afatal signal arrived.

voidmutex_lock_io(struct mutex * lock)

Acquire the mutex and mark the process as waiting for I/O

Parameters

structmutex*lock
The mutex to be acquired.

Description

Lock the mutex likemutex_lock(). While the task is waiting for thismutex, it will be accounted as being in the IO wait state by thescheduler.

Context

Process context.

intmutex_trylock(struct mutex * lock)

try to acquire the mutex, without waiting

Parameters

structmutex*lock
the mutex to be acquired

Description

Try to acquire the mutex atomically. Returns 1 if the mutexhas been acquired successfully, and 0 on contention.

This function must not be used in interrupt context. Themutex must be released by the same task that acquired it.

NOTE

this function follows the spin_trylock() convention, soit is negated from the down_trylock() return values! Be carefulabout this when converting semaphore users to mutexes.

intatomic_dec_and_mutex_lock(atomic_t * cnt, struct mutex * lock)

return holding mutex if we dec to 0

Parameters

atomic_t*cnt
the atomic which we are to dec
structmutex*lock
the mutex to return holding if we dec to 0

Description

return true and hold lock if we dec to 0, return false otherwise

Futex API reference

structfutex_q

The hashed futex queue entry, one per waiting task

Definition

struct futex_q {  struct plist_node list;  struct task_struct *task;  spinlock_t *lock_ptr;  union futex_key key;  struct futex_pi_state *pi_state;  struct rt_mutex_waiter *rt_waiter;  union futex_key *requeue_pi_key;  u32 bitset;};

Members

list
priority-sorted list of tasks waiting on this futex
task
the task waiting on the futex
lock_ptr
the hash bucket lock
key
the key the futex is hashed on
pi_state
optional priority inheritance state
rt_waiter
rt_waiter storage for use with requeue_pi
requeue_pi_key
the requeue_pi target futex key
bitset
bitset for the optional bitmasked wakeup

Description

We use this hashed waitqueue, instead of a normal wait_queue_entry_t, sowe can wake only the relevant ones (hashed queues may be shared).

A futex_q has a woken state, just like tasks have TASK_RUNNING.It is considered woken when plist_node_empty(q->list) || q->lock_ptr == 0.The order of wakeup is always to make the first condition true, thenthe second.

PI futexes are typically woken before they are removed from the hash list viathe rt_mutex code. See unqueue_me_pi().

struct futex_hash_bucket *hash_futex(union futex_key * key)

Return the hash bucket in the global hash

Parameters

unionfutex_key*key
Pointer to the futex key for which the hash is calculated

Description

We hash on the keys returned from get_futex_key (see below) and return thecorresponding hash bucket in the global hash.

intmatch_futex(union futex_key * key1, union futex_key * key2)

Check whether two futex keys are equal

Parameters

unionfutex_key*key1
Pointer to key1
unionfutex_key*key2
Pointer to key2

Description

Return 1 if two futex_keys are equal, 0 otherwise.

structhrtimer_sleeper *futex_setup_timer(ktime_t * time, structhrtimer_sleeper * timeout, int flags, u64 range_ns)

set up the sleeping hrtimer.

Parameters

ktime_t*time
ptr to the given timeout value
structhrtimer_sleeper*timeout
the hrtimer_sleeper structure to be set up
intflags
futex flags
u64range_ns
optional range in ns

Return

Initialized hrtimer_sleeper structure or NULL if no timeout
value given
intget_futex_key(u32 __user * uaddr, int fshared, union futex_key * key, enum futex_access rw)

Get parameters which are the keys for a futex

Parameters

u32__user*uaddr
virtual address of the futex
intfshared
0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED
unionfutex_key*key
address where result is stored.
enumfutex_accessrw
mapping needs to be read/write (values: FUTEX_READ,FUTEX_WRITE)

Return

a negative error code or 0

Description

The key words are stored inkey on success.

For shared mappings (whenfshared), the key is:

( inode->i_sequence, page->index, offset_within_page )

[ also see get_inode_sequence_number() ]

For private mappings (or when!fshared), the key is:

( current->mm, address, 0 )

This allows (cross process, where applicable) identification of the futexwithout keeping the page pinned for the duration of the FUTEX_WAIT.

lock_page() might sleep, the caller should not hold a spinlock.

intfault_in_user_writeable(u32 __user * uaddr)

Fault in user address and verify RW access

Parameters

u32__user*uaddr
pointer to faulting user space address

Description

Slow path to fixup the fault we just took in the atomic writeaccess touaddr.

We have no generic implementation of a non-destructive write to theuser address. We know that we faulted in the atomic pagefaultdisabled section so we can as well avoid the #PF overhead bycalling get_user_pages() right away.

structfutex_q *futex_top_waiter(struct futex_hash_bucket * hb, union futex_key * key)

Return the highest priority waiter on a futex

Parameters

structfutex_hash_bucket*hb
the hash bucket the futex_q’s reside in
unionfutex_key*key
the futex key (to distinguish it from other futex futex_q’s)

Description

Must be called with the hb lock held.

voidwait_for_owner_exiting(int ret, struct task_struct * exiting)

Block until the owner has exited

Parameters

intret
owner’s current futex lock status
structtask_struct*exiting
Pointer to the exiting task

Description

Caller must hold a refcount onexiting.

intfutex_lock_pi_atomic(u32 __user * uaddr, struct futex_hash_bucket * hb, union futex_key * key, struct futex_pi_state ** ps, struct task_struct * task, struct task_struct ** exiting, int set_waiters)

Atomic work required to acquire a pi aware futex

Parameters

u32__user*uaddr
the pi futex user address
structfutex_hash_bucket*hb
the pi futex hash bucket
unionfutex_key*key
the futex key associated with uaddr and hb
structfutex_pi_state**ps
the pi_state pointer where we store the result of thelookup
structtask_struct*task
the task to perform the atomic lock work for. This willbe “current” except in the case of requeue pi.
structtask_struct**exiting
Pointer to store the task pointer of the owner taskwhich is in the middle of exiting
intset_waiters
force setting the FUTEX_WAITERS bit (1) or not (0)

Return

  • 0 - ready to wait;
  • 1 - acquired the lock;
  • <0 - error

Description

The hb->lock and futex_key refs shall be held by the caller.

exiting is only set when the return value is -EBUSY. If so, this holdsa refcount on the exiting task on return and the caller needs to drop itafter waiting for the exit to complete.

void__unqueue_futex(structfutex_q * q)

Remove the futex_q from its futex_hash_bucket

Parameters

structfutex_q*q
The futex_q to unqueue

Description

The q->lock_ptr must not be NULL and must be held by the caller.

voidrequeue_futex(structfutex_q * q, struct futex_hash_bucket * hb1, struct futex_hash_bucket * hb2, union futex_key * key2)

Requeue a futex_q from one hb to another

Parameters

structfutex_q*q
the futex_q to requeue
structfutex_hash_bucket*hb1
the source hash_bucket
structfutex_hash_bucket*hb2
the target hash_bucket
unionfutex_key*key2
the new key for the requeued futex_q
voidrequeue_pi_wake_futex(structfutex_q * q, union futex_key * key, struct futex_hash_bucket * hb)

Wake a task that acquired the lock during requeue

Parameters

structfutex_q*q
the futex_q
unionfutex_key*key
the key of the requeue target futex
structfutex_hash_bucket*hb
the hash_bucket of the requeue target futex

Description

During futex_requeue, with requeue_pi=1, it is possible to acquire thetarget futex if it is uncontended or via a lock steal. Set the futex_q keyto the requeue target futex so the waiter can detect the wakeup on the rightfutex, but remove it from the hb and NULL the rt_waiter so it can detectatomic lock acquisition. Set the q->lock_ptr to the requeue target hb->lockto protect access to the pi_state to fixup the owner later. Must be calledwith both q->lock_ptr and hb->lock held.

intfutex_proxy_trylock_atomic(u32 __user * pifutex, struct futex_hash_bucket * hb1, struct futex_hash_bucket * hb2, union futex_key * key1, union futex_key * key2, struct futex_pi_state ** ps, struct task_struct ** exiting, int set_waiters)

Attempt an atomic lock for the top waiter

Parameters

u32__user*pifutex
the user address of the to futex
structfutex_hash_bucket*hb1
the from futex hash bucket, must be locked by the caller
structfutex_hash_bucket*hb2
the to futex hash bucket, must be locked by the caller
unionfutex_key*key1
the from futex key
unionfutex_key*key2
the to futex key
structfutex_pi_state**ps
address to store the pi_state pointer
structtask_struct**exiting
Pointer to store the task pointer of the owner taskwhich is in the middle of exiting
intset_waiters
force setting the FUTEX_WAITERS bit (1) or not (0)

Description

Try and get the lock on behalf of the top waiter if we can do it atomically.Wake the top waiter if we succeed. If the caller specified set_waiters,then directfutex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit.hb1 and hb2 must be held by the caller.

exiting is only set when the return value is -EBUSY. If so, this holdsa refcount on the exiting task on return and the caller needs to drop itafter waiting for the exit to complete.

Return

  • 0 - failed to acquire the lock atomically;
  • >0 - acquired the lock, return value is vpid of the top_waiter
  • <0 - error
intfutex_requeue(u32 __user * uaddr1, unsigned int flags, u32 __user * uaddr2, int nr_wake, int nr_requeue, u32 * cmpval, int requeue_pi)

Requeue waiters from uaddr1 to uaddr2

Parameters

u32__user*uaddr1
source futex user address
unsignedintflags
futex flags (FLAGS_SHARED, etc.)
u32__user*uaddr2
target futex user address
intnr_wake
number of waiters to wake (must be 1 for requeue_pi)
intnr_requeue
number of waiters to requeue (0-INT_MAX)
u32*cmpval
uaddr1 expected value (orNULL)
intrequeue_pi
if we are attempting to requeue from a non-pi futex to api futex (pi to pi requeue is not supported)

Description

Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquireuaddr2 atomically on behalf of the top waiter.

Return

  • >=0 - on success, the number of tasks requeued or woken;
  • <0 - on error
voidqueue_me(structfutex_q * q, struct futex_hash_bucket * hb)

Enqueue the futex_q on the futex_hash_bucket

Parameters

structfutex_q*q
The futex_q to enqueue
structfutex_hash_bucket*hb
The destination hash bucket

Description

The hb->lock must be held by the caller, and is released here. A call toqueue_me() is typically paired with exactly one call tounqueue_me(). Theexceptions involve the PI related operations, which may use unqueue_me_pi()or nothing if the unqueue is done as part of the wake process and the unqueuestate is implicit in the state of woken task (seefutex_wait_requeue_pi() foran example).

intunqueue_me(structfutex_q * q)

Remove the futex_q from its futex_hash_bucket

Parameters

structfutex_q*q
The futex_q to unqueue

Description

The q->lock_ptr must not be held by the caller. A call tounqueue_me() mustbe paired with exactly one earlier call toqueue_me().

Return

  • 1 - if the futex_q was still queued (and we removed unqueued it);
  • 0 - if the futex_q was already removed by the waking thread
intfixup_owner(u32 __user * uaddr, structfutex_q * q, int locked)

Post lock pi_state and corner case management

Parameters

u32__user*uaddr
user address of the futex
structfutex_q*q
futex_q (contains pi_state and access to the rt_mutex)
intlocked
if the attempt to take the rt_mutex succeeded (1) or not (0)

Description

After attempting to lock an rt_mutex, this function is called to cleanupthe pi_state owner as well as handle race conditions that may allow us toacquire the lock. Must be called with the hb lock held.

Return

  • 1 - success, lock taken;
  • 0 - success, lock not taken;
  • <0 - on error (-EFAULT)
voidfutex_wait_queue_me(struct futex_hash_bucket * hb, structfutex_q * q, structhrtimer_sleeper * timeout)

queue_me() and wait for wakeup, timeout, or signal

Parameters

structfutex_hash_bucket*hb
the futex hash bucket, must be locked by the caller
structfutex_q*q
the futex_q to queue up on
structhrtimer_sleeper*timeout
the prepared hrtimer_sleeper, or null for no timeout
intfutex_wait_setup(u32 __user * uaddr, u32 val, unsigned int flags, structfutex_q * q, struct futex_hash_bucket ** hb)

Prepare to wait on a futex

Parameters

u32__user*uaddr
the futex userspace address
u32val
the expected value
unsignedintflags
futex flags (FLAGS_SHARED, etc.)
structfutex_q*q
the associated futex_q
structfutex_hash_bucket**hb
storage for hash_bucket pointer to be returned to caller

Description

Setup the futex_q and locate the hash_bucket. Get the futex value andcompare it with the expected value. Handle atomic faults internally.Return with the hb lock held and a q.key reference on success, and unlockedwith no q.key reference on failure.

Return

  • 0 - uaddr contains val and hb has been locked;
  • <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
inthandle_early_requeue_pi_wakeup(struct futex_hash_bucket * hb, structfutex_q * q, union futex_key * key2, structhrtimer_sleeper * timeout)

Detect early wakeup on the initial futex

Parameters

structfutex_hash_bucket*hb
the hash_bucket futex_q was original enqueued on
structfutex_q*q
the futex_q woken while waiting to be requeued
unionfutex_key*key2
the futex_key of the requeue target futex
structhrtimer_sleeper*timeout
the timeout associated with the wait (NULL if none)

Description

Detect if the task was woken on the initial futex as opposed to the requeuetarget futex. If so, determine if it was a timeout or a signal that causedthe wakeup and return the appropriate error code to the caller. Must becalled with the hb lock held.

Return

  • 0 = no early wakeup detected;
  • <0 = -ETIMEDOUT or -ERESTARTNOINTR
intfutex_wait_requeue_pi(u32 __user * uaddr, unsigned int flags, u32 val, ktime_t * abs_time, u32 bitset, u32 __user * uaddr2)

Wait on uaddr and take uaddr2

Parameters

u32__user*uaddr
the futex we initially wait on (non-pi)
unsignedintflags
futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must bethe same type, no requeueing from private to shared, etc.
u32val
the expected value of uaddr
ktime_t*abs_time
absolute timeout
u32bitset
32 bit wakeup bitset set by userspace, defaults to all
u32__user*uaddr2
the pi futex we will take prior to returning to user-space

Description

The caller will wait on uaddr and will be requeued byfutex_requeue() touaddr2 which must be PI aware and unique from uaddr. Normal wakeup will wakeon uaddr2 and complete the acquisition of the rt_mutex prior to returning touserspace. This ensures the rt_mutex maintains an owner when it has waiters;without one, the pi logic would not know which task to boost/deboost, ifthere was a need to.

We call schedule infutex_wait_queue_me() when we enqueue and return therevia the following–1) wakeup on uaddr2 after an atomic lock acquisition byfutex_requeue()2) wakeup on uaddr2 after a requeue3) signal4) timeout

If 3, cleanup and return -ERESTARTNOINTR.

If 2, we may then block on trying to take the rt_mutex and return via:5) successful lock6) signal7) timeout8) other lock acquisition failure

If 6, return -EWOULDBLOCK (restarting the syscall would do the same).

If 4 or 7, we cleanup and return with -ETIMEDOUT.

Return

  • 0 - On success;
  • <0 - On error
longsys_set_robust_list(struct robust_list_head __user * head, size_t len)

Set the robust-futex list head of a task

Parameters

structrobust_list_head__user*head
pointer to the list-head
size_tlen
length of the list-head, as userspace expects
longsys_get_robust_list(int pid, struct robust_list_head __user *__user * head_ptr, size_t __user * len_ptr)

Get the robust-futex list head of a task

Parameters

intpid
pid of the process [zero for current task]
structrobust_list_head__user*__user*head_ptr
pointer to a list-head pointer, the kernel fills it in
size_t__user*len_ptr
pointer to a length field, the kernel fills in the header size
voidfutex_exit_recursive(struct task_struct * tsk)

Set the tasks futex state to FUTEX_STATE_DEAD

Parameters

structtask_struct*tsk
task to set the state on

Description

Set the futex exit state of the task lockless. The futex waiter codeobserves that state when a task is exiting and loops until the task hasactually finished the futex cleanup. The worst case for this is that thewaiter runs through the wait loop until the state becomes visible.

This is called from the recursive fault handling path in do_exit().

This is best effort. Either the futex exit code has run already ornot. If the OWNER_DIED bit has been set on the futex then the waiter cantake it over. If not, the problem is pushed back to user space. If thefutex exit code did not run yet, then an already queued waiter mightblock forever, but there is nothing which can be done about that.

Further reading

  • Documentation/locking/spinlocks.rst: Linus Torvalds’ spinlockingtutorial in the kernel sources.

  • Unix Systems for Modern Architectures: Symmetric Multiprocessing andCaching for Kernel Programmers:

    Curt Schimmel’s very good introduction to kernel level locking (notwritten for Linux, but nearly everything applies). The book isexpensive, but really worth every penny to understand SMP locking.[ISBN: 0201633388]

Thanks

Thanks to Telsa Gwynne for DocBooking, neatening and adding style.

Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul Mackerras,Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim Waugh, Pete Zaitcev,James Morris, Robert Love, Paul McKenney, John Ashby for proofreading,correcting, flaming, commenting.

Thanks to the cabal for having no influence on this document.

Glossary

preemption
Prior to 2.5, or whenCONFIG_PREEMPT is unset, processes in usercontext inside the kernel would not preempt each other (ie. you had thatCPU until you gave it up, except for interrupts). With the addition ofCONFIG_PREEMPT in 2.5.4, this changed: when in user context, higherpriority tasks can “cut in”: spinlocks were changed to disablepreemption, even on UP.
bh
Bottom Half: for historical reasons, functions with ‘_bh’ in them oftennow refer to any software interrupt, e.g. spin_lock_bh()blocks any software interrupt on the current CPU. Bottom halves aredeprecated, and will eventually be replaced by tasklets. Only one bottomhalf will be running at any time.
Hardware Interrupt / Hardware IRQ
Hardware interrupt request. in_irq() returns true in ahardware interrupt handler.
Interrupt Context
Not user context: processing a hardware irq or software irq. Indicatedby the in_interrupt() macro returning true.
SMP
Symmetric Multi-Processor: kernels compiled for multiple-CPU machines.(CONFIG_SMP=y).
Software Interrupt / softirq

Software interrupt handler. in_irq() returns false;in_softirq() returns true. Tasklets and softirqs bothfall into the category of ‘software interrupts’.

Strictly speaking a softirq is one of up to 32 enumerated softwareinterrupts which can run on multiple CPUs at once. Sometimes used torefer to tasklets as well (ie. all software interrupts).

tasklet
A dynamically-registrable software interrupt, which is guaranteed toonly run on one CPU at a time.
timer
A dynamically-registrable software interrupt, which is run at (or closeto) a given time. When running, it is just like a tasklet (in fact, theyare called from theTIMER_SOFTIRQ).
UP
Uni-Processor: Non-SMP. (CONFIG_SMP=n).
User Context
The kernel executing on behalf of a particular process (ie. a systemcall or trap) or kernel thread. You can tell which process with thecurrent macro.) Not to be confused with userspace. Can beinterrupted by software or hardware interrupts.
Userspace
A process executing its own code outside the kernel.