Wound/Wait Deadlock-Proof Mutex Design

Please readGeneric Mutex Subsystem first, as it applies to wait/wound mutexes too.

Motivation for WW-Mutexes

GPU’s do operations that commonly involve many buffers. Those bufferscan be shared across contexts/processes, exist in different memorydomains (for example VRAM vs system memory), and so on. And withPRIME / dmabuf, they can even be shared across devices. So there area handful of situations where the driver needs to wait for buffers tobecome ready. If you think about this in terms of waiting on a buffermutex for it to become available, this presents a problem becausethere is no way to guarantee that buffers appear in a execbuf/batch inthe same order in all contexts. That is directly under control ofuserspace, and a result of the sequence of GL calls that an applicationmakes. Which results in the potential for deadlock. The problem getsmore complex when you consider that the kernel may need to migrate thebuffer(s) into VRAM before the GPU operates on the buffer(s), whichmay in turn require evicting some other buffers (and you don’t want toevict other buffers which are already queued up to the GPU), but for asimplified understanding of the problem you can ignore this.

The algorithm that the TTM graphics subsystem came up with for dealing withthis problem is quite simple. For each group of buffers (execbuf) that needto be locked, the caller would be assigned a unique reservation id/ticket,from a global counter. In case of deadlock while locking all the buffersassociated with a execbuf, the one with the lowest reservation ticket (i.e.the oldest task) wins, and the one with the higher reservation id (i.e. theyounger task) unlocks all of the buffers that it has already locked, and thentries again.

In the RDBMS literature, a reservation ticket is associated with a transaction.and the deadlock handling approach is called Wait-Die. The name is based onthe actions of a locking thread when it encounters an already locked mutex.If the transaction holding the lock is younger, the locking transaction waits.If the transaction holding the lock is older, the locking transaction backs offand dies. Hence Wait-Die.There is also another algorithm called Wound-Wait:If the transaction holding the lock is younger, the locking transactionwounds the transaction holding the lock, requesting it to die.If the transaction holding the lock is older, it waits for the othertransaction. Hence Wound-Wait.The two algorithms are both fair in that a transaction will eventually succeed.However, the Wound-Wait algorithm is typically stated to generate fewer backoffscompared to Wait-Die, but is, on the other hand, associated with more work thanWait-Die when recovering from a backoff. Wound-Wait is also a preemptivealgorithm in that transactions are wounded by other transactions, and thatrequires a reliable way to pick up the wounded condition and preempt therunning transaction. Note that this is not the same as process preemption. AWound-Wait transaction is considered preempted when it dies (returning-EDEADLK) following a wound.

Concepts

Compared to normal mutexes two additional concepts/objects show up in the lockinterface for w/w mutexes:

Acquire context: To ensure eventual forward progress it is important that a tasktrying to acquire locks doesn’t grab a new reservation id, but keeps the one itacquired when starting the lock acquisition. This ticket is stored in theacquire context. Furthermore the acquire context keeps track of debugging stateto catch w/w mutex interface abuse. An acquire context is representing atransaction.

W/w class: In contrast to normal mutexes the lock class needs to be explicit forw/w mutexes, since it is required to initialize the acquire context. The lockclass also specifies what algorithm to use, Wound-Wait or Wait-Die.

Furthermore there are three different class of w/w lock acquire functions:

  • Normal lock acquisition with a context, using ww_mutex_lock.

  • Slowpath lock acquisition on the contending lock, used by the task that justkilled its transaction after having dropped all already acquired locks.These functions have the _slow postfix.

    From a simple semantics point-of-view the _slow functions are not strictlyrequired, since simply calling the normal ww_mutex_lock functions on thecontending lock (after having dropped all other already acquired locks) willwork correctly. After all if no other ww mutex has been acquired yet there’sno deadlock potential and hence the ww_mutex_lock call will block and notprematurely return -EDEADLK. The advantage of the _slow functions is ininterface safety:

    • ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slowhas a void return type. Note that since ww mutex code needs loops/retriesanyway the __must_check doesn’t result in spurious warnings, even though thevery first lock operation can never fail.

    • When full debugging is enabled ww_mutex_lock_slow checks that all acquiredww mutex have been released (preventing deadlocks) and makes sure that weblock on the contending lock (preventing spinning through the -EDEADLKslowpath until the contended lock can be acquired).

  • Functions to only acquire a single w/w mutex, which results in the exact samesemantics as a normal mutex. This is done by calling ww_mutex_lock with a NULLcontext.

    Again this is not strictly required. But often you only want to acquire asingle lock in which case it’s pointless to set up an acquire context (and sobetter to avoid grabbing a deadlock avoidance ticket).

Of course, all the usual variants for handling wake-ups due to signals are alsoprovided.

Usage

The algorithm (Wait-Die vs Wound-Wait) is chosen by using eitherDEFINE_WW_CLASS() (Wound-Wait) orDEFINE_WD_CLASS() (Wait-Die)As a rough rule of thumb, use Wound-Wait iff youexpect the number of simultaneous competing transactions to be typically small,and you want to reduce the number of rollbacks.

Three different ways to acquire locks within the same w/w class. Commondefinitions for methods #1 and #2:

static DEFINE_WW_CLASS(ww_class);struct obj {      struct ww_mutex lock;      /* obj data */};struct obj_entry {      struct list_head head;      struct obj *obj;};

Method 1, using a list in execbuf->buffers that’s not allowed to be reordered.This is useful if a list of required objects is already tracked somewhere.Furthermore the lock helper can use propagate the -EALREADY return code back tothe caller as a signal that an object is twice on the list. This is useful ifthe list is constructed from userspace input and the ABI requires userspace tonot have duplicate entries (e.g. for a gpu commandbuffer submission ioctl):

int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx){      struct obj *res_obj = NULL;      struct obj_entry *contended_entry = NULL;      struct obj_entry *entry;      ww_acquire_init(ctx, &ww_class);retry:      list_for_each_entry (entry, list, head) {              if (entry->obj == res_obj) {                      res_obj = NULL;                      continue;              }              ret = ww_mutex_lock(&entry->obj->lock, ctx);              if (ret < 0) {                      contended_entry = entry;                      goto err;              }      }      ww_acquire_done(ctx);      return 0;err:      list_for_each_entry_continue_reverse (entry, list, head)              ww_mutex_unlock(&entry->obj->lock);      if (res_obj)              ww_mutex_unlock(&res_obj->lock);      if (ret == -EDEADLK) {              /* we lost out in a seqno race, lock and retry.. */              ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);              res_obj = contended_entry->obj;              goto retry;      }      ww_acquire_fini(ctx);      return ret;}

Method 2, using a list in execbuf->buffers that can be reordered. Same semanticsof duplicate entry detection using -EALREADY as method 1 above. But thelist-reordering allows for a bit more idiomatic code:

int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx){      struct obj_entry *entry, *entry2;      ww_acquire_init(ctx, &ww_class);      list_for_each_entry (entry, list, head) {              ret = ww_mutex_lock(&entry->obj->lock, ctx);              if (ret < 0) {                      entry2 = entry;                      list_for_each_entry_continue_reverse (entry2, list, head)                              ww_mutex_unlock(&entry2->obj->lock);                      if (ret != -EDEADLK) {                              ww_acquire_fini(ctx);                              return ret;                      }                      /* we lost out in a seqno race, lock and retry.. */                      ww_mutex_lock_slow(&entry->obj->lock, ctx);                      /*                       * Move buf to head of the list, this will point                       * buf->next to the first unlocked entry,                       * restarting the for loop.                       */                      list_del(&entry->head);                      list_add(&entry->head, list);              }      }      ww_acquire_done(ctx);      return 0;}

Unlocking works the same way for both methods #1 and #2:

void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx){      struct obj_entry *entry;      list_for_each_entry (entry, list, head)              ww_mutex_unlock(&entry->obj->lock);      ww_acquire_fini(ctx);}

Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,and edges can only be changed when holding the locks of all involved nodes. w/wmutexes are a natural fit for such a case for two reasons:

  • They can handle lock-acquisition in any order which allows us to start walkinga graph from a starting point and then iteratively discovering new edges andlocking down the nodes those edges connect to.

  • Due to the -EALREADY return code signalling that a given objects is alreadyheld there’s no need for additional book-keeping to break cycles in the graphor keep track off which looks are already held (when using more than one nodeas a starting point).

Note that this approach differs in two important ways from the above methods:

  • Since the list of objects is dynamically constructed (and might very well bedifferent when retrying due to hitting the -EDEADLK die condition) there’sno need to keep any object on a persistent list when it’s not locked. We cantherefore move the list_head into the object itself.

  • On the other hand the dynamic object list construction also means that the -EALREADY returncode can’t be propagated.

Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock alist of starting nodes (passed in from userspace) using one of the abovemethods. And then lock any additional objects affected by the operations usingmethod #3 below. The backoff/retry procedure will be a bit more involved, sincewhen the dynamic locking step hits -EDEADLK we also need to unlock all theobjects acquired with the fixed list. But the w/w mutex debug checks will catchany interface misuse for these cases.

Also, method 3 can’t fail the lock acquisition step since it doesn’t return-EALREADY. Of course this would be different when using the _interruptiblevariants, but that’s outside of the scope of these examples here:

struct obj {      struct ww_mutex ww_mutex;      struct list_head locked_list;};static DEFINE_WW_CLASS(ww_class);void __unlock_objs(struct list_head *list){      struct obj *entry, *temp;      list_for_each_entry_safe (entry, temp, list, locked_list) {              /* need to do that before unlocking, since only the current lock holder is              allowed to use object */              list_del(&entry->locked_list);              ww_mutex_unlock(entry->ww_mutex)      }}void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx){      struct obj *obj;      ww_acquire_init(ctx, &ww_class);retry:      /* re-init loop start state */      loop {              /* magic code which walks over a graph and decides which objects               * to lock */              ret = ww_mutex_lock(obj->ww_mutex, ctx);              if (ret == -EALREADY) {                      /* we have that one already, get to the next object */                      continue;              }              if (ret == -EDEADLK) {                      __unlock_objs(list);                      ww_mutex_lock_slow(obj, ctx);                      list_add(&entry->locked_list, list);                      goto retry;              }              /* locked a new object, add it to the list */              list_add_tail(&entry->locked_list, list);      }      ww_acquire_done(ctx);      return 0;}void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx){      __unlock_objs(list);      ww_acquire_fini(ctx);}

Method 4: Only lock one single objects. In that case deadlock detection andprevention is obviously overkill, since with grabbing just one lock you can’tproduce a deadlock within just one class. To simplify this case the w/w mutexapi can be used with a NULL context.

Implementation Details

Design:

ww_mutex currently encapsulates astructmutex, this means no extra overhead fornormal mutex locks, which are far more common. As such there is only a smallincrease in code size if wait/wound mutexes are not used.

We maintain the following invariants for the wait list:

  1. Waiters with an acquire context are sorted by stamp order; waiterswithout an acquire context are interspersed in FIFO order.

  2. For Wait-Die, among waiters with contexts, only the first one can haveother locks acquired already (ctx->acquired > 0). Note that this waitermay come after other waiters without contexts in the list.

The Wound-Wait preemption is implemented with a lazy-preemption scheme:The wounded status of the transaction is checked only when there iscontention for a new lock and hence a true chance of deadlock. In thatsituation, if the transaction is wounded, it backs off, clears thewounded status and retries. A great benefit of implementing preemption inthis way is that the wounded transaction can identify a contending lock towait for before restarting the transaction. Just blindly restarting thetransaction would likely make the transaction end up in a situation whereit would have to back off again.

In general, not much contention is expected. The locks are typically used toserialize access to resources for devices, and optimization focus shouldtherefore be directed towards the uncontended cases.

Lockdep:

Special care has been taken to warn for as many cases of api abuseas possible. Some common api abuses will be caught withCONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.

Some of the errors which will be warned about:
  • Forgetting to call ww_acquire_fini or ww_acquire_init.

  • Attempting to lock more mutexes after ww_acquire_done.

  • Attempting to lock the wrong mutex after -EDEADLK andunlocking all mutexes.

  • Attempting to lock the right mutex after -EDEADLK,before unlocking all mutexes.

  • Calling ww_mutex_lock_slow before -EDEADLK was returned.

  • Unlocking mutexes with the wrong unlock function.

  • Calling one of the ww_acquire_* twice on the same context.

  • Using a different ww_class for the mutex than for the ww_acquire_ctx.

  • Normal lockdep errors that can result in deadlocks.

Some of the lockdep errors that can result in deadlocks:
  • Calling ww_acquire_init to initialize a second ww_acquire_ctx beforehaving called ww_acquire_fini on the first.

  • ‘normal’ deadlocks that can occur.

FIXME:

Update this section once we have the TASK_DEADLOCK task state flag magicimplemented.