vlocks for Bare-Metal Mutual Exclusion¶
Voting Locks, or “vlocks” provide a simple low-level mutual exclusionmechanism, with reasonable but minimal requirements on the memorysystem.
These are intended to be used to coordinate critical activity among CPUswhich are otherwise non-coherent, in situations where the hardwareprovides no other mechanism to support this and ordinary spinlockscannot be used.
vlocks make use of the atomicity provided by the memory system forwrites to a single memory location. To arbitrate, every CPU “votes foritself”, by storing a unique number to a common memory location. Thefinal value seen in that memory location when all the votes have beencast identifies the winner.
In order to make sure that the election produces an unambiguous resultin finite time, a CPU will only enter the election in the first place ifno winner has been chosen and the election does not appear to havestarted yet.
Algorithm¶
The easiest way to explain the vlocks algorithm is with some pseudo-code:
int currently_voting[NR_CPUS] = { 0, };int last_vote = -1; /* no votes yet */bool vlock_trylock(int this_cpu){ /* signal our desire to vote */ currently_voting[this_cpu] = 1; if (last_vote != -1) { /* someone already volunteered himself */ currently_voting[this_cpu] = 0; return false; /* not ourself */ } /* let's suggest ourself */ last_vote = this_cpu; currently_voting[this_cpu] = 0; /* then wait until everyone else is done voting */ for_each_cpu(i) { while (currently_voting[i] != 0) /* wait */; } /* result */ if (last_vote == this_cpu) return true; /* we won */ return false;}bool vlock_unlock(void){ last_vote = -1;}The currently_voting[] array provides a way for the CPUs to determinewhether an election is in progress, and plays a role analogous to the“entering” array in Lamport’s bakery algorithm [1].
However, once the election has started, the underlying memory systematomicity is used to pick the winner. This avoids the need for a staticpriority rule to act as a tie-breaker, or any counters which couldoverflow.
As long as the last_vote variable is globally visible to all CPUs, itwill contain only one value that won’t change once every CPU has clearedits currently_voting flag.
Features and limitations¶
vlocks are not intended to be fair. In the contended case, it is the_last_ CPU which attempts to get the lock which will be most likelyto win.
vlocks are therefore best suited to situations where it is necessaryto pick a unique winner, but it does not matter which CPU actuallywins.
Like other similar mechanisms, vlocks will not scale well to a largenumber of CPUs.
vlocks can be cascaded in a voting hierarchy to permit better scalingif necessary, as in the following hypothetical example for 4096 CPUs:
/* first level: local election */my_town = towns[(this_cpu >> 4) & 0xf];I_won = vlock_trylock(my_town, this_cpu & 0xf);if (I_won) { /* we won the town election, let's go for the state */ my_state = states[(this_cpu >> 8) & 0xf]; I_won = vlock_lock(my_state, this_cpu & 0xf)); if (I_won) { /* and so on */ I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; if (I_won) { /* ... */ } vlock_unlock(the_whole_country); } vlock_unlock(my_state);}vlock_unlock(my_town);
ARM implementation¶
The current ARM implementation [2] contains some optimisations beyondthe basic algorithm:
By packing the members of the currently_voting array close together,we can read the whole array in one transaction (providing the numberof CPUs potentially contending the lock is small enough). Thisreduces the number of round-trips required to external memory.
In the ARM implementation, this means that we can use a single loadand comparison:
LDR Rt, [Rn]CMP Rt, #0…in place of code equivalent to:
LDRB Rt, [Rn]CMP Rt, #0LDRBEQ Rt, [Rn, #1]CMPEQ Rt, #0LDRBEQ Rt, [Rn, #2]CMPEQ Rt, #0LDRBEQ Rt, [Rn, #3]CMPEQ Rt, #0This cuts down on the fast-path latency, as well as potentiallyreducing bus contention in contended cases.
The optimisation relies on the fact that the ARM memory systemguarantees coherency between overlapping memory accesses ofdifferent sizes, similarly to many other architectures. Note thatwe do not care which element of currently_voting appears in whichbits of Rt, so there is no need to worry about endianness in thisoptimisation.
If there are too many CPUs to read the currently_voting array inone transaction then multiple transations are still required. Theimplementation uses a simple loop of word-sized loads for thiscase. The number of transactions is still fewer than would berequired if bytes were loaded individually.
In principle, we could aggregate further by using LDRD or LDM, butto keep the code simple this was not attempted in the initialimplementation.
vlocks are currently only used to coordinate between CPUs which areunable to enable their caches yet. This means that theimplementation removes many of the barriers which would be requiredwhen executing the algorithm in cached memory.
packing of the currently_voting array does not work with cachedmemory unless all CPUs contending the lock are cache-coherent, dueto cache writebacks from one CPU clobbering values written by otherCPUs. (Though if all the CPUs are cache-coherent, you should beprobably be using proper spinlocks instead anyway).
The “no votes yet” value used for the last_vote variable is 0 (not-1 as in the pseudocode). This allows statically-allocated vlocksto be implicitly initialised to an unlocked state simply by puttingthem in .bss.
An offset is added to each CPU’s ID for the purpose of setting thisvariable, so that no CPU uses the value 0 for its ID.
Colophon¶
Originally created and documented by Dave Martin for Linaro Limited, foruse in ARM-based big.LITTLE platforms, with review and input gratefullyreceived from Nicolas Pitre and Achin Gupta. Thanks to Nicolas forgrabbing most of this text out of the relevant mail thread and writingup the pseudocode.
Copyright (C) 2012-2013 Linaro LimitedDistributed under the terms of Version 2 of the GNU General PublicLicense, as defined in linux/COPYING.
References¶
- [1] Lamport, L. “A New Solution of Dijkstra’s Concurrent Programming
Problem”, Communications of the ACM 17, 8 (August 1974), 453-455.
[2] linux/arch/arm/common/vlock.S, www.kernel.org.