Sequence counters and sequential locks

Introduction

Sequence counters are a reader-writer consistency mechanism withlockless readers (read-only retry loops), and no writer starvation. Theyare used for data that’s rarely written to (e.g. system time), where thereader wants a consistent set of information and is willing to retry ifthat information changes.

A data set is consistent when the sequence count at the beginning of theread side critical section is even and the same sequence count value isread again at the end of the critical section. The data in the set mustbe copied out inside the read side critical section. If the sequencecount has changed between the start and the end of the critical section,the reader must retry.

Writers increment the sequence count at the start and the end of theircritical section. After starting the critical section the sequence countis odd and indicates to the readers that an update is in progress. Atthe end of the write side critical section the sequence count becomeseven again which lets readers make progress.

A sequence counter write side critical section must never be preemptedor interrupted by read side sections. Otherwise the reader will spin forthe entire scheduler tick due to the odd sequence count value and theinterrupted writer. If that reader belongs to a real-time schedulingclass, it can spin forever and the kernel will livelock.

This mechanism cannot be used if the protected data contains pointers,as the writer can invalidate a pointer that the reader is following.

Sequence counters (seqcount_t)

This is the the raw counting mechanism, which does not protect againstmultiple writers. Write side critical sections must thus be serializedby an external lock.

If the write serialization primitive is not implicitly disablingpreemption, preemption must be explicitly disabled before entering thewrite side section. If the read section can be invoked from hardirq orsoftirq contexts, interrupts or bottom halves must also be respectivelydisabled before entering the write section.

If it’s desired to automatically handle the sequence counterrequirements of writer serialization and non-preemptibility, useSequential locks (seqlock_t) instead.

Initialization:

/* dynamic */seqcount_t foo_seqcount;seqcount_init(&foo_seqcount);/* static */static seqcount_t foo_seqcount = SEQCNT_ZERO(foo_seqcount);/* C99 struct init */struct {        .seq   = SEQCNT_ZERO(foo.seq),} foo;

Write path:

/* Serialized context with disabled preemption */write_seqcount_begin(&foo_seqcount);/* ... [[write-side critical section]] ... */write_seqcount_end(&foo_seqcount);

Read path:

do {        seq = read_seqcount_begin(&foo_seqcount);        /* ... [[read-side critical section]] ... */} while (read_seqcount_retry(&foo_seqcount, seq));

Sequence counters with associated locks (seqcount_LOCKTYPE_t)

As discussed atSequence counters (seqcount_t), sequence count write side criticalsections must be serialized and non-preemptible. This variant ofsequence counters associate the lock used for writer serialization atinitialization time, which enables lockdep to validate that the writeside critical sections are properly serialized.

This lock association is a NOOP if lockdep is disabled and has neitherstorage nor runtime overhead. If lockdep is enabled, the lock pointer isstored in struct seqcount and lockdep’s “lock is held” assertions areinjected at the beginning of the write side critical section to validatethat it is properly protected.

For lock types which do not implicitly disable preemption, preemptionprotection is enforced in the write side function.

The following sequence counters with associated locks are defined:

  • seqcount_spinlock_t
  • seqcount_raw_spinlock_t
  • seqcount_rwlock_t
  • seqcount_mutex_t
  • seqcount_ww_mutex_t

The plain seqcount read and write APIs branch out to the specificseqcount_LOCKTYPE_t implementation at compile-time. This avoids kernelAPI explosion per each new seqcount LOCKTYPE.

Initialization (replace “LOCKTYPE” with one of the supported locks):

/* dynamic */seqcount_LOCKTYPE_t foo_seqcount;seqcount_LOCKTYPE_init(&foo_seqcount, &lock);/* static */static seqcount_LOCKTYPE_t foo_seqcount =        SEQCNT_LOCKTYPE_ZERO(foo_seqcount, &lock);/* C99 struct init */struct {        .seq   = SEQCNT_LOCKTYPE_ZERO(foo.seq, &lock),} foo;

Write path: same as inSequence counters (seqcount_t), while running from a contextwith the associated LOCKTYPE lock acquired.

Read path: same as inSequence counters (seqcount_t).

Sequential locks (seqlock_t)

This contains theSequence counters (seqcount_t) mechanism earlier discussed, plus anembedded spinlock for writer serialization and non-preemptibility.

If the read side section can be invoked from hardirq or softirq context,use the write side function variants which disable interrupts or bottomhalves respectively.

Initialization:

/* dynamic */seqlock_t foo_seqlock;seqlock_init(&foo_seqlock);/* static */static DEFINE_SEQLOCK(foo_seqlock);/* C99 struct init */struct {        .seql   = __SEQLOCK_UNLOCKED(foo.seql)} foo;

Write path:

write_seqlock(&foo_seqlock);/* ... [[write-side critical section]] ... */write_sequnlock(&foo_seqlock);

Read path, three categories:

  1. Normal Sequence readers which never block a writer but they mustretry if a writer is in progress by detecting change in the sequencenumber. Writers do not wait for a sequence reader:

    do {        seq = read_seqbegin(&foo_seqlock);        /* ... [[read-side critical section]] ... */} while (read_seqretry(&foo_seqlock, seq));
  2. Locking readers which will wait if a writer or another locking readeris in progress. A locking reader in progress will also block a writerfrom entering its critical section. This read lock isexclusive. Unlike rwlock_t, only one locking reader can acquire it:

    read_seqlock_excl(&foo_seqlock);/* ... [[read-side critical section]] ... */read_sequnlock_excl(&foo_seqlock);
  3. Conditional lockless reader (as in 1), or locking reader (as in 2),according to a passed marker. This is used to avoid lockless readersstarvation (too much retry loops) in case of a sharp spike in writeactivity. First, a lockless read is tried (even marker passed). Ifthat trial fails (odd sequence counter is returned, which is used asthe next iteration marker), the lockless read is transformed to afull locking read and no retry loop is necessary:

    /* marker; even initialization */int seq = 0;do {        read_seqbegin_or_lock(&foo_seqlock, &seq);        /* ... [[read-side critical section]] ... */} while (need_seqretry(&foo_seqlock, seq));done_seqretry(&foo_seqlock, seq);

API documentation

seqcount_init(s)

runtime initializer for seqcount_t

Parameters

s
Pointer to the seqcount_t instance
SEQCNT_ZERO(name)

static initializer for seqcount_t

Parameters

name
Name of the seqcount_t instance
typedefseqcount

sequence counter with LOCKTYPR associated

Description

A plain sequence counter with external writer synchronization by aspinlock. The spinlock is associated to the sequence count in thestatic initializer or init function. This enables lockdep to validatethat the write side critical section is properly serialized.

SEQCOUNT_LOCKTYPE_ZERO(seq_name,assoc_lock)

static initializer for seqcount_LOCKNAME_t

Parameters

seq_name
undescribed
assoc_lock
undescribed
__read_seqcount_begin(s)

begin a seqcount_t read section w/o barrier

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

__read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb()barrier. Callers should ensure that smp_rmb() or equivalent ordering isprovided before actually loading any of the variables that are to beprotected in this critical section.

Use carefully, only in critical code, and comment how the barrier isprovided.

Return

count to be passed toread_seqcount_retry()

raw_read_seqcount_begin(s)

begin a seqcount_t read section w/o lockdep

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Return

count to be passed toread_seqcount_retry()

read_seqcount_begin(s)

begin a seqcount_t read critical section

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Return

count to be passed toread_seqcount_retry()

raw_read_seqcount(s)

read the raw seqcount_t counter value

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

raw_read_seqcount opens a read critical section of the givenseqcount_t, without any lockdep checking, and without checking ormasking the sequence counter LSB. Calling code is responsible forhandling that.

Return

count to be passed toread_seqcount_retry()

raw_seqcount_begin(s)

begin a seqcount_t read critical section w/o lockdep and w/o counter stabilization

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

raw_seqcount_begin opens a read critical section of the givenseqcount_t. Unlikeread_seqcount_begin(), this function will not waitfor the count to stabilize. If a writer is active when it begins, itwill fail theread_seqcount_retry() at the end of the read criticalsection instead of stabilizing at the beginning of it.

Use this only in special kernel hot paths where the read section issmall and has a high probability of success through other externalmeans. It will save a single branching instruction.

Return

count to be passed toread_seqcount_retry()

__read_seqcount_retry(s,start)

end a seqcount_t read section w/o barrier

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants
start
count, fromread_seqcount_begin()

Description

__read_seqcount_retry is like read_seqcount_retry, but has no smp_rmb()barrier. Callers should ensure that smp_rmb() or equivalent ordering isprovided before actually loading any of the variables that are to beprotected in this critical section.

Use carefully, only in critical code, and comment how the barrier isprovided.

Return

true if a read section retry is required, else false

read_seqcount_retry(s,start)

end a seqcount_t read critical section

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants
start
count, fromread_seqcount_begin()

Description

read_seqcount_retry closes the read critical section of givenseqcount_t. If the critical section was invalid, it must be ignored(and typically retried).

Return

true if a read section retry is required, else false

raw_write_seqcount_begin(s)

start a seqcount_t write section w/o lockdep

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants
raw_write_seqcount_end(s)

end a seqcount_t write section w/o lockdep

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants
write_seqcount_begin_nested(s,subclass)

start a seqcount_t write section with custom lockdep nesting level

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants
subclass
lockdep nesting level

Description

See Documentation/locking/lockdep-design.rst

write_seqcount_begin(s)

start a seqcount_t write side critical section

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

write_seqcount_begin opens a write side critical section of the givenseqcount_t.

Context

seqcount_t write side critical sections must be serialized andnon-preemptible. If readers can be invoked from hardirq or softirqcontext, interrupts or bottom halves must be respectively disabled.

write_seqcount_end(s)

end a seqcount_t write side critical section

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

The write section must’ve been opened withwrite_seqcount_begin().

raw_write_seqcount_barrier(s)

do a seqcount_t write barrier

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

This can be used to provide an ordering guarantee instead of the usualconsistency guarantee. It is one wmb cheaper, because it can collapsethe two back-to-back wmb()s.

Note that writes surrounding the barrier should be declared atomic (e.g.via WRITE_ONCE): a) to ensure the writes become visible to other threadsatomically, avoiding compiler optimizations; b) to document which writes aremeant to propagate to the reader critical section. This is necessary becauseneither writes before and after the barrier are enclosed in a seq-writercritical section that would ensure readers are aware of ongoing writes:

seqcount_t seq;bool X = true, Y = false;void read(void){        bool x, y;        do {                int s = read_seqcount_begin(&seq);                x = X; y = Y;        } while (read_seqcount_retry(&seq, s));        BUG_ON(!x && !y);}void write(void){        WRITE_ONCE(Y, true);        raw_write_seqcount_barrier(seq);        WRITE_ONCE(X, false);}
write_seqcount_invalidate(s)

invalidate in-progress seqcount_t read side operations

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

After write_seqcount_invalidate, no seqcount_t read side operationswill complete successfully and see data older than this.

raw_read_seqcount_latch(s)

pick even/odd seqcount_t latch data copy

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

Use seqcount_t latching to switch between two storage places protectedby a sequence counter. Doing so allows having interruptible, preemptible,seqcount_t write side critical sections.

Checkraw_write_seqcount_latch() for more details and a full reader andwriter usage example.

Return

sequence counter raw value. Use the lowest bit as an index forpicking which data copy to read. The full counter value must then bechecked withread_seqcount_retry().

raw_write_seqcount_latch(s)

redirect readers to even/odd copy

Parameters

s
Pointer to seqcount_t or any of the seqcount_locktype_t variants

Description

The latch technique is a multiversion concurrency control method that allowsqueries during non-atomic modifications. If you can guarantee queries neverinterrupt the modification – e.g. the concurrency is strictly between CPUs– you most likely do not need this.

Where the traditional RCU/lockless data structures rely on atomicmodifications to ensure queries observe either the old or the new state thelatch allows the same for non-atomic updates. The trade-off is doubling thecost of storage; we have to maintain two copies of the entire datastructure.

Very simply put: we first modify one copy and then the other. This ensuresthere is always one copy in a stable state, ready to give us an answer.

The basic form is a data structure like:

struct latch_struct {        seqcount_t              seq;        struct data_struct      data[2];};

Where a modification, which is assumed to be externally serialized, does thefollowing:

void latch_modify(struct latch_struct *latch, ...){        smp_wmb();      // Ensure that the last data[1] update is visible        latch->seq++;        smp_wmb();      // Ensure that the seqcount update is visible        modify(latch->data[0], ...);        smp_wmb();      // Ensure that the data[0] update is visible        latch->seq++;        smp_wmb();      // Ensure that the seqcount update is visible        modify(latch->data[1], ...);}

The query will have a form like:

struct entry *latch_query(struct latch_struct *latch, ...){        struct entry *entry;        unsigned seq, idx;        do {                seq = raw_read_seqcount_latch(&latch->seq);                idx = seq & 0x01;                entry = data_query(latch->data[idx], ...);        // read_seqcount_retry() includes needed smp_rmb()        } while (read_seqcount_retry(&latch->seq, seq));        return entry;}

So during the modification, queries are first redirected to data[1]. Then wemodify data[0]. When that is complete, we redirect queries back to data[0]and we can modify data[1].

NOTE

The non-requirement for atomic modifications does _NOT_ includethe publishing of new entries in the case where data is a dynamicdata structure.

An iteration might start in data[0] and get suspended long enoughto miss an entire modification sequence, once it resumes it mightobserve the new entry.

When data is a dynamic data structure; one should use regular RCUpatterns to manage the lifetimes of the objects within.

seqlock_init(sl)

dynamic initializer for seqlock_t

Parameters

sl
Pointer to the seqlock_t instance
DEFINE_SEQLOCK(sl)

Define a statically allocated seqlock_t

Parameters

sl
Name of the seqlock_t instance
unsignedread_seqbegin(const seqlock_t * sl)

start a seqlock_t read side critical section

Parameters

constseqlock_t*sl
Pointer to seqlock_t

Return

count, to be passed toread_seqretry()

unsignedread_seqretry(const seqlock_t * sl, unsigned start)

end a seqlock_t read side section

Parameters

constseqlock_t*sl
Pointer to seqlock_t
unsignedstart
count, fromread_seqbegin()

Description

read_seqretry closes the read side critical section of given seqlock_t.If the critical section was invalid, it must be ignored (and typicallyretried).

Return

true if a read section retry is required, else false

voidwrite_seqlock(seqlock_t * sl)

start a seqlock_t write side critical section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

write_seqlock opens a write side critical section for the givenseqlock_t. It also implicitly acquires the spinlock_t embedded insidethat sequential lock. All seqlock_t write side sections are thusautomatically serialized and non-preemptible.

Context

if the seqlock_t read section, or other write side criticalsections, can be invoked from hardirq or softirq contexts, use the_irqsave or _bh variants of this function instead.

voidwrite_sequnlock(seqlock_t * sl)

end a seqlock_t write side critical section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

write_sequnlock closes the (serialized and non-preemptible) write sidecritical section of given seqlock_t.

voidwrite_seqlock_bh(seqlock_t * sl)

start a softirqs-disabled seqlock_t write section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

_bh variant ofwrite_seqlock(). Use only if the read side section, orother write side sections, can be invoked from softirq contexts.

voidwrite_sequnlock_bh(seqlock_t * sl)

end a softirqs-disabled seqlock_t write section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

write_sequnlock_bh closes the serialized, non-preemptible, andsoftirqs-disabled, seqlock_t write side critical section opened withwrite_seqlock_bh().

voidwrite_seqlock_irq(seqlock_t * sl)

start a non-interruptible seqlock_t write section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

_irq variant ofwrite_seqlock(). Use only if the read side section, orother write sections, can be invoked from hardirq contexts.

voidwrite_sequnlock_irq(seqlock_t * sl)

end a non-interruptible seqlock_t write section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

write_sequnlock_irq closes the serialized and non-interruptibleseqlock_t write side section opened withwrite_seqlock_irq().

write_seqlock_irqsave(lock,flags)

start a non-interruptible seqlock_t write section

Parameters

lock
Pointer to seqlock_t
flags
Stack-allocated storage for saving caller’s local interruptstate, to be passed towrite_sequnlock_irqrestore().

Description

_irqsave variant ofwrite_seqlock(). Use it only if the read sidesection, or other write sections, can be invoked from hardirq context.

voidwrite_sequnlock_irqrestore(seqlock_t * sl, unsigned long flags)

end non-interruptible seqlock_t write section

Parameters

seqlock_t*sl
Pointer to seqlock_t
unsignedlongflags
Caller’s saved interrupt state, fromwrite_seqlock_irqsave()

Description

write_sequnlock_irqrestore closes the serialized and non-interruptibleseqlock_t write section previously opened withwrite_seqlock_irqsave().

voidread_seqlock_excl(seqlock_t * sl)

begin a seqlock_t locking reader section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

read_seqlock_excl opens a seqlock_t locking reader critical section. Alocking reader exclusively locks outboth other writersand otherlocking readers, but it does not update the embedded sequence number.

Locking readers act like a normal spin_lock()/spin_unlock().

The opened read section must be closed withread_sequnlock_excl().

Context

if the seqlock_t write section,or other read sections, canbe invoked from hardirq or softirq contexts, use the _irqsave or _bhvariant of this function instead.

voidread_sequnlock_excl(seqlock_t * sl)

end a seqlock_t locking reader critical section

Parameters

seqlock_t*sl
Pointer to seqlock_t
voidread_seqlock_excl_bh(seqlock_t * sl)

start a seqlock_t locking reader section with softirqs disabled

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

_bh variant ofread_seqlock_excl(). Use this variant only if theseqlock_t write side section,or other read sections, can be invokedfrom softirq contexts.

voidread_sequnlock_excl_bh(seqlock_t * sl)

stop a seqlock_t softirq-disabled locking reader section

Parameters

seqlock_t*sl
Pointer to seqlock_t
voidread_seqlock_excl_irq(seqlock_t * sl)

start a non-interruptible seqlock_t locking reader section

Parameters

seqlock_t*sl
Pointer to seqlock_t

Description

_irq variant ofread_seqlock_excl(). Use this only if the seqlock_twrite side section,or other read sections, can be invoked from ahardirq context.

voidread_sequnlock_excl_irq(seqlock_t * sl)

end an interrupts-disabled seqlock_t locking reader section

Parameters

seqlock_t*sl
Pointer to seqlock_t
read_seqlock_excl_irqsave(lock,flags)

start a non-interruptible seqlock_t locking reader section

Parameters

lock
Pointer to seqlock_t
flags
Stack-allocated storage for saving caller’s local interruptstate, to be passed toread_sequnlock_excl_irqrestore().

Description

_irqsave variant ofread_seqlock_excl(). Use this only if the seqlock_twrite side section,or other read sections, can be invoked from ahardirq context.

voidread_sequnlock_excl_irqrestore(seqlock_t * sl, unsigned long flags)

end non-interruptible seqlock_t locking reader section

Parameters

seqlock_t*sl
Pointer to seqlock_t
unsignedlongflags
Caller saved interrupt state, fromread_seqlock_excl_irqsave()
voidread_seqbegin_or_lock(seqlock_t * lock, int * seq)

begin a seqlock_t lockless or locking reader

Parameters

seqlock_t*lock
Pointer to seqlock_t
int*seq
Marker and return parameter. If the passed value is even, thereader will become alockless seqlock_t reader as inread_seqbegin().If the passed value is odd, the reader will become alocking readeras inread_seqlock_excl(). In the first call to this function, thecallermust initialize and pass an even value toseq; this way, alockless read can be optimistically tried first.

Description

read_seqbegin_or_lock is an API designed to optimistically try a normallockless seqlock_t read section first. If an odd counter is found, thelockless read trial has failed, and the next read iteration transformsitself into a full seqlock_t locking reader.

This is typically used to avoid seqlock_t lockless readers starvation(too much retry loops) in the case of a sharp spike in write sideactivity.

Check Documentation/locking/seqlock.rst for template example code.

Context

if the seqlock_t write section,or other read sections, canbe invoked from hardirq or softirq contexts, use the _irqsave or _bhvariant of this function instead.

Return

the encountered sequence counter value, through theseqparameter, which is overloaded as a return parameter. This returnedvalue must be checked withneed_seqretry(). If the read section need tobe retried, this returned value must also be passed as theseqparameter of the nextread_seqbegin_or_lock() iteration.

intneed_seqretry(seqlock_t * lock, int seq)

validate seqlock_t “locking or lockless” read section

Parameters

seqlock_t*lock
Pointer to seqlock_t
intseq
sequence count, fromread_seqbegin_or_lock()

Return

true if a read section retry is required, false otherwise

voiddone_seqretry(seqlock_t * lock, int seq)

end seqlock_t “locking or lockless” reader section

Parameters

seqlock_t*lock
Pointer to seqlock_t
intseq
count, fromread_seqbegin_or_lock()

Description

done_seqretry finishes the seqlock_t read side critical section startedwithread_seqbegin_or_lock() and validated byneed_seqretry().

unsigned longread_seqbegin_or_lock_irqsave(seqlock_t * lock, int * seq)

begin a seqlock_t lockless reader, or a non-interruptible locking reader

Parameters

seqlock_t*lock
Pointer to seqlock_t
int*seq
Marker and return parameter. Checkread_seqbegin_or_lock().

Description

This is the _irqsave variant ofread_seqbegin_or_lock(). Use it only ifthe seqlock_t write section,or other read sections, can be invokedfrom hardirq context.

Note

Interrupts will be disabled only for “locking reader” mode.

Return

  1. The saved local interrupts state in case of a locking reader, tobe passed todone_seqretry_irqrestore().
  2. The encountered sequence counter value, returned throughseqoverloaded as a return parameter. Checkread_seqbegin_or_lock().
voiddone_seqretry_irqrestore(seqlock_t * lock, int seq, unsigned long flags)

end a seqlock_t lockless reader, or a non-interruptible locking reader section

Parameters

seqlock_t*lock
Pointer to seqlock_t
intseq
Count, fromread_seqbegin_or_lock_irqsave()
unsignedlongflags
Caller’s saved local interrupt state in case of a lockingreader, also fromread_seqbegin_or_lock_irqsave()

Description

This is the _irqrestore variant ofdone_seqretry(). The read sectionmust’ve been opened withread_seqbegin_or_lock_irqsave(), and validatedbyneed_seqretry().