The Kernel Concurrency Sanitizer (KCSAN)

The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector, whichrelies on compile-time instrumentation, and uses a watchpoint-based samplingapproach to detect races. KCSAN’s primary purpose is to detectdata races.

Usage

KCSAN requires Clang version 11 or later.

To enable KCSAN configure the kernel with:

CONFIG_KCSAN = y

KCSAN provides several other configuration options to customize behaviour (seethe respective help text inlib/Kconfig.kcsan for more info).

Error reports

A typical data race report looks like this:

==================================================================BUG: KCSAN: data-race in generic_permission / kernfs_refresh_inodewrite to 0xffff8fee4c40700c of 4 bytes by task 175 on cpu 4: kernfs_refresh_inode+0x70/0x170 kernfs_iop_permission+0x4f/0x90 inode_permission+0x190/0x200 link_path_walk.part.0+0x503/0x8e0 path_lookupat.isra.0+0x69/0x4d0 filename_lookup+0x136/0x280 user_path_at_empty+0x47/0x60 vfs_statx+0x9b/0x130 __do_sys_newlstat+0x50/0xb0 __x64_sys_newlstat+0x37/0x50 do_syscall_64+0x85/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9read to 0xffff8fee4c40700c of 4 bytes by task 166 on cpu 6: generic_permission+0x5b/0x2a0 kernfs_iop_permission+0x66/0x90 inode_permission+0x190/0x200 link_path_walk.part.0+0x503/0x8e0 path_lookupat.isra.0+0x69/0x4d0 filename_lookup+0x136/0x280 user_path_at_empty+0x47/0x60 do_faccessat+0x11a/0x390 __x64_sys_access+0x3c/0x50 do_syscall_64+0x85/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9Reported by Kernel Concurrency Sanitizer on:CPU: 6 PID: 166 Comm: systemd-journal Not tainted 5.3.0-rc7+ #1Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014==================================================================

The header of the report provides a short summary of the functions involved inthe race. It is followed by the access types and stack traces of the 2 threadsinvolved in the data race.

The other less common type of data race report looks like this:

==================================================================BUG: KCSAN: data-race in e1000_clean_rx_irq+0x551/0xb10race at unknown origin, with read to 0xffff933db8a2ae6c of 1 bytes by interrupt on cpu 0: e1000_clean_rx_irq+0x551/0xb10 e1000_clean+0x533/0xda0 net_rx_action+0x329/0x900 __do_softirq+0xdb/0x2db irq_exit+0x9b/0xa0 do_IRQ+0x9c/0xf0 ret_from_intr+0x0/0x18 default_idle+0x3f/0x220 arch_cpu_idle+0x21/0x30 do_idle+0x1df/0x230 cpu_startup_entry+0x14/0x20 rest_init+0xc5/0xcb arch_call_rest_init+0x13/0x2b start_kernel+0x6db/0x700Reported by Kernel Concurrency Sanitizer on:CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-rc7+ #2Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014==================================================================

This report is generated where it was not possible to determine the otherracing thread, but a race was inferred due to the data value of the watchedmemory location having changed. These can occur either due to missinginstrumentation or e.g. DMA accesses. These reports will only be generated ifCONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=y (selected by default).

Selective analysis

It may be desirable to disable data race detection for specific accesses,functions, compilation units, or entire subsystems. For static blacklisting,the below options are available:

  • KCSAN understands thedata_race(expr) annotation, which tells KCSAN thatany data races due to accesses inexpr should be ignored and resultingbehaviour when encountering a data race is deemed safe.

  • Disabling data race detection for entire functions can be accomplished byusing the function attribute__no_kcsan:

    __no_kcsanvoid foo(void) {    ...

    To dynamically limit for which functions to generate reports, see theDebugFS interface blacklist/whitelist feature.

  • To disable data race detection for a particular compilation unit, add to theMakefile:

    KCSAN_SANITIZE_file.o := n
  • To disable data race detection for all compilation units listed in aMakefile, add to the respectiveMakefile:

    KCSAN_SANITIZE := n

Furthermore, it is possible to tell KCSAN to show or hide entire classes ofdata races, depending on preferences. These can be changed via the followingKconfig options:

  • CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY: If enabled and a conflicting writeis observed via a watchpoint, but the data value of the memory location wasobserved to remain unchanged, do not report the data race.
  • CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC: Assume that plain aligned writesup to word size are atomic by default. Assumes that such writes are notsubject to unsafe compiler optimizations resulting in data races. The optioncauses KCSAN to not report data races due to conflicts where the only plainaccesses are aligned writes up to word size.

DebugFS interface

The file/sys/kernel/debug/kcsan provides the following interface:

  • Reading/sys/kernel/debug/kcsan returns various runtime statistics.
  • Writingon oroff to/sys/kernel/debug/kcsan allows turning KCSANon or off, respectively.
  • Writing!some_func_name to/sys/kernel/debug/kcsan addssome_func_name to the report filter list, which (by default) blacklistsreporting data races where either one of the top stackframes are a functionin the list.
  • Writing eitherblacklist orwhitelist to/sys/kernel/debug/kcsanchanges the report filtering behaviour. For example, the blacklist featurecan be used to silence frequently occurring data races; the whitelist featurecan help with reproduction and testing of fixes.

Tuning performance

Core parameters that affect KCSAN’s overall performance and bug detectionability are exposed as kernel command-line arguments whose defaults can also bechanged via the corresponding Kconfig options.

  • kcsan.skip_watch (CONFIG_KCSAN_SKIP_WATCH): Number of per-CPU memoryoperations to skip, before another watchpoint is set up. Setting upwatchpoints more frequently will result in the likelihood of races to beobserved to increase. This parameter has the most significant impact onoverall system performance and race detection ability.
  • kcsan.udelay_task (CONFIG_KCSAN_UDELAY_TASK): For tasks, themicrosecond delay to stall execution after a watchpoint has been set up.Larger values result in the window in which we may observe a race toincrease.
  • kcsan.udelay_interrupt (CONFIG_KCSAN_UDELAY_INTERRUPT): Forinterrupts, the microsecond delay to stall execution after a watchpoint hasbeen set up. Interrupts have tighter latency requirements, and their delayshould generally be smaller than the one chosen for tasks.

They may be tweaked at runtime via/sys/module/kcsan/parameters/.

Data Races

In an execution, two memory accesses form adata race if theyconflict,they happen concurrently in different threads, and at least one of them is aplain access; theyconflict if both access the same memory location, and atleast one is a write. For a more thorough discussion and definition, see“PlainAccesses and Data Races” in the LKMM.

Relationship with the Linux-Kernel Memory Consistency Model (LKMM)

The LKMM defines the propagation and ordering rules of various memoryoperations, which gives developers the ability to reason about concurrent code.Ultimately this allows to determine the possible executions of concurrent code,and if that code is free from data races.

KCSAN is aware ofmarked atomic operations (READ_ONCE,WRITE_ONCE,atomic_*, etc.), but is oblivious of any ordering guarantees and simplyassumes that memory barriers are placed correctly. In other words, KCSANassumes that as long as a plain access is not observed to race with anotherconflicting access, memory operations are correctly ordered.

This means that KCSAN will not reportpotential data races due to missingmemory ordering. Developers should therefore carefully consider the requiredmemory ordering requirements that remain unchecked. If, however, missingmemory ordering (that is observable with a particular compiler andarchitecture) leads to an observable data race (e.g. entering a criticalsection erroneously), KCSAN would report the resulting data race.

Race Detection Beyond Data Races

For code with complex concurrency design, race-condition bugs may not alwaysmanifest as data races. Race conditions occur if concurrently executingoperations result in unexpected system behaviour. On the other hand, data racesare defined at the C-language level. The following macros can be used to checkproperties of concurrent code where bugs would not manifest as data races.

ASSERT_EXCLUSIVE_WRITER(var)

assert no concurrent writes tovar

Parameters

var
variable to assert on

Description

Assert that there are no concurrent writes tovar; other readers areallowed. This assertion can be used to specify properties of concurrent code,where violation cannot be detected as a normal data race.

For example, if we only have a single writer, but multiple concurrentreaders, to avoid data races, all these accesses must be marked; evenconcurrent marked writes racing with the single writer are bugs.Unfortunately, due to being marked, they are no longer data races. For caseslike these, we can use the macro as follows:

voidwriter(void){spin_lock(&update_foo_lock);ASSERT_EXCLUSIVE_WRITER(shared_foo);WRITE_ONCE(shared_foo,...);spin_unlock(&update_foo_lock);}voidreader(void){// update_foo_lock does not need to be held!...=READ_ONCE(shared_foo);}

Note

ASSERT_EXCLUSIVE_WRITER_SCOPED(), if applicable, performs more thoroughchecking if a clear scope where no concurrent writes are expected exists.

ASSERT_EXCLUSIVE_WRITER_SCOPED(var)

assert no concurrent writes tovar in scope

Parameters

var
variable to assert on

Description

Scoped variant ofASSERT_EXCLUSIVE_WRITER().

Assert that there are no concurrent writes tovar for the duration of thescope in which it is introduced. This provides a better way to fully coverthe enclosing scope, compared to multipleASSERT_EXCLUSIVE_WRITER(), andincreases the likelihood for KCSAN to detect racing accesses.

For example, it allows finding race-condition bugs that only occur due tostate changes within the scope itself:

voidwriter(void){spin_lock(&update_foo_lock);{ASSERT_EXCLUSIVE_WRITER_SCOPED(shared_foo);WRITE_ONCE(shared_foo,42);...// shared_foo should still be 42 here!}spin_unlock(&update_foo_lock);}voidbuggy(void){if(READ_ONCE(shared_foo)==42)WRITE_ONCE(shared_foo,1);// bug!}
ASSERT_EXCLUSIVE_ACCESS(var)

assert no concurrent accesses tovar

Parameters

var
variable to assert on

Description

Assert that there are no concurrent accesses tovar (no readers norwriters). This assertion can be used to specify properties of concurrentcode, where violation cannot be detected as a normal data race.

For example, where exclusive access is expected after determining no otherusers of an object are left, but the object is not actually freed. We cancheck that this property actually holds as follows:

if(refcount_dec_and_test(&obj->refcnt)){ASSERT_EXCLUSIVE_ACCESS(*obj);do_some_cleanup(obj);release_for_reuse(obj);}

Note

ASSERT_EXCLUSIVE_ACCESS_SCOPED(), if applicable, performs more thoroughchecking if a clear scope where no concurrent accesses are expected exists.

For cases where the object is freed,KASAN is a betterfit to detect use-after-free bugs.

ASSERT_EXCLUSIVE_ACCESS_SCOPED(var)

assert no concurrent accesses tovar in scope

Parameters

var
variable to assert on

Description

Scoped variant ofASSERT_EXCLUSIVE_ACCESS().

Assert that there are no concurrent accesses tovar (no readers nor writers)for the entire duration of the scope in which it is introduced. This providesa better way to fully cover the enclosing scope, compared to multipleASSERT_EXCLUSIVE_ACCESS(), and increases the likelihood for KCSAN to detectracing accesses.

ASSERT_EXCLUSIVE_BITS(var,mask)

assert no concurrent writes to subset of bits invar

Parameters

var
variable to assert on
mask
only check for modifications to bits set inmask

Description

Bit-granular variant ofASSERT_EXCLUSIVE_WRITER().

Assert that there are no concurrent writes to a subset of bits invar;concurrent readers are permitted. This assertion captures more detailedbit-level properties, compared to the other (word granularity) assertions.Only the bits set inmask are checked for concurrent modifications, whileignoring the remaining bits, i.e. concurrent writes (or reads) to ~mask bitsare ignored.

Use this for variables, where some bits must not be modified concurrently,yet other bits are expected to be modified concurrently.

For example, variables where, after initialization, some bits are read-only,but other bits may still be modified concurrently. A reader may wish toassert that this is true as follows:

ASSERT_EXCLUSIVE_BITS(flags,READ_ONLY_MASK);foo=(READ_ONCE(flags)&READ_ONLY_MASK)>>READ_ONLY_SHIFT;
ASSERT_EXCLUSIVE_BITS(flags,READ_ONLY_MASK);foo=(flags&READ_ONLY_MASK)>>READ_ONLY_SHIFT;

Another example, where this may be used, is when certain bits ofvar mayonly be modified when holding the appropriate lock, but other bits may stillbe modified concurrently. Writers, where other bits may change concurrently,could use the assertion as follows:

spin_lock(&foo_lock);ASSERT_EXCLUSIVE_BITS(flags,FOO_MASK);old_flags=flags;new_flags=(old_flags&~FOO_MASK)|(new_foo<<FOO_SHIFT);if(cmpxchg(&flags,old_flags,new_flags)!=old_flags){...}spin_unlock(&foo_lock);

Note

The access that immediately followsASSERT_EXCLUSIVE_BITS() is assumedto access the masked bits only, and KCSAN optimistically assumes it istherefore safe, even in the presence of data races, and marking it withREAD_ONCE() is optional from KCSAN’s point-of-view. We caution, however, thatit may still be advisable to do so, since we cannot reason about all compileroptimizations when it comes to bit manipulations (on the reader and writerside). If you are sure nothing can go wrong, we can write the above simplyas:

Implementation Details

KCSAN relies on observing that two accesses happen concurrently. Crucially, wewant to (a) increase the chances of observing races (especially for races thatmanifest rarely), and (b) be able to actually observe them. We can accomplish(a) by injecting various delays, and (b) by using address watchpoints (orbreakpoints).

If we deliberately stall a memory access, while we have a watchpoint for itsaddress set up, and then observe the watchpoint to fire, two accesses to thesame address just raced. Using hardware watchpoints, this is the approach takeninDataCollider.Unlike DataCollider, KCSAN does not use hardware watchpoints, but insteadrelies on compiler instrumentation and “soft watchpoints”.

In KCSAN, watchpoints are implemented using an efficient encoding that storesaccess type, size, and address in a long; the benefits of using “softwatchpoints” are portability and greater flexibility. KCSAN then relies on thecompiler instrumenting plain accesses. For each instrumented plain access:

  1. Check if a matching watchpoint exists; if yes, and at least one access is awrite, then we encountered a racing access.
  2. Periodically, if no matching watchpoint exists, set up a watchpoint andstall for a small randomized delay.
  3. Also check the data value before the delay, and re-check the data valueafter delay; if the values mismatch, we infer a race of unknown origin.

To detect data races between plain and marked accesses, KCSAN also annotatesmarked accesses, but only to check if a watchpoint exists; i.e. KCSAN neversets up a watchpoint on marked accesses. By never setting up watchpoints formarked operations, if all accesses to a variable that is accessed concurrentlyare properly marked, KCSAN will never trigger a watchpoint and therefore neverreport the accesses.

Key Properties

  1. Memory Overhead: The overall memory overhead is only a few MiBdepending on configuration. The current implementation uses a small array oflongs to encode watchpoint information, which is negligible.
  2. Performance Overhead: KCSAN’s runtime aims to be minimal, using anefficient watchpoint encoding that does not require acquiring any sharedlocks in the fast-path. For kernel boot on a system with 8 CPUs:
    • 5.0x slow-down with the default KCSAN config;
    • 2.8x slow-down from runtime fast-path overhead only (set very largeKCSAN_SKIP_WATCH and unsetKCSAN_SKIP_WATCH_RANDOMIZE).
  3. Annotation Overheads: Minimal annotations are required outside the KCSANruntime. As a result, maintenance overheads are minimal as the kernelevolves.
  4. Detects Racy Writes from Devices: Due to checking data values uponsetting up watchpoints, racy writes from devices can also be detected.
  5. Memory Ordering: KCSAN isnot explicitly aware of the LKMM’s orderingrules; this may result in missed data races (false negatives).
  6. Analysis Accuracy: For observed executions, due to using a samplingstrategy, the analysis isunsound (false negatives possible), but aims tobe complete (no false positives).

Alternatives Considered

An alternative data race detection approach for the kernel can be found in theKernel Thread Sanitizer (KTSAN).KTSAN is a happens-before data race detector, which explicitly establishes thehappens-before order between memory operations, which can then be used todetermine data races as defined inData Races.

To build a correct happens-before relation, KTSAN must be aware of all orderingrules of the LKMM and synchronization primitives. Unfortunately, any omissionleads to large numbers of false positives, which is especially detrimental inthe context of the kernel which includes numerous custom synchronizationmechanisms. To track the happens-before relation, KTSAN’s implementationrequires metadata for each memory location (shadow memory), which for each pagecorresponds to 4 pages of shadow memory, and can translate into overhead oftens of GiB on a large system.