False Sharing¶

What is False Sharing¶

False sharing is related with cache mechanism of maintaining the datacoherence of one cache line stored in multiple CPU’s caches; thenacademic definition for it is in[1]. Consider astructwith arefcount and a string:

struct foo {        refcount_t refcount;        ...        char name[16];} ____cacheline_internodealigned_in_smp;

Member ‘refcount’(A) and ‘name’(B) _share_ one cache line like below:

              +-----------+                     +-----------+              |   CPU 0   |                     |   CPU 1   |              +-----------+                     +-----------+             /                                        |            /                                         |           V                                          V       +----------------------+             +----------------------+       | A      B             | Cache 0     | A       B            | Cache 1       +----------------------+             +----------------------+                           |                  |---------------------------+------------------+-----------------------------                           |                  |                         +----------------------+                         |                      |                         +----------------------+            Main Memory  | A       B            |                         +----------------------+

‘refcount’ is modified frequently, but ‘name’ is set once at objectcreation time and is never modified. When many CPUs access ‘foo’ atthe same time, with ‘refcount’ being only bumped by one CPU frequentlyand ‘name’ being read by other CPUs, all those reading CPUs have toreload the whole cache line over and over due to the ‘sharing’, eventhough ‘name’ is never changed.

There are many real-world cases of performance regressions caused byfalse sharing. One of these is a rw_semaphore ‘mmap_lock’ insidemm_struct struct, whose cache line layout change triggered aregression and Linus analyzed in[2].

There are two key factors for a harmful false sharing:

A global datum accessed (shared) by many CPUs
In the concurrent accesses to the data, there is at least one writeoperation: write/write or write/read cases.

The sharing could be from totally unrelated kernel components, ordifferent code paths of the same kernel component.

False Sharing Pitfalls¶

Back in time when one platform had only one or a few CPUs, hot datamembers could be purposely put in the same cache line to make themcache hot and save cacheline/TLB, like a lock and the data protectedby it. But for recent large system with hundreds of CPUs, this maynot work when the lock is heavily contended, as the lock owner CPUcould write to the data, while other CPUs are busy spinning the lock.

Looking at past cases, there are several frequently occurring patternsfor false sharing:

lock (spinlock/mutex/semaphore) and data protected by it arepurposely put in one cache line.
global data being put together in one cache line. Some kernelsubsystems have many global parameters of small size (4 bytes),which can easily be grouped together and put into one cache line.
data members of a big data structure randomly sitting togetherwithout being noticed (cache line is usually 64 bytes or more),like ‘mem_cgroup’ struct.

Following ‘mitigation’ section provides real-world examples.

False sharing could easily happen unless they are intentionallychecked, and it is valuable to run specific tools for performancecritical workloads to detect false sharing affecting performance caseand optimize accordingly.

How to detect and analyze False Sharing¶

perf record/report/stat are widely used for performance tuning, andonce hotspots are detected, tools like ‘perf-c2c’ and ‘pahole’ canbe further used to detect and pinpoint the possible false sharingdata structures. ‘addr2line’ is also good at decoding instructionpointer when there are multiple layers of inline functions.

perf-c2c can capture the cache lines with most false sharing hits,decoded functions (line number of file) accessing that cache line,and in-line offset of the data. Simple commands are:

$ perf c2c record -ag sleep 3$ perf c2c report --call-graph none -k vmlinux

When running above during testing will-it-scale’s tlb_flush1 case,perf reports something like:

Total records                     :    1658231Locked Load/Store Operations      :      89439Load Operations                   :     623219Load Local HITM                   :      92117Load Remote HITM                  :        139#----------------------------------------------------------------------    4        0     2374        0        0        0  0xff1100088366d880#----------------------------------------------------------------------  0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1  0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1  0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1  0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1  0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1

A nice introduction for perf-c2c is[3].

‘pahole’ decodes data structure layouts delimited in cache linegranularity. Users can match the offset in perf-c2c output withpahole’s decoding to locate the exact data members. For globaldata, users can search the data address in System.map.

Possible Mitigations¶

False sharing does not always need to be mitigated. False sharingmitigations should balance performance gains with complexity andspace consumption. Sometimes, lower performance is OK, and it’sunnecessary to hyper-optimize every rarely used data structure ora cold data path.

False sharing hurting performance cases are seen more frequently withcore count increasing. Because of these detrimental effects, manypatches have been proposed across variety of subsystems (likenetworking and memory management) and merged. Some common mitigations(with examples) are:

Separate hot global data in its own dedicated cache line, even if itis just a ‘short’ type. The downside is more consumption of memory,cache line and TLB entries.
- Commit 91b6d3256356 (“net: cache align tcp_memory_allocated, tcp_sockets_allocated”)
Reorganize the data structure, separate the interfering members todifferent cache lines. One downside is it may introduce new falsesharing of other members.
- Commit 802f1d522d5f (“mm: page_counter: re-layout structure to reduce false sharing”)
Replace ‘write’ with ‘read’ when possible, especially in loops.Like for some global variable, use compare(read)-then-write insteadof unconditional write. For example, use:
```
if (!test_bit(XXX))        set_bit(XXX);
```
instead of directly “set_bit(XXX);”, similarly for atomic_t data:
```
if (atomic_read(XXX) == AAA)        atomic_set(XXX, BBB);
```
- Commit 7b1002f7cfe5 (“bcache: fixupbcache_dev_sectors_dirty_add() multithreaded CPU false sharing”)
- Commit 292648ac5cf1 (“mm: gup: allow FOLL_PIN to scale in SMP”)
Turn hot global data to ‘per-cpu data + global data’ when possible,or reasonably increase the threshold for syncing per-cpu data toglobal data, to reduce or postpone the ‘write’ to that global data.
- Commit 520f897a3554 (“ext4: use percpu_counters for extent_status cache hits/misses”)
- Commit 56f3547bfa4d (“mm: adjust vm_committed_as_batch according to vm overcommit policy”)

Surely, all mitigations should be carefully verified to not cause sideeffects. To avoid introducing false sharing when coding, it’s betterto:

Be aware of cache line boundaries
Group mostly read-only fields together
Group things that are written at the same time together
Separate frequently read and frequently written fields ondifferent cache lines.

and better add a comment stating the false sharing consideration.

One note is, sometimes even after a severe false sharing is detectedand solved, the performance may still have no obvious improvement asthe hotspot switches to a new place.

Miscellaneous¶

One open issue is that the kernel has an optional data structurerandomization mechanism, which also randomizes the situation of cacheline sharing among data members.

[1]

https://en.wikipedia.org/wiki/False_sharing

[2]

https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/

[3]

https://joemario.github.io/blog/2016/09/01/c2c-blog/

Movatterモバイル変換