Data-type profiling for perf

LWN.net needs you!
Without subscribers, LWN would simply not exist. Please considersigning up for a subscription and helping to keep LWN publishing.

December 21, 2023

This article was contributed by Julian Squires

Tooling for profiling the effects of memory usage and layout has alwayslagged behind that for profiling processor activity, so Namhyung Kim'spatch set for data-type profilingin perf is a welcome addition. It provides aggregated breakdowns ofmemory accesses by data type that can inform structure layout and accesspattern changes. Existing tools have either, likeheaptrack, focused onprofiling allocations, or, likeperf mem, on accounting memoryaccesses only at the address level. This new work builds on the latter,using DWARF debugging information to correlate memory operations with theirsource-level types.

Recent kernel history is full of examples of commits thatreorderstructures,padfields, orpackthem to improve performance. But how does one discover structures in needof optimization and characterize access to them to make such decisions?Pahole gives a static view of how datastructures span cache lines and where padding exists, but can't revealanything about access patterns.perf c2c is a powerful tool foridentifying cache-line contention, but won't reveal anything useful forsingle-threaded access. To understand the access behavior of a runningprogram, a broader picture of accesses to data structures is needed. Thisis where Kim's data type profiling work comes in.

Take, for example,thisrecent change to perf from Ian Rogers, who described it tersely as:"Avoid 6 byte hole for padding. Place more frequently used fieldsfirst in an attempt to use just 1 cache line in the common case."This is a classic structure-reordering optimization. Rogers quotespahole's output for the structure in question before the optimization:

    struct callchain_list {        u64                        ip;                   /*     0     8 */        struct map_symbol          ms;                   /*     8    24 */        struct {                _Bool              unfolded;             /*    32     1 */                _Bool              has_children;         /*    33     1 */        };                                               /*    32     2 */        /* XXX 6 bytes hole, try to pack */        u64                        branch_count;         /*    40     8 */        u64                        from_count;           /*    48     8 */        u64                        predicted_count;      /*    56     8 */        /* --- cacheline 1 boundary (64 bytes) --- */        u64                        abort_count;          /*    64     8 */        u64                        cycles_count;         /*    72     8 */        u64                        iter_count;           /*    80     8 */        u64                        iter_cycles;          /*    88     8 */        struct branch_type_stat *  brtype_stat;          /*    96     8 */        const char  *              srcline;              /*   104     8 */        struct list_head           list;                 /*   112    16 */        /* size: 128, cachelines: 2, members: 13 */        /* sum members: 122, holes: 1, sum holes: 6 */    };

We can see that there is a hole, and that the whole structure spans twocache lines, but not much more than that. Rogers's patch moves thelist_head structure up to fill the reported hole and, at the sametime, put a heavily accessed structure into the same cache line as theother frequently used data. Making a change like that, though, requiresknowledge of which fields are most often accessed. This is where perf'snew data type profiling comes in.

To use it, one starts by sampling memory operations with:

    perf mem record

Intel, AMD, and Arm each have some support for recordingprecise memory events on their contemporary processors, but this supportvaries in how comprehensive it is. On processors that support separatingload and store profiling (such asArmSPE orIntelPEBS), a command like:

    perf mem record -t store

can be used to find fieldsthat are heavily written. Here, we'll use it onperf report itself with a reasonably sized call chain to evaluatethe change.

Once a run has been done with the above command, it is time to use theresulting data to do the data-type profile. Kim's changes add a new command:

    perf annotate --data-type

that prints structureswith samples per field; it can be narrowed to a single type by providing anargument. This is what the output from:

    perf annotate --data-type=callchain_list

looks like before Rogers's patch (with the most active fields highlightedin bold):

    Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (218 samples):    ============================================================================    samples     offset       size  field        218          0        128  struct callchain_list         {18          0          8u64      ip;        157          8         24      struct map_symbol        ms {          0          8          8          struct maps* maps;60         16          8struct map*  map;97         24          8struct symbol*       sym;                                       };          0         32          2      struct    {          0         32          1          _Bool        unfolded;          0         33          1          _Bool        has_children;                                       };          0         40          8      u64      branch_count;          0         48          8      u64      from_count;          0         56          8      u64      predicted_count;          0         64          8      u64      abort_count;          0         72          8      u64      cycles_count;          0         80          8      u64      iter_count;          0         88          8      u64      iter_cycles;          0         96          8      struct branch_type_stat* brtype_stat;          0        104          8      char*    srcline;         43        112         16      struct list_head list {43        112          8struct list_head*    next;          0        120          8          struct list_head*    prev;                                       };                                   };

This makes the point of the patch clear. We can see thatlist isthe only field on the second cache line that is accessed as part of thisworkload. If that field could be moved to the first cache line, the cachebehavior of the application should improve. Data-type profiling lets usverify that assumption; its output after the patch looks like:

    Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (154 samples):    ============================================================================    samples     offset       size  field        154          0        128  struct callchain_list         {         28          0         16      struct list_head list {28          0          8struct list_head*    next;          0          8          8          struct list_head*    prev;                                       };9         16          8u64      ip;        116         24         24      struct map_symbol        ms {          1         24          8          struct maps* maps;60         32          8struct map*  map;55         40          8struct symbol*       sym;                                       };          1         48          8      char*    srcline;          0         56          8      u64      branch_count;          0         64          8      u64      from_count;          0         72          8      u64      cycles_count;          0         80          8      u64      iter_count;          0         88          8      u64      iter_cycles;          0         96          8      struct branch_type_stat* brtype_stat;          0        104          8      u64      predicted_count;          0        112          8      u64      abort_count;          0        120          2      struct    {          0        120          1          _Bool        unfolded;          0        121          1          _Bool        has_children;                                       };                                   };

For this workload, at least, the access patterns are as advertised.Some quickperf stat benchmarking revealed that theinstructions-per-cycle count had increased and the time elapsed haddecreased as a consequence of the change.

Anyone who has spent a lot of time scrutinizing pahole output,trying to shuffle structure members to balance size, cache-line access,falsesharing, and so on, is likely to find this useful. (Readers who havenot yet delved into this rabbit hole might want to start with UlrichDrepper's series on LWN, "Whatevery programmer should know about memory", specifically part 5, "What programmers can do".)

Data-type profiling obviously needs information about the program it islooking at to be able to do its job; specifically, identifying the data type associated witha load or store requires that there is DWARF debugging information forlocations, variables, and types. Any language supported by perf shouldwork. The author verified that, aside from C, Rust and Go programs producereasonable, though not always idiomatic for the language involved, output.

After sampling memory accesses, data-type aggregation correlatessampled instruction arguments with locations in the associated DWARFinformation, and then with their type.As is often the case in profiling, compiler optimizations can impedethis search. This unfortunately means that there are cases whereperf won't associate a memory event with a type because the DWARFinformation either wasn't thorough enough, or was too complex forperf to interpret.

Kim spoke about this work at the 2023 Linux Plumbers Conference (video), and notedsituations involving chains of pointers as a common case that isn'tsupported well currently. While he hasaworkaround for this problem, he also pointed out that there isaproposal for inverted location lists in DWARF that would be a more generalsolution.

For any given program address (usually the current programcounter (PC)),location lists inDWARF [large PDF] allow a debugging tool to look up how a symbol is currentlystored; it can be a location description, which may indicate the symbol is currentlystored in a register, or an address. What tools like perf wouldrather have is a mapping from an address or register to a symbol. This iseffectively an inversion of location lists, but computing this inversion ismuch less expensive for the compiler emitting the debugging information inthe first place. This has been a sore spot for perf in the past, judgingfrom the discussion between Arnaldo Carvalho de Melo and Peter Zijlstraduring the former's Linux Plumbers Conference 2022 talk (video) on profiling datastructures.

As of this article, Kim's work is unmerged but, since the changes areonly in user space, it's possible to try them out easily by buildingperf fromKim'sperf/data-profile-v3 branch. Given the enthusiastic reactionstothe v1patch set from perf tools maintainerArnaldo Carvalho deMelo,PeterZijlstra, andIngo Molnar, itseems likely that it won't remain unmerged for long.

Index entries for this article
GuestArticles	Squires, Julian

Data-type profiling for perf

Posted Dec 22, 2023 8:11 UTC (Fri) bySesse (subscriber, #53779) [Link]

Oh wow, I've been wanting this for some time. At some point, I even wrote a hack for something similar myself :-)

Data-type profiling for perf

Posted Dec 22, 2023 12:12 UTC (Fri) byacme (subscriber, #2443) [Link] (4 responses)

It's inhttps://git.kernel.org/pub/scm/linux/kernel/git/perf/perf..., probably will move to perf-tools-next later today, on its way to Linux v6.8 in January.

Great article! Should encourage people to test it and help with finding issues, fixing problems and adding more features.

It's not in 6.8

Posted Mar 23, 2024 9:36 UTC (Sat) byHi-Angel (guest, #110915) [Link] (3 responses)

> on its way to Linux v6.8 in January

I'm on 6.8.1 and calling a `perf annotate --data-type` results in a `Error: unknown option `data-type'`. So it didn't make it.

It's not in 6.8

Posted Mar 23, 2024 9:40 UTC (Sat) byHi-Angel (guest, #110915) [Link] (2 responses)

Oh, actually, sorry for the confusion, it seems that for some reason Archlinux packages updated the kernel but still didn't update `perf` utility that goes with it, Idk why. It was marked outdated 12 days ago. Anyway, please disregard my comment, will re-test it once Arch update perf to the kernel version.

It's not in 6.8

Posted Apr 5, 2024 11:50 UTC (Fri) byHi-Angel (guest, #110915) [Link] (1 responses)

FTR, apparently there's some problem with `perf` in 6.8. I was tired of waiting for 6.8 perf to appear in Arch repos so decided to download PKGBUILD and compile it myself. Well, now I know why it takes for long for perf to get updated: it's unbuildable, the linking stage fails with (also, Idk why text below is not getting aligned, I inserted indentation before it 🤷‍♂️):

LINK perf
/usr/bin/ld: /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../lib/Scrt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `map_for_pmu':
pmu-events.c:(.text+0x174): undefined reference to `perf_pmu__getcpuid'
/usr/bin/ld: pmu-events.c:(.text+0x1ac): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__for_each_event':
(.text+0x311): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__find_event':
(.text+0x4ce): undefined reference to `pmu__name_match'
/usr/bin/ld: (.text+0x5e0): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__num_events':
(.text+0x6f7): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `perf_pmu__find_events_table':
(.text+0xa33): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o:(.text+0xae3): more undefined references to `pmu__name_match' follow
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `find_core_events_table':
(.text+0xb79): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `find_core_metrics_table':
(.text+0xc09): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: /home/constantine/Projects/builds/linux-tools/src/linux/tools/perf/libsymbol/libsymbol.a(libsymbol-in.o): in function `__tolower':
/home/constantine/Projects/builds/linux-tools/src/linux/tools/lib/symbol/kallsyms.c:52:(.text+0x10): undefined reference to `_ctype'
/usr/bin/ld: /home/constantine/Projects/builds/linux-tools/src/linux/tools/perf/libsymbol/libsymbol.a(libsymbol-in.o): in function `__toupper':
/home/constantine/Projects/builds/linux-tools/src/linux/tools/lib/symbol/kallsyms.c:59:(.text+0x3e): undefined reference to `_ctype'
collect2: error: ld returned 1 exit status

It's not in 6.8

Posted Apr 5, 2024 12:01 UTC (Fri) byHi-Angel (guest, #110915) [Link]

Okay, I figured out what it's caused by: it's because I have `-flto` in default options and evidently there's some bug in `perf` that makes it break when that's defined. After removing flto I managed to compile it.

Data-type profiling for perf

Posted Dec 23, 2023 5:15 UTC (Sat) byroc (subscriber, #30627) [Link] (3 responses)

We don't really want compilers to emit redundant DWARF tables. That slows down builds and creates bloated binaries. A better approach would be to have a tool that can build inverted location lists from the regular location lists, persistently caching the results by build-ID when that's helpful.

Data-type profiling for perf

Posted Dec 26, 2023 21:08 UTC (Tue) byDanilaBerezin (guest, #168271) [Link] (2 responses)

I think slower builds and bloated binaries are an okay trade off for a debug build. But in general, yeah I would agree, I think if it's possible to create a secondary program that inverts the lists after the build, that would probably be preferable.

Data-type profiling for perf

Posted Dec 27, 2023 10:07 UTC (Wed) byWol (subscriber, #4433) [Link]

Sounds to me like a straightforward database file with index ...

Cheers,
Wol

Data-type profiling for perf

Posted Dec 27, 2023 10:14 UTC (Wed) bytaladar (subscriber, #68407) [Link]

If it is cheaper for the compiler to compute writing the information to a separate file as part of the compile process might also be an option.

Data-type profiling for perf

Posted Dec 24, 2023 0:16 UTC (Sun) bydankamongmen (subscriber, #35141) [Link]

this looks absolutely outstanding

Data-type profiling for perf

Posted Dec 26, 2023 21:27 UTC (Tue) byrywang014 (subscriber, #167182) [Link]

Can we do some large scale automations to discover more layout optimizations? It can run a wide range of benchmarks with this tool, and find if there are structs with multiple active cache lines and can be shuffled to a same cache line.

Data-type profiling for perf

Posted Jan 4, 2024 10:47 UTC (Thu) byrwmj (guest, #5474) [Link]

This looks fantastic. Next step would be some kind of latency analysis. I wonder if it's possible to see which fields have high latency for writes (which might indicate a cache line "ping-ponging" between cores)?

Movatterモバイル変換

Data-type profiling for perf

Data-type profiling for perf

Data-type profiling for perf

Data-type profiling for perf

It's not in 6.8

It's not in 6.8

It's not in 6.8

It's not in 6.8

Data-type profiling for perf

Data-type profiling for perf

Data-type profiling for perf

Data-type profiling for perf

Data-type profiling for perf

Data-type profiling for perf

Data-type profiling for perf