LWN.net needs you!Without subscribers, LWN would simply not exist. Please considersigning up for a subscription and helping to keep LWN publishing.
December 21, 2023
This article was contributed by Julian Squires
Tooling for profiling the effects of memory usage and layout has alwayslagged behind that for profiling processor activity, so Namhyung Kim'spatch set for data-type profilingin perf is a welcome addition. It provides aggregated breakdowns ofmemory accesses by data type that can inform structure layout and accesspattern changes. Existing tools have either, likeheaptrack, focused onprofiling allocations, or, likeperf mem, on accounting memoryaccesses only at the address level. This new work builds on the latter,using DWARF debugging information to correlate memory operations with theirsource-level types.
Recent kernel history is full of examples of commits thatreorderstructures,padfields, orpackthem to improve performance. But how does one discover structures in needof optimization and characterize access to them to make such decisions?Pahole gives a static view of how datastructures span cache lines and where padding exists, but can't revealanything about access patterns.perf c2c is a powerful tool foridentifying cache-line contention, but won't reveal anything useful forsingle-threaded access. To understand the access behavior of a runningprogram, a broader picture of accesses to data structures is needed. Thisis where Kim's data type profiling work comes in.
Take, for example,thisrecent change to perf from Ian Rogers, who described it tersely as:"Avoid 6 byte hole for padding. Place more frequently used fieldsfirst in an attempt to use just 1 cache line in the common case.
"This is a classic structure-reordering optimization. Rogers quotespahole's output for the structure in question before the optimization:
struct callchain_list { u64 ip; /* 0 8 */ struct map_symbol ms; /* 8 24 */ struct { _Bool unfolded; /* 32 1 */ _Bool has_children; /* 33 1 */ }; /* 32 2 */ /* XXX 6 bytes hole, try to pack */ u64 branch_count; /* 40 8 */ u64 from_count; /* 48 8 */ u64 predicted_count; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ u64 abort_count; /* 64 8 */ u64 cycles_count; /* 72 8 */ u64 iter_count; /* 80 8 */ u64 iter_cycles; /* 88 8 */ struct branch_type_stat * brtype_stat; /* 96 8 */ const char * srcline; /* 104 8 */ struct list_head list; /* 112 16 */ /* size: 128, cachelines: 2, members: 13 */ /* sum members: 122, holes: 1, sum holes: 6 */ };We can see that there is a hole, and that the whole structure spans twocache lines, but not much more than that. Rogers's patch moves thelist_head structure up to fill the reported hole and, at the sametime, put a heavily accessed structure into the same cache line as theother frequently used data. Making a change like that, though, requiresknowledge of which fields are most often accessed. This is where perf'snew data type profiling comes in.
To use it, one starts by sampling memory operations with:
perf mem record
Intel, AMD, and Arm each have some support for recordingprecise memory events on their contemporary processors, but this supportvaries in how comprehensive it is. On processors that support separatingload and store profiling (such asArmSPE orIntelPEBS), a command like:
perf mem record -t store
can be used to find fieldsthat are heavily written. Here, we'll use it onperf report itself with a reasonably sized call chain to evaluatethe change.
Once a run has been done with the above command, it is time to use theresulting data to do the data-type profile. Kim's changes add a new command:
perf annotate --data-type
that prints structureswith samples per field; it can be narrowed to a single type by providing anargument. This is what the output from:
perf annotate --data-type=callchain_list
looks like before Rogers's patch (with the most active fields highlightedin bold):
Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (218 samples): ============================================================================ samples offset size field 218 0 128 struct callchain_list {18 0 8u64 ip; 157 8 24 struct map_symbol ms { 0 8 8 struct maps* maps;60 16 8struct map* map;97 24 8struct symbol* sym; }; 0 32 2 struct { 0 32 1 _Bool unfolded; 0 33 1 _Bool has_children; }; 0 40 8 u64 branch_count; 0 48 8 u64 from_count; 0 56 8 u64 predicted_count; 0 64 8 u64 abort_count; 0 72 8 u64 cycles_count; 0 80 8 u64 iter_count; 0 88 8 u64 iter_cycles; 0 96 8 struct branch_type_stat* brtype_stat; 0 104 8 char* srcline; 43 112 16 struct list_head list {43 112 8struct list_head* next; 0 120 8 struct list_head* prev; }; };This makes the point of the patch clear. We can see thatlist isthe only field on the second cache line that is accessed as part of thisworkload. If that field could be moved to the first cache line, the cachebehavior of the application should improve. Data-type profiling lets usverify that assumption; its output after the patch looks like:
Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (154 samples): ============================================================================ samples offset size field 154 0 128 struct callchain_list { 28 0 16 struct list_head list {28 0 8struct list_head* next; 0 8 8 struct list_head* prev; };9 16 8u64 ip; 116 24 24 struct map_symbol ms { 1 24 8 struct maps* maps;60 32 8struct map* map;55 40 8struct symbol* sym; }; 1 48 8 char* srcline; 0 56 8 u64 branch_count; 0 64 8 u64 from_count; 0 72 8 u64 cycles_count; 0 80 8 u64 iter_count; 0 88 8 u64 iter_cycles; 0 96 8 struct branch_type_stat* brtype_stat; 0 104 8 u64 predicted_count; 0 112 8 u64 abort_count; 0 120 2 struct { 0 120 1 _Bool unfolded; 0 121 1 _Bool has_children; }; };For this workload, at least, the access patterns are as advertised.Some quickperf stat benchmarking revealed that theinstructions-per-cycle count had increased and the time elapsed haddecreased as a consequence of the change.
Anyone who has spent a lot of time scrutinizing pahole output,trying to shuffle structure members to balance size, cache-line access,falsesharing, and so on, is likely to find this useful. (Readers who havenot yet delved into this rabbit hole might want to start with UlrichDrepper's series on LWN, "Whatevery programmer should know about memory", specifically part 5, "What programmers can do".)
Data-type profiling obviously needs information about the program it islooking at to be able to do its job; specifically, identifying the data type associated witha load or store requires that there is DWARF debugging information forlocations, variables, and types. Any language supported by perf shouldwork. The author verified that, aside from C, Rust and Go programs producereasonable, though not always idiomatic for the language involved, output.
After sampling memory accesses, data-type aggregation correlatessampled instruction arguments with locations in the associated DWARFinformation, and then with their type.As is often the case in profiling, compiler optimizations can impedethis search. This unfortunately means that there are cases whereperf won't associate a memory event with a type because the DWARFinformation either wasn't thorough enough, or was too complex forperf to interpret.
Kim spoke about this work at the 2023 Linux Plumbers Conference (video), and notedsituations involving chains of pointers as a common case that isn'tsupported well currently. While he hasaworkaround for this problem, he also pointed out that there isaproposal for inverted location lists in DWARF that would be a more generalsolution.
For any given program address (usually the current programcounter (PC)),location lists inDWARF [large PDF] allow a debugging tool to look up how a symbol is currentlystored; it can be a location description, which may indicate the symbol is currentlystored in a register, or an address. What tools like perf wouldrather have is a mapping from an address or register to a symbol. This iseffectively an inversion of location lists, but computing this inversion ismuch less expensive for the compiler emitting the debugging information inthe first place. This has been a sore spot for perf in the past, judgingfrom the discussion between Arnaldo Carvalho de Melo and Peter Zijlstraduring the former's Linux Plumbers Conference 2022 talk (video) on profiling datastructures.
As of this article, Kim's work is unmerged but, since the changes areonly in user space, it's possible to try them out easily by buildingperf fromKim'sperf/data-profile-v3 branch. Given the enthusiastic reactionstothe v1patch set from perf tools maintainerArnaldo Carvalho deMelo,PeterZijlstra, andIngo Molnar, itseems likely that it won't remain unmerged for long.
| Index entries for this article | |
|---|---|
| GuestArticles | Squires, Julian |
Copyright © 2023, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds