flamegraph-rs/flamegraphPublic

NotificationsYou must be signed in to change notification settings
Fork185
Star5.8k

Easy flamegraphs for Rust projects and everything else, without Perl or pipes <3

License

Apache-2.0, MIT licenses found

Licenses found

5.8k stars 185 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 404 Commits
.github		.github
src		src
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
deny.toml		deny.toml
example.svg		example.svg
example_cropped.png		example_cropped.png

Repository files navigation

[cargo-]flamegraph

A Rust-powered flamegraph generator with additional support forCargo projects! It can be used to profile anything,not just Rust projects! No perl or pipes required <3

Built on top of@jonhoo's wonderfulInferno all-rust flamegraph generation library!

Tip

You might want to also trysamply, which provides a more interactive UIusing a seamless integration with Firefox's Profiler web UI. It is also written in Rust and has better macOS support.

Quick Start

Install it, and run

# Rust projectscargo flamegraph# Arbitrary binariesflamegraph -- /path/to/binary

How to use flamegraphs:what's a flamegraph, and how can I use it to guide systems performance work?

Installation

[cargo-]flamegraph supports

Linux: relies onperf
MacOS: relies onxctrace
Windows: native support with theblondie library; also works withdtrace on Windows

cargo install flamegraph will make theflamegraph andcargo-flamegraph binaries available inyour cargo binary directory. On most systems this is something like~/.cargo/bin.

Linux

Note: If you're using lld or mold on Linux, you must use the--no-rosegment flag. Otherwise perf will not be able to generate accurate stack traces (explanation). For example, for lld:

[target.x86_64-unknown-linux-gnu]linker ="/usr/bin/clang"rustflags = ["-Clink-arg=-fuse-ld=lld","-Clink-arg=-Wl,--no-rosegment"]

and for mold:

[target.x86_64-unknown-linux-gnu]linker ="clang"rustflags = ["-Clink-arg=-fuse-ld=/usr/local/bin/mold","-Clink-arg=-Wl,--no-rosegment"]

Debian (x86 and aarch)

Note: Debian bullseye (the current stable version as of 2022) packages an outdated version of Rust which does not meet flamegraph's requirements. You should userustup to install an up-to-date version of Rust, or upgrade to Debian bookworm (the current testing version) or newer.

sudo apt install -y linux-perf

Ubuntu (x86)

sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`

Ubuntu (aarch)

sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`

Note: The perf binary is not packaged for all kernel versions. A check and workaround is below.

# Check if the perf binary is missing:ls -l /usr/lib/linux-tools/`uname -r`/| grep perf# If it is there, you can stop here.# If is missing, check if you have the tools for another kernel version installed:ls -l /usr/lib/linux-tools/# If you do, check it contains perf:# (replace <FROM_KERNEL>)ls -l /usr/lib/linux-tools/<FROM_KERNEL>/# If you do, symlink it. This has been tested with 6.14.0-1014-aws running perf from 6.8.0-85-generic# (replace <FROM_KERNEL> as before)sudo ln -s /usr/lib/linux-tools/<FROM_KERNEL>/perf /usr/lib/linux-tools/`uname -r`/perf

Ubuntu/Ubuntu MATE (Raspberry Pi)

sudo apt install linux-tools-raspi

Pop!_OS

sudo apt install linux-tools-common linux-tools-generic

Windows

Blondie Backend

This is enabled by default.Windows is supported out-of-the-box, thanks toNicolas Abram's excellentblondie library.

DTrace on Windows

Alternatively, one caninstall DTrace on Windows. If found, flamegraph will always prefer usingdtrace over the built-in Windows support.

Shell auto-completion

At the moment, onlyflamegraph supports auto-completion. Supported shells arebash,fish,zsh,powershell andelvish.cargo-flamegraph does not support auto-completion because it is not as straight-forward to implement for custom cargo subcommands. See#153 for details.

How you enable auto-completion depends on your shell, e.g.

flamegraph --completions bash>$XDG_CONFIG_HOME/bash_completion# or /etc/bash_completion.d/

Examples

# if you'd like to profile an arbitrary executable:flamegraph [-o my_flamegraph.svg] -- /path/to/my/binary --my-arg 5# or if the executable is already running, you can provide the PID via `-p` (or `--pid`) flag:flamegraph [-o my_flamegraph.svg] --pid 1337# NOTE: By default, perf tries to compute which functions are# inlined at every stack frame for every sample. This can take# a very long time (see https://github.com/flamegraph-rs/flamegraph/issues/74).# If you don't want this, you can pass --no-inline to flamegraph:flamegraph --no-inline [-o my_flamegraph.svg] /path/to/my/binary --my-arg 5# cargo support provided through the cargo-flamegraph binary!# defaults to profiling cargo run --releasecargo flamegraph# by default, `--release` profile is used,# but you can override this:cargo flamegraph --dev# if you'd like to profile a specific binary:cargo flamegraph --bin=stress2# if you want to pass arguments as you would with cargo run:cargo flamegraph -- my-command --my-arg my-value -m -f# if you want to use interesting perf or dtrace options, use `-c`# this is handy for correlating things like branch-misses, cache-misses,# or anything else available via `perf list` or dtrace for your systemcargo flamegraph -c"record -e branch-misses -c 100 --call-graph lbr -g"# Run criterion benchmark# Note that the last --bench is required for `criterion 0.3` to run in benchmark mode, instead of test mode.cargo flamegraph --bench some_benchmark --features some_features -- --benchcargo flamegraph --example some_example --features some_features# Profile unit tests.# Note that a separating `--` is necessary if `--unit-test` is the last flag.cargo flamegraph --unit-test -- test::in::package::with::single::cratecargo flamegraph --unit-test crate_name -- test::in::package::with::multiple:cratecargo flamegraph --unit-test --dev test::may::omit::separator::if::unit::test::flag::not::last::flag# Profile integration tests.cargo flamegraph --test test_name

Usage

flamegraph is quite simple.cargo-flamegraph is more sophisticated:

Usage: cargo flamegraph [OPTIONS] [-- <TRAILING_ARGUMENTS>...]Arguments:  [TRAILING_ARGUMENTS]...  Trailing arguments passed to the binary being profiledOptions:      --dev                            Build with the dev profile      --profile <PROFILE>              Build with the specified profile  -p, --package <PACKAGE>              package with the binary to run  -b, --bin <BIN>                      Binary to run      --example <EXAMPLE>              Example to run      --test <TEST>                    Test binary to run (currently profiles the test harness and all tests in the binary)      --unit-test [<UNIT_TEST>]        Crate target to unit test, <unit-test> may be omitted if crate only has one target (currently profiles the test harness and all tests in the binary; test selection can be passed as trailing arguments after `--` as separator)      --bench <BENCH>                  Benchmark to run      --manifest-path <MANIFEST_PATH>  Path to Cargo.toml  -f, --features <FEATURES>            Build features to enable      --no-default-features            Disable default features  -r, --release                        No-op. For compatibility with `cargo run --release`  -v, --verbose                        Print extra output to help debug problems  -o, --output <OUTPUT>                Output file [default: flamegraph.svg]      --open                           Open the output .svg file with default program      --root                           Run with root privileges (using `sudo`)  -F, --freq <FREQUENCY>               Sampling frequency in Hz [default: 997]  -c, --cmd <CUSTOM_CMD>               Custom command for invoking perf/dtrace      --deterministic                  Colors are selected such that the color of a function does not change between runs  -i, --inverted                       Plot the flame graph up-side-down      --reverse                        Generate stack-reversed flame graph      --notes <STRING>                 Set embedded notes in SVG      --min-width <FLOAT>              Omit functions smaller than <FLOAT> pixels [default: 0.01]      --image-width <IMAGE_WIDTH>      Image width in pixels      --palette <PALETTE>              Color palette [possible values: hot, mem, io, red, green, blue, aqua, yellow, purple, orange, wakeup, java, perl, js, rust]      --skip-after <FUNCTION>          Cut off stack frames below <FUNCTION>; may be repeated      --flamechart                     Produce a flame chart (sort by time, do not merge stacks)      --ignore-status                  Ignores perf's exit code      --no-inline                      Disable inlining for perf script because of performance issues      --post-process <POST_PROCESS>    Run a command to process the folded stacks, taking the input from stdin and outputting to stdout  -h, --help                           Print help  -V, --version                        Print version

Then open the resultingflamegraph.svg with a browser, because most imageviewers do not support interactive svg-files.

Enabling perf for use by unprivileged users

To enable perf without running as root, you maylower theperf_event_paranoid value in procto an appropriate level for your environment.The most permissive value is-1 but may notbe acceptable for your security needs etc...

echo -1| sudo tee /proc/sys/kernel/perf_event_paranoid

Improving output when running with`--release`

Due to optimizations etc... sometimes the qualityof the information presented in the flamegraph willsuffer when profiling release builds.

To counter this to some extent, you may either set the following in yourCargo.toml file:

[profile.release]debug = true

Or set the environment variableCARGO_PROFILE_RELEASE_DEBUG=true.

Please note that tests, unit tests and benchmarks use thebench profile in release mode (seehere).

Usage with benchmarks

In order to perf existing benchmarks, you should set up a few configs.Set the following in yourCargo.toml file to run benchmarks:

[profile.bench]debug = true

Use custom paths for perf and dtrace

IfPERF orDTRACE environment variable is set,it'll be used as corresponding tool command.For example, to useperf from~/bin:

env PERF=~/bin/perf flamegraph /path/to/my/binary

Use custom`addr2line` binary for perf

It has been reported thataddr2line can run very slowly in several issues (#74,#199,#294). One solution is to usegimli-rs/addr2line instead of the systemaddr2line binary. This is suggested inthis comment, and you can follow the steps below to set it up:

cargo install addr2line --features=bin

Systems Performance Work Guided By Flamegraphs

Flamegraphs are used to visualize where time is being spentin your program. Many times per second, the threads in aprogram are interrupted and the current location in yourcode (based on the thread's instruction pointer) is recorded,along with the chain of functions that were called to get there.This is called stack sampling. These samples are thenprocessed and stacks that share common functions areadded together. Then an SVG is generated showing thecall stacks that were measured, widened to the proportionof all stack samples that contained them.

They-axis shows the stack depth number. When looking at aflamegraph, the main function of your program will be closer tothe bottom, and the called functions will be stacked on top,with the functions that they call stacked on top of them, etc...

Thex-axis spans all of the samples. It doesnot show thepassing of time from left to right. The left to right orderinghas no meaning.

Thewidth of each box shows the total time that thatfunction is on the CPU or is part of the call stack. If afunction's box is wider than others, that means that it consumesmore CPU per execution than other functions, or that it iscalled more than other functions.

Thecolor of each box isn't significant, and is chosen atrandom.

Flamegraphs are good for visualizing where the mostexpensive parts of your program are at runtime,which is wonderful because...

Humans are terrible at guessing about performance!

Especially people who come to Rust from C and C++ willoften over-optimize things in code that LLVM is able tooptimize away on its own. It's always better to writeRust in a clear and obvious way, before beginningmicro-optimizations, allocation-minimization, etc...

Lots of things that would seem like they would have terribleperformance are actually cheap or free in Rust. Closuresare fast. Initialization on the stack before movingto aBox is often compiled away. Clones are oftencompiled away. So,clone() away instead of fightingfor too long to get the compiler to be happy aboutownership!

Then make a flamegraph to see if any of that wasactually expensive.

Flamegraphs Are the Beginning, Not the End

Flamegraphs show you the things that are taking up time, but theyare a sampling technique to be used for high-level and initiallooks at the system under measurement. They are great for findingthe things to look into more closely, and often it will beobvious how to improve something based on its flamegraph, butthey are really more for choosing the target to perform optimizationon than an optimization measurement tool in itself. They arecoarse-grained, and difficult to diff (althoughthis may be supported soon).Also, because flamegraphs are based on the proportion of total timethat something takes, if you accidentally make something elsereally slow, it will show all other things as smaller on the flamegraph,even though the entire program runtime is much slower, the items youwere hoping to optimize look smaller.

It is a good idea to use Flamegraphs to figure out what you want tooptimize, and then set up a measurement environment that allowsyou to determine that an improvement has actually happened.

use flamegraphs to find a set of optimization targets
create benchmarks for these optimization targets, and ifappropriate use something like cachegrind and cg_diff tomeasure cpu instructionsand diff them against the previous version.
Measuring CPU instructions is often better than measuring the time it takesto run a workload in many cases, because it's possible that abackground task on your machine ran and caused something to slow downin terms of physical time, but if you actually made an implementationfaster, it is likely to have a stronger correlation with reduced totalCPU instructions.
Time spent on the CPU is not the full picture, as time is spentwaiting for IO to complete as well, which does not get accountedwith tools like perf that only measure what's consuming timeon the CPU. Check outBrendan Gregg's article on Off-CpuAccountingfor more information about this!

Performance Theory 101: Basics of Quantitative Engineering

Use realistic workloads on realistic hardware, or your data doesn'tnecessarily correspond very much with what will be happening in production
All of our guesses are wrong to some extent, so we have to measurethe effects of our work. Often the simple code that doesn't seemlike it should be fast is actually way faster than code that looksoptimized. We need to measure our optimizations to make sure that wedidn't make our code both harder to read AND slower.
Measure before you change anything, and save the resultsin a safe place! Many profiling tools will overwrite their old outputwhen you run them again, so make sure you take care to save thedata before you begin so that you can compare before and after.
Take measurements on a warmed up machine that isn't doing anythingelse, and has had time to cool off from the last workload.CPUs will fall asleep and drop into power-saving modes when idle,and they will also throttle if they get too hot (sometimes SIMDcan cause things to run slower because it heats things up so muchthat the core has to throttle).

Performance Theory 202: USE Method

The USE Method is a way to very quickly locate performanceproblems while minimizing discovery efforts. It's more aboutfinding production issues than flamegraphs directly, but it'sa great technique to have in your toolbox if you are going tobe doing performance triage, and flamegraphs can be helpfulfor identifying the components to then drill down into queueanalysis for.

Everything in a computer can be thought of as a resourcewith a queue in front of it, which can serve one or morerequests at a time. The various systems in our computersand programs can do a certain amount of work over timebefore requests start to pile up and wait in a lineuntil they are able to be serviced.

Some resources can handle more and more work without degradingin performance until they hit their maximum utilization point.Network devices can be thought of as working in this way toa large extent. Other resources start to saturate long beforethey hit their maximum utilization point, like disks.

Disks (especially spinning disks, but even SSDs) will do more andmore work if you allow more work to queue up until they hit theirmaximum throughput for a workload, but the latency per requestwill go up before it hits 100% utilization because the disk willtake longer before it can begin servicing each request. Tuning diskperformance often involves measuring the various IO queue depths tomake sure they are high enough to get nice throughput but not sohigh that latency becomes undesirable.

Anyway, nearly everything in our systems can be broken downto be analyzed based on 3 high-level characteristics:

Utilization is the amount of time the system undermeasurement is actually doing useful work servicing a request,and can be measured as the percent of available time spent servicingrequests
Saturation is when requests have to wait before beingserviced. This can be measured as the queue depth over time
Errors are when things start to fail, like when queuesare no longer able to accept any new requests - like when a TCP connectionis rejected because the system's TCP backlog is already full ofconnections that have not yet been accept'ed by the userspaceprogram.

This forms the necessary background to start applying the USE Methodto locate the performance-related issue in your complex system!

The approach is:

Enumerate the various resources that might be behaving poorly - maybe by creating aflamegraph and looking for functions that are taking more of the total runtime than expected
Pick one of them
(Errors) Check for errors like TCP connection failures, other IO failures, bad things in logs etc...
(Utilization) Measure the utilization of the system and see if its throughput is approachingthe known maximum, or the point that it is known to experience saturation
(Saturation) Is saturation actually happening? Are requests waiting in lines before being serviced?Is latency going up while throughput is staying the same?

These probing questions serve as a piercing flashlight forrapidly identifying the underlying issue most of the time.

If you want to learn more about this, check out Brendan Gregg'sblog post on it.I tend to recommend that anyone who is becoming an SRE shouldmake Brendan'sSystems Performancebook one of the first things they read to understand how tomeasure these things quickly in production systems.

The USE Method derives from an area of study calledqueue theorywhich has had a huge impact on the world of computing,as well as many other logistical endeavors that humanshave undertaken.

Performance Laws

If you want to drill more into theory, know the law(s)!

Universal Law of Scalabilityis about the relationship between concurrency gains, queuing and coordination costs
Amdahl's Lawis about the theoretical maximum gain that can be made for a workload by parallelization.
Little's Lawis a deceptively simple law with some subtle implications from queue theorythat allows us to reason about appropriate queue lengths for our systems