Python support for the Linuxperf
profiler¶
- author:
Pablo Galindo
The Linux perf profileris a very powerful tool that allows you to profile and obtaininformation about the performance of your application.perf
also has a very vibrant ecosystem of toolsthat aid with the analysis of the data that it produces.
The main problem with using theperf
profiler with Python applications is thatperf
only gets information about native symbols, that is, the names offunctions and procedures written in C. This means that the names and file namesof Python functions in your code will not appear in the output ofperf
.
Since Python 3.12, the interpreter can run in a special mode that allows Pythonfunctions to appear in the output of theperf
profiler. When this mode isenabled, the interpreter will interpose a small piece of code compiled on thefly before the execution of every Python function and it will teachperf
therelationship between this piece of code and the associated Python function usingperf map files.
Note
Support for theperf
profiler is currently only available for Linux onselect architectures. Check the output of theconfigure
build step orcheck the output ofpython-msysconfig|grepHAVE_PERF_TRAMPOLINE
to see if your system is supported.
For example, consider the following script:
deffoo(n):result=0for_inrange(n):result+=1returnresultdefbar(n):foo(n)defbaz(n):bar(n)if__name__=="__main__":baz(1000000)
We can runperf
to sample CPU stack traces at 9999 hertz:
$perfrecord-F9999-g-operf.datapythonmy_script.py
Then we can useperfreport
to analyze the data:
$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#..................................................................................................# 91.08% 0.00% 0 python.exe python.exe [.] _start | ---_start | --90.71%--__libc_start_main Py_BytesMain | |--56.88%--pymain_run_python.constprop.0 | | | |--56.13%--_PyRun_AnyFileObject | | _PyRun_SimpleFileObject | | | | | |--55.02%--run_mod | | | | | | | --54.65%--PyEval_EvalCode | | | _PyEval_EvalFrameDefault | | | PyObject_Vectorcall | | | _PyEval_Vector | | | _PyEval_EvalFrameDefault | | | PyObject_Vectorcall | | | _PyEval_Vector | | | _PyEval_EvalFrameDefault | | | PyObject_Vectorcall | | | _PyEval_Vector | | | | | | | |--51.67%--_PyEval_EvalFrameDefault | | | | | | | | | |--11.52%--_PyLong_Add | | | | | | | | | | | |--2.97%--_PyObject_Malloc...
As you can see, the Python functions are not shown in the output, only_PyEval_EvalFrameDefault
(the function that evaluates the Python bytecode) shows up. Unfortunately that’s not very useful because all Pythonfunctions use the same C function to evaluate bytecode so we cannot know which Python function corresponds to whichbytecode-evaluating function.
Instead, if we run the same experiment withperf
support enabled we get:
$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#.............................................................................................................................# 90.58% 0.36% 1 python.exe python.exe [.] _start | ---_start | --89.86%--__libc_start_main Py_BytesMain | |--55.43%--pymain_run_python.constprop.0 | | | |--54.71%--_PyRun_AnyFileObject | | _PyRun_SimpleFileObject | | | | | |--53.62%--run_mod | | | | | | | --53.26%--PyEval_EvalCode | | | py::<module>:/src/script.py | | | _PyEval_EvalFrameDefault | | | PyObject_Vectorcall | | | _PyEval_Vector | | | py::baz:/src/script.py | | | _PyEval_EvalFrameDefault | | | PyObject_Vectorcall | | | _PyEval_Vector | | | py::bar:/src/script.py | | | _PyEval_EvalFrameDefault | | | PyObject_Vectorcall | | | _PyEval_Vector | | | py::foo:/src/script.py | | | | | | | |--51.81%--_PyEval_EvalFrameDefault | | | | | | | | | |--13.77%--_PyLong_Add | | | | | | | | | | | |--3.26%--_PyObject_Malloc
How to enableperf
profiling support¶
perf
profiling support can be enabled either from the start usingthe environment variablePYTHONPERFSUPPORT
or the-Xperf
option,or dynamically usingsys.activate_stack_trampoline()
andsys.deactivate_stack_trampoline()
.
Thesys
functions take precedence over the-X
option,the-X
option takes precedence over the environment variable.
Example, using the environment variable:
$PYTHONPERFSUPPORT=1perfrecord-F9999-g-operf.datapythonmy_script.py$perfreport-g-iperf.data
Example, using the-X
option:
$perfrecord-F9999-g-operf.datapython-Xperfmy_script.py$perfreport-g-iperf.data
Example, using thesys
APIs in fileexample.py
:
importsyssys.activate_stack_trampoline("perf")do_profiled_stuff()sys.deactivate_stack_trampoline()non_profiled_stuff()
…then:
$perfrecord-F9999-g-operf.datapython./example.py$perfreport-g-iperf.data
How to obtain the best results¶
For best results, Python should be compiled withCFLAGS="-fno-omit-frame-pointer-mno-omit-leaf-frame-pointer"
as this allowsprofilers to unwind using only the frame pointer and not on DWARF debuginformation. This is because as the code that is interposed to allowperf
support is dynamically generated it doesn’t have any DWARF debugging informationavailable.
You can check if your system has been compiled with this flag by running:
$python-msysconfig|grep'no-omit-frame-pointer'
If you don’t see any output it means that your interpreter has not been compiled withframe pointers and therefore it may not be able to show Python functions in the outputofperf
.
How to work without frame pointers¶
If you are working with a Python interpreter that has been compiled withoutframe pointers, you can still use theperf
profiler, but the overhead will bea bit higher because Python needs to generate unwinding information for everyPython function call on the fly. Additionally,perf
will take more time toprocess the data because it will need to use the DWARF debugging information tounwind the stack and this is a slow process.
To enable this mode, you can use the environment variablePYTHON_PERF_JIT_SUPPORT
or the-Xperf_jit
option,which will enable the JIT mode for theperf
profiler.
Note
Due to a bug in theperf
tool, onlyperf
versions higher than v6.8will work with the JIT mode. The fix was also backported to the v6.7.2version of the tool.
Note that when checking the version of theperf
tool (which can be doneby runningperfversion
) you must take into account that some distrosadd some custom version numbers including a-
character. This meansthatperf6.7-3
is not necessarilyperf6.7.3
.
When using the perf JIT mode, you need an extra step before you can runperfreport
. You need to call theperfinject
command to inject the JITinformation into theperf.data
file.:
$perfrecord-F9999-g-k1--call-graphdwarf-operf.datapython-Xperf_jitmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data
or using the environment variable:
$PYTHON_PERF_JIT_SUPPORT=1perfrecord-F9999-g--call-graphdwarf-operf.datapythonmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data
perfinject--jit
command will readperf.data
,automatically pick up the perf dump file that Python creates (in/tmp/perf-$PID.dump
), and then createperf.jit.data
which merges all theJIT information together. It should also create a lot ofjitted-XXXX-N.so
files in the current directory which are ELF images for all the JIT trampolinesthat were created by Python.
Warning
When using--call-graphdwarf
, theperf
tool will takesnapshots of the stack of the process being profiled and save theinformation in theperf.data
file. By default, the size of the stack dumpis 8192 bytes, but you can change the size by passing it aftera comma like--call-graphdwarf,16384
.
The size of the stack dump is important because if the size is too smallperf
will not be able to unwind the stack and the output will beincomplete. On the other hand, if the size is too big, thenperf
won’tbe able to sample the process as frequently as it would like as the overheadwill be higher.
The stack size is particularly important when profiling Python code compiledwith low optimization levels (like-O0
), as these builds tend to havelarger stack frames. If you are compiling Python with-O0
and not seeingPython functions in your profiling output, try increasing the stack dumpsize to 65528 bytes (the maximum):
$perfrecord-F9999-g-k1--call-graphdwarf,65528-operf.datapython-Xperf_jitmy_script.py
Different compilation flags can significantly impact stack sizes:
Builds with
-O0
typically have much larger stack frames than those with-O1
or higherAdding optimizations (
-O1
,-O2
, etc.) typically reduces stack sizeFrame pointers (
-fno-omit-frame-pointer
) generally provide more reliable stack unwinding