Python support for the Linuxperf profiler

author:

Pablo Galindo

The Linux perf profileris a very powerful tool that allows you to profile and obtaininformation about the performance of your application.perf also has a very vibrant ecosystem of toolsthat aid with the analysis of the data that it produces.

The main problem with using theperf profiler with Python applications is thatperf only gets information about native symbols, that is, the names offunctions and procedures written in C. This means that the names and file namesof Python functions in your code will not appear in the output ofperf.

Since Python 3.12, the interpreter can run in a special mode that allows Pythonfunctions to appear in the output of theperf profiler. When this mode isenabled, the interpreter will interpose a small piece of code compiled on thefly before the execution of every Python function and it will teachperf therelationship between this piece of code and the associated Python function usingperf map files.

Note

Support for theperf profiler is currently only available for Linux onselect architectures. Check the output of theconfigure build step orcheck the output ofpython-msysconfig|grepHAVE_PERF_TRAMPOLINEto see if your system is supported.

For example, consider the following script:

deffoo(n):result=0for_inrange(n):result+=1returnresultdefbar(n):foo(n)defbaz(n):bar(n)if__name__=="__main__":baz(1000000)

We can runperf to sample CPU stack traces at 9999 hertz:

$perfrecord-F9999-g-operf.datapythonmy_script.py

Then we can useperfreport to analyze the data:

$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#..................................................................................................#    91.08%     0.00%             0  python.exe  python.exe          [.] _start            |            ---_start            |                --90.71%--__libc_start_main                        Py_BytesMain                        |                        |--56.88%--pymain_run_python.constprop.0                        |          |                        |          |--56.13%--_PyRun_AnyFileObject                        |          |          _PyRun_SimpleFileObject                        |          |          |                        |          |          |--55.02%--run_mod                        |          |          |          |                        |          |          |           --54.65%--PyEval_EvalCode                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     |                        |          |          |                     |--51.67%--_PyEval_EvalFrameDefault                        |          |          |                     |          |                        |          |          |                     |          |--11.52%--_PyLong_Add                        |          |          |                     |          |          |                        |          |          |                     |          |          |--2.97%--_PyObject_Malloc...

As you can see, the Python functions are not shown in the output, only_PyEval_EvalFrameDefault(the function that evaluates the Python bytecode) shows up. Unfortunately that’s not very useful because all Pythonfunctions use the same C function to evaluate bytecode so we cannot know which Python function corresponds to whichbytecode-evaluating function.

Instead, if we run the same experiment withperf support enabled we get:

$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#.............................................................................................................................#    90.58%     0.36%             1  python.exe  python.exe          [.] _start            |            ---_start            |                --89.86%--__libc_start_main                        Py_BytesMain                        |                        |--55.43%--pymain_run_python.constprop.0                        |          |                        |          |--54.71%--_PyRun_AnyFileObject                        |          |          _PyRun_SimpleFileObject                        |          |          |                        |          |          |--53.62%--run_mod                        |          |          |          |                        |          |          |           --53.26%--PyEval_EvalCode                        |          |          |                     py::<module>:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::baz:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::bar:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::foo:/src/script.py                        |          |          |                     |                        |          |          |                     |--51.81%--_PyEval_EvalFrameDefault                        |          |          |                     |          |                        |          |          |                     |          |--13.77%--_PyLong_Add                        |          |          |                     |          |          |                        |          |          |                     |          |          |--3.26%--_PyObject_Malloc

How to enableperf profiling support

perf profiling support can be enabled either from the start usingthe environment variablePYTHONPERFSUPPORT or the-Xperf option,or dynamically usingsys.activate_stack_trampoline() andsys.deactivate_stack_trampoline().

Thesys functions take precedence over the-X option,the-X option takes precedence over the environment variable.

Example, using the environment variable:

$PYTHONPERFSUPPORT=1perfrecord-F9999-g-operf.datapythonmy_script.py$perfreport-g-iperf.data

Example, using the-X option:

$perfrecord-F9999-g-operf.datapython-Xperfmy_script.py$perfreport-g-iperf.data

Example, using thesys APIs in fileexample.py:

importsyssys.activate_stack_trampoline("perf")do_profiled_stuff()sys.deactivate_stack_trampoline()non_profiled_stuff()

…then:

$perfrecord-F9999-g-operf.datapython./example.py$perfreport-g-iperf.data

How to obtain the best results

For best results, Python should be compiled withCFLAGS="-fno-omit-frame-pointer-mno-omit-leaf-frame-pointer" as this allowsprofilers to unwind using only the frame pointer and not on DWARF debuginformation. This is because as the code that is interposed to allowperfsupport is dynamically generated it doesn’t have any DWARF debugging informationavailable.

You can check if your system has been compiled with this flag by running:

$python-msysconfig|grep'no-omit-frame-pointer'

If you don’t see any output it means that your interpreter has not been compiled withframe pointers and therefore it may not be able to show Python functions in the outputofperf.

How to work without frame pointers

If you are working with a Python interpreter that has been compiled withoutframe pointers, you can still use theperf profiler, but the overhead will bea bit higher because Python needs to generate unwinding information for everyPython function call on the fly. Additionally,perf will take more time toprocess the data because it will need to use the DWARF debugging information tounwind the stack and this is a slow process.

To enable this mode, you can use the environment variablePYTHON_PERF_JIT_SUPPORT or the-Xperf_jit option,which will enable the JIT mode for theperf profiler.

Note

Due to a bug in theperf tool, onlyperf versions higher than v6.8will work with the JIT mode. The fix was also backported to the v6.7.2version of the tool.

Note that when checking the version of theperf tool (which can be doneby runningperfversion) you must take into account that some distrosadd some custom version numbers including a- character. This meansthatperf6.7-3 is not necessarilyperf6.7.3.

When using the perf JIT mode, you need an extra step before you can runperfreport. You need to call theperfinject command to inject the JITinformation into theperf.data file.:

$perfrecord-F9999-g-k1--call-graphdwarf-operf.datapython-Xperf_jitmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data

or using the environment variable:

$PYTHON_PERF_JIT_SUPPORT=1perfrecord-F9999-g--call-graphdwarf-operf.datapythonmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data

perfinject--jit command will readperf.data,automatically pick up the perf dump file that Python creates (in/tmp/perf-$PID.dump), and then createperf.jit.data which merges all theJIT information together. It should also create a lot ofjitted-XXXX-N.sofiles in the current directory which are ELF images for all the JIT trampolinesthat were created by Python.

Warning

When using--call-graphdwarf, theperf tool will takesnapshots of the stack of the process being profiled and save theinformation in theperf.data file. By default, the size of the stack dumpis 8192 bytes, but you can change the size by passing it aftera comma like--call-graphdwarf,16384.

The size of the stack dump is important because if the size is too smallperf will not be able to unwind the stack and the output will beincomplete. On the other hand, if the size is too big, thenperf won’tbe able to sample the process as frequently as it would like as the overheadwill be higher.

The stack size is particularly important when profiling Python code compiledwith low optimization levels (like-O0), as these builds tend to havelarger stack frames. If you are compiling Python with-O0 and not seeingPython functions in your profiling output, try increasing the stack dumpsize to 65528 bytes (the maximum):

$perfrecord-F9999-g-k1--call-graphdwarf,65528-operf.datapython-Xperf_jitmy_script.py

Different compilation flags can significantly impact stack sizes:

  • Builds with-O0 typically have much larger stack frames than those with-O1 or higher

  • Adding optimizations (-O1,-O2, etc.) typically reduces stack size

  • Frame pointers (-fno-omit-frame-pointer) generally provide more reliable stack unwinding