Python support for theperfmap compatible profilers

author:

Pablo Galindo

The Linux perf profiler andsamply are powerful tools that allow you toprofile and obtain information about the performance of your application.Both tools have vibrant ecosystems that aid with the analysis of the data they produce.

The main problem with using these profilers with Python applications is thatthey only get information about native symbols, that is, the names offunctions and procedures written in C. This means that the names and file namesof Python functions in your code will not appear in the profiler output.

Since Python 3.12, the interpreter can run in a special mode that allows Pythonfunctions to appear in the output of compatible profilers. When this mode isenabled, the interpreter will interpose a small piece of code compiled on thefly before the execution of every Python function and it will teach the profiler therelationship between this piece of code and the associated Python function usingperf map files.

Note

Support for profiling is available on Linux and macOS on select architectures.Perf is available on Linux, while samply can be used on both Linux and macOS.samply support on macOS is available starting from Python 3.15.Check the output of theconfigure build step orcheck the output ofpython-msysconfig|grepHAVE_PERF_TRAMPOLINEto see if your system is supported.

For example, consider the following script:

deffoo(n):result=0for_inrange(n):result+=1returnresultdefbar(n):foo(n)defbaz(n):bar(n)if__name__=="__main__":baz(1000000)

We can runperf to sample CPU stack traces at 9999 hertz:

$perfrecord-F9999-g-operf.datapythonmy_script.py

Then we can useperfreport to analyze the data:

$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#..................................................................................................#    91.08%     0.00%             0  python.exe  python.exe          [.] _start            |            ---_start            |                --90.71%--__libc_start_main                        Py_BytesMain                        |                        |--56.88%--pymain_run_python.constprop.0                        |          |                        |          |--56.13%--_PyRun_AnyFileObject                        |          |          _PyRun_SimpleFileObject                        |          |          |                        |          |          |--55.02%--run_mod                        |          |          |          |                        |          |          |           --54.65%--PyEval_EvalCode                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     |                        |          |          |                     |--51.67%--_PyEval_EvalFrameDefault                        |          |          |                     |          |                        |          |          |                     |          |--11.52%--_PyCompactLong_Add                        |          |          |                     |          |          |                        |          |          |                     |          |          |--2.97%--_PyObject_Malloc...

As you can see, the Python functions are not shown in the output, only_PyEval_EvalFrameDefault(the function that evaluates the Python bytecode) shows up. Unfortunately that’s not very useful because all Pythonfunctions use the same C function to evaluate bytecode so we cannot know which Python function corresponds to whichbytecode-evaluating function.

Instead, if we run the same experiment withperf support enabled we get:

$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#.............................................................................................................................#    90.58%     0.36%             1  python.exe  python.exe          [.] _start            |            ---_start            |                --89.86%--__libc_start_main                        Py_BytesMain                        |                        |--55.43%--pymain_run_python.constprop.0                        |          |                        |          |--54.71%--_PyRun_AnyFileObject                        |          |          _PyRun_SimpleFileObject                        |          |          |                        |          |          |--53.62%--run_mod                        |          |          |          |                        |          |          |           --53.26%--PyEval_EvalCode                        |          |          |                     py::<module>:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::baz:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::bar:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::foo:/src/script.py                        |          |          |                     |                        |          |          |                     |--51.81%--_PyEval_EvalFrameDefault                        |          |          |                     |          |                        |          |          |                     |          |--13.77%--_PyCompactLong_Add                        |          |          |                     |          |          |                        |          |          |                     |          |          |--3.26%--_PyObject_Malloc

Using the samply profiler

samply is a modern profiler that can be used as an alternative to perf.It uses the same perf map files that Python generates, making it compatiblewith Python’s profiling support. samply is particularly useful on macOSwhere perf is not available.

To use samply with Python, first install it following the instructions athttps://github.com/mstange/samply, then run:

$samplyrecordPYTHONPERFSUPPORT=1pythonmy_script.py

This will open a web interface where you can analyze the profiling datainteractively. The advantage of samply is that it provides a modernweb-based interface for analyzing profiling data and works on both Linuxand macOS.

On macOS, samply support requires Python 3.15 or later. Also on macOS, samplycan’t profile signed Python executables due to restrictions by macOS. You canprofile with Python binaries that you’ve compiled yourself, or which areunsigned or locally-signed (such as anything installed by Homebrew). Inorder to attach to running processes on macOS, runsamplysetup once (andevery time samply is updated) to self-sign the samply binary.

How to enableperf profiling support

perf profiling support can be enabled either from the start usingthe environment variablePYTHONPERFSUPPORT or the-Xperf option,or dynamically usingsys.activate_stack_trampoline() andsys.deactivate_stack_trampoline().

Thesys functions take precedence over the-X option,the-X option takes precedence over the environment variable.

Example, using the environment variable:

$PYTHONPERFSUPPORT=1perfrecord-F9999-g-operf.datapythonmy_script.py$perfreport-g-iperf.data

Example, using the-X option:

$perfrecord-F9999-g-operf.datapython-Xperfmy_script.py$perfreport-g-iperf.data

Example, using thesys APIs in fileexample.py:

importsyssys.activate_stack_trampoline("perf")do_profiled_stuff()sys.deactivate_stack_trampoline()non_profiled_stuff()

…then:

$perfrecord-F9999-g-operf.datapython./example.py$perfreport-g-iperf.data

How to obtain the best results

For best results, Python should be compiled withCFLAGS="-fno-omit-frame-pointer-mno-omit-leaf-frame-pointer" as this allowsprofilers to unwind using only the frame pointer and not on DWARF debuginformation. This is because as the code that is interposed to allowperfsupport is dynamically generated it doesn’t have any DWARF debugging informationavailable.

You can check if your system has been compiled with this flag by running:

$python-msysconfig|grep'no-omit-frame-pointer'

If you don’t see any output it means that your interpreter has not been compiled withframe pointers and therefore it may not be able to show Python functions in the outputofperf.

How to work without frame pointers

If you are working with a Python interpreter that has been compiled withoutframe pointers, you can still use theperf profiler, but the overhead will bea bit higher because Python needs to generate unwinding information for everyPython function call on the fly. Additionally,perf will take more time toprocess the data because it will need to use the DWARF debugging information tounwind the stack and this is a slow process.

To enable this mode, you can use the environment variablePYTHON_PERF_JIT_SUPPORT or the-Xperf_jit option,which will enable the JIT mode for theperf profiler.

Note

Due to a bug in theperf tool, onlyperf versions higher than v6.8will work with the JIT mode. The fix was also backported to the v6.7.2version of the tool.

Note that when checking the version of theperf tool (which can be doneby runningperfversion) you must take into account that some distrosadd some custom version numbers including a- character. This meansthatperf6.7-3 is not necessarilyperf6.7.3.

When using the perf JIT mode, you need an extra step before you can runperfreport. You need to call theperfinject command to inject the JITinformation into theperf.data file.:

$perfrecord-F9999-g-k1--call-graphdwarf-operf.datapython-Xperf_jitmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data

or using the environment variable:

$PYTHON_PERF_JIT_SUPPORT=1perfrecord-F9999-g--call-graphdwarf-operf.datapythonmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data

perfinject--jit command will readperf.data,automatically pick up the perf dump file that Python creates (in/tmp/perf-$PID.dump), and then createperf.jit.data which merges all theJIT information together. It should also create a lot ofjitted-XXXX-N.sofiles in the current directory which are ELF images for all the JIT trampolinesthat were created by Python.

Warning

When using--call-graphdwarf, theperf tool will takesnapshots of the stack of the process being profiled and save theinformation in theperf.data file. By default, the size of the stack dumpis 8192 bytes, but you can change the size by passing it aftera comma like--call-graphdwarf,16384.

The size of the stack dump is important because if the size is too smallperf will not be able to unwind the stack and the output will beincomplete. On the other hand, if the size is too big, thenperf won’tbe able to sample the process as frequently as it would like as the overheadwill be higher.

The stack size is particularly important when profiling Python code compiledwith low optimization levels (like-O0), as these builds tend to havelarger stack frames. If you are compiling Python with-O0 and not seeingPython functions in your profiling output, try increasing the stack dumpsize to 65528 bytes (the maximum):

$perfrecord-F9999-g-k1--call-graphdwarf,65528-operf.datapython-Xperf_jitmy_script.py

Different compilation flags can significantly impact stack sizes:

  • Builds with-O0 typically have much larger stack frames than those with-O1 or higher

  • Adding optimizations (-O1,-O2, etc.) typically reduces stack size

  • Frame pointers (-fno-omit-frame-pointer) generally provide more reliable stack unwinding