Python 對 Linuxperf 分析器的支援

作者:

Pablo Galindo

Linux 性能分析器 (Linux perf profiler) 是一個非常強大的工具,可讓你分析並取得有關應用程式的性能資訊。perf 還擁有一個非常活躍的工具生態系統,有助於分析其生成的資料。

在 Python 應用程式中使用perf 分析器的主要問題是perf 僅取得有關原生符號的資訊,即用 C 編寫的函式和程式的名稱。這表示程式碼中的 Python 函式名稱和檔案名稱不會出現在perf 的輸出中。

從 Python 3.12 開始,直譯器可以在特殊模式下執行,該模式允許 Python 函式出現在perf 分析器的輸出中。啟用此模式後,直譯器將在執行每個 Python 函式之前插入 (interpose) 一小段動態編譯的程式碼,並使用perf map 檔案來告訴perf 這段程式碼與相關聯的 Python 函式間的關係。

備註

目前對perf 分析器的支援僅適用於 Linux 的特定架構上。檢查configure 建構步驟的輸出或檢查python-msysconfig|grepHAVE_PERF_TRAMPOLINE 的輸出來查看你的系統是否支援。

例如,參考以下腳本:

deffoo(n):result=0for_inrange(n):result+=1returnresultdefbar(n):foo(n)defbaz(n):bar(n)if__name__=="__main__":baz(1000000)

我們可以執行perf 以 9999 赫茲取樣 CPU 堆疊追蹤 (stack trace):

$perfrecord-F9999-g-operf.datapythonmy_script.py

然後我們可以使用perfreport 來分析資料:

$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#..................................................................................................#    91.08%     0.00%             0  python.exe  python.exe          [.] _start            |            ---_start            |                --90.71%--__libc_start_main                        Py_BytesMain                        |                        |--56.88%--pymain_run_python.constprop.0                        |          |                        |          |--56.13%--_PyRun_AnyFileObject                        |          |          _PyRun_SimpleFileObject                        |          |          |                        |          |          |--55.02%--run_mod                        |          |          |          |                        |          |          |           --54.65%--PyEval_EvalCode                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     |                        |          |          |                     |--51.67%--_PyEval_EvalFrameDefault                        |          |          |                     |          |                        |          |          |                     |          |--11.52%--_PyLong_Add                        |          |          |                     |          |          |                        |          |          |                     |          |          |--2.97%--_PyObject_Malloc...

如你所見,Python 函式未顯示在輸出中,僅顯示_Py_Eval_EvalFrameDefault (為 Python 位元組碼 (bytecode) 求值的函式)。不幸的是,這不是很有用,因為所有 Python 函式都使用相同的 C 函式來替位元組碼求值,因此我們無法知道哪個 Python 函式是對應於哪個位元組碼計算函式。

作為替代,如果我們在啟用perf 支援的情況下執行相同的實驗,我們會得到:

$perfreport--stdio-n-g#ChildrenSelfSamplesCommandSharedObjectSymbol#.............................................................................................................................#    90.58%     0.36%             1  python.exe  python.exe          [.] _start            |            ---_start            |                --89.86%--__libc_start_main                        Py_BytesMain                        |                        |--55.43%--pymain_run_python.constprop.0                        |          |                        |          |--54.71%--_PyRun_AnyFileObject                        |          |          _PyRun_SimpleFileObject                        |          |          |                        |          |          |--53.62%--run_mod                        |          |          |          |                        |          |          |           --53.26%--PyEval_EvalCode                        |          |          |                     py::<module>:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::baz:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::bar:/src/script.py                        |          |          |                     _PyEval_EvalFrameDefault                        |          |          |                     PyObject_Vectorcall                        |          |          |                     _PyEval_Vector                        |          |          |                     py::foo:/src/script.py                        |          |          |                     |                        |          |          |                     |--51.81%--_PyEval_EvalFrameDefault                        |          |          |                     |          |                        |          |          |                     |          |--13.77%--_PyLong_Add                        |          |          |                     |          |          |                        |          |          |                     |          |          |--3.26%--_PyObject_Malloc

如何啟用perf 分析支援

要啟用perf 分析支援,可以在一開始就使用環境變數PYTHONPERFSUPPORT 或使用-Xperf 選項,也可以使用sys.activate_stack_trampoline()sys.deactivate_stack_trampoline() 來動態啟用。

sys 函式優先於-X 選項、-X 選項優先於環境變數。

例如,使用環境變數:

$PYTHONPERFSUPPORT=1perfrecord-F9999-g-operf.datapythonmy_script.py$perfreport-g-iperf.data

例如,使用-X 選項:

$perfrecord-F9999-g-operf.datapython-Xperfmy_script.py$perfreport-g-iperf.data

例如,在example.py 檔案中使用sys API:

importsyssys.activate_stack_trampoline("perf")do_profiled_stuff()sys.deactivate_stack_trampoline()non_profiled_stuff()

...然後:

$perfrecord-F9999-g-operf.datapython./example.py$perfreport-g-iperf.data

如何獲得最佳結果

為了獲得最佳結果,應使用CFLAGS="-fno-omit-frame-pointer-mno-omit-leaf-frame-pointer" 來進行 Python 編譯,因為這能允許分析器僅使用 frame 指標而不是 DWARF 除錯資訊來解析 (unwind)。這是因為,由於插入以允許perf 支援的程式碼是動態生成的,因此它沒有任何可用的 DWARF 除錯資訊。

你可以透過執行以下指令來檢查你的系統是否已使用此旗標進行編譯:

$python-msysconfig|grep'no-omit-frame-pointer'

如果你沒有看到任何輸出,則表示你的直譯器尚未使用 frame 指標進行編譯,因此它可能無法在perf 的輸出中顯示 Python 函式。

How to work without frame pointers

If you are working with a Python interpreter that has been compiled withoutframe pointers, you can still use theperf profiler, but the overhead will bea bit higher because Python needs to generate unwinding information for everyPython function call on the fly. Additionally,perf will take more time toprocess the data because it will need to use the DWARF debugging information tounwind the stack and this is a slow process.

To enable this mode, you can use the environment variablePYTHON_PERF_JIT_SUPPORT or the-Xperf_jit option,which will enable the JIT mode for theperf profiler.

備註

Due to a bug in theperf tool, onlyperf versions higher than v6.8will work with the JIT mode. The fix was also backported to the v6.7.2version of the tool.

Note that when checking the version of theperf tool (which can be doneby runningperfversion) you must take into account that some distrosadd some custom version numbers including a- character. This meansthatperf6.7-3 is not necessarilyperf6.7.3.

When using the perf JIT mode, you need an extra step before you can runperfreport. You need to call theperfinject command to inject the JITinformation into theperf.data file.:

$perfrecord-F9999-g-k1--call-graphdwarf-operf.datapython-Xperf_jitmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data

或使用環境變數:

$PYTHON_PERF_JIT_SUPPORT=1perfrecord-F9999-g--call-graphdwarf-operf.datapythonmy_script.py$perfinject-iperf.data--jit--outputperf.jit.data$perfreport-g-iperf.jit.data

perfinject--jit command will readperf.data,automatically pick up the perf dump file that Python creates (in/tmp/perf-$PID.dump), and then createperf.jit.data which merges all theJIT information together. It should also create a lot ofjitted-XXXX-N.sofiles in the current directory which are ELF images for all the JIT trampolinesthat were created by Python.

警告

When using--call-graphdwarf, theperf tool will takesnapshots of the stack of the process being profiled and save theinformation in theperf.data file. By default, the size of the stack dumpis 8192 bytes, but you can change the size by passing it aftera comma like--call-graphdwarf,16384.

The size of the stack dump is important because if the size is too smallperf will not be able to unwind the stack and the output will beincomplete. On the other hand, if the size is too big, thenperf won'tbe able to sample the process as frequently as it would like as the overheadwill be higher.

The stack size is particularly important when profiling Python code compiledwith low optimization levels (like-O0), as these builds tend to havelarger stack frames. If you are compiling Python with-O0 and not seeingPython functions in your profiling output, try increasing the stack dumpsize to 65528 bytes (the maximum):

$perfrecord-F9999-g-k1--call-graphdwarf,65528-operf.datapython-Xperf_jitmy_script.py

Different compilation flags can significantly impact stack sizes:

  • Builds with-O0 typically have much larger stack frames than those with-O1 or higher

  • Adding optimizations (-O1,-O2, etc.) typically reduces stack size

  • Frame pointers (-fno-omit-frame-pointer) generally provide more reliable stack unwinding