Background
Experimental optimization techniques using LLVM-BOLT have been applied in CPython 3.12, but in order to provide this as an official feature, it must be ensured that the optimized binary does not change the expected behavior.
It has been confirmed that the peephole optimization provided by LLVM-BOLT re-raises the#53093 issue, and it passes the unit tests normally when this is disabled.
A detailed explanation of LLVM_BOLT will be shared at this year's Python Language Summit lightning talk, and materials will be published after the presentation.
(Also, I already shared the draft material at Discord of Python core team)
Performance Impact
I measured that enabling or disabling peephole optimization of LLVM-BOLT has almost no impact on performance, and it shows only noise-level results in the pyperformance benchmarks also.
pyperformance
| Benchmark | as-is | to-be |
|---|
| async_generators | 531 ms | 550 ms: 1.04x slower |
| async_tree_cpu_io_mixed | 1.10 sec | 1.10 sec: 1.01x slower |
| chameleon | 10.3 ms | 10.4 ms: 1.01x slower |
| chaos | 104 ms | 103 ms: 1.01x faster |
| bench_thread_pool | 1.24 ms | 1.25 ms: 1.01x slower |
| coroutines | 44.0 ms | 44.2 ms: 1.01x slower |
| coverage | 151 ms | 147 ms: 1.02x faster |
| crypto_pyaes | 118 ms | 118 ms: 1.01x slower |
| deepcopy_reduce | 4.68 us | 4.70 us: 1.01x slower |
| deltablue | 5.32 ms | 5.28 ms: 1.01x faster |
| django_template | 53.0 ms | 54.3 ms: 1.02x slower |
| dulwich_log | 110 ms | 111 ms: 1.01x slower |
| fannkuch | 630 ms | 611 ms: 1.03x faster |
| float | 116 ms | 115 ms: 1.01x faster |
| generators | 112 ms | 112 ms: 1.00x slower |
| genshi_text | 34.3 ms | 34.5 ms: 1.01x slower |
| go | 214 ms | 212 ms: 1.01x faster |
| hexiom | 9.28 ms | 9.32 ms: 1.01x slower |
| json_loads | 38.6 us | 39.1 us: 1.01x slower |
| logging_format | 10.1 us | 10.0 us: 1.01x faster |
| logging_silent | 157 ns | 156 ns: 1.01x faster |
| logging_simple | 9.04 us | 9.14 us: 1.01x slower |
| mako | 15.4 ms | 15.7 ms: 1.02x slower |
| meteor_contest | 165 ms | 166 ms: 1.01x slower |
| nbody | 166 ms | 160 ms: 1.04x faster |
| nqueens | 130 ms | 132 ms: 1.02x slower |
| pathlib | 26.9 ms | 27.3 ms: 1.01x slower |
| pickle_list | 5.94 us | 5.90 us: 1.01x faster |
| pidigits | 303 ms | 303 ms: 1.00x slower |
| pprint_safe_repr | 1.11 sec | 1.11 sec: 1.01x slower |
| pprint_pformat | 2.26 sec | 2.27 sec: 1.00x slower |
| python_startup | 12.7 ms | 12.7 ms: 1.00x slower |
| python_startup_no_site | 9.40 ms | 9.39 ms: 1.00x faster |
| raytrace | 461 ms | 463 ms: 1.00x slower |
| regex_compile | 194 ms | 196 ms: 1.01x slower |
| regex_effbot | 4.87 ms | 4.80 ms: 1.01x faster |
| richards | 66.6 ms | 67.5 ms: 1.01x slower |
| scimark_fft | 497 ms | 492 ms: 1.01x faster |
| scimark_lu | 178 ms | 183 ms: 1.03x slower |
| scimark_monte_carlo | 107 ms | 106 ms: 1.01x faster |
| scimark_sor | 182 ms | 180 ms: 1.01x faster |
| scimark_sparse_mat_mult | 6.30 ms | 6.36 ms: 1.01x slower |
| spectral_norm | 165 ms | 169 ms: 1.02x slower |
| sqlglot_optimize | 83.0 ms | 83.4 ms: 1.00x slower |
| sqlglot_normalize | 172 ms | 173 ms: 1.00x slower |
| sqlite_synth | 3.89 us | 3.84 us: 1.01x faster |
| sympy_expand | 742 ms | 745 ms: 1.00x slower |
| unpickle_list | 7.53 us | 7.50 us: 1.00x faster |
| xml_etree_generate | 132 ms | 131 ms: 1.01x faster |
| Geometric mean | (ref) | 1.00x slower |
Benchmark hidden because not significant (31): 2to3, async_tree_none, async_tree_io, async_tree_memoization, bench_mp_pool, deepcopy, deepcopy_memo, docutils, genshi_xml, html5lib, json_dumps, mdp, pickle, pickle_dict, pickle_pure_python, pyflate, regex_dna, regex_v8, sqlglot_parse, sqlglot_transpile, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_process
Uh oh!
There was an error while loading.Please reload this page.
Background
Experimental optimization techniques using LLVM-BOLT have been applied in CPython 3.12, but in order to provide this as an official feature, it must be ensured that the optimized binary does not change the expected behavior.
It has been confirmed that the peephole optimization provided by LLVM-BOLT re-raises the#53093 issue, and it passes the unit tests normally when this is disabled.
A detailed explanation of LLVM_BOLT will be shared at this year's Python Language Summit lightning talk, and materials will be published after the presentation.
(Also, I already shared the draft material at Discord of Python core team)
Performance Impact
I measured that enabling or disabling peephole optimization of LLVM-BOLT has almost no impact on performance, and it shows only noise-level results in the pyperformance benchmarks also.
pyperformance
Benchmark hidden because not significant (31): 2to3, async_tree_none, async_tree_io, async_tree_memoization, bench_mp_pool, deepcopy, deepcopy_memo, docutils, genshi_xml, html5lib, json_dumps, mdp, pickle, pickle_dict, pickle_pure_python, pyflate, regex_dna, regex_v8, sqlglot_parse, sqlglot_transpile, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_process