Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
Description
Bug report
Bug description:
Hi Team,
I have an X-Elite laptop with an ARM64-based SoC, and I’ve been running Python workloads on it. However, I’ve noticed that Python seems to perform slower on Windows for ARM devices. To investigate, I used pybench, which provides a solid set of test cases for performance benchmarking. i have also taken intel x64 Lunarlake 258V device which has similar geekbench performance like X-Elite to see the performance delta.
Pybench :https://share.gtd-gmbh.de/d/7e9368c6350a4894bf8f/files/?p=%2FWorklets%2FPyBench%2Fpybench-for-3.10.tar.gz&dl=1
I collected the following results:
Environment | Total Time (ms) |
---|---|
Windows on ARM64 | 802 |
Windows on x64 | 507 |
WSL2 (Linux on windows ARM64) | 515 |
To further analyze, I tested multiple Python versions and observed that earlier ARM64 Windows releases performed better than the latest one:
Python Version Comparison
Version | Windows ARM64 (ms) | Windows x64 (ms) |
---|---|---|
3.11.0 | 763 | 575 |
3.11.3 | 589 | Not tested |
3.11.6 | 590 | Not tested |
3.11.9 | 568 | Not tested |
3.12.0 | 666 | 545 |
3.12.5 | 688 | Not tested |
3.12.6 | 700 | Not tested |
3.12.7 | 802 | Not tested |
3.12.10 | 802 | 507 |
It’s clear that x64 performance has improved with each release, while ARM64 performance has been inconsistent, with a noticeable regression in the latest version.
I also cloned the Python 3.12.10 source and compiled it on the ARM64 Windows device using different compilers. I found that using clang-cl (19.1.2) with computed gotos enabled yielded significantly better performance than the official release:
Compiled vs. Released (Python 3.12.10)
version | ARM64 (Release) | ARM64 (Compiled) |
---|---|---|
Python v3.12.10 (ms) | 802 | 628 |
Here i have question:
Can anybody please share the compilation steps (which compiler and flags) used to compile release ARM64 Windows binaries? If it is MSVC, is there any specific reason for not using clang-cl? Based on my experiment with pybench, I am seeing good results with clang-cl. Are there any other test cases we are running with release binaries where clang is not performing better?
Analysis:
I have tried to collect ETL logs in Windows and profile the test with Profile Explorer. The actual bottleneck I am seeing is in the compiler interpreter. The function python312.dll!_PyEval_EvalFrameDefault is the bottleneck.
CPython versions tested on:
CPython main branch, 3.12
Operating systems tested on:
Windows