NotificationsYou must be signed in to change notification settings
Fork32.3k
Star67.9k

Performance issue with ARM64 windows Python release binaries #134524

Open

Performance issue with ARM64 windows Python release binaries#134524

Labels

OS-windowspendingThe issue will be closed if no feedback is providedperformancePerformance or resource usagetriagedThe issue has been accepted as valid by a triager.

Description

Akash9824

opened

on May 22, 2025

Bug report

Bug description:

Hi Team,

I have an X-Elite laptop with an ARM64-based SoC, and I’ve been running Python workloads on it. However, I’ve noticed that Python seems to perform slower on Windows for ARM devices. To investigate, I used pybench, which provides a solid set of test cases for performance benchmarking. i have also taken intel x64 Lunarlake 258V device which has similar geekbench performance like X-Elite to see the performance delta.

Pybench :https://share.gtd-gmbh.de/d/7e9368c6350a4894bf8f/files/?p=%2FWorklets%2FPyBench%2Fpybench-for-3.10.tar.gz&dl=1
I collected the following results:

Environment	Total Time (ms)
Windows on ARM64	802
Windows on x64	507
WSL2 (Linux on windows ARM64)	515

To further analyze, I tested multiple Python versions and observed that earlier ARM64 Windows releases performed better than the latest one:
Python Version Comparison

Version	Windows ARM64 (ms)	Windows x64 (ms)
3.11.0	763	575
3.11.3	589	Not tested
3.11.6	590	Not tested
3.11.9	568	Not tested
3.12.0	666	545
3.12.5	688	Not tested
3.12.6	700	Not tested
3.12.7	802	Not tested
3.12.10	802	507

It’s clear that x64 performance has improved with each release, while ARM64 performance has been inconsistent, with a noticeable regression in the latest version.
I also cloned the Python 3.12.10 source and compiled it on the ARM64 Windows device using different compilers. I found that using clang-cl (19.1.2) with computed gotos enabled yielded significantly better performance than the official release:
Compiled vs. Released (Python 3.12.10)

version	ARM64 (Release)	ARM64 (Compiled)
Python v3.12.10 (ms)	802	628

Here i have question:
Can anybody please share the compilation steps (which compiler and flags) used to compile release ARM64 Windows binaries? If it is MSVC, is there any specific reason for not using clang-cl? Based on my experiment with pybench, I am seeing good results with clang-cl. Are there any other test cases we are running with release binaries where clang is not performing better?

Analysis:
I have tried to collect ETL logs in Windows and profile the test with Profile Explorer. The actual bottleneck I am seeing is in the compiler interpreter. The function python312.dll!_PyEval_EvalFrameDefault is the bottleneck.

CPython versions tested on:

CPython main branch, 3.12

Operating systems tested on:

Windows

Metadata

Assignees

No one assigned

Labels

OS-windowspendingThe issue will be closed if no feedback is providedperformancePerformance or resource usagetriagedThe issue has been accepted as valid by a triager.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance issue with ARM64 windows Python release binaries #134524

Description

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions