NotificationsYou must be signed in to change notification settings
Fork32k
Star67.3k

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

mpage merged 1 commit intopython:mainfrommpage:gh-129987-no-slp-vectorize

Apr 9, 2025

Merged

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

mpage merged 1 commit intopython:mainfrommpage:gh-129987-no-slp-vectorize

Apr 9, 2025

Conversation

Copy link

Contributor

mpage commentedApr 9, 2025•
edited
Loading

#131750 mysteriously caused a ~6% regression for the free-threaded build. The cause was poor code generation of opcode dispatch in the interpreter loop. Before the change the dispatch code looked like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]            DISPATCH();          19cd0a: mov    -0x268(%rbp),%rsi          19cd11: movzbl %ah,%ecx          19cd14: movzbl %al,%eax          19cd17: mov    %ecx,%r10d          19cd1a: jmp    *(%rsi,%rax,8)

After the change, the dispatch code looked like:

# Shared dispatch code/root/src/cpython/Python/generated_cases.c.h:81 [BINARY_OP]            DISPATCH();          19dd67: mov    -0x280(%rbp),%r10          19dd6e: movzbl %ah,%ecx          19dd71: movzbl %al,%eax          19dd74: mov    %ecx,%r14d          19dd77: mov    -0x270(%rbp),%rcx          19dd7e: mov    (%rcx,%rax,8),%rdx          19dd82: nopw   0x0(%rax,%rax,1)          19dd88: movq   -0x258(%rbp),%xmm0          19dd90: movq   %r12,%xmm4          19dd95: punpcklqdq %xmm4,%xmm0          19dd99: movhlps %xmm0,%xmm3          19dd9c: movq   %xmm0,%r15          19dda1: movq   %xmm3,%r11          19dda6: mov    %r11,%rcx          19dda9: jmp    *%rdx          # Duplicated dispatch code/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]            DISPATCH();          19dde4: movzbl %ah,%ecx          19dde7: movzbl %al,%eax          19ddea: mov    %ecx,%r14d          19dded: mov    -0x270(%rbp),%rcx          19ddf4: mov    (%rcx,%rax,8),%rdx          19ddf8: jmp    19dd99 <_PyEval_EvalFrameDefault+0x289>

There are two problems:

We now have two jumps (one direct jump to the shared dispatch logic and one indirect jump to the next opcode handler) instead of one (the indirect jump to the opcode handler).
There's a significant amount of register shuffling in the shared dispatch code.

Both of these problems appear to be caused by GCC's SLP autovectorizer. After the change, it decides to store both thenext_instr pointer and thestack_pointer in a single 128 bit register in the shared basic block that contains the opcode dispatch. This is introduced in theslp1 pass (tree dump below):

  _24061 = VIEW_CONVERT_EXPR<long unsigned int>(stack_pointer_14587);  _24062 = VIEW_CONVERT_EXPR<long unsigned int>(next_instr_14097);  _24063 = {_24062, _24061};  <bb 19> [count: 1658034300]:  # frame_2363(ab) = PHI <frame_20485(4258), frame_20519(18)>  # oparg_1245(ab) = PHI <oparg_20252(4258), oparg_14635(18)>  # next_instr_1246(ab) = PHI <next_instr_11924(4258), next_instr_14097(18)>  # stack_pointer_2976(ab) = PHI <stack_pointer_20484(4258), stack_pointer_14587(18)>  # _3209 = PHI <_20217(4258), _20681(18)>  #   # Combination of next_instr and stack_pointer:  #   # vect_next_instr_1246.7061_24064 = PHI <vect_next_instr_11924.7060_24060(4258), _24063(18)>  _24067 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 64>;  _24068(ab) = (union _PyStackRef *) _24067;  _24065 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 0>;  _24066(ab) = (union _Py_CODEUNIT *) _24065;  # DEBUG stack_pointer => stack_pointer_2976(ab)  # DEBUG next_instr => next_instr_1246(ab)  # DEBUG oparg => oparg_1245(ab)  # DEBUG frame => frame_2363(ab)  goto _3209;

Disabling the SLP autovectorization pass for the interpreter loop fixes both problems. After this change the opcode dispatch code looks like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]            DISPATCH();          19aa37: mov    -0x260(%rbp),%rsi          19aa3e: movzbl %ah,%ecx          19aa41: movzbl %al,%eax          19aa44: movslq %ecx,%r15          19aa47: jmp    *(%rsi,%rax,8)

Performance improves by~8% for the free-threaded build.

Surprisingly, this also seems to improve performance for the default build by~4%. I don't understand why and I don't fully trust the result. The generated dispatch code for the default build looks unaffected by this change. Additionally, measuring instructions retired usingfastbench shows a negligible change, whereas it shows a ~8% reduction for the free-threaded build.

Issue:computed-goto interpreter: Prevent the compiler from mergingDISPATCH calls #129987

Disable GCC SLP autovectorization for the interpreter loop on x86-64

45cd786

bedevere-appbot mentioned this pull request

Apr 9, 2025

computed-goto interpreter: Prevent the compiler from mergingDISPATCH calls#129987

Closed

mpage added the skip news label

Apr 9, 2025

Copy link

pinskia commentedApr 9, 2025

I think this is the same ashttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777 .

mpage requested review fromcolesbury andYhg1s

April 9, 2025 15:50

mpage marked this pull request as ready for review

April 9, 2025 15:50

mpage requested a review frommarkshannon as acode owner

April 9, 2025 15:50

bedevere-appbot added the awaiting core review label

Apr 9, 2025

colesbury approved these changes

Apr 9, 2025

View reviewed changes

Copy link

Contributor

colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice!

bedevere-appbot added awaiting merge and removed awaiting core review labels

Apr 9, 2025

mpage merged commit1f5682f intopython:main

Apr 9, 2025

55 checks passed

bedevere-appbot removed the awaiting merge label

Apr 9, 2025

mpage deleted the gh-129987-no-slp-vectorize branch

April 9, 2025 17:34

mpage mentioned this pull request

Apr 9, 2025

Don't inline slow path functions in the interpreter loop#132336

Open

seehwan pushed a commit to seehwan/cpython that referenced this pull request

Apr 16, 2025

pythongh-129987: Disable GCC SLP autovectorization for the interprete…

9f011a7

…r loop on x86-64 (python#132295)The SLP autovectorizer can cause poor code generation for opcode dispatch, negating any benefit we get from vectorization elsewhere in the interpreter loop.

Labels

skip news

3 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

Uh oh!

Conversation

mpage commentedApr 9, 2025•
edited
Loading

Uh oh!

Uh oh!

pinskia commentedApr 9, 2025

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Movatterモバイル変換

Uh oh!

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

Uh oh!

Conversation

mpage commentedApr 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

pinskia commentedApr 9, 2025

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mpage commentedApr 9, 2025•
edited
Loading