Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
mpage merged 1 commit intopython:mainfrommpage:gh-129987-no-slp-vectorize
Apr 9, 2025

Conversation

mpage
Copy link
Contributor

@mpagempage commentedApr 9, 2025
edited
Loading

#131750 mysteriously caused a ~6% regression for the free-threaded build. The cause was poor code generation of opcode dispatch in the interpreter loop. Before the change the dispatch code looked like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]            DISPATCH();          19cd0a: mov    -0x268(%rbp),%rsi          19cd11: movzbl %ah,%ecx          19cd14: movzbl %al,%eax          19cd17: mov    %ecx,%r10d          19cd1a: jmp    *(%rsi,%rax,8)

After the change, the dispatch code looked like:

# Shared dispatch code/root/src/cpython/Python/generated_cases.c.h:81 [BINARY_OP]            DISPATCH();          19dd67: mov    -0x280(%rbp),%r10          19dd6e: movzbl %ah,%ecx          19dd71: movzbl %al,%eax          19dd74: mov    %ecx,%r14d          19dd77: mov    -0x270(%rbp),%rcx          19dd7e: mov    (%rcx,%rax,8),%rdx          19dd82: nopw   0x0(%rax,%rax,1)          19dd88: movq   -0x258(%rbp),%xmm0          19dd90: movq   %r12,%xmm4          19dd95: punpcklqdq %xmm4,%xmm0          19dd99: movhlps %xmm0,%xmm3          19dd9c: movq   %xmm0,%r15          19dda1: movq   %xmm3,%r11          19dda6: mov    %r11,%rcx          19dda9: jmp    *%rdx          # Duplicated dispatch code/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]            DISPATCH();          19dde4: movzbl %ah,%ecx          19dde7: movzbl %al,%eax          19ddea: mov    %ecx,%r14d          19dded: mov    -0x270(%rbp),%rcx          19ddf4: mov    (%rcx,%rax,8),%rdx          19ddf8: jmp    19dd99 <_PyEval_EvalFrameDefault+0x289>

There are two problems:

  1. We now have two jumps (one direct jump to the shared dispatch logic and one indirect jump to the next opcode handler) instead of one (the indirect jump to the opcode handler).
  2. There's a significant amount of register shuffling in the shared dispatch code.

Both of these problems appear to be caused by GCC's SLP autovectorizer. After the change, it decides to store both thenext_instr pointer and thestack_pointer in a single 128 bit register in the shared basic block that contains the opcode dispatch. This is introduced in theslp1 pass (tree dump below):

  _24061 = VIEW_CONVERT_EXPR<long unsigned int>(stack_pointer_14587);  _24062 = VIEW_CONVERT_EXPR<long unsigned int>(next_instr_14097);  _24063 = {_24062, _24061};  <bb 19> [count: 1658034300]:  # frame_2363(ab) = PHI <frame_20485(4258), frame_20519(18)>  # oparg_1245(ab) = PHI <oparg_20252(4258), oparg_14635(18)>  # next_instr_1246(ab) = PHI <next_instr_11924(4258), next_instr_14097(18)>  # stack_pointer_2976(ab) = PHI <stack_pointer_20484(4258), stack_pointer_14587(18)>  # _3209 = PHI <_20217(4258), _20681(18)>  #   # Combination of next_instr and stack_pointer:  #   # vect_next_instr_1246.7061_24064 = PHI <vect_next_instr_11924.7060_24060(4258), _24063(18)>  _24067 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 64>;  _24068(ab) = (union _PyStackRef *) _24067;  _24065 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 0>;  _24066(ab) = (union _Py_CODEUNIT *) _24065;  # DEBUG stack_pointer => stack_pointer_2976(ab)  # DEBUG next_instr => next_instr_1246(ab)  # DEBUG oparg => oparg_1245(ab)  # DEBUG frame => frame_2363(ab)  goto _3209;

Disabling the SLP autovectorization pass for the interpreter loop fixes both problems. After this change the opcode dispatch code looks like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]            DISPATCH();          19aa37: mov    -0x260(%rbp),%rsi          19aa3e: movzbl %ah,%ecx          19aa41: movzbl %al,%eax          19aa44: movslq %ecx,%r15          19aa47: jmp    *(%rsi,%rax,8)

Performance improves by~8% for the free-threaded build.

Surprisingly, this also seems to improve performance for the default build by~4%. I don't understand why and I don't fully trust the result. The generated dispatch code for the default build looks unaffected by this change. Additionally, measuring instructions retired usingfastbench shows a negligible change, whereas it shows a ~8% reduction for the free-threaded build.

@pinskia
Copy link

I think this is the same ashttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777 .

mpage and colesbury reacted with thumbs up emoji

@mpagempage requested review fromcolesbury andYhg1sApril 9, 2025 15:50
@mpagempage marked this pull request as ready for reviewApril 9, 2025 15:50
Copy link
Contributor

@colesburycolesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice!

@mpagempage merged commit1f5682f intopython:mainApr 9, 2025
55 checks passed
@mpagempage deleted the gh-129987-no-slp-vectorize branchApril 9, 2025 17:34
seehwan pushed a commit to seehwan/cpython that referenced this pull requestApr 16, 2025
…r loop on x86-64 (python#132295)The SLP autovectorizer can cause poor code generation for opcode dispatch, negating any benefit we get from vectorization elsewhere in the interpreter loop.
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@colesburycolesburycolesbury approved these changes

@Yhg1sYhg1sAwaiting requested review from Yhg1s

@markshannonmarkshannonAwaiting requested review from markshannonmarkshannon is a code owner

Assignees
No one assigned
Labels
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@mpage@pinskia@colesbury

[8]ページ先頭

©2009-2025 Movatter.jp