Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64#132295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Merged
Uh oh!
There was an error while loading.Please reload this page.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
pinskia commentedApr 9, 2025
I think this is the same ashttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777 . |
colesbury approved these changesApr 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice!
1f5682f
intopython:main 55 checks passed
Uh oh!
There was an error while loading.Please reload this page.
seehwan pushed a commit to seehwan/cpython that referenced this pull requestApr 16, 2025
…r loop on x86-64 (python#132295)The SLP autovectorizer can cause poor code generation for opcode dispatch, negating any benefit we get from vectorization elsewhere in the interpreter loop.
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading.Please reload this page.
#131750 mysteriously caused a ~6% regression for the free-threaded build. The cause was poor code generation of opcode dispatch in the interpreter loop. Before the change the dispatch code looked like:
After the change, the dispatch code looked like:
There are two problems:
Both of these problems appear to be caused by GCC's SLP autovectorizer. After the change, it decides to store both the
next_instr
pointer and thestack_pointer
in a single 128 bit register in the shared basic block that contains the opcode dispatch. This is introduced in theslp1
pass (tree dump below):Disabling the SLP autovectorization pass for the interpreter loop fixes both problems. After this change the opcode dispatch code looks like:
Performance improves by~8% for the free-threaded build.
Surprisingly, this also seems to improve performance for the default build by~4%. I don't understand why and I don't fully trust the result. The generated dispatch code for the default build looks unaffected by this change. Additionally, measuring instructions retired using
fastbench
shows a negligible change, whereas it shows a ~8% reduction for the free-threaded build.DISPATCH
calls #129987