Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

GH-126795: Increase the JIT threshold from 16 to 4096#126816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
brandtbucher merged 8 commits intopython:mainfrombrandtbucher:jump-backoff
Nov 18, 2024

Conversation

@brandtbucher
Copy link
Member

@brandtbucherbrandtbucher commentedNov 14, 2024
edited by bedevere-appbot
Loading

The core change itself is simple, and results in2.1% speed improvement and a 3.6% memory improvement for JIT builds. The bulk of this PR is just modifying most of the tests intest_capi.test_opt to remove assumptions about the warmup threshold.

diegorusso and alonme reacted with hooray emoji
@brandtbucherbrandtbucher added performancePerformance or resource usage interpreter-core(Objects, Python, Grammar, and Parser dirs) topic-JIT labelsNov 14, 2024
@brandtbucherbrandtbucher self-assigned thisNov 14, 2024
@@ -0,0 +1 @@
Increase the threshold for JIT code warmup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we want to include some mention of the performance improvements we've seen with the new threshold?

brandtbucher reacted with thumbs up emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree with Savannah.

loop()
# Subtract because optimizer doesn't kick in sooner
self.assertEqual(opt.get_count(),1000-TIER2_THRESHOLD)
self.assertEqual(opt.get_count(),1001)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I might be missing something. Why did we remove the piece about theTIER2_THRESHOLD?

I guess, more generally, how did you decide when to add/remove the threshold from a test? It seems we're wholesale replacing hardcoded values in tests but also adding/removing the threshold when basic arithmetic is used. Was this just trial and error, or is there some convention I'm not picking up on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

IIUC, we were iterating 1000 times,1000 - TIER2_THRESHOLD of which happened after reaching the threshold.

By that doesn't work ifTIER2_THRESHOLD > 1000, so now we iterate1000 + TIER2_THRESHOLD times, 1000 of which happen after reaching the threshold.

savannahostrowski reacted with thumbs up emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Earlier logic assumed a low threshold (< 1000). With the new threshold this logic needs to be adjusted to keep the test working. Now it's generic enough, that the threshold could be any (positive) number.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ahhhh, that makes sense. Thanks!

diegorusso reacted with thumbs up emoji
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, generally my approach was to change the number of loops to use the constant, then fix up any math to use the constant (for tests that assert some result). This one was a bit different, since it needed to runmore thanTIER2_THRESHOLD times. 1000 used to be a "big" number, but now it's not.

savannahostrowski reacted with thumbs up emoji
Copy link
Member

@markshannonmarkshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A few comments on how we name the consts, but looks good in principle.

x0 = x1 = x2 = x3 = x4 = x5 = x6 = x7 = x8 = x9 =42
y0 = y1 = y2 = y3 = y4 = y5 = y6 = y7 = y8 = y9 =42
z0 = z1 = z2 = z3 = z4 = z5 = z6 = z7 = z8 = z9 =42
a0 = a1 = a2 = a3 = a4 = a5 = a6 = a7 = a8 = a9 ={TIER2_THRESHOLD}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think those 42s are anything to do with thresholds, just a Douglas Adams reference.

brandtbucher reacted with thumbs up emojidiegorusso reacted with heart emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

42 is always the right answer :)

w0 = w1 = w2 = w3 = w4 = w5 = w6 = w7 = w8 = w9 ={TIER2_THRESHOLD}
x0 = x1 = x2 = x3 = x4 = x5 = x6 = x7 = x8 = x9 ={TIER2_THRESHOLD}
y0 = y1 = y2 = y3 = y4 = y5 = y6 = y7 = y8 = y9 ={TIER2_THRESHOLD}
z0 = z1 = z2 = z3 = z4 = z5 = z6 = z7 = z8 = z9 ={TIER2_THRESHOLD}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

z9 does need to exceed the threshold though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

shouldn't we add an assert somewhere to check this condition? if z9 is bigger than the threshold the test fails.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The test passes if z9 meets or exceeds the threshold. It fails if it doesn't. I think it's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, you are right, I got It wrong :)

foriinrange(100):
foriinrange(TIER2_THRESHOLD+100):
x+=a.attr
# for the first 90 iterations we set the attribute on this dummy function which shouldn't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This comment needs updating

brandtbucher reacted with thumbs up emoji

if (PyModule_Add(module,"TIER2_THRESHOLD",
PyLong_FromLong(JUMP_BACKWARD_INITIAL_VALUE))<0) {
PyLong_FromLong(JUMP_BACKWARD_INITIAL_VALUE+1))<0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd rather not have the+1 here.
Could you add another constEXCEEDS_TIER2_THRESHOLD = TIER2_THRESHOLD + 1?
In the tests above, useEXCEEDS_TIER2_THRESHOLD where we want to exceed the threshold, and in the places where we need the actual threshold, and you currently haveTIER2_THRESHOLD - 1, useTIER2_THRESHOLD

In theory, the tests should continue to work ifEXCEEDS_TIER2_THRESHOLD = TIER2_THRESHOLD + 5.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The thing is, the threshold to tier upis 4096 hits. We just haveJUMP_BACKWARD_INITIAL_VALUE set to 4095 because we "count" the final zero.

This makes sense for me when looking at the tests, personally. It's intuitive that something that loopsTIER2_THRESHOLD times would tier up, and something that loopsTIER2_THRESHOLD - 1 wouldn't.

@bedevere-app
Copy link

When you're done making the requested changes, leave the comment:I have made the requested changes; please review again.

@brandtbucher
Copy link
MemberAuthor

Results across platforms. Looks like the memory savings are more pronounced on AArch64 macOS and performance impact is more pronounced on AArch64 Linux. Windows seems to benefit less overall (we don't measure memory on Windows, though):

  • aarch64-apple-darwin: 2% faster, 5% less memory
  • aarch64-unknown-linux-gnu: 9% faster, 4% less memory
  • x86_64-unknown-linux-gnu: 1-2% faster, 3-4% less memory
  • x86_64-pc-windows-msvc: 1% faster
  • i686-pc-windows-msvc: 1% faster
diegorusso, savannahostrowski, Fidget-Spinner, thinkwelltwd, and Zynocs reacted with rocket emoji

Copy link
Contributor

@diegorussodiegorusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is more a question: are we going to make this threshold changeable at runtime?

loop()
# Subtract because optimizer doesn't kick in sooner
self.assertEqual(opt.get_count(),1000-TIER2_THRESHOLD)
self.assertEqual(opt.get_count(),1001)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Earlier logic assumed a low threshold (< 1000). With the new threshold this logic needs to be adjusted to keep the test working. Now it's generic enough, that the threshold could be any (positive) number.

x0 = x1 = x2 = x3 = x4 = x5 = x6 = x7 = x8 = x9 =42
y0 = y1 = y2 = y3 = y4 = y5 = y6 = y7 = y8 = y9 =42
z0 = z1 = z2 = z3 = z4 = z5 = z6 = z7 = z8 = z9 =42
a0 = a1 = a2 = a3 = a4 = a5 = a6 = a7 = a8 = a9 ={TIER2_THRESHOLD}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

42 is always the right answer :)

w0 = w1 = w2 = w3 = w4 = w5 = w6 = w7 = w8 = w9 ={TIER2_THRESHOLD}
x0 = x1 = x2 = x3 = x4 = x5 = x6 = x7 = x8 = x9 ={TIER2_THRESHOLD}
y0 = y1 = y2 = y3 = y4 = y5 = y6 = y7 = y8 = y9 ={TIER2_THRESHOLD}
z0 = z1 = z2 = z3 = z4 = z5 = z6 = z7 = z8 = z9 ={TIER2_THRESHOLD}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

shouldn't we add an assert somewhere to check this condition? if z9 is bigger than the threshold the test fails.

@@ -0,0 +1 @@
Increase the threshold for JIT code warmup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree with Savannah.

@diegorusso
Copy link
Contributor

Results across platforms. Looks like the memory savings are more pronounced on AArch64 macOS and performance impact is more pronounced on AArch64 Linux. Windows seems to benefit less overall (we don't measure memory on Windows, though):

  • aarch64-apple-darwin: 2% faster, 5% less memory
  • aarch64-unknown-linux-gnu: 9% faster, 4% less memory
  • x86_64-unknown-linux-gnu: 1-2% faster, 3-4% less memory
  • x86_64-pc-windows-msvc: 1% faster
  • i686-pc-windows-msvc: 1% faster

Anyway, great stuff! These results are incredible!

savannahostrowski reacted with rocket emoji

@alonme
Copy link
Contributor

Results across platforms. Looks like the memory savings are more pronounced on AArch64 macOS and performance impact is more pronounced on AArch64 Linux. Windows seems to benefit less overall (we don't measure memory on Windows, though):

  • aarch64-apple-darwin: 2% faster, 5% less memory
  • aarch64-unknown-linux-gnu: 9% faster, 4% less memory
  • x86_64-unknown-linux-gnu: 1-2% faster, 3-4% less memory
  • x86_64-pc-windows-msvc: 1% faster
  • i686-pc-windows-msvc: 1% faster

Any idea whyaarch64-unknown-linux-gnu would be affected ~4X more?

@brandtbucher
Copy link
MemberAuthor

This is more a question: are we going to make this threshold changeable at runtime?

Maybe eventually. I'll open an issue about exactly what interfaces people would like to control the lifecycle of JIT code.

The thing is that changing the thresholds of our counters at runtime is sort of tricky, since they are initialized to some constant and count down from there towards zero. There's also alot of them.

That means that we need to re-initialize counters whenever the threshold is changed in order to make this work. It's doable, but not quite as simple as e.g. updating GC thresholds at runtime. An env var set at startup could be another option.

@brandtbucher
Copy link
MemberAuthor

Any idea why aarch64-unknown-linux-gnu would be affected ~4X more?

Not 100% sure.

Anecdotally, AArch64 Linux code is the largest and least efficient to compile (lots of tricky instruction patching, emitting range-extending trampolines, etc.), so it would make sense that compiling less of it helps us.

@brandtbucher
Copy link
MemberAuthor

Okay, I think I've addressed everyone's comments.

I have made the requested changes; please review again (or don't, either way is fine).

@bedevere-app
Copy link

Thanks for making the requested changes!

@markshannon: please review the changes made to this pull request.

Copy link
Member

@savannahostrowskisavannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

... this can result in performance gains of 1-9% and memory savings of 3-5%.

One small typo but LGTM 🔥

Co-authored-by: Savannah Ostrowski <savannahostrowski@gmail.com>
@diegorusso
Copy link
Contributor

Thanks Brandt, It LGTM.

@brandtbucherbrandtbucher merged commit4cd1076 intopython:mainNov 18, 2024
39 checks passed
ebonnal pushed a commit to ebonnal/cpython that referenced this pull requestJan 12, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@diegorussodiegorussodiegorusso left review comments

@savannahostrowskisavannahostrowskisavannahostrowski approved these changes

@markshannonmarkshannonAwaiting requested review from markshannon

Assignees

@brandtbucherbrandtbucher

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetopic-JIT

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

5 participants

@brandtbucher@diegorusso@alonme@savannahostrowski@markshannon

[8]ページ先頭

©2009-2025 Movatter.jp