- Notifications
You must be signed in to change notification settings - Fork5.2k
Fix condition to align methods to 32 bytes boundary#42909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
kunalspathak commentedSep 30, 2020
@dotnet/jit-contrib ,@AndyAyersMS |
AndyAyersMS commentedSep 30, 2020
Expected impact is
Would be good to highlight one or two benchmarks we really expect to stabilize with this (benchmarks with very simple timing payloads). Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations. Beyond that, once this is in, it would be good to do a retrospective study of benchmarks after we've gotten a few days worth of data, and see if we really did move the needle on stability. I had hoped to do that with#2249 but lab collection wasn't in great shape back then; looking back now it doesn't seem to have really had much impact, perhaps because it only affected Tier1 methods. Any idea what a typical delta this causes on the overall size of the code segment? |
kunalspathak commentedOct 1, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I picked upLoopReturn benchmark which wasrecently found by BenchmarkDotNet to have 22% regression but is Another thing that I experimented along with method alignment is the effect on the benchmark timing because of loop alignment within the method. With the help of Andy'sNopPadding branch, I added various padding of All the measurements are done on StabilityWithout the no-ops, the loop starts at offset Benchmark measurements for default alignment
Benchmark measurements for 32-byte alignment
PerformanceI also tried measuring the performance between default vs. 32-byte alignment and I see some segment of my observation that I saw in stability. As soon as the loop gets aligned to 32-byte boundary (after adding 3+ nops), the performance with 32 byte alignment becomes better. So perhaps, to improve performance as Andy pointed out multiple times, it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins.
|
kunalspathak commentedOct 1, 2020
@adamsitnik , @danmosemsft |
AndyAyersMS commentedOct 1, 2020
danmoseley commentedOct 1, 2020
Pretty nice if we get perf for customers as well as stability for us. i didn't expect that: I assumed we would just move perf around when we made alignment stable. So great! |
adamsitnik commentedOct 1, 2020
If I remember correctly,@AndreyAkinshin has added a manual loop unrolling to this particular loop to minimize the effects of the alignment. If you pass privatevoidWorkloadActionUnroll(System.Int64invokeCount){for(System.Int64i=0;i<invokeCount;i++){consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());}} to disable it, you need to pass privatevoidWorkloadActionUnroll(System.Int64invokeCount){for(System.Int64i=0;i<invokeCount;i++){consumer.Consume(workloadDelegate());}} |
adamsitnik commentedOct 1, 2020
@kunalspathak this looks and sounds awesome!
this would be perfect!! FWIW here are some benchmarks that I remember are very dependent on the loop alignment:
No loops, but super dependent on alignment on
|
kunalspathak commentedOct 6, 2020
I added logging around the change to see how many methods get 32-byte aligned vs. the default alignment and then ran pmi on .NET libraries. There is definitely duplicate methods that might get counted in either of them, but on an average, 532,667 methods continued getting default alignment while 70,573 methods got 32-bytes alignment i.e. out of 603,240 total methods, 70,573 methods got 32-byte aligned which is approx.11% of them. This is close to what@AndyAyersMS anticipated in#2249 (comment).
I tried gathering the results forFib benchmark (the only one which is bimodel and has no-loops) and seeing mixed results. master:
PR:
|
kunalspathak commentedOct 6, 2020
For x86 as well, I don't see much difference for Fib benchmark on my local machine. master:
PR:
|
AndyAyersMS commentedOct 7, 2020
If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation. I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods? |
kunalspathak commentedOct 7, 2020
That makes sense.
The total code size is 51377607 bytes or around 51MB. |
AndyAyersMS commentedOct 7, 2020
So perhaps as much as 1% of the code space is now padding. I suppose it's possible the code heap can fit some smaller methods into the larger padding voids. Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations. Let's take this change and keep an eye on the perf lab runs to see if we do indeed see increased stability. |
kunalspathak commentedOct 7, 2020
Yes, I wanted to get the heap size number for this one and spent some time debugging theallocCode and allocCodeRaw but in order to get accurate numbers from the appropriate heap (specially around |
Sergio0694 commentedOct 7, 2020
This is pretty cool! 😄 @AndyAyersMS Do you think this could be related to the .NET perf regession from#36907? |
AndyAyersMS commentedOct 7, 2020
This work is a prerequisite for aligning branches -- we plan to explore doing that soon. I'd expect this change might stabilize performance of your code, but would not necessarily optimize it. |
Sergio0694 commentedOct 7, 2020
Ah, I see, thank you for the clarification! 😊 Was just curious whether this was at least in part related to that, happy to hear branch alignment work is planned soon! |
Indotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.


In#2249, we started doing alignment of methods to 32-byte boundary for
Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.