Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fix condition to align methods to 32 bytes boundary#42909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
kunalspathak merged 1 commit intodotnet:masterfromkunalspathak:align32bytes
Oct 7, 2020

Conversation

@kunalspathak
Copy link
Contributor

In#2249, we started doing alignment of methods to 32-byte boundary forTier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

@Dotnet-GitSync-BotDotnet-GitSync-Bot added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labelSep 30, 2020
@kunalspathak
Copy link
ContributorAuthor

@dotnet/jit-contrib ,@AndyAyersMS

@AndyAyersMS
Copy link
Member

Expected impact is

  • (short term) perf stability for benchmarks that run jitted code with loops
  • (medium term) ability to optimize perf by adjusting alignment of code within methods

Would be good to highlight one or two benchmarks we really expect to stabilize with this (benchmarks with very simple timing payloads).

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

Beyond that, once this is in, it would be good to do a retrospective study of benchmarks after we've gotten a few days worth of data, and see if we really did move the needle on stability. I had hoped to do that with#2249 but lab collection wasn't in great shape back then; looking back now it doesn't seem to have really had much impact, perhaps because it only affected Tier1 methods.

Any idea what a typical delta this causes on the overall size of the code segment?

@kunalspathak
Copy link
ContributorAuthor

kunalspathak commentedOct 1, 2020
edited
Loading

I picked upLoopReturn benchmark which wasrecently found by BenchmarkDotNet to have 22% regression but isBimodel test. The benchmark itself has a tight loop on which I can do some experimentation. For the sake of argument, I removedthe only condition in the loop code to eliminate any branch prediction effect. The loop is 17 bytes long, so a single fetch of 32B line should bring the entire loop code into the decoder and cached in DSB.

Another thing that I experimented along with method alignment is the effect on the benchmark timing because of loop alignment within the method. With the help of Andy'sNopPadding branch, I added various padding ofnop to adjust loop header with alignments ranging from0x0 thru0xF to see if fixing the method alignment to 32-byte boundary helps overall execution timing and stability of the benchmark regardless of where loop is aligned. The results are positive.

All the measurements are done onIntel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores.

Stability

Without the no-ops, the loop starts at offset0x3D, so adding 1 no-op moves the loop to0x3E and so forth. In below graph, X-axis is no. of nops padding I added in the method and Y-axis is the standard deviation of benchmark timings measured over 5 iterations. Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment). With default alignment, for some nops padding, the benchmarks are stable, but in general they deviate if the loop's alignment is not favorable to the DSB line. As opposed to that, the 32-byte alignment provides stable timings with very little deviation, no matter what the loop alignment is. For 32-byte method alignment, when the loop's alignment gets to 0x40 (after adding 3 thru 15 nops), the performance even gets better. My theory is that after 3 nops are added in a method, the loop (17 bytes long) gets loaded entirely in a single fetch. For e.g. with 2 nops, the loop starts at0x3F address which is not a boundary of 32-byte alignment and so there might be two fetches needed to bring the loop to decoder. However after adding, for example, 5 nops, although the loop gets aligned to0x42, I assume the fetch happens starting from0x40 (because it is a 32-byte boundary), so the entire loop gets loaded even though non-loop uops between0x40~0x41 gets fetched too. This performance side-effect is guaranteed because the method is 32-byte aligned. So yes, I think this is the step in right direction.

image

Benchmark measurements for default alignment
Number of no-opsiter# 1iter# 2iter# 3iter# 4iter# 5MedianAveragedeviationloop offset
045.1845.3645.1945.1945.1145.1945.2060.0826080x3D
145.0345.0345.5845.2845.2945.2845.2420.2039020x3E
246.0645.445.2745.3345.3945.3945.490.2887910x3F
352.5952.7952.5944.7844.9452.5949.5383.8206040x40
445.153.7346.9945.3745.4145.4147.323.2734080x41
552.2952.5545.3352.3652.3352.3350.9722.8224130x42
650.8552.4755.5652.352.4652.4652.7281.5406930x43
752.3246.147.6452.4446.0947.6448.9182.882710x44
845.1945.9852.645.9846.1245.9847.1742.7327610x45
953.3846.7752.352.545.452.350.073.3025080x46
1046.0152.4146.1951.952.9151.949.8843.106610x47
1152.3852.3652.3852.2952.3352.3652.3480.0342930x48
1252.652.652.8152.6152.6152.6152.6460.0821220x49
1352.5552.7452.7752.6752.6152.6752.6680.0810930x4A
1452.1852.5252.1845.0145.0152.1849.383.5702490x4B
1544.9952.3244.9752.5552.4952.3249.4643.6619540x4C
Benchmark measurements for 32-byte alignment
Number of nopsiter# 1iter# 2iter# 3iter# 4iter# 5MedianAveragedeviationloop offset
052.6252.1552.2152.4352.1552.2152.326666670.1852997570x3D
152.0752.1852.0152.1552.2552.0752.086666670.0840x3E
251.9651.9752.4152.2352.0151.9752.113333330.1768162890x3F
345.9145.8546.4746.0945.9945.9146.076666670.2193080030x40
446.0445.8445.8745.8545.9145.8745.916666670.0730479290x41
545.245.3145.3645.2345.2745.3145.290.0567802780x42
645.9445.8745.7845.8845.8145.8745.863333330.0560713830x43
745.1845.245.1645.2145.345.1845.180.0481663780x44
845.3945.0645.2745.245.2145.2745.240.1070700710x45
945.2145.2745.2745.2745.5145.2745.250.1046135750x46
1045.2845.1945.2245.245.2145.2245.230.0316227770x47
1145.5245.2445.3245.3145.2445.3245.360.1026839810x48
1245.2545.1645.2245.1845.1945.2245.210.0316227770x49
1345.6545.5445.2345.5745.7245.5445.473333330.1682141490x4A
1446.8745.645.4445.5544.9845.645.970.6303459370x4B
1545.2845.4545.2845.2245.645.2845.336666670.1399428450x4C

Performance

I also tried measuring the performance between default vs. 32-byte alignment and I see some segment of my observation that I saw in stability. As soon as the loop gets aligned to 32-byte boundary (after adding 3+ nops), the performance with 32 byte alignment becomes better. So perhaps, to improve performance as Andy pointed out multiple times, it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins.

Number of nopsmedian with default alignmentmedian with 32-byte alignment% difference
045.1852.2116%
145.0352.0715%
246.0651.9714%
352.5945.91-13%
445.145.871%
552.2945.31-13%
650.8545.87-13%
752.3245.18-5%
845.1945.27-2%
953.3845.27-13%
1046.0145.22-13%
1152.3845.32-13%
1252.645.22-14%
1352.5545.54-14%
1452.1845.6-13%
1544.9945.28-13%
danmoseley and adamsitnik reacted with heart emoji

@kunalspathak
Copy link
ContributorAuthor

@adamsitnik , @danmosemsft

@AndyAyersMS
Copy link
Member

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

Here's the historical record showing the bimodal behavior, seems like most of the time it gets a bad alignment and is slow.

newplot (3)

@danmoseley
Copy link
Member

Pretty nice if we get perf for customers as well as stability for us. i didn't expect that: I assumed we would just move perf around when we made alignment stable. So great!

adamsitnik reacted with hooray emoji

@adamsitnik
Copy link
Member

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

If I remember correctly,@AndreyAkinshin has added a manual loop unrolling to this particular loop to minimize the effects of the alignment.

If you pass--keepFiles and open the auto-generated*.notcs file you should be able to see something like this:

privatevoidWorkloadActionUnroll(System.Int64invokeCount){for(System.Int64i=0;i<invokeCount;i++){consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());}}

to disable it, you need to pass--unrollFactor 1:

privatevoidWorkloadActionUnroll(System.Int64invokeCount){for(System.Int64i=0;i<invokeCount;i++){consumer.Consume(workloadDelegate());}}

@adamsitnik
Copy link
Member

Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment).

@kunalspathak this looks and sounds awesome!

it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins

this would be perfect!!

FWIW here are some benchmarks that I remember are very dependent on the loop alignment:

  • System.Memory.Span<Char>.IndexOfValue
  • System.Memory.Span<Byte>.BinarySearch

No loops, but super dependent on alignment onx86:

  • Benchstone.BenchI.Fib.Test

@kunalspathak
Copy link
ContributorAuthor

Any idea what a typical delta this causes on the overall size of the code segment?

I added logging around the change to see how many methods get 32-byte aligned vs. the default alignment and then ran pmi on .NET libraries. There is definitely duplicate methods that might get counted in either of them, but on an average, 532,667 methods continued getting default alignment while 70,573 methods got 32-bytes alignment i.e. out of 603,240 total methods, 70,573 methods got 32-byte aligned which is approx.11% of them. This is close to what@AndyAyersMS anticipated in#2249 (comment).

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

I tried gathering the results forFib benchmark (the only one which is bimodel and has no-loops) and seeing mixed results.

master:

MethodMeanErrorStdDevMedianMinMax
Fib136.0 us0.61 us0.51 us135.8 us135.3 us137.0 us
Fib142.2 us0.55 us0.51 us142.2 us141.4 us143.4 us
Fib142.3 us0.79 us0.70 us142.3 us141.5 us143.9 us
Fib142.2 us0.70 us0.65 us142.0 us141.5 us143.6 us
Fib143.2 us1.94 us1.72 us142.8 us141.5 us147.1 us

PR:

MethodMeanErrorStdDevMedianMinMax
Fib142.1 us0.82 us0.73 us141.9 us141.3 us144.1 us
Fib142.0 us0.51 us0.45 us142.0 us141.4 us143.0 us
Fib142.1 us0.54 us0.48 us142.3 us141.3 us142.9 us
Fib142.2 us0.70 us0.62 us142.0 us141.5 us143.6 us
Fib149.5 us3.73 us4.15 us147.7 us143.2 us158.1 us

@kunalspathak
Copy link
ContributorAuthor

For x86 as well, I don't see much difference for Fib benchmark on my local machine.

master:

MethodMeanErrorStdDevMedianMinMax
Fib142.4 us1.04 us0.87 us142.4 us141.3 us144.2 us
Fib145.7 us2.79 us2.99 us144.8 us142.5 us151.9 us
Fib142.6 us0.91 us0.81 us142.5 us141.7 us144.5 us
Fib142.7 us0.79 us0.70 us142.5 us141.4 us143.8 us
Fib142.5 us0.53 us0.49 us142.4 us141.7 us143.4 us

PR:

MethodMeanErrorStdDevMedianMinMax
Fib142.3 us0.77 us0.72 us142.4 us141.3 us143.5 us
Fib142.5 us0.96 us0.85 us142.0 us141.6 us144.2 us
Fib143.7 us1.34 us1.25 us143.4 us141.7 us145.9 us
Fib142.1 us0.45 us0.38 us142.1 us141.5 us142.9 us
Fib144.3 us2.08 us1.94 us143.7 us142.1 us148.1 us

@AndyAyersMS
Copy link
Member

out of 603,240 total methods, 70,573 methods got 32-byte aligned

If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

@kunalspathak
Copy link
ContributorAuthor

out of 603,240 total methods, 70,573 methods got 32-byte aligned

If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

That makes sense.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

The total code size is 51377607 bytes or around 51MB.

@AndyAyersMS
Copy link
Member

So perhaps as much as 1% of the code space is now padding. I suppose it's possible the code heap can fit some smaller methods into the larger padding voids.

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Let's take this change and keep an eye on the perf lab runs to see if we do indeed see increased stability.

kunalspathak and tannergooding reacted with thumbs up emojitannergooding reacted with hooray emoji

@kunalspathak
Copy link
ContributorAuthor

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Yes, I wanted to get the heap size number for this one and spent some time debugging theallocCode and allocCodeRaw but in order to get accurate numbers from the appropriate heap (specially aroundallocCodeRaw andUnlockedAllocMemForCode_NoThrow) needed some extra special-casing because those methods are also called from runtime which I thought of deferring it given that I could get rough ballpark using other way. But it is on my TODO list to get the heap size.

@kunalspathakkunalspathak merged commit03a0931 intodotnet:masterOct 7, 2020
@Sergio0694
Copy link
Contributor

This is pretty cool! 😄

@AndyAyersMS Do you think this could be related to the .NET perf regession from#36907?
Since your last hypothesis was about those switch branches in the hot path possibly not being aligned on .NET 5?

@AndyAyersMS
Copy link
Member

This work is a prerequisite for aligning branches -- we plan to explore doing that soon.

I'd expect this change might stabilize performance of your code, but would not necessarily optimize it.

@Sergio0694
Copy link
Contributor

Ah, I see, thank you for the clarification! 😊

Was just curious whether this was at least in part related to that, happy to hear branch alignment work is planned soon!

kunalspathak added a commit to kunalspathak/runtime that referenced this pull requestOct 23, 2020
Indotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.
@ghostghost locked asresolvedand limited conversation to collaboratorsDec 7, 2020
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@AndyAyersMSAndyAyersMSAndyAyersMS approved these changes

+1 more reviewer

@BruceForstallBruceForstallBruceForstall approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

7 participants

@kunalspathak@AndyAyersMS@danmoseley@adamsitnik@Sergio0694@BruceForstall@Dotnet-GitSync-Bot

[8]ページ先頭

©2009-2025 Movatter.jp