NotificationsYou must be signed in to change notification settings
Fork5.2k
Star17.2k

Fix condition to align methods to 32 bytes boundary#42909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

kunalspathak merged 1 commit intodotnet:masterfromkunalspathak:align32bytes

Oct 7, 2020

Merged

Fix condition to align methods to 32 bytes boundary#42909

kunalspathak merged 1 commit intodotnet:masterfromkunalspathak:align32bytes

Oct 7, 2020

Conversation

Copy link

Contributor

kunalspathak commentedSep 30, 2020

In#2249, we started doing alignment of methods to 32-byte boundary forTier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

fix condition to align method at 32 bytes

ab9ce53

Dotnet-GitSync-Bot added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label

Sep 30, 2020

Copy link

ContributorAuthor

kunalspathak commentedSep 30, 2020

@dotnet/jit-contrib ,@AndyAyersMS

This was referencedSep 30, 2020

OSX deprovisionjaredpar/runfo#41

Closed

Test initial triagejaredpar/runfo#50

Closed

Test initial triagejaredpar/runfo#51

Closed

Copy link

Member

AndyAyersMS commentedSep 30, 2020

Expected impact is

(short term) perf stability for benchmarks that run jitted code with loops
(medium term) ability to optimize perf by adjusting alignment of code within methods

Would be good to highlight one or two benchmarks we really expect to stabilize with this (benchmarks with very simple timing payloads).

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

Beyond that, once this is in, it would be good to do a retrospective study of benchmarks after we've gotten a few days worth of data, and see if we really did move the needle on stability. I had hoped to do that with#2249 but lab collection wasn't in great shape back then; looking back now it doesn't seem to have really had much impact, perhaps because it only affected Tier1 methods.

Any idea what a typical delta this causes on the overall size of the code segment?

Copy link

ContributorAuthor

kunalspathak commentedOct 1, 2020•
edited
Loading

I picked upLoopReturn benchmark which wasrecently found by BenchmarkDotNet to have 22% regression but isBimodel test. The benchmark itself has a tight loop on which I can do some experimentation. For the sake of argument, I removedthe only condition in the loop code to eliminate any branch prediction effect. The loop is 17 bytes long, so a single fetch of 32B line should bring the entire loop code into the decoder and cached in DSB.

Another thing that I experimented along with method alignment is the effect on the benchmark timing because of loop alignment within the method. With the help of Andy'sNopPadding branch, I added various padding ofnop to adjust loop header with alignments ranging from0x0 thru0xF to see if fixing the method alignment to 32-byte boundary helps overall execution timing and stability of the benchmark regardless of where loop is aligned. The results are positive.

All the measurements are done onIntel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores.

Stability

Without the no-ops, the loop starts at offset0x3D, so adding 1 no-op moves the loop to0x3E and so forth. In below graph, X-axis is no. of nops padding I added in the method and Y-axis is the standard deviation of benchmark timings measured over 5 iterations. Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment). With default alignment, for some nops padding, the benchmarks are stable, but in general they deviate if the loop's alignment is not favorable to the DSB line. As opposed to that, the 32-byte alignment provides stable timings with very little deviation, no matter what the loop alignment is. For 32-byte method alignment, when the loop's alignment gets to 0x40 (after adding 3 thru 15 nops), the performance even gets better. My theory is that after 3 nops are added in a method, the loop (17 bytes long) gets loaded entirely in a single fetch. For e.g. with 2 nops, the loop starts at0x3F address which is not a boundary of 32-byte alignment and so there might be two fetches needed to bring the loop to decoder. However after adding, for example, 5 nops, although the loop gets aligned to0x42, I assume the fetch happens starting from0x40 (because it is a 32-byte boundary), so the entire loop gets loaded even though non-loop uops between0x40~0x41 gets fetched too. This performance side-effect is guaranteed because the method is 32-byte aligned. So yes, I think this is the step in right direction.

Benchmark measurements for default alignment

Number of no-ops	iter# 1	iter# 2	iter# 3	iter# 4	iter# 5	Median	Average	deviation	loop offset
0	45.18	45.36	45.19	45.19	45.11	45.19	45.206	0.082608	0x3D
1	45.03	45.03	45.58	45.28	45.29	45.28	45.242	0.203902	0x3E
2	46.06	45.4	45.27	45.33	45.39	45.39	45.49	0.288791	0x3F
3	52.59	52.79	52.59	44.78	44.94	52.59	49.538	3.820604	0x40
4	45.1	53.73	46.99	45.37	45.41	45.41	47.32	3.273408	0x41
5	52.29	52.55	45.33	52.36	52.33	52.33	50.972	2.822413	0x42
6	50.85	52.47	55.56	52.3	52.46	52.46	52.728	1.540693	0x43
7	52.32	46.1	47.64	52.44	46.09	47.64	48.918	2.88271	0x44
8	45.19	45.98	52.6	45.98	46.12	45.98	47.174	2.732761	0x45
9	53.38	46.77	52.3	52.5	45.4	52.3	50.07	3.302508	0x46
10	46.01	52.41	46.19	51.9	52.91	51.9	49.884	3.10661	0x47
11	52.38	52.36	52.38	52.29	52.33	52.36	52.348	0.034293	0x48
12	52.6	52.6	52.81	52.61	52.61	52.61	52.646	0.082122	0x49
13	52.55	52.74	52.77	52.67	52.61	52.67	52.668	0.081093	0x4A
14	52.18	52.52	52.18	45.01	45.01	52.18	49.38	3.570249	0x4B
15	44.99	52.32	44.97	52.55	52.49	52.32	49.464	3.661954	0x4C

Benchmark measurements for 32-byte alignment

Number of nops	iter# 1	iter# 2	iter# 3	iter# 4	iter# 5	Median	Average	deviation	loop offset
0	52.62	52.15	52.21	52.43	52.15	52.21	52.32666667	0.185299757	0x3D
1	52.07	52.18	52.01	52.15	52.25	52.07	52.08666667	0.084	0x3E
2	51.96	51.97	52.41	52.23	52.01	51.97	52.11333333	0.176816289	0x3F
3	45.91	45.85	46.47	46.09	45.99	45.91	46.07666667	0.219308003	0x40
4	46.04	45.84	45.87	45.85	45.91	45.87	45.91666667	0.073047929	0x41
5	45.2	45.31	45.36	45.23	45.27	45.31	45.29	0.056780278	0x42
6	45.94	45.87	45.78	45.88	45.81	45.87	45.86333333	0.056071383	0x43
7	45.18	45.2	45.16	45.21	45.3	45.18	45.18	0.048166378	0x44
8	45.39	45.06	45.27	45.2	45.21	45.27	45.24	0.107070071	0x45
9	45.21	45.27	45.27	45.27	45.51	45.27	45.25	0.104613575	0x46
10	45.28	45.19	45.22	45.2	45.21	45.22	45.23	0.031622777	0x47
11	45.52	45.24	45.32	45.31	45.24	45.32	45.36	0.102683981	0x48
12	45.25	45.16	45.22	45.18	45.19	45.22	45.21	0.031622777	0x49
13	45.65	45.54	45.23	45.57	45.72	45.54	45.47333333	0.168214149	0x4A
14	46.87	45.6	45.44	45.55	44.98	45.6	45.97	0.630345937	0x4B
15	45.28	45.45	45.28	45.22	45.6	45.28	45.33666667	0.139942845	0x4C

Performance

I also tried measuring the performance between default vs. 32-byte alignment and I see some segment of my observation that I saw in stability. As soon as the loop gets aligned to 32-byte boundary (after adding 3+ nops), the performance with 32 byte alignment becomes better. So perhaps, to improve performance as Andy pointed out multiple times, it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins.

Number of nops	median with default alignment	median with 32-byte alignment	% difference
0	45.18	52.21	16%
1	45.03	52.07	15%
2	46.06	51.97	14%
3	52.59	45.91	-13%
4	45.1	45.87	1%
5	52.29	45.31	-13%
6	50.85	45.87	-13%
7	52.32	45.18	-5%
8	45.19	45.27	-2%
9	53.38	45.27	-13%
10	46.01	45.22	-13%
11	52.38	45.32	-13%
12	52.6	45.22	-14%
13	52.55	45.54	-14%
14	52.18	45.6	-13%
15	44.99	45.28	-13%

Copy link

ContributorAuthor

kunalspathak commentedOct 1, 2020

@adamsitnik , @danmosemsft

Copy link

Member

AndyAyersMS commentedOct 1, 2020

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

Here's the historical record showing the bimodal behavior, seems like most of the time it gets a bad alignment and is slow.

Copy link

Member

danmoseley commentedOct 1, 2020

Pretty nice if we get perf for customers as well as stability for us. i didn't expect that: I assumed we would just move perf around when we made alignment stable. So great!

Copy link

Member

adamsitnik commentedOct 1, 2020

Note because of the way BenchmarkDotNet operates the main "driving" loop for most benchmark runs (which invokes a delegate that runs the method to benchmark) will be jitted and has a loop, so bypasses Tiering. One hypothesis we're exploring is that the varying alignment of this loop over time contributes to some of our observed day to day perf fluctuations.

If I remember correctly,@AndreyAkinshin has added a manual loop unrolling to this particular loop to minimize the effects of the alignment.

If you pass--keepFiles and open the auto-generated*.notcs file you should be able to see something like this:

privatevoidWorkloadActionUnroll(System.Int64invokeCount){for(System.Int64i=0;i<invokeCount;i++){consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());consumer.Consume(workloadDelegate());}}

to disable it, you need to pass--unrollFactor 1:

privatevoidWorkloadActionUnroll(System.Int64invokeCount){for(System.Int64i=0;i<invokeCount;i++){consumer.Consume(workloadDelegate());}}

Copy link

Member

adamsitnik commentedOct 1, 2020

Key observation is that there is very less deviation with 32-byte alignment as compared to the default (16-byte alignment).

@kunalspathak this looks and sounds awesome!

it will be good to align hot loops at 32-byte boundary which in combination with method alignment will give us both stability and performance wins

this would be perfect!!

FWIW here are some benchmarks that I remember are very dependent on the loop alignment:

System.Memory.Span<Char>.IndexOfValue
System.Memory.Span<Byte>.BinarySearch

No loops, but super dependent on alignment onx86:

Benchstone.BenchI.Fib.Test

BruceForstall approved these changes

Oct 6, 2020

View reviewed changes

Copy link

ContributorAuthor

kunalspathak commentedOct 6, 2020

Any idea what a typical delta this causes on the overall size of the code segment?

I added logging around the change to see how many methods get 32-byte aligned vs. the default alignment and then ran pmi on .NET libraries. There is definitely duplicate methods that might get counted in either of them, but on an average, 532,667 methods continued getting default alignment while 70,573 methods got 32-bytes alignment i.e. out of 603,240 total methods, 70,573 methods got 32-byte aligned which is approx.11% of them. This is close to what@AndyAyersMS anticipated in#2249 (comment).

LoopReturn seems like a reasonable choice, though I am also interested cases where the benchmark payload does not have a loop. Maybe we can find one?

I tried gathering the results forFib benchmark (the only one which is bimodel and has no-loops) and seeing mixed results.

master:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	136.0 us	0.61 us	0.51 us	135.8 us	135.3 us	137.0 us
Fib	142.2 us	0.55 us	0.51 us	142.2 us	141.4 us	143.4 us
Fib	142.3 us	0.79 us	0.70 us	142.3 us	141.5 us	143.9 us
Fib	142.2 us	0.70 us	0.65 us	142.0 us	141.5 us	143.6 us
Fib	143.2 us	1.94 us	1.72 us	142.8 us	141.5 us	147.1 us

PR:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	142.1 us	0.82 us	0.73 us	141.9 us	141.3 us	144.1 us
Fib	142.0 us	0.51 us	0.45 us	142.0 us	141.4 us	143.0 us
Fib	142.1 us	0.54 us	0.48 us	142.3 us	141.3 us	142.9 us
Fib	142.2 us	0.70 us	0.62 us	142.0 us	141.5 us	143.6 us
Fib	149.5 us	3.73 us	4.15 us	147.7 us	143.2 us	158.1 us

Copy link

ContributorAuthor

kunalspathak commentedOct 6, 2020

For x86 as well, I don't see much difference for Fib benchmark on my local machine.

master:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	142.4 us	1.04 us	0.87 us	142.4 us	141.3 us	144.2 us
Fib	145.7 us	2.79 us	2.99 us	144.8 us	142.5 us	151.9 us
Fib	142.6 us	0.91 us	0.81 us	142.5 us	141.7 us	144.5 us
Fib	142.7 us	0.79 us	0.70 us	142.5 us	141.4 us	143.8 us
Fib	142.5 us	0.53 us	0.49 us	142.4 us	141.7 us	143.4 us

PR:

Method	Mean	Error	StdDev	Median	Min	Max
Fib	142.3 us	0.77 us	0.72 us	142.4 us	141.3 us	143.5 us
Fib	142.5 us	0.96 us	0.85 us	142.0 us	141.6 us	144.2 us
Fib	143.7 us	1.34 us	1.25 us	143.4 us	141.7 us	145.9 us
Fib	142.1 us	0.45 us	0.38 us	142.1 us	141.5 us	142.9 us
Fib	144.3 us	2.08 us	1.94 us	143.7 us	142.1 us	148.1 us

Copy link

Member

AndyAyersMS commentedOct 7, 2020

out of 603,240 total methods, 70,573 methods got 32-byte aligned

If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

Copy link

ContributorAuthor

kunalspathak commentedOct 7, 2020

out of 603,240 total methods, 70,573 methods got 32-byte aligned
If we assume half of these would have been 32 byte aligned anyways, then we have an additional 35K instances of 16 bytes wasted in the code segment, so around 560K bytes of extra allocation.

That makes sense.

I'd like to know how this compares to the total amount of code. What was the total code size for the 603,240 methods?

The total code size is 51377607 bytes or around 51MB.

Copy link

Member

AndyAyersMS commentedOct 7, 2020

So perhaps as much as 1% of the code space is now padding. I suppose it's possible the code heap can fit some smaller methods into the larger padding voids.

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Let's take this change and keep an eye on the perf lab runs to see if we do indeed see increased stability.

AndyAyersMS approved these changes

Oct 7, 2020

View reviewed changes

Copy link

ContributorAuthor

kunalspathak commentedOct 7, 2020

Would be nice at some point to actually measure the heap sizes instead of trying to do these back of the envelope estimations.

Yes, I wanted to get the heap size number for this one and spent some time debugging theallocCode and allocCodeRaw but in order to get accurate numbers from the appropriate heap (specially aroundallocCodeRaw andUnlockedAllocMemForCode_NoThrow) needed some extra special-casing because those methods are also called from runtime which I thought of deferring it given that I could get rough ballpark using other way. But it is on my TODO list to get the heap size.

kunalspathak merged commit03a0931 intodotnet:master

Oct 7, 2020

Copy link

Contributor

Sergio0694 commentedOct 7, 2020

This is pretty cool! 😄

@AndyAyersMS Do you think this could be related to the .NET perf regession from#36907?
Since your last hypothesis was about those switch branches in the hot path possibly not being aligned on .NET 5?

Copy link

Member

AndyAyersMS commentedOct 7, 2020

This work is a prerequisite for aligning branches -- we plan to explore doing that soon.

I'd expect this change might stabilize performance of your code, but would not necessarily optimize it.

Copy link

Contributor

Sergio0694 commentedOct 7, 2020

Ah, I see, thank you for the clarification! 😊

Was just curious whether this was at least in part related to that, happy to hear branch alignment work is planned soon!

This was referencedOct 9, 2020

Investigate impact of method entry alignment#9912

Open

Stabilize performance measurement#43227

Open

This was referencedOct 21, 2020

[Perf -89%] Benchstone.BenchI.IniArray.TestDrewScoggins/performance-2#2267

Open

[Perf -64%] Span.Sorting.QuickSortSpanDrewScoggins/performance-2#2319

Open

kunalspathak added a commit to kunalspathak/runtime that referenced this pull request

Oct 23, 2020

fix condition to align method at 32 bytes (dotnet#42909)

3c111e8

Indotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

kunalspathak mentioned this pull request

Nov 13, 2020

33% performance degradation in .NET 5#44457

Closed