NotificationsYou must be signed in to change notification settings
Fork5.1k
Star16.6k

[mono] Fix SpanHelpers.Reverse regression#70650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

adamsitnik merged 3 commits intodotnet:mainfromBrzVlad:fix-span-reverse-perf

Jun 22, 2022

Merged

[mono] Fix SpanHelpers.Reverse regression#70650

adamsitnik merged 3 commits intodotnet:mainfromBrzVlad:fix-span-reverse-perf

Jun 22, 2022

Conversation

Copy link

Member

BrzVlad commentedJun 13, 2022

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.

#64412
dotnet/perf-autofiling-issues#5014

ghost added the area-System.Memory label

Jun 13, 2022

ghost assignedBrzVlad

Jun 13, 2022

Copy link

ghost commentedJun 13, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info inarea-owners.md if you want to be subscribed.

Issue Details

#64412
dotnet/perf-autofiling-issues#5014

Author:	BrzVlad
Assignees:	-
Labels:	`area-System.Memory`
Milestone:	-

Copy link

MemberAuthor

BrzVlad commentedJun 13, 2022

@alexcovington @tannergooding

tannergooding reviewed

Jun 13, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

		ref byte last = ref Unsafe.Add(ref buf, length - 1 - i);
		(last, first) = (first, last);
		}
		ReverseInner(ref buf, length);

Copy link

Member

tannergoodingJun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you help elaborate why this is faster now?

ReverseInner is effectively doing the same thing as this, just with atemp rather than using a tuple. There is a slight difference in how the addressing will end up as well, but none of this looks like it should be that impactful.

If there is something specifically impacting WASM, then it would be good to ensure we have a bug tracking that because these two loops should be effectively the same. Likewise, we should likely addAggressiveInlining toReverseInner to ensure we aren't paying the cost of a call to handle the at most 7-8 loop trailing iterations on x64/Arm64.

Copy link

Member

jkotasJun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you help elaborate why this is faster now?

It avoids this issue#64412 (comment) among other things.

Copy link

Member

tannergoodingJun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I remember that minor issue, but I wouldn't have expected 2-3x perf difference there. It's worse IL but the JIT is also able to optimize it to the same native codegen in many cases (especially for the primitives like we're using here).

This seems like a case where there is likely some WASM specific inefficiency and that's what I'd like to better understand.

Copy link

Member

jkotasJun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note that this change also introduced 1.67x regression on arm64:#68667.

I agree that it would be nice for the JIT to optimize this to have the same performance as the original loop. None of our code-generators are there (each code-generator for different reasons).

Copy link

Member

tannergoodingJun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

-- Just noting I'm not pushing back against this change, I think its fine and simplifies things. Just wanting to better understand why there is such a drastic difference here so we know what to look out for in the future.

Copy link

Member

jkotasJun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

Yes, it is fixing the regression on coreclr arm64. For reference, here is the disassembly of the core loop forReverse<int>:

Current main:

    0x280e91070: lsl    x1, x24, #2    0x280e91074: add    x1, x22, x1    0x280e91078: sub    x3, x2, x24    0x280e9107c: lsl    x3, x3, #2    0x280e91080: add    x3, x3, x22    0x280e91084: ldr    w4, [x1]    0x280e91088: ldr    w5, [x3]    0x280e9108c: str    w4, [x3]    0x280e91090: str    w5, [x1]    0x280e91094: add    x24, x24, #0x1    0x280e91098: cmp    x25, x24    0x280e9109c: b.hi   0x280e91070

This change:

    0x280f41240: ldr    w2, [x0]    0x280f41244: ldr    w3, [x1]    0x280f41248: str    w3, [x0]    0x280f4124c: str    w2, [x1]    0x280f41250: add    x0, x0, #0x4    0x280f41254: sub    x1, x1, #0x4    0x280f41258: cmp    x0, x1    0x280f4125c: b.lo   0x280f41240

The regression is caused by an extra induction variable. RyuJIT is able to optimize out the extra local, but I do not expect that Mono is always able to optimize out the extra local (interpreter in particular).

Copy link

MemberAuthor

BrzVladJun 14, 2022•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For reference, here is the code generated by the interpreter for the loop in question. The generated code is fairly similar with the arm64 code Jan posted above, with some additional redundancy in the unoptimized case. The main problem is thatfirst andlast are recomputed with every loop iteration, relative toi.

Current main:

IR_0007: conv.i8.i4     [48 <- 16],     // Computation of `first`, extra calculation due to non-constant indexerIR_000a: mul.i8.imm     [48 <- 48], 4   //IR_000e: add.i8         [24 <- 0 48],   /////////////////////IR_0012: sub1.i4        [48 <- 8],      // Computation of `last`, extra calculation due to non-constant indexerIR_0015: sub.i4         [48 <- 48 16],  //IR_0019: conv.i8.i4     [48 <- 48],     //IR_001c: mul.i8.imm     [48 <- 48], 4   //IR_0020: add.i8         [48 <- 0 48],   ///////////////////IR_0024: ldind.i4       [32 <- 24],IR_0027: ldind.i4       [40 <- 48],IR_002a: stind.i4       [nil <- 48 32],IR_002d: stind.i4       [nil <- 24 40],IR_0030: add1.i4        [16 <- 16],     // index must be incrementedIR_0033: ldc.i4.2       [48 <- nil],    // Division of length should be hoisted out of the loop, optimization not supported by interp yetIR_0035: div.i4         [48 <- 8 48],   //IR_0039: blt.i4.sp      [nil <- 16 48], IR_0007

This change:

IR_0015: ldind.i4       [32 <- 16],IR_0018: ldind.i4       [40 <- 24],IR_001b: stind.i4       [nil <- 16 40],IR_001e: stind.i4       [nil <- 24 32],IR_0021: add.i8.imm     [16 <- 16], 4IR_0025: add.i8.imm     [24 <- 24], -4IR_0029: clt.un.i8      [40 <- 16 24],IR_002d: brtrue.i4.sp   [nil <- 40], IR_0015

SamMonoRT approved these changes

Jun 14, 2022

View reviewed changes

lewing reviewed

Jun 14, 2022

View reviewed changes

Copy link

Member

lewing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks like you need to guard for the length == 0 case

runfoappbot mentioned this pull request

Jun 15, 2022

system.net.http.functional.tests crashing#70757

Closed

lewing approved these changes

Jun 15, 2022

View reviewed changes

Copy link

Member

radical commentedJun 16, 2022

/azp run runtime-wasm

Copy link

azure-pipelinesbot commentedJun 16, 2022

Azure Pipelines successfully started running 1 pipeline(s).

BrzVlad added3 commits

June 20, 2022 11:31

[mono] Fix SpanHelpers.Reverse regression

ac420d4

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the slow path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized scenario (or the rest of the array in the vectorized scenario). This is 2-3 times faster on wasm for example.

Mark method for inlining

6a7f8c4

Add check for minimum length

80f701c

This method can now be called with length 0

BrzVlad force-pushed thefix-span-reverse-perf branch from3ec80c0 to80f701cCompare

June 20, 2022 08:31

tannergooding reviewed

Jun 20, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.cs

		private static void ReverseInner<T>(ref T elements, nuint length)
		{
		Debug.Assert(length > 0);
		if (length <= 1)

Copy link

Member

tannergoodingJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This can beif (length == 0) { return } since zero is the only other value it could be here. It will also be a more efficient check.

Copy link

MemberAuthor

BrzVladJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I added comparison with 1 since reversing one element is a nop, but it probably doesn't really make a difference.

Copy link

Member

jkotas commentedJun 20, 2022

#70944 is refactoring this in more significant ways.

Copy link

Member

SamMonoRT commentedJun 21, 2022

#70944 is refactoring this in more significant ways.

Since this seems ready, should we go ahead and merge this ?

tannergooding approved these changes

Jun 21, 2022

View reviewed changes

adamsitnik approved these changes

Jun 22, 2022

View reviewed changes

Copy link

Member

adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, thank you@BrzVlad !

adamsitnik merged commit2a680c1 intodotnet:main

Jun 22, 2022

adamsitnik added this to the7.0.0 milestone

Jun 22, 2022

This was referencedJun 22, 2022

Improve Span.Reverse fast path performance#70944

Merged

Regressions in System.Tests.Perf_Array for ARM64#68667

Closed

Copy link

Member

EgorBo commentedJun 23, 2022

Improvements on win-arm64:dotnet/perf-autofiling-issues#6342

kunalspathak mentioned this pull request

Jul 7, 2022

[Perf] Changes at 6/22/2022 1:08:18 PMdotnet/perf-autofiling-issues#6488

Closed

ghost locked asresolvedand limited conversation to collaborators

Jul 23, 2022

Labels

area-System.Memory

8 participants

Movatterモバイル変換

[mono] Fix SpanHelpers.Reverse regression#70650

[mono] Fix SpanHelpers.Reverse regression#70650

Uh oh!

Conversation

BrzVlad commentedJun 13, 2022

Uh oh!

ghost commentedJun 13, 2022

Uh oh!

BrzVlad commentedJun 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrzVladJun 14, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewing left a comment

Choose a reason for hiding this comment

Uh oh!

radical commentedJun 16, 2022

Uh oh!

azure-pipelinesbot commentedJun 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas commentedJun 20, 2022

Uh oh!

SamMonoRT commentedJun 21, 2022

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

EgorBo commentedJun 23, 2022

Uh oh!

Uh oh!

BrzVladJun 14, 2022•
edited
Loading