Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[mono] Fix SpanHelpers.Reverse regression#70650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
adamsitnik merged 3 commits intodotnet:mainfromBrzVlad:fix-span-reverse-perf
Jun 22, 2022

Conversation

BrzVlad
Copy link
Member

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.

#64412
dotnet/perf-autofiling-issues#5014

@ghostghost added the area-System.Memory labelJun 13, 2022
@ghostghost assignedBrzVladJun 13, 2022
@ghost
Copy link

Tagging subscribers to this area: @dotnet/area-system-memory
See info inarea-owners.md if you want to be subscribed.

Issue Details

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.

#64412
dotnet/perf-autofiling-issues#5014

Author:BrzVlad
Assignees:-
Labels:

area-System.Memory

Milestone:-

@BrzVlad
Copy link
MemberAuthor

ref byte last = ref Unsafe.Add(ref buf, length - 1 - i);
(last, first) = (first, last);
}
ReverseInner(ref buf, length);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you help elaborate why this is faster now?

ReverseInner is effectively doing the same thing as this, just with atemp rather than using a tuple. There is a slight difference in how the addressing will end up as well, but none of this looks like it should be that impactful.

If there is something specifically impacting WASM, then it would be good to ensure we have a bug tracking that because these two loops should be effectively the same. Likewise, we should likely addAggressiveInlining toReverseInner to ensure we aren't paying the cost of a call to handle the at most 7-8 loop trailing iterations on x64/Arm64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you help elaborate why this is faster now?

It avoids this issue#64412 (comment) among other things.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I remember that minor issue, but I wouldn't have expected 2-3x perf difference there. It's worse IL but the JIT is also able to optimize it to the same native codegen in many cases (especially for the primitives like we're using here).

This seems like a case where there is likely some WASM specific inefficiency and that's what I'd like to better understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note that this change also introduced 1.67x regression on arm64:#68667.

I agree that it would be nice for the JIT to optimize this to have the same performance as the original loop. None of our code-generators are there (each code-generator for different reasons).

tannergooding reacted with thumbs up emoji

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

-- Just noting I'm not pushing back against this change, I think its fine and simplifies things. Just wanting to better understand why there is such a drastic difference here so we know what to look out for in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

Yes, it is fixing the regression on coreclr arm64. For reference, here is the disassembly of the core loop forReverse<int>:

Current main:

    0x280e91070: lsl    x1, x24, #2    0x280e91074: add    x1, x22, x1    0x280e91078: sub    x3, x2, x24    0x280e9107c: lsl    x3, x3, #2    0x280e91080: add    x3, x3, x22    0x280e91084: ldr    w4, [x1]    0x280e91088: ldr    w5, [x3]    0x280e9108c: str    w4, [x3]    0x280e91090: str    w5, [x1]    0x280e91094: add    x24, x24, #0x1    0x280e91098: cmp    x25, x24    0x280e9109c: b.hi   0x280e91070

This change:

    0x280f41240: ldr    w2, [x0]    0x280f41244: ldr    w3, [x1]    0x280f41248: str    w3, [x0]    0x280f4124c: str    w2, [x1]    0x280f41250: add    x0, x0, #0x4    0x280f41254: sub    x1, x1, #0x4    0x280f41258: cmp    x0, x1    0x280f4125c: b.lo   0x280f41240

The regression is caused by an extra induction variable. RyuJIT is able to optimize out the extra local, but I do not expect that Mono is always able to optimize out the extra local (interpreter in particular).

Copy link
MemberAuthor

@BrzVladBrzVladJun 14, 2022
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For reference, here is the code generated by the interpreter for the loop in question. The generated code is fairly similar with the arm64 code Jan posted above, with some additional redundancy in the unoptimized case. The main problem is thatfirst andlast are recomputed with every loop iteration, relative toi.

Current main:

IR_0007: conv.i8.i4     [48 <- 16],     // Computation of `first`, extra calculation due to non-constant indexerIR_000a: mul.i8.imm     [48 <- 48], 4   //IR_000e: add.i8         [24 <- 0 48],   /////////////////////IR_0012: sub1.i4        [48 <- 8],      // Computation of `last`, extra calculation due to non-constant indexerIR_0015: sub.i4         [48 <- 48 16],  //IR_0019: conv.i8.i4     [48 <- 48],     //IR_001c: mul.i8.imm     [48 <- 48], 4   //IR_0020: add.i8         [48 <- 0 48],   ///////////////////IR_0024: ldind.i4       [32 <- 24],IR_0027: ldind.i4       [40 <- 48],IR_002a: stind.i4       [nil <- 48 32],IR_002d: stind.i4       [nil <- 24 40],IR_0030: add1.i4        [16 <- 16],     // index must be incrementedIR_0033: ldc.i4.2       [48 <- nil],    // Division of length should be hoisted out of the loop, optimization not supported by interp yetIR_0035: div.i4         [48 <- 8 48],   //IR_0039: blt.i4.sp      [nil <- 16 48], IR_0007

This change:

IR_0015: ldind.i4       [32 <- 16],IR_0018: ldind.i4       [40 <- 24],IR_001b: stind.i4       [nil <- 16 40],IR_001e: stind.i4       [nil <- 24 32],IR_0021: add.i8.imm     [16 <- 16], 4IR_0025: add.i8.imm     [24 <- 24], -4IR_0029: clt.un.i8      [40 <- 16 24],IR_002d: brtrue.i4.sp   [nil <- 40], IR_0015

tannergooding reacted with thumbs up emoji
Copy link
Member

@lewinglewing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks like you need to guard for the length == 0 case

@radical
Copy link
Member

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the slow path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized scenario (or the rest of the array in the vectorized scenario). This is 2-3 times faster on wasm for example.
This method can now be called with length 0
@BrzVladBrzVladforce-pushed thefix-span-reverse-perf branch from3ec80c0 to80f701cCompareJune 20, 2022 08:31
private static void ReverseInner<T>(ref T elements, nuint length)
{
Debug.Assert(length > 0);
if (length <= 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This can beif (length == 0) { return } since zero is the only other value it could be here. It will also be a more efficient check.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I added comparison with 1 since reversing one element is a nop, but it probably doesn't really make a difference.

@jkotas
Copy link
Member

#70944 is refactoring this in more significant ways.

SamMonoRT reacted with thumbs up emoji

@SamMonoRT
Copy link
Member

#70944 is refactoring this in more significant ways.

Since this seems ready, should we go ahead and merge this ?

Copy link
Member

@adamsitnikadamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, thank you@BrzVlad !

@EgorBo
Copy link
Member

Improvements on win-arm64:dotnet/perf-autofiling-issues#6342

@ghostghost locked asresolvedand limited conversation to collaboratorsJul 23, 2022
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@jkotasjkotasjkotas left review comments

@lewinglewinglewing approved these changes

@adamsitnikadamsitnikadamsitnik approved these changes

@tannergoodingtannergoodingtannergooding approved these changes

@SamMonoRTSamMonoRTSamMonoRT approved these changes

Assignees

@BrzVladBrzVlad

Projects
None yet
Milestone
7.0.0
Development

Successfully merging this pull request may close these issues.

8 participants
@BrzVlad@radical@jkotas@SamMonoRT@EgorBo@lewing@adamsitnik@tannergooding

[8]ページ先頭

©2009-2025 Movatter.jp