NotificationsYou must be signed in to change notification settings
Fork5.2k
Star17.2k

LSRA-throughput: Iterate over the regMaskTP instead all registers#87424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

kunalspathak merged 12 commits intodotnet:mainfromkunalspathak:reg-for-loop

Jun 19, 2023

Merged

LSRA-throughput: Iterate over the regMaskTP instead all registers#87424

kunalspathak merged 12 commits intodotnet:mainfromkunalspathak:reg-for-loop

Jun 19, 2023

Conversation

Copy link

Contributor

kunalspathak commentedJun 12, 2023•
edited
Loading

At few places, just iterate over theregMaskTP instead of all the registers and checking them against the mask. This will remove the impact of adding more registers because with the changes in PR, we will just iterate over registers of interest.
Updated the pattern we use to extract theregNumber from the mask and toggling the bit in the mask.

Fixes:#87337

kunalspathak added2 commits

June 12, 2023 12:48

replace for-loop with regMaspTP iterator

18bb1cb

jit format

70813e3

ghost added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label

Jun 12, 2023

ghost assignedkunalspathak

Jun 12, 2023

Copy link

ghost commentedJun 12, 2023

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

Issue Details

Fixes:#87337

Author:	kunalspathak
Assignees:	kunalspathak
Labels:	`area-CodeGen-coreclr`
Milestone:	-

REVERT

b9fd5eb

build-analysisbot mentioned this pull request

Jun 13, 2023

Tracking issue for CI build timeouts#76454

Closed

fix a bug

40de6c0

tannergooding reviewed

Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp OutdatedShow resolvedHide resolved

tannergooding reviewed

Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp OutdatedShow resolvedHide resolved

kunalspathak added4 commits

June 13, 2023 08:59

address review feedback

119cecd

Add genFirstRegNumFromMaskAndToggle and genFirstRegNumFromMask

014bcd6

Use actualRegistersMask

e2b460a

jit format

19ec31c

Copy link

ContributorAuthor

kunalspathak commentedJun 13, 2023

Results are very encouraging:

However, linux native compiler is not happy:

kunalspathak marked this pull request as ready for review

June 13, 2023 22:20

Copy link

ContributorAuthor

kunalspathak commentedJun 13, 2023

@dotnet/jit-contrib@BruceForstall

Copy link

Member

tannergooding commentedJun 13, 2023

It's interesting that its worse forLinux x64 on Linux x64.

I think that means that Clang is generating more instructions now than it was before (Linux x64 on Windows x64 where its an improvement is on MSVC).

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

tannergooding reviewed

Jun 13, 2023

View reviewed changes

src/coreclr/jit/codegenarm64.cpp OutdatedShow resolvedHide resolved

tannergooding reviewed

Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp

Comment on lines +300 to +302

		regNumber regNum =genFirstRegNumFromMask(candidates);
		regMaskTP candidateBit =genRegMask(regNum);
		candidates ^= candidateBit;

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can this one not be?

Suggested change

	regNumber regNum = genFirstRegNumFromMask(candidates);
	regMaskTP candidateBit = genRegMask(regNum);
	candidates ^= candidateBit;
	regNumber regNum = genFirstRegNumFromMaskAndToggle(candidates);

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

no, because we needcandidateBit few lines below. That's why I cannot usegenFirstRegNumFromMaskAndToggle().

tannergooding reviewed

Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp OutdatedShow resolvedHide resolved

tannergooding reviewed

Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp OutdatedShow resolvedHide resolved

tannergooding reviewed

Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsrabuild.cpp

Comment on lines +2756 to +2764

		if (availableRegCount < (sizeof(regMaskTP) *8))
		{
		// Mask out the bits that are between 64 ~ availableRegCount
		actualRegistersMask = (1ULL << availableRegCount) -1;
		}
		else
		{
		actualRegistersMask = ~RBM_NONE;
		}

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why not have this always beactualRegisterMask = (1ULL << availableRegCount) - 1?

That way its always exactly the bitmask of actual registers available. No more, no less.

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, that's ideally how it should be, but for arm64,availableRegCount == 65 (including theREG_STK, etc.). So(1ULL << 65) returns0x2 and with- 1, we getactualRegisterMask becomes1. Debugger shows correct value.

I am little confused on why that happens.

Copy link

Member

tannergoodingJun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

regMaskTP isunsigned __int64 for Arm64 and so we can represent at most 64 registers and therefore1 << 63 is the highest shift we can safely do, because1 << 64 is overshifting and therefore undefined behavior.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it(1 << 65) == 0, then0 - 1 == -1, which isAllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as(1 << (65 % 63)) == (1 << 2) == 4 and then4 - 1 == 3 and others still as something completely different.

It looks like this isn't an "issue" today because the register allocator cannot allocateREG_SP itself. It's only manually used bycodegenarm64 instead and so it doesn't need to be included inactualRegistersMask. That makes working around this "simpler" since its effectively a "special" register likeREG_STK.

Short term we probably want to add an assert to validate the tracked registers don't exceed 64-bits (that isACTUAL_REG_CNT <= 64) and to special case when it is exactly 64-bits.

Long term, I imagine we want to consider better ways to represent this so we can avoid the problem altogether. Having distinct register files for each category (SIMD/FP vs General/Integer vs Special/Other) is one way. That may also help in other areas where someInteger registers are actuallySpecial registers and cannot be used "generally" (i.e.REG_ZR is effectively reserved and cannot be assigned, just consumed). It would also reduce the cost for various operations in the case where only one register type is being used.

Copy link

ContributorAuthor

kunalspathakJun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.

That's exactly what I understand it. What confuses me is the compiler decides different behavior during execution vs. "watch windows" in debugging.

Copy link

ContributorAuthor

kunalspathakJun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

While I agree with your suggestion, for this PR, I will keep the code that I have currently to handle the arm64 case.

Copy link

Member

tannergooding commentedJun 13, 2023

Changes overall look good/correct. Had a few open questions on certain parts and whether we couldn't do the same optimizations we were doing elsewhere.

Copy link

ContributorAuthor

kunalspathak commentedJun 14, 2023

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

I tried as best as to check the disassembly, but I was not able to get it reliably. I triedobjdump -d libclrjit.so on Release bits, but it doesn't even prints the function name before the start of disassembly section to find out the code. I ended up debugging withgdb and put breakpoints around the area and copied the disassembly i got from asm window.

assembly code for processStartBlockLocations

before:

  >0x7fffe7675c21 <LinearScan::processBlockStartLocations(BasicBlock*)+2961>       cmpl$0x0,0x1368(%r15)                                                                                                                                                                                                              ││0x7fffe7675c29 <LinearScan::processBlockStartLocations(BasicBlock*)+2969>je0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>                                                                                                                                                      ││0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+2975>lea0x110(%r15),%rax                                                                                                                                                                                                               ││0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+2982>xor    %ecx,%ecx                                                                                                                                                                                                                      ││0x7fffe7675c38 <LinearScan::processBlockStartLocations(BasicBlock*)+2984>xorpd  %xmm0,%xmm0                                                                                                                                                                                                                    ││0x7fffe7675c3c <LinearScan::processBlockStartLocations(BasicBlock*)+2988>jmp0x7fffe7675c88 <LinearScan::processBlockStartLocations(BasicBlock*)+3064>                                                                                                                                                      ││0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+2990>movq$0x0,0x18(%rax)                                                                                                                                                                                                                ││0x7fffe7675c46 <LinearScan::processBlockStartLocations(BasicBlock*)+2998>mov0x28(%rax),%edx                                                                                                                                                                                                                ││0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3001>       movl$0xffffffff,0x1034(%r15,%rdx,4)                                                                                                                                                                                                ││0x7fffe7675c55 <LinearScan::processBlockStartLocations(BasicBlock*)+3013>movq$0x0,0x1118(%r15,%rdx,8)                                                                                                                                                                                                       ││0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3025>       nopw   %cs:0x0(%rax,%rax,1)                                                                                                                                                                                                           ││0x7fffe7675c6b <LinearScan::processBlockStartLocations(BasicBlock*)+3035>       nopl0x0(%rax,%rax,1)                                                                                                                                                                                                               ││0x7fffe7675c70 <LinearScan::processBlockStartLocations(BasicBlock*)+3040>add$0x1,%rcx                                                                                                                                                                                                                      ││0x7fffe7675c74 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>mov0x1368(%r15),%edx                                                                                                                                                                                                              ││0x7fffe7675c7b <LinearScan::processBlockStartLocations(BasicBlock*)+3051>add$0x30,%rax                                                                                                                                                                                                                     ││0x7fffe7675c7f <LinearScan::processBlockStartLocations(BasicBlock*)+3055>cmp    %rdx,%rcx                                                                                                                                                                                                                      ││0x7fffe7675c82 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>jae0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>

after:

0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+3039>       callq0x7fffe76d6e40 <BitOperations::BitScanForward(unsigned long)>                                                                                                                                                                  ││  >0x7fffe7675c34 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>mov    %eax,%ecx                                                                                                                                                                                                                      ││0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+3046>btc    %rax,%r12                                                                                                                                                                                                                      ││0x7fffe7675c3a <LinearScan::processBlockStartLocations(BasicBlock*)+3050>lea    (%rcx,%rcx,2),%rcx                                                                                                                                                                                                             ││0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+3054>shl$0x4,%rcx                                                                                                                                                                                                                      ││0x7fffe7675c42 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>mov0xf40(%r15),%rbx                                                                                                                                                                                                               ││0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3065>bts    %rax,%rbx                                                                                                                                                                                                                      ││0x7fffe7675c4d <LinearScan::processBlockStartLocations(BasicBlock*)+3069>mov    %rbx,0xf40(%r15)                                                                                                                                                                                                               ││0x7fffe7675c54 <LinearScan::processBlockStartLocations(BasicBlock*)+3076>mov0x128(%r15,%rcx,1),%rax                                                                                                                                                                                                        ││0x7fffe7675c5c <LinearScan::processBlockStartLocations(BasicBlock*)+3084>test   %rax,%rax                                                                                                                                                                                                                      ││0x7fffe7675c5f <LinearScan::processBlockStartLocations(BasicBlock*)+3087>je0x7fffe7675c27 <LinearScan::processBlockStartLocations(BasicBlock*)+3031>                                                                                                                                                      ││0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3089>lea    (%r15,%rcx,1),%rdx                                                                                                                                                                                                             ││0x7fffe7675c65 <LinearScan::processBlockStartLocations(BasicBlock*)+3093>add$0x128,%rdx

I was able to get something reliable for my 2nd change and I don't see anything suspicious here.

review feedback

7a7ca9a

Copy link

Member

tannergooding commentedJun 14, 2023

I don't see anything suspicious here.

Yeah, it looks like hte same code just shuffled around a bit and different registers. However, Its very odd/unexpected thatBitScanForward is not being inlined here. We should annotate the method with__forceinline since its just abstracting an intrinsic call.

kunalspathak added2 commits

June 16, 2023 09:07

Inline BitScanForward

181660a

fix build error

2c599f4

Copy link

ContributorAuthor

kunalspathak commentedJun 16, 2023

It is worth pasting the TP gains here before the results gets deleted:

BruceForstall approved these changes

Jun 17, 2023

View reviewed changes

src/coreclr/jit/gentree.cpp

		gtRsvdRegs&= ~tempRegMask;
		returngenRegNumFromMask(tempRegMask);
		regNumber tempReg =genFirstRegNumFromMask(availableSet);
		gtRsvdRegs^= genRegMask(tempReg);

Copy link

Contributor

BruceForstallJun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this actually faster than the previous code? Since it needs to do either a left shift (on amd64) or memory lookup (non-amd64). The same question applies to all the places where you introducedgenRegMask.

It seems like you're sayingb = genRegMask(...) +a ^= b is faster thana &= ~b?

ThegenFirstRegNumFromMaskAndToggle cases seem like a clear win, but I'm not as sure about these.

Copy link

Member

tannergoodingJun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

a ^= (1 << …) is specially recognized and transformed intobtc on xarch. There is sometimes special optimizations possible on Arm64 as well, but it’s worst case the same number of instructions and execution cost (but often slightly shorter)

src/coreclr/jit/lsra.cpp OutdatedShow resolvedHide resolved

remove commented code

1791121

kunalspathak merged commit60d00ec intodotnet:main

Jun 19, 2023

kunalspathak deleted the reg-for-loop branch

June 19, 2023 16:40

JulieLeeMSFT mentioned this pull request

Jul 17, 2023

Improving Arm64 Performance in .NET 8.0#77010

Closed

28 tasks

ghost locked asresolvedand limited conversation to collaborators

Jul 20, 2023

Labels

area-CodeGen-coreclr

CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Movatterモバイル変換

LSRA-throughput: Iterate over the regMaskTP instead all registers#87424

LSRA-throughput: Iterate over the regMaskTP instead all registers#87424

Uh oh!

Conversation

kunalspathak commentedJun 12, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ghost commentedJun 12, 2023

Uh oh!

Uh oh!

Uh oh!

kunalspathak commentedJun 13, 2023

Uh oh!

kunalspathak commentedJun 13, 2023

Uh oh!

tannergooding commentedJun 13, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commentedJun 13, 2023

Uh oh!

kunalspathak commentedJun 14, 2023

Uh oh!

tannergooding commentedJun 14, 2023

Uh oh!

kunalspathak commentedJun 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kunalspathak commentedJun 12, 2023•
edited
Loading