Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

LSRA-throughput: Iterate over the regMaskTP instead all registers#87424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
kunalspathak merged 12 commits intodotnet:mainfromkunalspathak:reg-for-loop
Jun 19, 2023

Conversation

@kunalspathak
Copy link
Contributor

@kunalspathakkunalspathak commentedJun 12, 2023
edited
Loading

  • At few places, just iterate over theregMaskTP instead of all the registers and checking them against the mask. This will remove the impact of adding more registers because with the changes in PR, we will just iterate over registers of interest.
  • Updated the pattern we use to extract theregNumber from the mask and toggling the bit in the mask.

Fixes:#87337

@ghostghost added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labelJun 12, 2023
@ghost
Copy link

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

Issue Details

Fixes:#87337

Author:kunalspathak
Assignees:kunalspathak
Labels:

area-CodeGen-coreclr

Milestone:-

@kunalspathak
Copy link
ContributorAuthor

Results are very encouraging:

imageimage

However, linux native compiler is not happy:

image

@kunalspathakkunalspathak marked this pull request as ready for reviewJune 13, 2023 22:20
@kunalspathak
Copy link
ContributorAuthor

@dotnet/jit-contrib@BruceForstall

@tannergooding
Copy link
Member

It's interesting that its worse forLinux x64 on Linux x64.

I think that means that Clang is generating more instructions now than it was before (Linux x64 on Windows x64 where its an improvement is on MSVC).

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

Comment on lines +300 to +302
regNumber regNum =genFirstRegNumFromMask(candidates);
regMaskTP candidateBit =genRegMask(regNum);
candidates ^= candidateBit;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can this one not be?

Suggested change
regNumber regNum = genFirstRegNumFromMask(candidates);
regMaskTP candidateBit = genRegMask(regNum);
candidates ^= candidateBit;
regNumber regNum = genFirstRegNumFromMaskAndToggle(candidates);

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

no, because we needcandidateBit few lines below. That's why I cannot usegenFirstRegNumFromMaskAndToggle().

Comment on lines +2756 to +2764
if (availableRegCount < (sizeof(regMaskTP) *8))
{
// Mask out the bits that are between 64 ~ availableRegCount
actualRegistersMask = (1ULL << availableRegCount) -1;
}
else
{
actualRegistersMask = ~RBM_NONE;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why not have this always beactualRegisterMask = (1ULL << availableRegCount) - 1?

That way its always exactly the bitmask of actual registers available. No more, no less.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, that's ideally how it should be, but for arm64,availableRegCount == 65 (including theREG_STK, etc.). So(1ULL << 65) returns0x2 and with- 1, we getactualRegisterMask becomes1. Debugger shows correct value.

image

I am little confused on why that happens.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

regMaskTP isunsigned __int64 for Arm64 and so we can represent at most 64 registers and therefore1 << 63 is the highest shift we can safely do, because1 << 64 is overshifting and therefore undefined behavior.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it(1 << 65) == 0, then0 - 1 == -1, which isAllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as(1 << (65 % 63)) == (1 << 2) == 4 and then4 - 1 == 3 and others still as something completely different.

It looks like this isn't an "issue" today because the register allocator cannot allocateREG_SP itself. It's only manually used bycodegenarm64 instead and so it doesn't need to be included inactualRegistersMask. That makes working around this "simpler" since its effectively a "special" register likeREG_STK.

Short term we probably want to add an assert to validate the tracked registers don't exceed 64-bits (that isACTUAL_REG_CNT <= 64) and to special case when it is exactly 64-bits.

Long term, I imagine we want to consider better ways to represent this so we can avoid the problem altogether. Having distinct register files for each category (SIMD/FP vs General/Integer vs Special/Other) is one way. That may also help in other areas where someInteger registers are actuallySpecial registers and cannot be used "generally" (i.e.REG_ZR is effectively reserved and cannot be assigned, just consumed). It would also reduce the cost for various operations in the case where only one register type is being used.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.

That's exactly what I understand it. What confuses me is the compiler decides different behavior during execution vs. "watch windows" in debugging.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

While I agree with your suggestion, for this PR, I will keep the code that I have currently to handle the arm64 case.

@tannergooding
Copy link
Member

Changes overall look good/correct. Had a few open questions on certain parts and whether we couldn't do the same optimizations we were doing elsewhere.

@kunalspathak
Copy link
ContributorAuthor

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

I tried as best as to check the disassembly, but I was not able to get it reliably. I triedobjdump -d libclrjit.so on Release bits, but it doesn't even prints the function name before the start of disassembly section to find out the code. I ended up debugging withgdb and put breakpoints around the area and copied the disassembly i got from asm window.

assembly code for processStartBlockLocations

before:

  >0x7fffe7675c21 <LinearScan::processBlockStartLocations(BasicBlock*)+2961>       cmpl$0x0,0x1368(%r15)                                                                                                                                                                                                              │0x7fffe7675c29 <LinearScan::processBlockStartLocations(BasicBlock*)+2969>je0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>                                                                                                                                                      │0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+2975>lea0x110(%r15),%rax0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+2982>xor    %ecx,%ecx0x7fffe7675c38 <LinearScan::processBlockStartLocations(BasicBlock*)+2984>xorpd  %xmm0,%xmm00x7fffe7675c3c <LinearScan::processBlockStartLocations(BasicBlock*)+2988>jmp0x7fffe7675c88 <LinearScan::processBlockStartLocations(BasicBlock*)+3064>                                                                                                                                                      │0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+2990>movq$0x0,0x18(%rax)                                                                                                                                                                                                                │0x7fffe7675c46 <LinearScan::processBlockStartLocations(BasicBlock*)+2998>mov0x28(%rax),%edx0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3001>       movl$0xffffffff,0x1034(%r15,%rdx,4)                                                                                                                                                                                                │0x7fffe7675c55 <LinearScan::processBlockStartLocations(BasicBlock*)+3013>movq$0x0,0x1118(%r15,%rdx,8)                                                                                                                                                                                                       │0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3025>       nopw   %cs:0x0(%rax,%rax,1)                                                                                                                                                                                                           │0x7fffe7675c6b <LinearScan::processBlockStartLocations(BasicBlock*)+3035>       nopl0x0(%rax,%rax,1)                                                                                                                                                                                                               │0x7fffe7675c70 <LinearScan::processBlockStartLocations(BasicBlock*)+3040>add$0x1,%rcx0x7fffe7675c74 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>mov0x1368(%r15),%edx0x7fffe7675c7b <LinearScan::processBlockStartLocations(BasicBlock*)+3051>add$0x30,%rax0x7fffe7675c7f <LinearScan::processBlockStartLocations(BasicBlock*)+3055>cmp    %rdx,%rcx0x7fffe7675c82 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>jae0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>

after:

0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+3039>       callq0x7fffe76d6e40 <BitOperations::BitScanForward(unsigned long)>                                                                                                                                                                  ││  >0x7fffe7675c34 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>mov    %eax,%ecx0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+3046>btc    %rax,%r120x7fffe7675c3a <LinearScan::processBlockStartLocations(BasicBlock*)+3050>lea    (%rcx,%rcx,2),%rcx0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+3054>shl$0x4,%rcx0x7fffe7675c42 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>mov0xf40(%r15),%rbx0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3065>bts    %rax,%rbx0x7fffe7675c4d <LinearScan::processBlockStartLocations(BasicBlock*)+3069>mov    %rbx,0xf40(%r15)                                                                                                                                                                                                               │0x7fffe7675c54 <LinearScan::processBlockStartLocations(BasicBlock*)+3076>mov0x128(%r15,%rcx,1),%rax0x7fffe7675c5c <LinearScan::processBlockStartLocations(BasicBlock*)+3084>test   %rax,%rax0x7fffe7675c5f <LinearScan::processBlockStartLocations(BasicBlock*)+3087>je0x7fffe7675c27 <LinearScan::processBlockStartLocations(BasicBlock*)+3031>                                                                                                                                                      │0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3089>lea    (%r15,%rcx,1),%rdx0x7fffe7675c65 <LinearScan::processBlockStartLocations(BasicBlock*)+3093>add$0x128,%rdx

I was able to get something reliable for my 2nd change and I don't see anything suspicious here.

image

@tannergooding
Copy link
Member

I don't see anything suspicious here.

Yeah, it looks like hte same code just shuffled around a bit and different registers. However, Its very odd/unexpected thatBitScanForward is not being inlined here. We should annotate the method with__forceinline since its just abstracting an intrinsic call.

@kunalspathak
Copy link
ContributorAuthor

It is worth pasting the TP gains here before the results gets deleted:

imageimageimageimageimageimageimage
jakobbotsch, GerardSmit, Wraith2, and PaulusParssinen reacted with rocket emoji

gtRsvdRegs&= ~tempRegMask;
returngenRegNumFromMask(tempRegMask);
regNumber tempReg =genFirstRegNumFromMask(availableSet);
gtRsvdRegs^= genRegMask(tempReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this actually faster than the previous code? Since it needs to do either a left shift (on amd64) or memory lookup (non-amd64). The same question applies to all the places where you introducedgenRegMask.

It seems like you're sayingb = genRegMask(...) +a ^= b is faster thana &= ~b?

ThegenFirstRegNumFromMaskAndToggle cases seem like a clear win, but I'm not as sure about these.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

a ^= (1 << …) is specially recognized and transformed intobtc on xarch. There is sometimes special optimizations possible on Arm64 as well, but it’s worst case the same number of instructions and execution cost (but often slightly shorter)

@kunalspathakkunalspathak merged commit60d00ec intodotnet:mainJun 19, 2023
@kunalspathakkunalspathak deleted the reg-for-loop branchJune 19, 2023 16:40
@ghostghost locked asresolvedand limited conversation to collaboratorsJul 20, 2023
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@tannergoodingtannergoodingtannergooding left review comments

+1 more reviewer

@BruceForstallBruceForstallBruceForstall approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

@kunalspathakkunalspathak

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Optimize the iteration over all registers in various places in LSRA

3 participants

@kunalspathak@tannergooding@BruceForstall

[8]ページ先頭

©2009-2025 Movatter.jp