- Notifications
You must be signed in to change notification settings - Fork5.2k
Replace successive "ldr" and "str" instructions with "ldp" and "stp"#77540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
This change serves to address the following four Github tickets: 1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp dotnet#35130 2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp dotnet#35132 3. ARM64: Optimize pair of "str reg, [reg]" to stp dotnet#35133 4. ARM64: Optimize pair of "str reg, [fp]" to stp dotnet#35134A technique was employed that involved detecting an optimisationopportunity as instruction sequences were being generated.The optimised instruction was then generated on top of the previousinstruction, with no second instruction generated. Thus, there were nochanges to instruction group size at “emission time” and no changes tojump instructions.
dnfadmin commentedOct 27, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
ghost commentedOct 27, 2022
Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch Issue DetailsThis change serves to address the following four Github tickets: A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.
|
a74nh commentedOct 27, 2022
Uh oh!
There was an error while loading.Please reload this page.
kunalspathak commentedOct 27, 2022
@dotnet/jit-contrib@BruceForstall |
Uh oh!
There was an error while loading.Please reload this page.
kunalspathak commentedOct 27, 2022
While this is definitely a good change, it feels to me that we need to have some common method to do the replacement ( |
AndyJGraham commentedOct 28, 2022
Hi, Kunal. I am not sure what you mean here. It seems to me thatany instruction can either be emitted and added to the instruction groupor used to overwrite the last emitted instruction. I cannot see any way that this can be achieved without altering each emitting function. Can you please advise? Thanks, Andy |
kunalspathak commentedOct 28, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I think there is some missing GC tracking missing for the 2nd register. In below diff, we need to report that both (windows-arm64 benchmark diff 3861.dasm) Same here: (windows-arm64 benchmark diff 26954.dasm) We see that towards the end of |
kunalspathak left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looking closely at the PR, I think this should be fine to have the logic atemit_R_R_R_I(). Added some comments to point the missing gc tracking information.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
BruceForstall left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think you should consider a different model where we support "back up" in the emitter.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
BruceForstall commentedOct 29, 2022
It would be useful to include here in the comments a few examples of the improvement asm diffs. Also useful would be insight into what could be done to improve codegen related to this in the future (i.e., what cases are we missing after this change?). |
Uh oh!
There was an error while loading.Please reload this page.
BruceForstall commentedNov 2, 2022
@AndyJGraham Given my comment#77540 (comment), I was curious if it would work. I implemented it withhttps://github.com/BruceForstall/runtime/tree/LdpStp_RemoveLastInstruction on top of this. I think it's the right way to go. It isn't completely bug free yet, though. However, while doing this, I noticed a couple things about the implementation:
but doesn't handle: |
kunalspathak commentedNov 2, 2022
Saw them here too:#77540 (comment) |
BruceForstall commentedNov 2, 2022
I think that one has been addressed, but I think there is a potential str/str=>stp GC hole. |
consecutive STR and LDR instructions.
Uh oh!
There was an error while loading.Please reload this page.
kunalspathak commentedNov 9, 2022
You might also want to take into account |
BruceForstall commentedNov 15, 2022
@AndyJGraham Can you please fix the conflict so tests can run? |
BruceForstall commentedFeb 6, 2023
@a74nh@AndyJGraham@kunalspathak Assuming the perf regressions in#81551 are directly attributable to this work: are there cases where a single ldp/stp is slower on some platforms than two consecutive ldr/str? |
a74nh commentedFeb 7, 2023
ldp/stp should never be slower than two consecutive ldr/stp. I ran this myself on an altra, ubuntu: 10 times with current head: 10 times with this ldr/str patch reverted: The regression is reporting going from 4ns to 6ns. Looking at the 100,1000,10000 variations: current HEAD: With ld/stp patch reverted: Again, we're only seeing a few nanoseconds difference on a much larger range. My gut is to say we are within variance and this difference should just vanish on the next run of the test suite. How easy is it to rerun the CI for those tests? I'll give IterateForEach a run see if I get similar results. |
tannergooding commentedFeb 7, 2023
It's possible that loop alignment or some other peephole optimization is regressed from the differing instruction. Might be worth getting the disassembly to validate the before/after. |
a74nh commentedFeb 7, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Assembly for the main routine under test hasn't changed at all: (The LDP and STP here are prologue/epilogue entries outside the scope of this patch) |
a74nh commentedFeb 7, 2023
Meanwhile, I'm getting a consistent perf drop of 0.1us for System.Collections.IterateForEach when the optimisation is enabled. Will investigate this a little more. |
kunalspathak commentedFeb 7, 2023
You might want to check the disassembly of |
a74nh commentedFeb 8, 2023
Narrowed down the Disabling LDP/STP optimization only on
Full assembly for MoveNext.LDP is in G_M39015_IG07m This is outside of a loop. The only branches to code after the LDP are error cases. Code for MoveNext()LDP is used for the load of entry.key and entry.value. Entry structTKey and TValue are both ints. So we shouldn't have any alignment issues within the struct. And I would hope that everything else is the dictionary is generally aligned too. I'm a little concerned about register dependencies. x1/w1 is being used as source and dest. But that shouldn't cause any issues. Still digging.... |
a74nh commentedFeb 8, 2023
Looks like@tannergooding was right with the alignment issues.
2 and 3 are the same because they are both doing two loads. Disassembly with addresses:Looks like what we've now got is that some of those branch targets are misaligned addresses. When the LDP (at 0x0000ffffb74f26a8) is two LDRs, then the misaligned addresses become aligned. I think we need to check targets of branches are aligned, and if not insert a NOP to align them - that'll be the start of every basic block that where the predecessor isn't the previous block. LLVM has something already for this (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Subtarget.cpp#L94), and is quite specific depending on the exact Arm processor. Before I start trying anything out - has aligning targets been discussed before? Is there already any code which does similar? |
tannergooding commentedFeb 8, 2023
We already have this, but it is a heuristic and isn't always done. CC.@kunalspathak |
BruceForstall commentedFeb 8, 2023
Are you suggesting that the loop back-end branch |
BruceForstall commentedFeb 8, 2023
I believe the perf jobs run every night.@kunalspathak would know for sure. |
BruceForstall commentedFeb 8, 2023
Overall, if we can verify that the issue is due to loop alignment (or cache effect or some non-deterministic feature) and there isn't a bad interaction between the peephole optimization and the loop alignment implementation that is breaking the loop alignment implementation, then we might choose to simply ignore a regression. It's still worthwhile doing the due diligence to ensure we understand the regression to the best of our ability. |
kunalspathak commentedFeb 8, 2023
They are run after every few hours on batch of commits. However, looking at the overall history of the benchmark athttps://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_arm64_ubuntu%2020.04/System.Hashing.GetStringHashCode(BytesCount%3a%2010).html , the regression looks fairly stable.
We added loop alignment back in .NET 6 and you can read the detail heuristics inhttps://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/. Essentially, we try to align the start of the loop (the target of the backedge) as much as possible, given that it fits various criteria like size of the loop body, how much padding is needed, if the loop is call-free or not, etc. We have seen cases in the past where the loops in code/benchmarks were aligned and because of optimizations, they stopped being aligned and end up in regression. Again, this usually happen because the algorithm thinks that the loop size is too large that it won't make sense to align or the amount of padding to be added to further align is more and we cannot afford to waste bytes in that.
I agree. |
kunalspathak commentedFeb 8, 2023
Just to be sure, try disabling loop alignment using |
BruceForstall commentedFeb 8, 2023
@kunalspathak Since this optimization only reduces code size, these cases shouldn't occur, right? Has the alignment padding already been committed when this peephole optimization occurs? Or will the alignment padding required be adjusted after the peep? |
kunalspathak commentedFeb 8, 2023
Ideally yes, but to confirm that, we need to really make sure that
No, the alignment padding adjustment happens after the peep. |
a74nh commentedFeb 9, 2023
Setting this didn't cause any difference. I then hacked coreclr so that it inserted a NOP at the start of the next block after the LDP, giving: And this regained all the lost performance! Back to 2.18ms. Note that this is the only function I'm allowing peepholes to occur.
This is with LDP: This is with LDP and a NOP:
Not quite, it would be the jumps to 0x0000ffffb74f26c4 or 0x0000ffffb74f26dc. My NOP causes both of these to become aligned. |
tannergooding commentedFeb 9, 2023
I'd also notably expect the |
kunalspathak commentedFeb 9, 2023
Which means that the loop alignment is definitely not affecting it, although from your experiment, it shows that aligning few places would improve the performance (but not necessarily the one that was lost with ldp change). Basically, if you add
Were they aligned before your ldp change? |
a74nh commentedFeb 9, 2023
Before the LDP change, they were both aligned. So adding the NOP is making the rest of the instructions be in the same position as they were with two LDRs. I tried some more moving around, and moving the NOP after the ret (or anywhere else afterwards) drops the performance again back to 2.3ms. Which is odd as G_M39015_IG09 is aligned and there is nothing branching to G_M39015_IG08.
I can give this a try too. The next step would be recreate this as a standalone binary using that block of assembly and get it in a simulator. It might take a bit of time to get it showing the exact same behaviour. If we think it's important enough, then I can give it a go. |
BruceForstall commentedFeb 11, 2023
It certainly seems like there is some odd micro-architectural effect here. E.g., and this seems like grasping at straws, maybe there's an instruction prefetcher grabbing instructions at the presumably (!) cold, not-taken branch targets that are newly unaligned, causing conflicts with fetching the fall-through path? I'm not sure how much more I would invest into this investigation, although understanding more might save time the next time we see an unexplained regression. |
kunalspathak commentedFeb 13, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
kunalspathak commentedFeb 13, 2023
I went through the issues and I don't see any other regressions. |
a74nh commentedFeb 16, 2023
Some updates..... I extracted the assembly for the entire function into a test program and set some dummy memory values for the dictionary. I ran this on a cycle accurate simulator for the N1, and extracted the traces (including pipeline stages). I did this once for the program with LDP, and once with an LDP plus a NOP. There was nothing to suggest any difference between the two, except for the NOP adding a slight delay. Sadly I'm unable to share any of the traces. What my test app doesn't replicate is the exact memory setup of the coreclr version (eg - the code has the same alignment, but is in a different location. The contents of the dictionary are different, and lives in a different location). So it's possible this is causing a difference. There's also differences to coreclr (eg GC), to take into account. As a diversion, now that I have some code to insert NOPs in arbitrary places during clr codegen, I experimented a bit more with moving around the NOP. I've annotated the code below with the benchmark result when a NOP (or 2 NOPs or 3) is placed there. It's hard to make firm statements here, but:
|
kunalspathak commentedFeb 16, 2023
Both the regressions seemed to come back after that despite I don't see any PR that would have improved them. The diff range is:6ad1205...dce07a8 At this point, I won't spend much time on this given that your experiments proved it was around general alignment (and not necessarily loop alignment). Thank you@a74nh for spending time in investigating. |
BruceForstall commentedFeb 16, 2023
@a74nh That's some amazing in-depth analysis. I agree with@kunalspathak that it doesn't seem worth spending any more time on it at this point. |





Uh oh!
There was an error while loading.Please reload this page.
This change serves to address the following four Github tickets:
A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.