A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.

Replace successive "ldr" and "str" instructions with "ldp" and "stp"

b88ff31

This change serves to address the following four Github tickets:    1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp dotnet#35130    2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp dotnet#35132    3. ARM64: Optimize pair of "str reg, [reg]" to stp dotnet#35133    4. ARM64: Optimize pair of "str reg, [fp]" to stp  dotnet#35134A technique was employed that involved detecting an optimisationopportunity as instruction sequences were being generated.The optimised instruction was then generated on top of the previousinstruction, with no second instruction generated. Thus, there were nochanges to instruction group size at “emission time” and no changes tojump instructions.

ghost added area-CodeGen-coreclr

CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

community-contributionIndicates that the PR has been added by a community member labels

Oct 27, 2022

Copy link

dnfadmin commentedOct 27, 2022•
edited
Loading

All CLA requirements met.

Copy link

ghost commentedOct 27, 2022

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

Issue Details

This change serves to address the following four Github tickets:

1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp #351302. ARM64: Optimize pair of "ldr reg, [reg]" to ldp #351323. ARM64: Optimize pair of "str reg, [reg]" to stp #351334. ARM64: Optimize pair of "str reg, [fp]" to stp  #35134

Author:	AndyJGraham
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

Copy link

Contributor

a74nh commentedOct 27, 2022

@kunalspathak

jakobbotsch reviewed

Oct 27, 2022

View reviewed changes

src/coreclr/jit/emitarm64.cpp OutdatedShow resolvedHide resolved

Copy link

Contributor

kunalspathak commentedOct 27, 2022

@dotnet/jit-contrib@BruceForstall

jakobbotsch reviewed

Oct 27, 2022

View reviewed changes

src/coreclr/jit/emitarm64.cpp OutdatedShow resolvedHide resolved

Copy link

Contributor

kunalspathak commentedOct 27, 2022

While this is definitely a good change, it feels to me that we need to have some common method to do the replacement (resuse != nullptr) in general fashion instead of doing it in one off methods (emit_R_R_R_I()) .

Copy link

ContributorAuthor

AndyJGraham commentedOct 28, 2022

While this is definitely a good change, it feels to me that we need to have some common method to do the replacement (resuse != nullptr) in general fashion instead of doing it in one off methods (emit_R_R_R_I()) .

Hi, Kunal. I am not sure what you mean here. It seems to me thatany instruction can either be emitted and added to the instruction groupor used to overwrite the last emitted instruction.

I cannot see any way that this can be achieved without altering each emitting function. Can you please advise?

Thanks, Andy

Copy link

Contributor

kunalspathak commentedOct 28, 2022•
edited
Loading

I think there is some missing GC tracking missing for the 2nd register. In below diff, we need to report that bothx0 andx2 holds GC values (as seen on left), but we only report thatx0 has GC value.

(windows-arm64 benchmark diff 3861.dasm)

Same here:

(windows-arm64 benchmark diff 26954.dasm)

We see that towards the end ofIG87, we markx4 as not holding GC value anymore, but we fail to add it. Also I am little confused withV112 andV107 being replaced withV00 in the comments.

kunalspathak suggested changes

Oct 28, 2022

View reviewed changes

Copy link

Contributor

kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking closely at the PR, I think this should be fine to have the logic atemit_R_R_R_I(). Added some comments to point the missing gc tracking information.

src/coreclr/jit/emitarm64.cpp OutdatedShow resolvedHide resolved

src/coreclr/jit/emitarm64.cppShow resolvedHide resolved

src/coreclr/jit/emitarm64.cpp OutdatedShow resolvedHide resolved

ghost added the needs-author-actionAn issue or pull request that requires more info or actions from the author. label

Oct 28, 2022

BruceForstall suggested changes

Oct 29, 2022

View reviewed changes

Copy link

Contributor

BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think you should consider a different model where we support "back up" in the emitter.

src/coreclr/jit/emitarm64.cpp OutdatedShow resolvedHide resolved

Copy link

Contributor

BruceForstall commentedOct 29, 2022

It would be useful to include here in the comments a few examples of the improvement asm diffs.

Also useful would be insight into what could be done to improve codegen related to this in the future (i.e., what cases are we missing after this change?).

BruceForstall reviewed

Oct 29, 2022

View reviewed changes

src/coreclr/jit/emitarm64.cpp OutdatedShow resolvedHide resolved

No longer use a temporary buffer to build the optimized instruction.

f0c918c

ghost removed the needs-author-actionAn issue or pull request that requires more info or actions from the author. label

Oct 31, 2022

Addressed assorted review comments.

f1b236e

Copy link

Contributor

BruceForstall commentedNov 2, 2022

@AndyJGraham Given my comment#77540 (comment), I was curious if it would work. I implemented it withhttps://github.com/BruceForstall/runtime/tree/LdpStp_RemoveLastInstruction on top of this. I think it's the right way to go. It isn't completely bug free yet, though.

However, while doing this, I noticed a couple things about the implementation:

I think there's a GC hole when merging twostr to a singlestp if one or more of the stores are to tracked stack local GC variables. I haven't seen or been able to construct this yet as a test. Note thatemitIns_R_S andemitIns_S_R save the stack local variable in theid and use it for GC info. In the case ofemitIns_R_S I don't think it matters because we're reading from a stack local to a register and we are setting the register GC bits properly. However, withemitIns_S_R we're writing to the stack local. When the optimization kicks in, it callsemitIns_R_R_R_I and loses the fact the target is a stack local. This means thatemitInsWritesToLclVarStackLocPair will never return true. There is a function,emitIns_S_S_R_R that is designed to handle two register writes to the same stack variable (used for 16-byte SIMD types and outgoing stack arguments). We have no support for a singlestp instruction writing to two coincidentally adjacent (tracked, GC-ref) stack locals. It seems like we shouldn't do this optimization if the stack location is a GC/byref pointer.
The optimization handles cases like:

            ldr     w1, [x20, #0x10]            ldr     w2, [x20, #0x14]=>            ldp     w1, w2, [x20, #0x10]

but doesn't handle:

            ldr     w1, [x20, #0x14]            ldr     w2, [x20, #0x10]=>            ldp     w2, w1, [x20, #0x10]

Copy link

Contributor

kunalspathak commentedNov 2, 2022

I think there's a GC hole

Saw them here too:#77540 (comment)

Copy link

Contributor

BruceForstall commentedNov 2, 2022

Saw them here too:#77540 (comment)

I think that one has been addressed, but I think there is a potential str/str=>stp GC hole.

kunalspathak mentioned this pull request

Nov 2, 2022

Improving Arm64 Performance in .NET 8.0#77010

Closed

28 tasks

Now optimizes ascending locations and decending locations with

c0533bd

consecutive STR and LDR instructions.

kunalspathak reviewed

Nov 4, 2022

View reviewed changes

src/coreclr/jit/emitarm64.cpp OutdatedShow resolvedHide resolved

Copy link

Contributor

kunalspathak commentedNov 9, 2022

You might also want to take into accountemitForceNewIG....See#78074

AndyJGraham added2 commits

November 14, 2022 16:23

Modification to remove last instructions.

372ee97

Merge branch 'main'

12fc291

Copy link

Contributor

BruceForstall commentedNov 15, 2022

@AndyJGraham Can you please fix the conflict so tests can run?

Copy link

Contributor

BruceForstall commentedFeb 6, 2023

@a74nh @AndyJGraham @kunalspathak Assuming the perf regressions in#81551 are directly attributable to this work: are there cases where a single ldp/stp is slower on some platforms than two consecutive ldr/str?

Copy link

Contributor

a74nh commentedFeb 7, 2023

@a74nh @AndyJGraham @kunalspathak Assuming the perf regressions in#81551 are directly attributable to this work: are there cases where a single ldp/stp is slower on some platforms than two consecutive ldr/str?

ldp/stp should never be slower than two consecutive ldr/stp.

I ran this myself on an altra, ubuntu:

10 times with current head:

|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.775 ns | 0.0008 ns | 0.0007 ns | 5.774 ns | 5.773 ns | 5.776 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.774 ns | 0.0009 ns | 0.0008 ns | 5.774 ns | 5.773 ns | 5.776 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.209 ns | 0.0010 ns | 0.0008 ns | 7.209 ns | 7.208 ns | 7.211 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.172 ns | 0.0014 ns | 0.0013 ns | 7.172 ns | 7.170 ns | 7.174 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.802 ns | 0.0011 ns | 0.0009 ns | 5.802 ns | 5.800 ns | 5.803 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.554 ns | 0.0010 ns | 0.0009 ns | 5.555 ns | 5.553 ns | 5.556 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 6.352 ns | 0.0007 ns | 0.0007 ns | 6.352 ns | 6.351 ns | 6.353 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.555 ns | 0.0009 ns | 0.0009 ns | 5.554 ns | 5.553 ns | 5.557 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.204 ns | 0.0013 ns | 0.0011 ns | 7.204 ns | 7.203 ns | 7.207 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 6.220 ns | 0.0007 ns | 0.0007 ns | 6.220 ns | 6.219 ns | 6.221 ns |         - |

10 times with this ldr/str patch reverted:

|            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.014 ns | 0.0034 ns | 0.0029 ns | 7.014 ns | 7.009 ns | 7.020 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.815 ns | 0.0021 ns | 0.0019 ns | 5.815 ns | 5.811 ns | 5.818 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.026 ns | 0.0013 ns | 0.0012 ns | 7.026 ns | 7.023 ns | 7.027 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.007 ns | 0.0020 ns | 0.0019 ns | 7.007 ns | 7.005 ns | 7.010 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.606 ns | 0.0009 ns | 0.0008 ns | 5.606 ns | 5.604 ns | 5.607 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 5.598 ns | 0.0315 ns | 0.0294 ns | 5.614 ns | 5.537 ns | 5.616 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 6.440 ns | 0.0019 ns | 0.0017 ns | 6.440 ns | 6.437 ns | 6.443 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.017 ns | 0.0021 ns | 0.0019 ns | 7.017 ns | 7.014 ns | 7.019 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 7.013 ns | 0.0023 ns | 0.0019 ns | 7.013 ns | 7.010 ns | 7.016 ns |         - ||            Method | BytesCount |     Mean |     Error |    StdDev |   Median |      Min |      Max | Allocated || GetStringHashCode |         10 | 6.352 ns | 0.0022 ns | 0.0020 ns | 6.351 ns | 6.349 ns | 6.356 ns |         - |

The regression is reporting going from 4ns to 6ns.
My testing is showing a range of anywhere from 5ns to 7ns.

Looking at the 100,1000,10000 variations:

current HEAD:

|            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated || GetStringHashCode |         10 |     5.588 ns | 0.0010 ns | 0.0010 ns |     5.588 ns |     5.586 ns |     5.590 ns |         - || GetStringHashCode |        100 |    47.271 ns | 0.0049 ns | 0.0043 ns |    47.270 ns |    47.265 ns |    47.280 ns |         - || GetStringHashCode |       1000 |   452.993 ns | 0.0301 ns | 0.0267 ns |   452.992 ns |   452.953 ns |   453.038 ns |         - || GetStringHashCode |      10000 | 4,666.836 ns | 3.2538 ns | 3.0436 ns | 4,666.813 ns | 4,660.594 ns | 4,671.444 ns |         - ||            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated || GetStringHashCode |         10 |     5.774 ns | 0.0009 ns | 0.0008 ns |     5.774 ns |     5.773 ns |     5.776 ns |         - || GetStringHashCode |        100 |    45.450 ns | 0.0039 ns | 0.0037 ns |    45.450 ns |    45.446 ns |    45.458 ns |         - || GetStringHashCode |       1000 |   453.829 ns | 0.0321 ns | 0.0301 ns |   453.814 ns |   453.793 ns |   453.885 ns |         - || GetStringHashCode |      10000 | 4,675.019 ns | 0.3641 ns | 0.3406 ns | 4,675.028 ns | 4,674.435 ns | 4,675.676 ns |         - ||            Method | BytesCount |         Mean |      Error |     StdDev |       Median |          Min |          Max | Allocated || GetStringHashCode |         10 |     7.171 ns |  0.0009 ns |  0.0008 ns |     7.171 ns |     7.170 ns |     7.173 ns |         - || GetStringHashCode |        100 |    45.751 ns |  0.0086 ns |  0.0077 ns |    45.748 ns |    45.744 ns |    45.766 ns |         - || GetStringHashCode |       1000 |   453.007 ns |  0.0540 ns |  0.0478 ns |   452.995 ns |   452.948 ns |   453.104 ns |         - || GetStringHashCode |      10000 | 4,558.079 ns | 11.4109 ns | 10.6738 ns | 4,561.511 ns | 4,520.597 ns | 4,561.850 ns |         - |

With ld/stp patch reverted:

|            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated || GetStringHashCode |         10 |     6.464 ns | 0.0008 ns | 0.0008 ns |     6.464 ns |     6.463 ns |     6.466 ns |         - || GetStringHashCode |        100 |    45.622 ns | 0.0059 ns | 0.0052 ns |    45.621 ns |    45.613 ns |    45.633 ns |         - || GetStringHashCode |       1000 |   454.581 ns | 0.0586 ns | 0.0548 ns |   454.589 ns |   454.505 ns |   454.674 ns |         - || GetStringHashCode |      10000 | 4,548.124 ns | 8.2210 ns | 7.6900 ns | 4,550.542 ns | 4,521.367 ns | 4,550.865 ns |         - ||            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated || GetStringHashCode |         10 |     6.463 ns | 0.0018 ns | 0.0017 ns |     6.463 ns |     6.460 ns |     6.466 ns |         - || GetStringHashCode |        100 |    45.624 ns | 0.0054 ns | 0.0048 ns |    45.623 ns |    45.618 ns |    45.633 ns |         - || GetStringHashCode |       1000 |   453.109 ns | 0.0251 ns | 0.0235 ns |   453.101 ns |   453.083 ns |   453.153 ns |         - || GetStringHashCode |      10000 | 4,552.636 ns | 0.5026 ns | 0.4455 ns | 4,552.749 ns | 4,551.203 ns | 4,553.072 ns |         - ||            Method | BytesCount |         Mean |     Error |    StdDev |       Median |          Min |          Max | Allocated || GetStringHashCode |         10 |     6.410 ns | 0.0013 ns | 0.0012 ns |     6.410 ns |     6.408 ns |     6.412 ns |         - || GetStringHashCode |        100 |    45.629 ns | 0.0360 ns | 0.0301 ns |    45.620 ns |    45.609 ns |    45.703 ns |         - || GetStringHashCode |       1000 |   454.546 ns | 0.0397 ns | 0.0371 ns |   454.560 ns |   454.465 ns |   454.591 ns |         - || GetStringHashCode |      10000 | 4,551.589 ns | 4.3182 ns | 3.8280 ns | 4,552.570 ns | 4,538.298 ns | 4,552.890 ns |         - |

Again, we're only seeing a few nanoseconds difference on a much larger range.

My gut is to say we are within variance and this difference should just vanish on the next run of the test suite. How easy is it to rerun the CI for those tests?

I'll give IterateForEach a run see if I get similar results.

Copy link

Member

tannergooding commentedFeb 7, 2023

It's possible that loop alignment or some other peephole optimization is regressed from the differing instruction.

Might be worth getting the disassembly to validate the before/after.

Copy link

Contributor

a74nh commentedFeb 7, 2023•
edited
Loading

Might be worth getting the disassembly to validate the before/after

Assembly for the main routine under test hasn't changed at all: (The LDP and STP here are prologue/epilogue entries outside the scope of this patch)

; Assembly listing for method System.String:GetHashCode():int:this; Emitting BLENDED_CODE for generic ARM64 CPU - Unix; Tier-1 compilation; optimized code; fp based frame; fully interruptible; No PGO data; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data; Final local variable assignments;;  V00 this         [V00,T00] (  4,  4   )     ref  ->   x1         this class-hnd single-def;* V01 loc0         [V01    ] (  0,  0   )    long  ->  zero-ref   ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace";; Lcl frame size = 0G_M27075_IG01:            stp     fp, lr, [sp, #-0x10]!            mov     fp, sp            mov     x1, x0;; size=12 bbWeight=1 PerfScore 2.00G_M27075_IG02:            add     x0, x1, #12            ldr     w1, [x1, #0x08]            lsl     w1, w1, #1            movz    w2, #0xD1FFAB1E            movk    w2, #0xD1FFAB1E LSL #16            movz    w3, #0xD1FFAB1E            movk    w3, #0xD1FFAB1E LSL #16            movz    x4, #0xD1FFAB1E      // code for System.Marvin:ComputeHash32(byref,uint,uint,uint):int            movk    x4, #0xD1FFAB1E LSL #16            movk    x4, #0xD1FFAB1E LSL #32            ldr     x4, [x4];; size=44 bbWeight=1 PerfScore 11.00G_M27075_IG03:            ldp     fp, lr, [sp], #0x10            br      x4

Copy link

Contributor

a74nh commentedFeb 7, 2023

Meanwhile, I'm getting a consistent perf drop of 0.1us for System.Collections.IterateForEach when the optimisation is enabled. Will investigate this a little more.

Copy link

Contributor

kunalspathak commentedFeb 7, 2023

You might want to check the disassembly ofMarvin:ComputeHash32().

JulieLeeMSFT mentioned this pull request

Feb 8, 2023

What's new in .NET 8 Preview 1dotnet/core#8133

Closed

3 tasks

Copy link

Contributor

a74nh commentedFeb 8, 2023

Narrowed down theIterateForEach issue a little...

Disabling LDP/STP optimization only onSystem.Collections.Generic.Dictionary2+Enumerator[int,int]:MoveNext()` regains all the lost performance.

MoveNext() has a single use ofLDP.

Full assembly for MoveNext.

; Assembly listing for method System.Collections.Generic.Dictionary`2+Enumerator[int,int]:MoveNext():bool:this; Emitting BLENDED_CODE for generic ARM64 CPU - Unix; Tier-1 compilation; optimized code; fp based frame; fully interruptible; No PGO data; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data; Final local variable assignments;;  V00 this         [V00,T00] ( 13, 28   )   byref  ->   x0         this single-def;  V01 loc0         [V01,T03] (  4,  5   )   byref  ->   x1        ;  V02 loc1         [V02,T02] (  4,  8   )     int  ->   x2        ;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace";  V04 tmp1         [V04,T01] (  3, 12   )     ref  ->   x1         class-hnd "impAppendStmt";* V05 tmp2         [V05    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "NewObj constructor temp";  V06 tmp3         [V06,T05] (  2,  2   )     int  ->   x2         "Inlining Arg";  V07 tmp4         [V07,T06] (  2,  2   )     int  ->   x1         "Inlining Arg";  V08 tmp5         [V08,T07] (  2,  1   )     int  ->   x2         V05.key(offs=0x00) P-INDEP "field V05.key (fldOffset=0x0)";  V09 tmp6         [V09,T08] (  2,  1   )     int  ->   x1         V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)";  V10 tmp7         [V10,T04] (  3,  3   )   byref  ->   x0         single-def "BlockOp address local";; Lcl frame size = 0G_M39015_IG01:            stp     fp, lr, [sp, #-0x10]!            mov     fp, sp;; size=8 bbWeight=1 PerfScore 1.50G_M39015_IG02:            ldr     w1, [x0, #0x08]            ldr     x2, [x0]            ldr     w2, [x2, #0x44]            cmp     w1, w2            bne     G_M39015_IG09;; size=20 bbWeight=1 PerfScore 10.50G_M39015_IG03:            ldr     w1, [x0, #0x0C]            ldr     x2, [x0]            ldr     w2, [x2, #0x38]            cmp     w1, w2            blo     G_M39015_IG06;; size=20 bbWeight=8 PerfScore 84.00G_M39015_IG04:            ldr     x1, [x0]            ldr     w1, [x1, #0x38]            add     w1, w1, #1            str     w1, [x0, #0x0C]            str     xzr, [x0, #0x14]            mov     w0, wzr;; size=24 bbWeight=0.50 PerfScore 4.50G_M39015_IG05:            ldp     fp, lr, [sp], #0x10            ret     lr;; size=8 bbWeight=0.50 PerfScore 1.00G_M39015_IG06:            ldr     x1, [x0]            ldr     x1, [x1, #0x10]            ldr     w2, [x0, #0x0C]            add     w3, w2, #1            str     w3, [x0, #0x0C]            ldr     w3, [x1, #0x08]            cmp     w2, w3            bhs     G_M39015_IG10            ubfiz   x2, x2, #4, #32            add     x2, x2, #16            add     x1, x1, x2            ldr     w2, [x1, #0x04]            cmn     w2, #1            blt     G_M39015_IG03;; size=56 bbWeight=2 PerfScore 43.00G_M39015_IG07:            ldp     w2, w1, [x1, #0x08]            add     x0, x0, #20            str     w2, [x0]            str     w1, [x0, #0x04]            mov     w0, #1;; size=20 bbWeight=0.50 PerfScore 3.00G_M39015_IG08:            ldp     fp, lr, [sp], #0x10            ret     lr;; size=8 bbWeight=0.50 PerfScore 1.00G_M39015_IG09:            movz    x0, #0xD1FFAB1E      // code for System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()            movk    x0, #0xD1FFAB1E LSL #16            movk    x0, #0xD1FFAB1E LSL #32            ldr     x0, [x0]            blr     x0            brk_unix #0;; size=24 bbWeight=0 PerfScore 0.00G_M39015_IG10:            bl      CORINFO_HELP_RNGCHKFAIL            brk_unix #0;; size=8 bbWeight=0 PerfScore 0.00

LDP is in G_M39015_IG07m This is outside of a loop. The only branches to code after the LDP are error cases.

Code for MoveNext()

            public bool MoveNext()            {                if (_version != _dictionary._version)                {                    ThrowHelper.ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion();                }                // Use unsigned comparison since we set index to dictionary.count+1 when the enumeration ends.                // dictionary.count+1 could be negative if dictionary.count is int.MaxValue                while ((uint)_index < (uint)_dictionary._count)                {                    ref Entry entry = ref _dictionary._entries![_index++];                    if (entry.next >= -1)                    {                        _current = new KeyValuePair<TKey, TValue>(entry.key, entry.value);                        return true;                    }                }                _index = _dictionary._count + 1;                _current = default;                return false;            }

LDP is used for the load of entry.key and entry.value.

Entry struct

        private struct Entry        {            public uint hashCode;            /// <summary>            /// 0-based index of next entry in chain: -1 means end of chain            /// also encodes whether this entry _itself_ is part of the free list by changing sign and subtracting 3,            /// so -2 means end of free list, -3 means index 0 but on free list, -4 means index 1 but on free list, etc.            /// </summary>            public int next;            public TKey key;     // Key of entry            public TValue value; // Value of entry        }

TKey and TValue are both ints. So we shouldn't have any alignment issues within the struct. And I would hope that everything else is the dictionary is generally aligned too.

I'm a little concerned about register dependencies. x1/w1 is being used as source and dest. But that shouldn't cause any issues.

Still digging....

Copy link

Contributor

a74nh commentedFeb 8, 2023

It's possible that loop alignment or some other peephole optimization is regressed from the differing instruction.

Looks like@tannergooding was right with the alignment issues.

Current head: 2.37 us
Disable all LDP/STP peepholes: 2.27 us
Disable removing the previous instruction when peepholing (giving us LDR+LDP): 2.27 us
When peepholing, generate LDP+NOP: 2.18 us

2 and 3 are the same because they are both doing two loads.
4 is probably what we should be getting with peepholes working correctly.

Disassembly with addresses:

   0x0000ffffb74f2620:stpx29, x30, [sp, #-16]!   0x0000ffffb74f2624:movx29, sp   0x0000ffffb74f2628:ldrw1, [x0, #8]   0x0000ffffb74f262c:ldrx2, [x0]   0x0000ffffb74f2630:ldrw2, [x2, #68]   0x0000ffffb74f2634:cmpw1, w2   0x0000ffffb74f2638:b.ne0xffffb74f26c4  // b.any   0x0000ffffb74f263c:ldrw1, [x0, #12]   0x0000ffffb74f2640:ldrx2, [x0]   0x0000ffffb74f2644:ldrw2, [x2, #56]   0x0000ffffb74f2648:cmpw1, w2   0x0000ffffb74f264c:b.cc0xffffb74f2670  // b.lo, b.ul, b.last   0x0000ffffb74f2650:ldrx1, [x0]   0x0000ffffb74f2654:ldrw1, [x1, #56]   0x0000ffffb74f2658:addw1, w1, #0x1   0x0000ffffb74f265c:strw1, [x0, #12]   0x0000ffffb74f2660:sturxzr, [x0, #20]   0x0000ffffb74f2664:movw0, wzr   0x0000ffffb74f2668:ldpx29, x30, [sp], #16   0x0000ffffb74f266c:ret   0x0000ffffb74f2670:ldrx1, [x0]   0x0000ffffb74f2674:ldrx1, [x1, #16]   0x0000ffffb74f2678:ldrw2, [x0, #12]   0x0000ffffb74f267c:addw3, w2, #0x1   0x0000ffffb74f2680:strw3, [x0, #12]   0x0000ffffb74f2684:ldrw3, [x1, #8]   0x0000ffffb74f2688:cmpw2, w3   0x0000ffffb74f268c:b.cs0xffffb74f26dc  // b.hs, b.nlast   0x0000ffffb74f2690:ubfizx2, x2, #4, #32   0x0000ffffb74f2694:addx2, x2, #0x10   0x0000ffffb74f2698:addx1, x1, x2   0x0000ffffb74f269c:ldrw2, [x1, #4]   0x0000ffffb74f26a0:cmnw2, #0x1   0x0000ffffb74f26a4:b.lt0xffffb74f263c  // b.tstop   0x0000ffffb74f26a8:ldpw2, w1, [x1, #8]   0x0000ffffb74f26ac:addx0, x0, #0x14   0x0000ffffb74f26b0:strw2, [x0]   0x0000ffffb74f26b4:strw1, [x0, #4]   0x0000ffffb74f26b8:movw0, #0x1                   // #1   0x0000ffffb74f26bc:ldpx29, x30, [sp], #16   0x0000ffffb74f26c0:ret   0x0000ffffb74f26c4:movx0, #0xb078                // #45176   0x0000ffffb74f26c8:movkx0, #0xb75e, lsl #16   0x0000ffffb74f26cc:movkx0, #0xffff, lsl #32   0x0000ffffb74f26d0:ldrx0, [x0]   0x0000ffffb74f26d4:blrx0   0x0000ffffb74f26d8:brk#0x0   0x0000ffffb74f26dc:bl0xffffb74f00f8   0x0000ffffb74f26e0:brk#0x0   0x0000ffffb74f26e4:stllrbw17, [x1]   0x0000ffffb74f26e8:.inst0x00400012 ; undefined   0x0000ffffb74f26ec:.inst0x00400027 ; undefined   0x0000ffffb74f26f0:st1h{z1.s}, p0, [x15, z4.s, uxtw #1]   0x0000ffffb74f26f4:udf#0   0x0000ffffb74f26f8:tbnzx16, #50, 0xffffb74efa04   0x0000ffffb74f26fc:udf#65535   0x0000ffffb74f2700:stpx29, x30, [sp, #-16]!   0x0000ffffb74f2704:movx29, sp   0x0000ffffb74f2708:ldpx29, x30, [sp], #16   0x0000ffffb74f270c:ret   0x0000ffffb74f2710:ldxrbw4, [x0]   0x0000ffffb74f2714:.inst0x00400002 ; undefined   0x0000ffffb74f2718:st1h{z1.s}, p0, [x15, z4.s, uxtw #1]   0x0000ffffb74f271c:udf#0

Looks like what we've now got is that some of those branch targets are misaligned addresses. When the LDP (at 0x0000ffffb74f26a8) is two LDRs, then the misaligned addresses become aligned.

I think we need to check targets of branches are aligned, and if not insert a NOP to align them - that'll be the start of every basic block that where the predecessor isn't the previous block.

LLVM has something already for this (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Subtarget.cpp#L94), and is quite specific depending on the exact Arm processor.

Before I start trying anything out - has aligning targets been discussed before? Is there already any code which does similar?

Copy link

Member

tannergooding commentedFeb 8, 2023

I think we need to check targets of branches are aligned, and if not insert a NOP to align them - that'll be the start of every basic block that where the predecessor isn't the previous block.

We already have this, but it is a heuristic and isn't always done. CC.@kunalspathak

Copy link

Contributor

BruceForstall commentedFeb 8, 2023

Are you suggesting that the loop back-end branchb.lt0xffffb74f263c is to an address that is not 16-byte aligned, and thus perhaps we're getting sub-optimal cache behavior, or similar? The other branches to non-16-byte-aligned addresses are to error cases. What do the addresses look like for the ldr/ldr (pre-optimization) case? I would think the ldp would only affect alignment of addresses that follow it, but perhaps there's some interaction with the existing loop alignment code.

Copy link

Contributor

BruceForstall commentedFeb 8, 2023

My gut is to say we are within variance and this difference should just vanish on the next run of the test suite. How easy is it to rerun the CI for those tests?

I believe the perf jobs run every night.@kunalspathak would know for sure.

Copy link

Contributor

BruceForstall commentedFeb 8, 2023

Overall, if we can verify that the issue is due to loop alignment (or cache effect or some non-deterministic feature) and there isn't a bad interaction between the peephole optimization and the loop alignment implementation that is breaking the loop alignment implementation, then we might choose to simply ignore a regression. It's still worthwhile doing the due diligence to ensure we understand the regression to the best of our ability.

Copy link

Contributor

kunalspathak commentedFeb 8, 2023

I believe the perf jobs run every night.@kunalspathak would know for sure.

They are run after every few hours on batch of commits. However, looking at the overall history of the benchmark athttps://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_arm64_ubuntu%2020.04/System.Hashing.GetStringHashCode(BytesCount%3a%2010).html , the regression looks fairly stable.

Before I start trying anything out - has aligning targets been discussed before? Is there already any code which does similar?

We added loop alignment back in .NET 6 and you can read the detail heuristics inhttps://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/. Essentially, we try to align the start of the loop (the target of the backedge) as much as possible, given that it fits various criteria like size of the loop body, how much padding is needed, if the loop is call-free or not, etc. We have seen cases in the past where the loops in code/benchmarks were aligned and because of optimizations, they stopped being aligned and end up in regression. Again, this usually happen because the algorithm thinks that the loop size is too large that it won't make sense to align or the amount of padding to be added to further align is more and we cannot afford to waste bytes in that.

Overall, if we can verify that the issue is due to loop alignment (or cache effect or some non-deterministic feature) and there isn't a bad interaction between the peephole optimization and the loop alignment implementation that is breaking the loop alignment implementation, then we might choose to simply ignore a regression.

I agree.

Copy link

Contributor

kunalspathak commentedFeb 8, 2023

Just to be sure, try disabling loop alignment usingDOTNET_JitAlignLoops=0 (the flag works on Release builds as well) before and after your changes. If that doesn't show any regressions, then we would know for sure that the regression is from the alignment.

Copy link

Contributor

BruceForstall commentedFeb 8, 2023

this usually happen because the algorithm thinks that the loop size is too large that it won't make sense to align or the amount of padding to be added to further align is more and we cannot afford to waste bytes in that.

@kunalspathak Since this optimization only reduces code size, these cases shouldn't occur, right?

Has the alignment padding already been committed when this peephole optimization occurs? Or will the alignment padding required be adjusted after the peep?

Copy link

Contributor

kunalspathak commentedFeb 8, 2023

@kunalspathak Since this optimization only reduces code size, these cases shouldn't occur, right?

Ideally yes, but to confirm that, we need to really make sure thatnop was present to align that target before this change. If it was, then@a74nh , could you please provide JitDump of before and after and I can check what is going on. Other reason that I can think of is the loop size in this case is 108 bytes and it takes almost 4 blocks of 32B size to fit this look. If I recall the heuristics for Arm64, we would only allow max 4 bytes of padding for this so it is highly unlikely that we would have aligned the loop before this change given that the difference in loop body size is just 4 bytes (2 ldrs replaced with 1 ldp).

Has the alignment padding already been committed when this peephole optimization occurs? Or will the alignment padding required be adjusted after the peep?

No, the alignment padding adjustment happens after the peep.

Copy link

Contributor

a74nh commentedFeb 9, 2023

DOTNET_JitAlignLoops=0

Setting this didn't cause any difference.

I then hacked coreclr so that it inserted a NOP at the start of the next block after the LDP, giving:

G_M39015_IG07:IN001f: 000088  29410422          ldp     w2, w1, [x1, #0x08]IN0020: 00008C  91005000          add     x0, x0, #20IN0021: 000090  B9000002          str     w2, [x0]IN0022: 000094  B9000401          str     w1, [x0, #0x04]IN0023: 000098  52800020          mov     w0, #1G_M39015_IG08: IN0030: 00009C  D503201F          nop     IN0031: 0000A0  A8C17BFD          ldp     fp, lr, [sp], #0x10IN0032: 0000A4  D65F03C0          ret     lr

And this regained all the lost performance! Back to 2.18ms.

Note that this is the only function I'm allowing peepholes to occur.

could you please provide JitDump of before and after and I can check what is going on

This is with LDP:
dump_ldp.txt

This is with LDP and a NOP:
dump_ldp_with_nop.txt

Are you suggesting that the loop back-end branch b.lt0xffffb74f263c

Not quite, it would be the jumps to 0x0000ffffb74f26c4 or 0x0000ffffb74f26dc. My NOP causes both of these to become aligned.

Copy link

Member

tannergooding commentedFeb 9, 2023

Not quite, it would be the jumps to 0x0000ffffb74f26c4 or 0x0000ffffb74f26dc. My NOP causes both of these to become aligned.

IG09 andIG10 should both be cold blocks (they throw), so its a bit unclear why this is impactful.

I'd also notably expect thenop after theret so it doesn't impact normal code flow execution

Copy link

Contributor

kunalspathak commentedFeb 9, 2023

Setting this didn't cause any difference.

Which means that the loop alignment is definitely not affecting it, although from your experiment, it shows that aligning few places would improve the performance (but not necessarily the one that was lost with ldp change). Basically, if you addNOP in same places before your change, do you not see similar improvements?

My NOP causes both of these to become aligned.

Were they aligned before your ldp change?

Copy link

Contributor

a74nh commentedFeb 9, 2023

Were they aligned before your ldp change?

Before the LDP change, they were both aligned. So adding the NOP is making the rest of the instructions be in the same position as they were with two LDRs.

I tried some more moving around, and moving the NOP after the ret (or anywhere else afterwards) drops the performance again back to 2.3ms.

IN001f: 000088                    ldp     w2, w1, [x1, #0x08]IN0020: 00008C                    add     x0, x0, #20IN0021: 000090                    str     w2, [x0]IN0022: 000094                    str     w1, [x0, #0x04]IN0023: 000098                    mov     w0, #1G_M39015_IG08:IN0030: 00009C                    ldp     fp, lr, [sp], #0x10IN0031: 0000A0                    ret     lrIN0032: 0000A4                    nop     G_M39015_IG09:

Which is odd as G_M39015_IG09 is aligned and there is nothing branching to G_M39015_IG08.

it shows that aligning few places would improve the performance (but not necessarily the one that was lost with ldp change). Basically, if you add NOP in same places before your change, do you not see similar improvements?

I can give this a try too.

The next step would be recreate this as a standalone binary using that block of assembly and get it in a simulator. It might take a bit of time to get it showing the exact same behaviour. If we think it's important enough, then I can give it a go.

Copy link

Contributor

BruceForstall commentedFeb 11, 2023

It certainly seems like there is some odd micro-architectural effect here. E.g., and this seems like grasping at straws, maybe there's an instruction prefetcher grabbing instructions at the presumably (!) cold, not-taken branch targets that are newly unaligned, causing conflicts with fetching the fall-through path?

I'm not sure how much more I would invest into this investigation, although understanding more might save time the next time we see an unexplained regression.

kunalspathak mentioned this pull request

Feb 13, 2023

[Perf] Linux/arm64: 1 Improvement on 1/27/2023 2:26:56 PMdotnet/perf-autofiling-issues#12381

Closed

Copy link

Contributor

kunalspathak commentedFeb 13, 2023•
edited
Loading

Improvements:dotnet/perf-autofiling-issues#12381,dotnet/perf-autofiling-issues#12368

kunalspathak mentioned this pull request

Feb 13, 2023

[Perf] Windows/arm64: 3 Improvements on 1/27/2023 2:26:56 PMdotnet/perf-autofiling-issues#12368

Closed

Copy link

Contributor

kunalspathak commentedFeb 13, 2023

I went through the issues and I don't see any other regressions.

Copy link

Contributor

a74nh commentedFeb 16, 2023

Some updates.....

I extracted the assembly for the entire function into a test program and set some dummy memory values for the dictionary. I ran this on a cycle accurate simulator for the N1, and extracted the traces (including pipeline stages). I did this once for the program with LDP, and once with an LDP plus a NOP. There was nothing to suggest any difference between the two, except for the NOP adding a slight delay. Sadly I'm unable to share any of the traces.

What my test app doesn't replicate is the exact memory setup of the coreclr version (eg - the code has the same alignment, but is in a different location. The contents of the dictionary are different, and lives in a different location). So it's possible this is causing a difference. There's also differences to coreclr (eg GC), to take into account.

As a diversion, now that I have some code to insert NOPs in arbitrary places during clr codegen, I experimented a bit more with moving around the NOP. I've annotated the code below with the benchmark result when a NOP (or 2 NOPs or 3) is placed there.
The benchmark speed without a NOP is 2.37

G_M39015_IG01:        ; func=00, offs=000000H, size=0008H, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IGIN002c: 000000                    stp     fp, lr, [sp, #-0x10]!IN002d: 000004                    mov     fp, spG_M39015_IG02:        ; offs=000008H, size=0014H, bbWeight=1, PerfScore 10.50, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB01 [0000], byref, iszIN0001: 000008                    ldr     w1, [x0, #0x08]IN0002: 00000C                    ldr     x2, [x0]IN0003: 000010                    ldr     w2, [x2, #0x44]IN0004: 000014                    cmp     w1, w2IN0005: 000018                    bne     G_M39015_IG09G_M39015_IG03:        ; offs=00001CH, size=0014H, bbWeight=8, PerfScore 84.00, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB05 [0004], byref, iszIN0006: 00001C                    ldr     w1, [x0, #0x0C]IN0007: 000020                    ldr     x2, [x0]IN0008: 000024                    ldr     w2, [x2, #0x38]IN0009: 000028                    cmp     w1, w2IN000a: 00002C                    blo     G_M39015_IG06G_M39015_IG04:        ; offs=000030H, size=0018H, bbWeight=0.50, PerfScore 4.50, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB06 [0005], byrefIN000b: 000030                    ldr     x1, [x0]IN000c: 000034                    ldr     w1, [x1, #0x38]IN000d: 000038                    add     w1, w1, #1IN000e: 00003C                    str     w1, [x0, #0x0C]IN000f: 000040                    str     xzr, [x0, #0x14]IN0010: 000044                    mov     w0, wzrG_M39015_IG05:        ; offs=000048H, size=0008H, bbWeight=0.50, PerfScore 1.00, epilog, nogc, extendIN002e: 000048                    ldp     fp, lr, [sp], #0x10IN002f: 00004C                    ret     lr                      nop//2.37 us                    2nops//2.35 usG_M39015_IG06:        ; offs=000050H, size=0038H, bbWeight=2, PerfScore 43.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, BB03 [0002], gcvars, byref, iszIN0011: 000050                    ldr     x1, [x0]                      nop//2.18 us                    2nops//2.30 us                    3nops//2.25 usIN0012: 000054                    ldr     x1, [x1, #0x10]IN0013: 000058                    ldr     w2, [x0, #0x0C]IN0014: 00005C                    add     w3, w2, #1IN0015: 000060                    str     w3, [x0, #0x0C]IN0016: 000064                    ldr     w3, [x1, #0x08]IN0017: 000068                    cmp     w2, w3IN0018: 00006C                    bhs     G_M39015_IG10IN0019: 000070                    ubfiz   x2, x2, #4, #32IN001a: 000074                    add     x2, x2, #16IN001b: 000078                    add     x1, x1, x2IN001c: 00007C                    ldr     w2, [x1, #0x04]                      nop//2.20 usIN001d: 000080                    cmn     w2, #1                      nop//2.20 usIN001e: 000084                    blt     G_M39015_IG03G_M39015_IG07:        ; offs=000088H, size=0014H, bbWeight=0.50, PerfScore 3.00, gcrefRegs=0000 {}, byrefRegs=0003 {x0 x1}, BB04 [0003], byref                      nop//2.18 us                    2nops//2.20 usIN001f: 000088                     w1, [x1, #0x08]                      nop//can't place here as it interferes with the peephole.IN0020: 00008C                    add     x0, x0, #20                      nop//2.18 us                    2nops//2.25 usIN0021: 000090                    str     w2, [x0]                      nop//2.18 usIN0022: 000094                    str     w1, [x0, #0x04]                      nop//2.18 usIN0023: 000098                    mov     w0, #1G_M39015_IG08:        ; offs=00009CH, size=0008H, bbWeight=0.50, PerfScore 1.00, epilog, nogc, extend                      nop//2.18 usIN0030: 00009C                    ldp     fp, lr, [sp], #0x10                      nop//2.18 us                    2nops//2.22 us                    3nops//2.35 usIN0031: 0000A0                    ret     lr                      nop//2.37 usG_M39015_IG09:        ; offs=0000A4H, size=0018H, bbWeight=0, PerfScore 0.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0001], gcvars, byrefIN0024: 0000A4                    movz    x0, #0xB078      // code for System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()IN0025: 0000A8                    movk    x0, #0x707C LSL #16IN0026: 0000AC                    movk    x0, #0xFFFF LSL #32IN0027: 0000B0                    ldr     x0, [x0]IN0028: 0000B4                    blr     x0IN0029: 0000B8                    brk_unix #0G_M39015_IG10:        ; offs=0000BCH, size=0008H, bbWeight=0, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB07 [0007], byrefIN002a: 0000BC                    bl      CORINFO_HELP_RNGCHKFAILIN002b: 0000C0                    brk_unix #0

It's hard to make firm statements here, but:

Adding 2 NOPs is usually only slightly slower than a single NOP. This suggests to me the slowdown isn't related to instruction alignment (as 2 NOPs should gives us the same 8byte alignment as no NOPs).
Moving the NOP to after IN0030 still gives the improvement. This tells me it's not a register dependency between the instructions (and the simulator would have told me that)
It's possible some of these speed ups are happening do to different effects.

Copy link

Contributor

kunalspathak commentedFeb 16, 2023

Both the regressions seemed to come back after that despite I don't see any PR that would have improved them.

The diff range is:6ad1205...dce07a8

At this point, I won't spend much time on this given that your experiments proved it was around general alignment (and not necessarily loop alignment). Thank you@a74nh for spending time in investigating.

Copy link

Contributor

BruceForstall commentedFeb 16, 2023

@a74nh That's some amazing in-depth analysis. I agree with@kunalspathak that it doesn't seem worth spending any more time on it at this point.

ghost locked asresolvedand limited conversation to collaborators

Mar 19, 2023

Labels

area-CodeGen-coreclr

CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

community-contribution

Indicates that the PR has been added by a community member

Movatterモバイル変換

Replace successive "ldr" and "str" instructions with "ldp" and "stp"#77540

Replace successive "ldr" and "str" instructions with "ldp" and "stp"#77540

Uh oh!

Conversation

AndyJGraham commentedOct 27, 2022• edited by BruceForstallLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

dnfadmin commentedOct 27, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ghost commentedOct 27, 2022

Uh oh!

a74nh commentedOct 27, 2022

Uh oh!

Uh oh!

kunalspathak commentedOct 27, 2022

Uh oh!

Uh oh!

kunalspathak commentedOct 27, 2022

Uh oh!

AndyJGraham commentedOct 28, 2022

Uh oh!

kunalspathak commentedOct 28, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

kunalspathak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BruceForstall commentedOct 29, 2022

Uh oh!

Uh oh!

BruceForstall commentedNov 2, 2022

Uh oh!

kunalspathak commentedNov 2, 2022

Uh oh!

BruceForstall commentedNov 2, 2022

Uh oh!

Uh oh!

kunalspathak commentedNov 9, 2022

Uh oh!

BruceForstall commentedNov 15, 2022

Uh oh!

BruceForstall commentedFeb 6, 2023

Uh oh!

a74nh commentedFeb 7, 2023

Uh oh!

tannergooding commentedFeb 7, 2023

Uh oh!

a74nh commentedFeb 7, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

a74nh commentedFeb 7, 2023

Uh oh!

kunalspathak commentedFeb 7, 2023

Uh oh!

a74nh commentedFeb 8, 2023

Uh oh!

a74nh commentedFeb 8, 2023

Uh oh!

tannergooding commentedFeb 8, 2023

Uh oh!

AndyJGraham commentedOct 27, 2022•
edited by BruceForstall
Loading

dnfadmin commentedOct 27, 2022•
edited
Loading

kunalspathak commentedOct 28, 2022•
edited
Loading

a74nh commentedFeb 7, 2023•
edited
Loading

kunalspathak commentedFeb 13, 2023•
edited
Loading