Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

JIT: Add a disabled-by-default implementation of strength reduction#104243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
jakobbotsch merged 6 commits intodotnet:mainfromjakobbotsch:strength-reduction-2
Jul 4, 2024

Conversation

@jakobbotsch
Copy link
Member

@jakobbotschjakobbotsch commentedJul 1, 2024
edited
Loading

This adds a disabled-by-default implementation of strength reduction. At this point the implementation should be correct, however it is currently both a size and perfscore regression when it is enabled. More work will be needed to get the heuristics right and to make it kick in for more cases.

Strength reduction replaces "expensive" operations computed on every loop iteration with cheaper ones by creating more induction variables. In C# terms it effectively transforms something like

privatestructS{publicintA,B,C;}[MethodImpl(MethodImplOptions.NoInlining)]privatestaticfloatSum(S[]ss){intsum=0;for(inti=0;i<ss.Length;i++){Sv=ss[i];sum+=v.A;sum+=v.B;sum+=v.C;}returnsum;}

into an equivalent

intsum=0;refScurS=refss[0];for(inti=0;i<ss.Length;i++){sum+=curS.A;sum+=curS.B;sum+=curS.C;curS=refUnsafe.Add(refcurS,1);}

With strength reduction enabled this PR thus changes codegen of the standardforeach version above from

G_M63518_IG03:  ;; offset=0x0011lear10,[rdx+2*rdx]lear10, bword ptr[rcx+4*r10+0x10]movr9d, dword ptr[r10]movr11d, dword ptr[r10+0x04]movr10d, dword ptr[r10+0x08]addeax,r9daddeax,r11daddeax,r10dincedxcmpr8d,edxjg       SHORT G_M63518_IG03;; size=36 bbWeight=4 PerfScore 39.00

to

G_M63518_IG03:  ;; offset=0x000Daddrcx,16;; size=4 bbWeight=0.25 PerfScore 0.06G_M63518_IG04:  ;; offset=0x0011movr8,rcxmovr10d, dword ptr[r8]movr9d, dword ptr[r8+0x04]movr8d, dword ptr[r8+0x08]addeax,r10daddeax,r9daddeax,r8daddrcx,12decedxjne      SHORT G_M63518_IG04;; size=31 bbWeight=4 PerfScore 34.00

on x64. Further improvements can be made to enable better address modes.

The current heuristics try to ensure that we do not actually end up with more primary induction variables. The strength reduction only kicks in when it thinks that all uses of the primary IV can be replaced by the new primary IV. However, uses inside loop exit tests are allowed to stay unreplaced by the assumption that the downwards loop transformation will be able to get rid of them.

Getting the cases around overflow right turned out to be hard and required reasoning about trip counts that was added in a previous PR. Generally, the issue is that we need to prove that transforming a zero extension of an add recurrence to a 64-bit add recurrence is legal. For example, for a simple case of

for (int i = 0; i < arr.Length; i++)  sum += arr[i];

the IV analysis is eventually going to end up wanting to show thatzext<64>(int32 <L, 0, 1>) => int64 <L, 0, 1> is a correct transformation. This requires showing that the add recurrence does not step past 2^32-1, which requires the bound on the trip count that we can now compute. The reasoning done for both the trip count and around the overflow is still very limited but can be improved incrementally.

The implementation works by considering every primary IV of the loop in turn, and by initializing 'cursors' pointing to each use of the primary IV. It then tries to repeatedly advance these cursors to the parent of the uses while it results in a new set of cursors that still compute the same (now derived) IV. If it manages to do this once, then replacing the cursors by a new primary IV should result in the old primary IV no longer being necessary, while having replaced some operations by cheaper ones.

PaulusParssinen reacted with thumbs up emoji
This adds a disabled-by-default implementation of strength reduction. Atthis point the implementation should be correct, however it is currentlyboth a size and perfscore regression when it is enabled. More work willbe needed to get the heuristics right and to make it kick in for morecases.Strength reduction replaces "expensive" operations computed on everyloop iteration with cheaper ones by creating more inductionvariables. In C# terms it effectively transforms something like```private struct S{    public int A, B, C;}[MethodImpl(MethodImplOptions.NoInlining)]private static float Sum(S[] ss){    int sum = 0;    foreach (S v in ss)    {        sum += v.A;        sum += v.B;        sum += v.C;    }    return sum;}```into an equivalent```int sum = 0;ref S curS = ref ss[0];for (int i = 0; i < ss.Length; i++){  sum += curS.A;  sum += curS.B;  sum += curS.C;  curS = ref Unsafe.Add(ref curS, 1);}```With strength reduction enabled this PR thus changes codegen of thestandard `foreach` version above from```asmG_M63518_IG03:  ;; offset=0x0011       lea      r10, [rdx+2*rdx]       lea      r10, bword ptr [rcx+4*r10+0x10]       mov      r9d, dword ptr [r10]       mov      r11d, dword ptr [r10+0x04]       mov      r10d, dword ptr [r10+0x08]       add      eax, r9d       add      eax, r11d       add      eax, r10d       inc      edx       cmp      r8d, edx       jg       SHORT G_M63518_IG03;; size=36 bbWeight=4 PerfScore 39.00```to```asmG_M63518_IG04:  ;; offset=0x0011       mov      r8, rcx       mov      r10d, dword ptr [r8]       mov      r9d, dword ptr [r8+0x04]       mov      r8d, dword ptr [r8+0x08]       add      eax, r10d       add      eax, r9d       add      eax, r8d       add      rcx, 12       dec      edx       jne      SHORT G_M63518_IG04;; size=31 bbWeight=4 PerfScore 34.00```on x64. Further improvements can be made to enable better address modes.The current heuristics try to ensure that we do not actually end up withmore primary induction variables. The strength reduction only kicks inwhen it thinks that all uses of the primary IV can be replaced by thenew primary IV. However, uses inside loop exit tests are allowed to stayunreplaced by the assumption that the downwards loop transformationwill be able to get rid of them.Getting the cases around overflow right turned out to be hard andrequired reasoning about trip counts that was added in a previous PR.Generally, the issue is that we need to prove that transforming a zeroextension of an add recurrence to a 64-bit add recurrence is legal. Forexample, for a simple case of```for (int i = 0; i < arr.Length; i++)  sum += arr[i];```the IV analysis is eventually going to end up wanting to show that`zext<64>(int32 <L, 0, 1>) => int64 <L, 0, 1>` is a correcttransformation. This requires showing that the add recurrence does notstep past 2^32-1, which requires the bound on the trip count that we cannow compute. The reasoning done for both the trip count and around theoverflow is still very limited but can be improved incrementally.The implementation works by considering every primary IV of the loop inturn, and by initializing 'cursors' pointing to each use of the primaryIV. It then tries to repeatedly advance these cursors to the parent ofthe uses while it results in a new set of cursors that still compute thesame (now derived) IV. If it manages to do this once, then replacing thecursors by a new primary IV should result in the old primary IV nolonger being necessary, while having replaced some operations by cheaperones.
@ghostghost added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labelJul 1, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
MemberAuthor

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
MemberAuthor

@MichalStrehovsky Have you seen this one before during ilc?https://dev.azure.com/dnceng-public/public/_build/results?buildId=726752&view=logs&j=15d34cf8-8614-5f8a-e364-3c281b6c41dd&t=2f724b17-e574-5ab4-a401-210ce585d93f&l=767

Do I understand correctly this is during the ilc that was jitted without my changes from this PR?

@jakobbotsch
Copy link
MemberAuthor

jakobbotsch commentedJul 2, 2024
edited
Loading

Going to bounce the stress pipelines which seem to have been in bad shape with e.g.#104263 and#104232.

@jakobbotsch

This comment was marked as spam.

@azure-pipelines

This comment was marked as resolved.

@jakobbotsch

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

@jakobbotsch
Copy link
MemberAuthor

Still have some correctness issues to sort out.. for example, we create an illegal byref inTest in the following example:

publicstaticvoidMain(){Test(refUnsafe.NullRef<ushort>(),()=>false);}[MethodImpl(MethodImplOptions.NoInlining)]publicstaticintTest(refushortb,Func<bool>sumIndex){intsum=0;for(nuinti=0;i<10000;i++){if(sumIndex())sum+=Unsafe.Add(refb,0x100000+i);}returnsum;}
G_M401_IG02:  ;; offset=0x000Bxoresi,esileardi, bword ptr[rcx+0x00200000]movebp,0x2710;; size=14 bbWeight=0.25 PerfScore 0.25G_M401_IG03:  ;; offset=0x0019movrcx, gword ptr[rbx+0x08]call[rbx+0x18]System.Func`1[ubyte]:Invoke():ubyte:thistesteax,eaxje       SHORT G_M401_IG05;; size=11 bbWeight=4 PerfScore 25.00G_M401_IG04:  ;; offset=0x0024movzxrax, word  ptr[rdi]addesi,eax;; size=5 bbWeight=2 PerfScore 4.50G_M401_IG05:  ;; offset=0x0029addrdi,2decrbpjne      SHORT G_M401_IG03;; size=9 bbWeight=4 PerfScore 6.00

We end up creating and reporting a byref0x00200000 eagerly that is never created in normal circumstances.

@jakobbotsch
Copy link
MemberAuthor

Just going to bail indiscriminately on the GC pointers case for now. Will handle some of the cases where we can prove it is ok in a follow-up.

@jakobbotsch

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

@jakobbotsch

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

@jakobbotsch

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

@jakobbotschjakobbotsch marked this pull request as ready for reviewJuly 2, 2024 20:55
@jakobbotsch
Copy link
MemberAuthor

cc @dotnet/jit-contrib PTAL@AndyAyersMS

The jitstress/libraries-jitstress failures look like#103577,#104316,#104317,#102706.

Diffs with this enabled do not look too encouraging yet. They looked quite a bit better before disabling the GC types, so hopefully will be able to get it into a better state once I start improving on the legality checks and heuristics. I want to do that in some smaller changes separately. Am going to push a commit to disable this by default again.

Copy link
Member

@AndyAyersMSAndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Overall this looks great. I had few questions but they can be addressed (or ignored) in follow-ups.

returnfalse;
}

bool isInsideExitTest =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe we should compute the set of exit blocks during loop analysis?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Or at least introduce some helper for this... let me do it as part of a follow-up.

//
// It is not a correctness requirement that we remove all uses; if we end up
// not doing so (e.g. because a cursor was not created by this function),
// then we may just end up with extra primary IVs in the loop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be interesting to have a stress mode where you intentionally don't create cursors all the places you could?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, I think so. Good idea for a follow-up.

// The algorithm here works in the following way: we process each primary
// IV in turn. For every primary IV, we create a 'cursor' pointing to every
// use of that primary IV. We then continuously advance each cursor to the
// parent node as long as all cursors represent the same derived IV. Once we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm curious about the "same" derived IV requirement ... if we are say converting from a char array to an int array

int[]a= ...char[]b= ...for(inti=0; ...){a[i]=(int)b[i];}

if I understand correctly the current code would not strength reduce because we'd have <L, 0, 4> vs <L, 0, 2>? Seems like we still might want to strength reduce as long as the set of cursors doesn't expand into too many different new IVs? Or say if it is relatively cheap to convert from one family to another (say multiply by 2 which can be done with scaled address mode).

Also wonder if this profitability equation changes if we have post-increment modes (I forget where we ended up on this).

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

if I understand correctly the current code would not strength reduce because we'd have <L, 0, 4> vs <L, 0, 2>? Seems like we still might want to strength reduce as long as the set of cursors doesn't expand into too many different new IVs? Or say if it is relatively cheap to convert from one family to another (say multiply by 2 which can be done with scaled address mode).

Yeah, that's right. I think you're right we can generalize this: we can keep going if the addrecs have the same 'start' and all step values are divisible by the minimum step value. I'll consider that as a follow-up.

Also wonder if this profitability equation changes if we have post-increment modes (I forget where we ended up on this).

Currently there's really no heuristics here; we kind of just do strength reduction whenever we have a candidate. I definitely need to sit down and consider more carefully how the transformation affects performance and size (usually it seems to regress size because it requires computing some 'start' value in the preheader).

break;
}

// TODO-CQ: If this is now the source to a store, we can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do you know how often you see this? I suppose for normal array uses it's rare, we may only see it for pointer math where users have CSE'd parts of the computation or something.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fairly often:

Advanced cursor node typeADD                  :   26343CAST                 :   15729LSH                  :   11239BOUNDS_CHECK         :    8355IND                  :    5714LT                   :    3143GT                   :    2536STORE_LCL_VAR        :    1319GE                   :    1252SUB                  :    1008CALL                 :     993STORE_BLK            :     907HWINTRINSIC          :     892BLK                  :     841UMOD                 :     822LE                   :     437STOREIND             :     357EQ                   :     297MUL                  :      89AND                  :      63NE                   :      61MOD                  :      38INTRINSIC            :      16NEG                  :      14UDIV                 :      10COMMA                :       7BITCAST              :       2DIV                  :       2RSH                  :       2

@jakobbotsch
Copy link
MemberAuthor

There's a few diffs from some of the new sanity checks added when creating SCEVs.

Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@AndyAyersMSAndyAyersMSAndyAyersMS approved these changes

Assignees

@jakobbotschjakobbotsch

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@jakobbotsch@AndyAyersMS

[8]ページ先頭

©2009-2025 Movatter.jp