NotificationsYou must be signed in to change notification settings
Fork5.1k
Star16.6k

Fix perf issues discovered in "For software performance, can you always trust inlining" blog#61408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

EgorBo merged 9 commits intodotnet:mainfromEgorBo:fix-lemire-issues

Dec 7, 2021

Merged

Fix perf issues discovered in "For software performance, can you always trust inlining" blog#61408

EgorBo merged 9 commits intodotnet:mainfromEgorBo:fix-lemire-issues

Dec 7, 2021

Conversation

Copy link

Member

EgorBo commentedNov 10, 2021•
edited
Loading

In"For software performance, can you always trust inlining?" blog post@lemire discovered two issues with the following C# code:

[MethodImpl(MethodImplOptions.AggressiveInlining)]unsafestaticboolis_made_of_sixteen_digits(byte*chars){Vector128<sbyte>ascii0=Vector128.Create((sbyte)47);Vector128<sbyte>after_ascii9=Vector128.Create((sbyte)58);Vector128<sbyte>raw=Sse41.LoadDquVector128((sbyte*)chars);vara=Sse2.CompareGreaterThan(raw,ascii0);varb=Sse2.CompareLessThan(raw,after_ascii9);varc=Sse2.Subtract(a,b);// this is not optimalreturn(Sse41.TestZ(c,c));}unsafestaticintParseNumberString(byte*p,byte*pend){if((p+16<=pend)&&is_made_of_sixteen_digits(p))if((p+32<=pend)&&is_made_of_sixteen_digits(p+16))return2;return1;return0;}

is_made_of_sixteen_digits is not inlined intoParseNumberString without[AggressiveInlining]
bothVector128.Create() are not CSE'd after inlining.

This PR fixes both issues, new codegen:https://www.diffchecker.com/oFYTlikO (and[AggressiveInlining] is not needed any more)

1. Inlining

Actuallyis_made_of_sixteen_digits is inlined here in .NET 6.0 without any changes, but only if promoted from tier0 to tier1 naturally. I enabled "call" opcode resolving for TieredCompilation=0 mode or/and for methods with loops which currently bypass Tier0. Nowis_made_of_sixteen_digits is always inlineable because inliner understands what those calls are:

Inline candidate looks like a wrapper method.  Multiplier increased to 1.Inline candidate has SIMD type args, locals or return value.  Multiplier increased to 4.Inline has 5 intrinsics.  Multiplier increased to 6.5.Inline has 2 foldable intrinsics.  Multiplier increased to 9.5.Inline candidate callsite is boring.  Multiplier increased to 10.8.calleeNativeSizeEstimate=816callsiteNativeSizeEstimate=85benefit multiplier=10.8threshold=918Native estimate for function size is within threshold for inlining 81.6 <= 91.8 (multiplier = 10.8)

Will see how this impacts JIT throughput in TC=0 mode. Perhaps, I should only enable it for methods with loops when TC_QJFL is 0 (default).

2. Enable CSE for`Vector128.Create(42)`

We do support CSE for simd operations but it turns outVector128.Create(42) had super low "cost" (1) so CSE used to always gave up on it, e.g.:

staticboolCaller(){returnCallee()&&Callee();}staticboolCallee(){vara=Vector128.Create(42);varb=Vector128.Create(43);returnSse41.TestZ(a,b);}

After Inlining CSE refused to optimize Create(42):

Considering CSE #02 {$c2 , $4  } [def=100.000000, use=50.000000, cost=  2      ]CSE Expression : N002 (  2,  2) CSE #02 (def)[000016] ------------              *  HWINTRINSIC simd16 int Create $c2N001 (  1,  1)              [000015] ------------              \--*  CNS_INT   int    43 $42Aggressive CSE Promotion (250.000000 >= 200.000000)cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000defCnt=100.000000, useCnt=50.000000, cost=2, size=2def_cost=1, use_cost=1, extra_no_cost=2, extra_yes_cost=0CSE cost savings check (102.000000 >= 150.000000) failsDid Not promote this CSE

After my change:

Considering CSE #01 {$c0 , $3  } [def=100.000000, use=50.000000, cost=  3      ]CSE Expression : N002 (  3,  3) CSE #01 (def)[000011] ------------              *  HWINTRINSIC simd16 int Create $c0N001 (  1,  1)              [000010] ------------              \--*  CNS_INT   int    42 $41Aggressive CSE Promotion (250.000000 >= 200.000000)cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000defCnt=100.000000, useCnt=50.000000, cost=3, size=3def_cost=1, use_cost=1, extra_no_cost=4, extra_yes_cost=0CSE cost savings check (154.000000 >= 150.000000) passesPromoting CSE:

Codegen diff forCaller:

; Method Program:Caller():boolG_M51476_IG01:       vzeroupper G_M51476_IG02:       vmovupd  xmm0, xmmword ptr [reloc @RWD00]       vmovupd  xmm1, xmmword ptr [reloc @RWD16]       vptest   xmm0, xmm1       sete     al       movzx    rax, al       test     eax, eax       je       SHORT G_M51476_IG05-G_M51476_IG03:-       vmovupd  xmm0, xmmword ptr [reloc @RWD00]-       vmovupd  xmm0, xmmword ptr [reloc @RWD16]-G_M51476_IG04:       ret      G_M51476_IG05:       xor      eax, eaxG_M51476_IG06:       ret      RWD00  dq0000002A0000002Ah, 0000002A0000002AhRWD16  dq0000002B0000002Bh, 0000002B0000002Bh-; Total bytes of code: 54+; Total bytes of code: 38

@dotnet/jit-contrib@jakobbotsch (CSE-area owner)@tannergooding

Will post diffs a bit later

Make Vector128.Create more expensive, enable call opcode resolution i…

a433102

…n no-TC mode

ghost added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label

Nov 10, 2021

Copy link

ghost commentedNov 10, 2021

Tagging subscribers to this area:@JulieLeeMSFT
See info inarea-owners.md if you want to be subscribed.

Issue Details

In"For software performance, can you always trust inlining?" blog post@lemire discovered two issues with the following C# code:

[MethodImpl(MethodImplOptions.AggressiveInlining)]unsafestaticboolis_made_of_sixteen_digits(byte*chars){Vector128<sbyte>ascii0=Vector128.Create((sbyte)47);Vector128<sbyte>after_ascii9=Vector128.Create((sbyte)58);Vector128<sbyte>raw=Sse41.LoadDquVector128((sbyte*)chars);vara=Sse2.CompareGreaterThan(raw,ascii0);varb=Sse2.CompareLessThan(raw,after_ascii9);varc=Sse2.Subtract(a,b);// this is not optimalreturn(Sse41.TestZ(c,c));}unsafestaticintParseNumberString(byte*p,byte*pend){if((p+16<=pend)&&is_made_of_sixteen_digits(p))if((p+32<=pend)&&is_made_of_sixteen_digits(p+16))return2;return1;return0;}

is_made_of_sixteen_digits is not inlined intoParseNumberString without[AggressiveInlining]
bothVector128.Create() are not CSE'd after inlining.

This PR fixes both issues, new codegen:https://www.diffchecker.com/oFYTlikO (and[AggressiveInlining] is not needed any more)

1. Inlining

First, I enabled "call" opcode resolving for TieredCompilation=0 mode or/and for methods with loops which currently bypass Tier0. Nowis_made_of_sixteen_digits is always inlineable because inliner understands what those calls are:

Inline candidate looks like a wrapper method.  Multiplier increased to 1.Inline candidate has SIMD type args, locals or return value.  Multiplier increased to 4.Inline has 5 intrinsics.  Multiplier increased to 6.5.Inline has 2 foldable intrinsics.  Multiplier increased to 9.5.Inline candidate callsite is boring.  Multiplier increased to 10.8.calleeNativeSizeEstimate=816callsiteNativeSizeEstimate=85benefit multiplier=10.8threshold=918Native estimate for function size is within threshold for inlining 81.6 <= 91.8 (multiplier = 10.8)

Will see how this impacts JIT throughput in TC=0 mode.

2. Enable CSE for`Vector128.Create(42)`

We do support CSE for simd operations but it turns outVector128.Create(42) had super low "cost" (1) so CSE used to always gave up on it, e.g.:

staticboolCaller(){returnCallee()&&Callee();}staticboolCallee(){vara=Vector128.Create(42);varb=Vector128.Create(43);returnSse41.TestZ(a,b);}

After Inlining CSE refused to optimize Create(42):

Considering CSE #02 {$c2 , $4  } [def=100.000000, use=50.000000, cost=  2      ]CSE Expression : N002 (  2,  2) CSE #02 (def)[000016] ------------              *  HWINTRINSIC simd16 int Create $c2N001 (  1,  1)              [000015] ------------              \--*  CNS_INT   int    43 $42Aggressive CSE Promotion (250.000000 >= 200.000000)cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000defCnt=100.000000, useCnt=50.000000, cost=2, size=2def_cost=1, use_cost=1, extra_no_cost=2, extra_yes_cost=0CSE cost savings check (102.000000 >= 150.000000) failsDid Not promote this CSE

After my change:

Considering CSE #01 {$c0 , $3  } [def=100.000000, use=50.000000, cost=  3      ]CSE Expression : N002 (  3,  3) CSE #01 (def)[000011] ------------              *  HWINTRINSIC simd16 int Create $c0N001 (  1,  1)              [000010] ------------              \--*  CNS_INT   int    42 $41Aggressive CSE Promotion (250.000000 >= 200.000000)cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000defCnt=100.000000, useCnt=50.000000, cost=3, size=3def_cost=1, use_cost=1, extra_no_cost=4, extra_yes_cost=0CSE cost savings check (154.000000 >= 150.000000) passesPromoting CSE:

Codegen diff forCaller:

; Method Program:Caller():boolG_M51476_IG01:       vzeroupper G_M51476_IG02:       vmovupd  xmm0, xmmword ptr [reloc @RWD00]       vmovupd  xmm1, xmmword ptr [reloc @RWD16]       vptest   xmm0, xmm1       sete     al       movzx    rax, al       test     eax, eax       je       SHORT G_M51476_IG05-G_M51476_IG03:-       vmovupd  xmm0, xmmword ptr [reloc @RWD00]-       vmovupd  xmm0, xmmword ptr [reloc @RWD16]-G_M51476_IG04:       ret      G_M51476_IG05:       xor      eax, eaxG_M51476_IG06:       ret      RWD00  dq0000002A0000002Ah, 0000002A0000002AhRWD16  dq0000002B0000002Bh, 0000002B0000002Bh-; Total bytes of code: 54+; Total bytes of code: 38

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

Copy link

MemberAuthor

EgorBo commentedNov 10, 2021•
edited
Loading

Inliner diffs (jit-diff tool, --pmi mode):

PMI CodeSize Diffs for System.Private.CoreLib.dll for  default jitSummary of Code Size diffs:(Lower is better)Total bytes of base: 5536930Total bytes of diff: 5589051Total bytes of delta: 52121 (0.94 % of base)Total relative delta: 394.76    diff is a regression.    relative diff is a regression.Top file regressions (bytes):       52121 : System.Private.CoreLib.dasm (0.94% of base)1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.Top method regressions (bytes):        1402 (289.07% of base) : System.Private.CoreLib.dasm - ValueTuple`1:System.IValueTupleInternal.ToStringEnd():String:this (8 methods)        1208 (15.85% of base) : System.Private.CoreLib.dasm - WhenAllPromise`1:Invoke(Task):this (8 methods)         700 (167.06% of base) : System.Private.CoreLib.dasm - HashSet`1:EqualityComparersAreEqual(HashSet`1,HashSet`1):bool (8 methods)         664 (206.21% of base) : System.Private.CoreLib.dasm - BadImageFormatException:ToString():String:this         619 (86.82% of base) : System.Private.CoreLib.dasm - TypeNameBuilder:AddAssemblyQualifiedName(Type,int):this         613 (12.23% of base) : System.Private.CoreLib.dasm - Dictionary`2:GetObjectData(SerializationInfo,StreamingContext):this (8 methods)         610 (12.11% of base) : System.Private.CoreLib.dasm - HashSet`1:GetObjectData(SerializationInfo,StreamingContext):this (8 methods)         586 (13.63% of base) : System.Private.CoreLib.dasm - ArraySortHelper`1:IntroSort(Span`1,int,Comparison`1) (9 methods)         519 (52.16% of base) : System.Private.CoreLib.dasm - TextInfo:ToTitleCase(String):String:this         516 (56.27% of base) : System.Private.CoreLib.dasm - ConcurrentQueue`1:System.Collections.ICollection.CopyTo(Array,int):this (8 methods)         494 (81.92% of base) : System.Private.CoreLib.dasm - ContractHelper:GetFailureMessage(int,String):String         490 (103.81% of base) : System.Private.CoreLib.dasm - HashSet`1:.ctor(int,IEqualityComparer`1):this (8 methods)         482 (113.95% of base) : System.Private.CoreLib.dasm - ManifestBuilder:WriteMessageAttrib(StringBuilder,String,String,String):this         474 (14.31% of base) : System.Private.CoreLib.dasm - CustomAttributeTypedArgument:ToString(bool):String:this         468 (77.61% of base) : System.Private.CoreLib.dasm - EventProvider:GetDataFromController(int,long,byref,byref,byref):bool:this         467 (10.47% of base) : System.Private.CoreLib.dasm - DateTimeFormatInfo:CreateTokenHashTable():ref:this         461 (105.25% of base) : System.Private.CoreLib.dasm - Lazy`1:PublicationOnlyViaConstructor(LazyHelper):this (8 methods)         459 (24.88% of base) : System.Private.CoreLib.dasm - HashSet`1:.ctor(IEnumerable`1,IEqualityComparer`1):this (8 methods)         449 (231.44% of base) : System.Private.CoreLib.dasm - EventSource:ReportOutOfBandMessage(String):this         445 (64.77% of base) : System.Private.CoreLib.dasm - CultureInfo:GetCultureInfo(String,String):CultureInfoTop method improvements (bytes):        -807 (-11.16% of base) : System.Private.CoreLib.dasm - String:Concat(IEnumerable`1):String (9 methods)        -693 (-9.63% of base) : System.Private.CoreLib.dasm - String:JoinCore(ReadOnlySpan`1,IEnumerable`1):String (8 methods)        -552 (-11.36% of base) : System.Private.CoreLib.dasm - Vector128`1:ToString(String,IFormatProvider):String:this (6 methods)        -529 (-10.75% of base) : System.Private.CoreLib.dasm - Vector`1:ToString(String,IFormatProvider):String:this (6 methods)        -529 (-10.72% of base) : System.Private.CoreLib.dasm - Vector256`1:ToString(String,IFormatProvider):String:this (6 methods)        -526 (-10.27% of base) : System.Private.CoreLib.dasm - Vector64`1:ToString(String,IFormatProvider):String:this (6 methods)        -503 (-14.17% of base) : System.Private.CoreLib.dasm - StringSerializer:GetSerializedString(TimeZoneInfo):String        -448 (-24.71% of base) : System.Private.CoreLib.dasm - Dictionary`2:.ctor(int,IEqualityComparer`1):this (12 base, 10 diff methods)        -284 (-24.76% of base) : System.Private.CoreLib.dasm - RuntimeMethodInfo:ToString():String:this        -207 (-13.13% of base) : System.Private.CoreLib.dasm - ApplicationId:ToString():String:this        -204 (-30.09% of base) : System.Private.CoreLib.dasm - RTDynamicMethod:ToString():String:this        -188 (-24.67% of base) : System.Private.CoreLib.dasm - RuntimePropertyInfo:ToString():String:this        -128 (-12.12% of base) : System.Private.CoreLib.dasm - IdManager:.ctor():this (8 methods)        -120 (-18.07% of base) : System.Private.CoreLib.dasm - SystemThreadingTasks_FutureDebugView`1:get_CancellationPending():bool:this (8 methods)        -106 (-10.76% of base) : System.Private.CoreLib.dasm - ArraySortHelper`1:SwapIfGreater(Span`1,Comparison`1,int,int) (9 base, 8 diff methods)        -106 (-5.87% of base) : System.Private.CoreLib.dasm - GateThread:GateThreadStart()        -100 (-12.41% of base) : System.Private.CoreLib.dasm - String:JoinCore(ReadOnlySpan`1,ref):String         -94 (-15.59% of base) : System.Private.CoreLib.dasm - RuntimeConstructorInfo:ToString():String:this         -94 (-5.48% of base) : System.Private.CoreLib.dasm - String:Join(String,IEnumerable`1):String (9 methods)         -93 (-7.32% of base) : System.Private.CoreLib.dasm - CustomAttributeData:ToString():String:thisTop method regressions (percentages):         434 (1,808.33% of base) : System.Private.CoreLib.dasm - Assembly:CreateQualifiedName(String,String):String         153 (1,390.91% of base) : System.Private.CoreLib.dasm - SignatureHelper:AddArgument(Type):this          67 (1,340.00% of base) : System.Private.CoreLib.dasm - Array:System.Collections.IList.IndexOf(Object):int:this         140 (1,272.73% of base) : System.Private.CoreLib.dasm - ModuleHandle:GetRuntimeMethodHandleFromMetadataToken(int):RuntimeMethodHandle:this         140 (1,272.73% of base) : System.Private.CoreLib.dasm - ModuleHandle:ResolveMethodHandle(int):RuntimeMethodHandle:this          58 (1,160.00% of base) : System.Private.CoreLib.dasm - Path:EndsInDirectorySeparator(ReadOnlySpan`1):bool         425 (988.37% of base) : System.Private.CoreLib.dasm - Vector256:Abs(Vector256`1):Vector256`1 (6 methods)         209 (950.00% of base) : System.Private.CoreLib.dasm - EventListener:EnableEvents(EventSource,int,long):this          46 (920.00% of base) : System.Private.CoreLib.dasm - String:op_Equality(String,String):bool         203 (882.61% of base) : System.Private.CoreLib.dasm - EventListener:EnableEvents(EventSource,int):this         225 (681.82% of base) : System.Private.CoreLib.dasm - RuntimeModule:GetMethodImpl(String,int,Binder,int,ref,ref):MethodInfo:this         220 (564.10% of base) : System.Private.CoreLib.dasm - ModuleBuilder:GetMethodImpl(String,int,Binder,int,ref,ref):MethodInfo:this          61 (554.55% of base) : System.Private.CoreLib.dasm - FastResourceComparer:Equals(String,String):bool:this          61 (554.55% of base) : System.Private.CoreLib.dasm - NonRandomizedStringEqualityComparer:Equals(String,String):bool:this          61 (554.55% of base) : System.Private.CoreLib.dasm - OrdinalCaseSensitiveComparer:Equals(String,String):bool:this         415 (506.10% of base) : System.Private.CoreLib.dasm - RuntimeFieldInfo:ToString():String:this         430 (483.15% of base) : System.Private.CoreLib.dasm - ContractHelper:GetDisplayMessage(int,String,String):String         417 (468.54% of base) : System.Private.CoreLib.dasm - ParameterInfo:ToString():String:this         214 (465.22% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryGetValue(__Canon,byref):bool:this (1 base, 5 diff methods)          51 (463.64% of base) : System.Private.CoreLib.dasm - Net5CompatSeedImpl:NextBytes(Span`1):thisTop method improvements (percentages):         -34 (-33.01% of base) : System.Private.CoreLib.dasm - NullabilityInfoContext:.ctor():this        -204 (-30.09% of base) : System.Private.CoreLib.dasm - RTDynamicMethod:ToString():String:this         -41 (-24.85% of base) : System.Private.CoreLib.dasm - Task:HandleException(Exception):this        -284 (-24.76% of base) : System.Private.CoreLib.dasm - RuntimeMethodInfo:ToString():String:this        -448 (-24.71% of base) : System.Private.CoreLib.dasm - Dictionary`2:.ctor(int,IEqualityComparer`1):this (12 base, 10 diff methods)        -188 (-24.67% of base) : System.Private.CoreLib.dasm - RuntimePropertyInfo:ToString():String:this         -47 (-22.71% of base) : System.Private.CoreLib.dasm - Task:ExecuteEntryUnsafe(Thread):this         -47 (-22.71% of base) : System.Private.CoreLib.dasm - Task:ExecuteFromThreadPool(Thread):this         -47 (-18.65% of base) : System.Private.CoreLib.dasm - Task:ExecuteEntry():bool:this         -18 (-18.18% of base) : System.Private.CoreLib.dasm - AssemblyLoadContext:get_AllContexts():Dictionary`2         -57 (-18.15% of base) : System.Private.CoreLib.dasm - ThreadPoolTaskScheduler:TryExecuteTaskInline(Task,bool):bool:this        -120 (-18.07% of base) : System.Private.CoreLib.dasm - SystemThreadingTasks_FutureDebugView`1:get_CancellationPending():bool:this (8 methods)         -15 (-18.07% of base) : System.Private.CoreLib.dasm - SystemThreadingTasks_TaskDebugView:get_CancellationPending():bool:this         -47 (-16.67% of base) : System.Private.CoreLib.dasm - <>c:<.cctor>b__10_0(Object):this (2 methods)         -94 (-15.59% of base) : System.Private.CoreLib.dasm - RuntimeConstructorInfo:ToString():String:this         -72 (-15.52% of base) : System.Private.CoreLib.dasm - HashSet`1:.ctor(int):this (8 methods)        -503 (-14.17% of base) : System.Private.CoreLib.dasm - StringSerializer:GetSerializedString(TimeZoneInfo):String         -13 (-13.68% of base) : System.Private.CoreLib.dasm - CultureInfo:get_CachedCulturesByLcid():Dictionary`2        -207 (-13.13% of base) : System.Private.CoreLib.dasm - ApplicationId:ToString():String:this        -100 (-12.41% of base) : System.Private.CoreLib.dasm - String:JoinCore(ReadOnlySpan`1,ref):String453 total methods with Code Size differences (93 improved, 360 regressed), 25750 unchanged.

450 methods affected, but actually this PR doesn't change anything in codegen for R2R and for JIT withTC_QuickJitForLoops=1. So for the very default mode it only improves inlining inside methods with loops in non-prejitted code. As a bonus - developers will see a better (and closer to reality) inlining on sharplab.io

CSE for Vector.Create:

Top method improvements (bytes): -31 (-1.10% of base) : 21617.dasm - HardwareIntrinsics.RayTracer.Packet256Tracer:GetNaturalColor(System.Runtime.Intrinsics.Vector256`1[Int32],HardwareIntrinsics.RayTracer.VectorPacket256,HardwareIntrinsics.RayTracer.VectorPacket256,HardwareIntrinsics.RayTracer.VectorPacket256,HardwareIntrinsics.RayTracer.Scene):HardwareIntrinsics.RayTracer.VectorPacket256:thisTop method regressions (percentages): 5 ( 0.37% of base) : 6066.dasm - BilinearTest:BilinearInterpol_Vector(System.Double[],System.Double[],double,double,System.Double[],double,double,double):System.Double[]:this

We mostly hoist Vector.Create by hands in the BCL so diffs couldn't find anything. The regression inBilinearInterpol_Vector doesn't look like a real regression - it CSE'd Vector.Create from an inner loop:https://www.diffchecker.com/FvtaAZzj (but was not CSE'd from the outerloop then, see#61420 (comment))

jakobbotsch approved these changes

Nov 10, 2021

View reviewed changes

Copy link

Member

jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM.

Copy link

MemberAuthor

EgorBo commentedNov 10, 2021

Regarding JIT's TP:
#51124 (comment) a similar change for Crossgen2 decreased TP by 1.5% so similar numbers are expected for JIT inTieredCompilation=0 mode. As for the default mode where we only regress in methods with loops the total TP regression should be way lower as not all methods have loops. Moreover, we plan to eventually (hopefully in .NET 7.0) make QuickJitForLoops=1 to be default so this change won't regress anything there.

@jkotas are we fine with it?

EgorBo mentioned this pull request

Nov 10, 2021

Hoisting the invariant out of multi-level nested loops#61420

Closed

jkotas reviewed

Nov 10, 2021

View reviewed changes

src/coreclr/jit/fgbasic.cppShow resolvedHide resolved

Copy link

Member

jkotas commentedNov 10, 2021

@jkotas are we fine with it?

Fine with me.

Remove dead code

1673244

Copy link

Member

AndyAyersMS commentedNov 10, 2021

Perhaps, I should only enable it for methods with loops when TC_QJFL is 0

I don't think we should take any dependence on QJFL in the jit.

AndyAyersMS approved these changes

Nov 10, 2021

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated

		@@ -3964,9 +3966,32 @@ unsigned Compiler::gtSetEvalOrder(GenTree* tree)
		goto DONE;
		}
		}
		#endif

		switch (hwTree->gtHWIntrinsicId)

Copy link

Member

AndyAyersMSNov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we need to review costing for HW intrinsics more broadly?

Copy link

Member

tannergoodingDec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We likely should. We probably aren't accounting for cases where helper intrinsics are more expensive than they appear nor cases where operands have less cost due to special handling that hardware intrinsics get.

There's also probably cases where operands (like scalar DBL_CNS) are currently participating in overall CSE and shouldn't for certain cases.

JulieLeeMSFT added this to the7.0.0 milestone

Nov 11, 2021

JulieLeeMSFT assignedEgorBo

Nov 11, 2021

EgorBo added4 commits

November 15, 2021 12:53

Merge branch 'main' ofhttps://github.com/dotnet/runtimeinto fix-lem…

72fb64e

…ire-issues

Merge branch 'main' ofhttps://github.com/dotnet/runtimeinto fix-lem…

dddf349

…ire-issues

Merge branch 'main' ofhttps://github.com/dotnet/runtimeinto fix-lem…

4f4c709

…ire-issues# Conflicts:#src/coreclr/jit/gentree.cpp

fix typo

80897b4

tannergooding reviewed

Dec 1, 2021

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated

		if (hwTree->gtGetOp1()->OperIsConst() && (hwTree->gtGetOp2() == nullptr))
		{
		// Vector.Create(cns) is cheap but not that cheap to be (1,1)
		costEx = 2;

Copy link

Member

tannergoodingDec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should beIND_COST_EX

Copy link

Member

tannergoodingDec 1, 2021•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actually, it probably needs to be a "bit" more complex.

If all operands are constant and its not representingall bits zero orall bits set then itsIND_COST_EX.

If part of the value isn'tconstant then the cost increases as the number of operands increases. We don't currently, but could eventually, handle "partial constants".

If the value representsall bits zero orall bits set, then its cheaper and its justxor or the relevantcmp SIMD instruction and is special cased by hardware.

Copy link

MemberAuthor

EgorBoDec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@tannergooding yes I had "all zeros/ones" cases in mind but the problem that they complicate code a lot (especially the AllBitsSet case for different types) for a very rare case where usually get_Zero/get_AllBitsSet intrinsics are used. I think it won't hurt if we do CSE more often for these cases or we better move the logic to recognize get_Zero/get_AllBitsSet early in morph/importer and it will work as expected + IR will be simplified earlier.

fix build issue

3368ff2

tannergooding reviewed

Dec 2, 2021

View reviewed changes

src/coreclr/jit/gentree.cpp

		case NI_Vector128_Create:
		#endif
		{
		if ((hwTree->GetOperandCount() == 1) && hwTree->Op(1)->OperIsConst())

Copy link

Member

tannergoodingDec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why are we only doingOperandCount == 1?

What about the cases whereOperandCount == 2 throughOperandCount == 32? Are those being properly tracked as "expensive" and getting CSEd?

Copy link

MemberAuthor

EgorBoDec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@tannergooding yeah, they are assigned a higher cost automatically due to multiple arguments so the problem doesn't reproduce for them. but that's a good point, I guess Vector128.Create(1,2,3,4,5,6,7,8) currently gets a very high cost while in reality it should still be 3/2

Copy link

Member

tannergoodingDec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think its fine to log an issue for and cover in a separate issue here.

tannergooding approved these changes

Dec 2, 2021

View reviewed changes

Copy link

Member

tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changes LGTM.

We should log an issue to see if we can track some of the overall costs more accurately, particularly for helper intrinsics or intrinsics which are more expensive than others.

EgorBo added2 commits

December 7, 2021 11:59

Merge branch 'main' ofhttps://github.com/dotnet/runtimeinto fix-lem…

bd38782

…ire-issues

Update gentree.cpp

8967fce

Copy link

MemberAuthor

EgorBo commentedDec 7, 2021

Changes LGTM.
We should log an issue to see if we can track some of the overall costs more accurately, particularly for helper intrinsics or intrinsics which are more expensive than others.

Sure, I'm taking a quick look now, e.g. on arm64 floating point constants are never hoisted currently (unlike x64 we can't "contain" them)