- Notifications
You must be signed in to change notification settings - Fork5.1k
Fix perf issues discovered in "For software performance, can you always trust inlining" blog#61408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
ghost commentedNov 10, 2021
Tagging subscribers to this area:@JulieLeeMSFT Issue DetailsIn"For software performance, can you always trust inlining?" blog post@lemire discovered two issues with the following C# code: [MethodImpl(MethodImplOptions.AggressiveInlining)]unsafestaticboolis_made_of_sixteen_digits(byte*chars){Vector128<sbyte>ascii0=Vector128.Create((sbyte)47);Vector128<sbyte>after_ascii9=Vector128.Create((sbyte)58);Vector128<sbyte>raw=Sse41.LoadDquVector128((sbyte*)chars);vara=Sse2.CompareGreaterThan(raw,ascii0);varb=Sse2.CompareLessThan(raw,after_ascii9);varc=Sse2.Subtract(a,b);// this is not optimalreturn(Sse41.TestZ(c,c));}unsafestaticintParseNumberString(byte*p,byte*pend){if((p+16<=pend)&&is_made_of_sixteen_digits(p))if((p+32<=pend)&&is_made_of_sixteen_digits(p+16))return2;return1;return0;}
This PR fixes both issues, new codegen:https://www.diffchecker.com/oFYTlikO (and 1. InliningFirst, I enabled "call" opcode resolving for TieredCompilation=0 mode or/and for methods with loops which currently bypass Tier0. Now
Will see how this impacts JIT throughput in TC=0 mode. 2. Enable CSE for |
Author: | EgorBo |
---|---|
Assignees: | - |
Labels: |
|
Milestone: | - |
EgorBo commentedNov 10, 2021 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Inliner diffs (jit-diff tool, --pmi mode):
450 methods affected, but actually this PR doesn't change anything in codegen for R2R and for JIT with CSE for Vector.Create:
We mostly hoist Vector.Create by hands in the BCL so diffs couldn't find anything. The regression in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM.
Regarding JIT's TP: @jkotas are we fine with it? |
Uh oh!
There was an error while loading.Please reload this page.
Fine with me. |
I don't think we should take any dependence on QJFL in the jit. |
src/coreclr/jit/gentree.cpp Outdated
@@ -3964,9 +3966,32 @@ unsigned Compiler::gtSetEvalOrder(GenTree* tree) | |||
goto DONE; | |||
} | |||
} | |||
#endif | |||
switch (hwTree->gtHWIntrinsicId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Do we need to review costing for HW intrinsics more broadly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We likely should. We probably aren't accounting for cases where helper intrinsics are more expensive than they appear nor cases where operands have less cost due to special handling that hardware intrinsics get.
There's also probably cases where operands (like scalar DBL_CNS) are currently participating in overall CSE and shouldn't for certain cases.
…ire-issues# Conflicts:#src/coreclr/jit/gentree.cpp
src/coreclr/jit/gentree.cpp Outdated
if (hwTree->gtGetOp1()->OperIsConst() && (hwTree->gtGetOp2() == nullptr)) | ||
{ | ||
// Vector.Create(cns) is cheap but not that cheap to be (1,1) | ||
costEx = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This should beIND_COST_EX
tannergoodingDec 1, 2021 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actually, it probably needs to be a "bit" more complex.
If all operands are constant and its not representingall bits zero
orall bits set
then itsIND_COST_EX
.
If part of the value isn'tconstant
then the cost increases as the number of operands increases. We don't currently, but could eventually, handle "partial constants".
If the value representsall bits zero
orall bits set
, then its cheaper and its justxor
or the relevantcmp
SIMD instruction and is special cased by hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@tannergooding yes I had "all zeros/ones" cases in mind but the problem that they complicate code a lot (especially the AllBitsSet case for different types) for a very rare case where usually get_Zero/get_AllBitsSet intrinsics are used. I think it won't hurt if we do CSE more often for these cases or we better move the logic to recognize get_Zero/get_AllBitsSet early in morph/importer and it will work as expected + IR will be simplified earlier.
case NI_Vector128_Create: | ||
#endif | ||
{ | ||
if ((hwTree->GetOperandCount() == 1) && hwTree->Op(1)->OperIsConst()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Why are we only doingOperandCount == 1
?
What about the cases whereOperandCount == 2
throughOperandCount == 32
? Are those being properly tracked as "expensive" and getting CSEd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@tannergooding yeah, they are assigned a higher cost automatically due to multiple arguments so the problem doesn't reproduce for them. but that's a good point, I guess Vector128.Create(1,2,3,4,5,6,7,8) currently gets a very high cost while in reality it should still be 3/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think its fine to log an issue for and cover in a separate issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Changes LGTM.
We should log an issue to see if we can track some of the overall costs more accurately, particularly for helper intrinsics or intrinsics which are more expensive than others.
Sure, I'm taking a quick look now, e.g. on arm64 floating point constants are never hoisted currently (unlike x64 we can't "contain" them) |
EgorBo commentedDec 7, 2021 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Windows Arm64 Checked seems to be failing everywhere and is unrelated to this PR. |
Uh oh!
There was an error while loading.Please reload this page.
In"For software performance, can you always trust inlining?" blog post@lemire discovered two issues with the following C# code:
is_made_of_sixteen_digits
is not inlined intoParseNumberString
without[AggressiveInlining]
Vector128.Create()
are not CSE'd after inlining.This PR fixes both issues, new codegen:https://www.diffchecker.com/oFYTlikO (and
[AggressiveInlining]
is not needed any more)1. Inlining
Actually
is_made_of_sixteen_digits
is inlined here in .NET 6.0 without any changes, but only if promoted from tier0 to tier1 naturally. I enabled "call" opcode resolving for TieredCompilation=0 mode or/and for methods with loops which currently bypass Tier0. Nowis_made_of_sixteen_digits
is always inlineable because inliner understands what those calls are:Will see how this impacts JIT throughput in TC=0 mode. Perhaps, I should only enable it for methods with loops when TC_QJFL is 0 (default).
2. Enable CSE for
Vector128.Create(42)
We do support CSE for simd operations but it turns out
Vector128.Create(42)
had super low "cost" (1) so CSE used to always gave up on it, e.g.:After Inlining CSE refused to optimize Create(42):
After my change:
Codegen diff for
Caller
:@dotnet/jit-contrib@jakobbotsch (CSE-area owner)@tannergooding
Will post diffs a bit later