- Notifications
You must be signed in to change notification settings - Fork5.2k
Optimize Vector128<long> multiplication for arm64#104177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
EgorBo commentedJun 28, 2024
@EgorBot -arm64 -profiler usingSystem.IO.Hashing;usingBenchmarkDotNet.Attributes;publicclassBench{staticreadonlybyte[]Data=newbyte[1000000];[Benchmark]publicbyte[]BenchXxHash128(){XxHash128hash=new();hash.Append(Data);returnhash.GetHashAndReset();}} |
Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch |
Uh oh!
There was an error while loading.Please reload this page.
| case TYP_LONG: | ||
| case TYP_ULONG: | ||
| { | ||
| assert(simdSize == 16); | ||
| // Make op1 and op2 multi-use: | ||
| GenTree* op1Dup = fgMakeMultiUse(&op1); | ||
| GenTree* op2Dup = fgMakeMultiUse(&op2); | ||
| // long left0 = op1.GetElement(0) | ||
| // long left1 = op1.GetElement(1) | ||
| GenTree* left0 = gtNewSimdGetElementNode(TYP_LONG, op1, gtNewIconNode(0), simdBaseJitType, 16); | ||
| GenTree* left1 = gtNewSimdGetElementNode(TYP_LONG, op1Dup, gtNewIconNode(1), simdBaseJitType, 16); | ||
| // long right0 = op2.GetElement(0) | ||
| // long right1 = op2.GetElement(1) | ||
| GenTree* right0 = gtNewSimdGetElementNode(TYP_LONG, op2, gtNewIconNode(0), simdBaseJitType, 16); | ||
| GenTree* right1 = gtNewSimdGetElementNode(TYP_LONG, op2Dup, gtNewIconNode(1), simdBaseJitType, 16); | ||
| // Vector128<long> vec = Vector128.Create(left0 * right0, left1 * right1) | ||
| op1 = gtNewOperNode(GT_MUL, TYP_LONG, left0, right0); | ||
| op2 = gtNewOperNode(GT_MUL, TYP_LONG, left1, right1); | ||
| GenTree* vec = gtNewSimdCreateScalarUnsafeNode(TYP_SIMD16, op1, simdBaseJitType, 16); | ||
| return gtNewSimdHWIntrinsicNode(TYP_SIMD16, vec, gtNewIconNode(1), op2, NI_AdvSimd_Insert, | ||
| simdBaseJitType, 16); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is this just avoiding the cost of inlining, unrolling, and simplifying the work the JIT would have to do?
EgorBot commentedJun 28, 2024
Benchmark results on Arm64
Flame graphs:Main vsPR 🔥 For clean |
neon-sunset commentedJul 1, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I wanted to ask is there a reason LLVM's codegen variant did not work? On some cores, |
src/coreclr/jit/gentree.cpp Outdated
| return gtNewSimdHWIntrinsicNode(type, vec, gtNewIconNode(1), op2, NI_AdvSimd_Insert, | ||
| simdBaseJitType, 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
nit: UsegtNewSimdWithElementNode(type, vec, gtNewIconNode(1), op2, simdBaseJitType, simdSize) which ensures all the optimal handling takes place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks! Applied
| op1 = gtNewBitCastNode(TYP_LONG, op1); | ||
| op2 = gtNewBitCastNode(TYP_LONG, op2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Why bitcast instead ofToScalar? If this is generating better code, it seems like a pretty "core" scenario we're not handling from theToScalar path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@tannergooding because op2 can be either 8-byteTYP_SIMD8 or 8-byte scalar (TYP_LONG) so bitcast allowed me to simplify handling. In my initial version I forgot that this path is used for bothMUL(vector, vector) andMUL(vector, scalar) (wherescalar is broadcasted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah, that makes sense, 👍
Follow up to#103555 for arm64