- Notifications
You must be signed in to change notification settings - Fork5.2k
Accelerate Vector128<long>::op_Multiply on x64#103555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
EgorBo commentedJun 17, 2024
Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL |
neon-sunset commentedJun 17, 2024
Note#103539 (comment) (andhttps://godbolt.org/z/eqsrf341M) from xxHash128 issue. |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
…sics/Vector128_1.csCo-authored-by: Tanner Gooding <tagoo@outlook.com>
EgorBo commentedJun 20, 2024
@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128 usingSystem.IO.Hashing;usingBenchmarkDotNet.Attributes;publicclassBench{staticreadonlybyte[]Data=newbyte[1000000];[Benchmark]publicbyte[]BenchXxHash128(){XxHash128hash=new();hash.Append(Data);returnhash.GetHashAndReset();}} |
EgorBot commentedJun 20, 2024
Benchmark results on Intel
Flame graphs:Main vsPR 🔥 For clean |
EgorBot commentedJun 20, 2024
Benchmark results on Amd
Flame graphs:Main vsPR 🔥 For clean |
EgorBot commentedJun 20, 2024
Benchmark results on Arm64
Flame graphs:Main vsPR 🔥 For clean |
EgorBo commentedJun 21, 2024
/azp list |
This comment was marked as resolved.
This comment was marked as resolved.
EgorBo commentedJun 21, 2024
/azp run runtime-coreclr jitstress-isas-x86 |
| Azure Pipelines successfully started running 1 pipeline(s). |
EgorBo commentedJun 21, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@tannergooding PTAL, I'll add arm64 separately, need to test different impls. Benchmark improvement:#103555 (comment) |
Uh oh!
There was an error while loading.Please reload this page.
| // Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero); | ||
| GenTreeHWIntrinsic* tmp3 = | ||
| gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type), | ||
| is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd, | ||
| CORINFO_TYPE_UINT, simdSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I know in other places we've started avoidinghadd in favor ofshuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I tried to benchmark different implementations for it and they all were equaly fast e.g.#99871 (comment)
| if (TARGET_POINTER_SIZE ==4) | ||
| { | ||
| // TODO-XARCH-CQ:We shouldsupport long/ulong multiplication | ||
| // TODO-XARCH-CQ:32bitsupport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What's blocking 32-bit support? It doesn't look like we're using any_X64 intrinsics in the fallback logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Not sure to be honest, that check was pre-existing, I only changed comment
Uh oh!
There was an error while loading.Please reload this page.
This PR optimizes
Vector128andVector256multiplication forlong/ulongwhen AVX512 is not presented in the system. It makes XxHash128 faster, see#103555 (comment)Current codegen on x64 cpu without AVX512:
New codegen: