- Notifications
You must be signed in to change notification settings - Fork5.2k
JIT: Add support for byte/sbyte SIMD multiplication on XArch#86811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Implement SIMD-based byte/sbyte multiplication for Vector128/256/512 through widening/narrowing
ghost commentedMay 26, 2023
Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch Issue DetailsThe PR adds support for SIMD-based byte/sbyte multiplication on XArch. Given that there is no specific hardware intrinsic, the hardware acceleration is achieved widening/narrowing vectors, depending on the supported ISA. The basic algorithm is the following:
SPMI reports some regressions, but as far as I could see the previous implementation (software fallback) was not inlined, while the current one is. The following code publicstaticVector128<byte>MultiplyB128(Vector128<byte>v1,Vector128<byte>v2)=>v1*v2; Produces this codegen G_M65351_IG01: ;; offset=0000Hvzeroupper;; size=3 bbWeight=1 PerfScore 1.00G_M65351_IG02: ;; offset=0003H vpmovzxbwymm0, qword ptr[rdx] vpmovzxbwymm1, qword ptr[r8] vpmullwymm0,ymm0,ymm1vpandymm0,ymm0, ymmword ptr[reloc @RWD00] vpackuswbymm0,ymm0,ymm0vpermqymm0,ymm0,-40vmovups xmmword ptr[rcx],xmm0movrax,rcx;; size=39 bbWeight=1 PerfScore 21.25G_M65351_IG03: ;; offset=002AHvzeroupperret;; size=4 bbWeight=1 PerfScore 2.00RWD00 dq00FF00FF00FF00FFh,00FF00FF00FF00FFh,00FF00FF00FF00FFh,00FF00FF00FF00FFh
|
Uh oh!
There was an error while loading.Please reload this page.
Checked codegen only, no AVX512 tests run (hardware unavailable)
BladeWise commentedMay 29, 2023
I have updated the PR to fix the issue with AVX512. |
kunalspathak commentedJun 5, 2023
JulieLeeMSFT commentedJun 19, 2023
@tannergooding, please review this community contribution and consider if this can be done in C# instead of in JIT. |
BruceForstall left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@tannergooding Could this code instead be implemented in C# in (for example) theVector128<T>::operator* fallback implementation, using all the same hardware intrinsics, ISA checking, etc.? Or is there some reason why that either can't or shouldn't be done?
| case TYP_BYTE: | ||
| case TYP_UBYTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It looks like there is some precedent for inserting additional IR nodes in this function, but this is quite a lot more than the other cases, and these new cases return without falling through like the others. Is this the right place for this code? Or should this be a separate function that is called before getting here?
tannergooding commentedJul 7, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
It "could" be, but there are certain pessimizations that I believe make doing that the wrong choice for these APIs. Most notably because these APIs are considered "perf critical" and one of the primary ways users interact with the SIMD types. One of the pessimizations is that the IR produced by inlining still differs and is "sub-optimal" in comparison. It is often inserting additional copies, locals, temporaries, or other handling at the "boundary" just before/after the inlined code. A lot of this has been improved with the introduction of forward sub, but there are still notable edge cases. Likewise, it is still dependent on the inliner and while we've made some improvements to that (particularly in adjusting the budget when many intrinsic calls exist), there are still plenty of cases (especially in more complex methods) where the budget is quickly eaten by APIs like this being in managed and it will simply give up. That is often less problematic for other APIs, but represents a particular concern when you're considering APIs that are exposing primitive functionality used to write all your other algorithms. We ultimately are generating roughly 6 instructions of very specialized code for what is a baseline arithmetic operation on the SIMD types. If we were designing IL "today", we would potentially have SIMD type support built into the IL and so this would be the same handling we'd be doing for Notably, it would probably be better for this logic to be in lowering or morph instead and for us to just track it as Longer term I would definitely like to see the issues called out above addressed, even if that's done via special casing these APIs in the inliner/etc instead. Once we're confident that's the case, then we should look at incrementally moving all similar functionality into managed instead. We might still end up with some bits staying as handled by the JIT (such as for truly simple operations where equivalent handling exists for |
JulieLeeMSFT commentedJul 31, 2023
Since it is past .NET 8 code complete, setting this to .NET 9. |
BladeWise commentedSep 30, 2023
@BruceForstall I have merged the branch and fixed the break with current sources. |
DeepakRajendrakumaran commentedOct 2, 2023
Sorry for the late reply. Changes look good at first glance. Couldn't find any similar implementation in native compilers to compare disasm though |
Uh oh!
There was an error while loading.Please reload this page.
BladeWise commentedOct 6, 2023
@BruceForstall as per your request:
|
BladeWise commentedOct 7, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Test failures seem unrelated to the PR (Build Analysis failurelogs report 'No space left on device'). |
BruceForstall commentedOct 9, 2023
/azp run runtime-coreclr superpmi-diffs |
| Azure Pipelines successfully started running 1 pipeline(s). |
BruceForstall commentedOct 9, 2023
/azp run runtime-coreclr jitstress-isas-avx512, runtime-coreclr jitstress-isas-x86 |
| Azure Pipelines successfully started running 2 pipeline(s). |
BladeWise commentedOct 9, 2023
Failing checks are related to#93163. |
BruceForstall left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@BladeWise Thank you for your work on this, and your patience for how long it took to get it in.
And thank you for your contribution to .NET!
BladeWise commentedOct 9, 2023
@BruceForstall thanks to you for reviewing and support! |
The PR adds support for SIMD-based byte/sbyte multiplication on XArch.
Given that there is no specific hardware intrinsic, the hardware acceleration is achieved widening/narrowing vectors, depending on the supported ISA.
The basic algorithm is the following:
Vector256<byte>/Vector256<sbyte>input and ifAvx512BW is supported, the vectors are widened asVector512<ushort>/Vector512<short>, multiplied and then narrowed back.Vector128<byte>/Vector128<sbyte>input and ifAvx2 is supported, the vectors are widended asVector256<ushort>/Vector256<short>, multiplied and then narrowed back.Vector128<byte>produces twoVector128<ushort>, each having half of the original vector), each half is multiplied with his counterpart and results are narrowed into a single vector.SPMI reports some regressions, but as far as I could see the previous implementation (software fallback) was not inlined, while the current one is.
The following code
Produces this codegen