Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

JIT: Add support for byte/sbyte SIMD multiplication on XArch#86811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@BladeWise
Copy link
Contributor

The PR adds support for SIMD-based byte/sbyte multiplication on XArch.

Given that there is no specific hardware intrinsic, the hardware acceleration is achieved widening/narrowing vectors, depending on the supported ISA.

The basic algorithm is the following:

  • In case of aVector256<byte>/Vector256<sbyte> input and ifAvx512BW is supported, the vectors are widened asVector512<ushort>/Vector512<short>, multiplied and then narrowed back.
  • In case of aVector128<byte>/Vector128<sbyte> input and ifAvx2 is supported, the vectors are widended asVector256<ushort>/Vector256<short>, multiplied and then narrowed back.
  • In any other case, the vectors are splitted in halves (lower/upper) and widened as vectors of the same size of the input (e.g., oneVector128<byte> produces twoVector128<ushort>, each having half of the original vector), each half is multiplied with his counterpart and results are narrowed into a single vector.

SPMI reports some regressions, but as far as I could see the previous implementation (software fallback) was not inlined, while the current one is.

The following code

publicstaticVector128<byte>MultiplyB128(Vector128<byte>v1,Vector128<byte>v2)=>v1*v2;

Produces this codegen

G_M65351_IG01:  ;; offset=0000Hvzeroupper;; size=3 bbWeight=1 PerfScore 1.00G_M65351_IG02:  ;; offset=0003H       vpmovzxbwymm0, qword ptr[rdx]       vpmovzxbwymm1, qword ptr[r8]       vpmullwymm0,ymm0,ymm1vpandymm0,ymm0, ymmword ptr[reloc @RWD00]       vpackuswbymm0,ymm0,ymm0vpermqymm0,ymm0,-40vmovups  xmmword ptr[rcx],xmm0movrax,rcx;; size=39 bbWeight=1 PerfScore 21.25G_M65351_IG03:  ;; offset=002AHvzeroupperret;; size=4 bbWeight=1 PerfScore 2.00RWD00  dq00FF00FF00FF00FFh,00FF00FF00FF00FFh,00FF00FF00FF00FFh,00FF00FF00FF00FFh

Implement SIMD-based byte/sbyte multiplication  for Vector128/256/512 through widening/narrowing
@ghostghost added community-contributionIndicates that the PR has been added by a community member area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labelsMay 26, 2023
@ghost
Copy link

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

Issue Details

The PR adds support for SIMD-based byte/sbyte multiplication on XArch.

Given that there is no specific hardware intrinsic, the hardware acceleration is achieved widening/narrowing vectors, depending on the supported ISA.

The basic algorithm is the following:

  • In case of aVector256<byte>/Vector256<sbyte> input and ifAvx512BW is supported, the vectors are widened asVector512<ushort>/Vector512<short>, multiplied and then narrowed back.
  • In case of aVector128<byte>/Vector128<sbyte> input and ifAvx2 is supported, the vectors are widended asVector256<ushort>/Vector256<short>, multiplied and then narrowed back.
  • In any other case, the vectors are splitted in halves (lower/upper) and widened as vectors of the same size of the input (e.g., oneVector128<byte> produces twoVector128<ushort>, each having half of the original vector), each half is multiplied with his counterpart and results are narrowed into a single vector.

SPMI reports some regressions, but as far as I could see the previous implementation (software fallback) was not inlined, while the current one is.

The following code

publicstaticVector128<byte>MultiplyB128(Vector128<byte>v1,Vector128<byte>v2)=>v1*v2;

Produces this codegen

G_M65351_IG01:  ;; offset=0000Hvzeroupper;; size=3 bbWeight=1 PerfScore 1.00G_M65351_IG02:  ;; offset=0003H       vpmovzxbwymm0, qword ptr[rdx]       vpmovzxbwymm1, qword ptr[r8]       vpmullwymm0,ymm0,ymm1vpandymm0,ymm0, ymmword ptr[reloc @RWD00]       vpackuswbymm0,ymm0,ymm0vpermqymm0,ymm0,-40vmovups  xmmword ptr[rcx],xmm0movrax,rcx;; size=39 bbWeight=1 PerfScore 21.25G_M65351_IG03:  ;; offset=002AHvzeroupperret;; size=4 bbWeight=1 PerfScore 2.00RWD00  dq00FF00FF00FF00FFh,00FF00FF00FF00FFh,00FF00FF00FF00FFh,00FF00FF00FF00FFh
Author:BladeWise
Assignees:-
Labels:

area-CodeGen-coreclr,community-contribution

Milestone:-

@BladeWise
Copy link
ContributorAuthor

I have updated the PR to fix the issue with AVX512.
Source code got longer to address SIMD16 handling with and without AVX512BW_VL support.

@kunalspathak
Copy link
Contributor

@tannergooding

@JulieLeeMSFT
Copy link
Member

@tannergooding, please review this community contribution and consider if this can be done in C# instead of in JIT.

Copy link
Contributor

@BruceForstallBruceForstall left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@tannergooding Could this code instead be implemented in C# in (for example) theVector128<T>::operator* fallback implementation, using all the same hardware intrinsics, ISA checking, etc.? Or is there some reason why that either can't or shouldn't be done?

Comment on lines +19782 to +19783
case TYP_BYTE:
case TYP_UBYTE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It looks like there is some precedent for inserting additional IR nodes in this function, but this is quite a lot more than the other cases, and these new cases return without falling through like the others. Is this the right place for this code? Or should this be a separate function that is called before getting here?

@tannergooding
Copy link
Member

tannergooding commentedJul 7, 2023
edited
Loading

Could this code instead be implemented in C# in (for example) the Vector128::operator* fallback implementation, using all the same hardware intrinsics, ISA checking, etc.? Or is there some reason why that either can't or shouldn't be done?

It "could" be, but there are certain pessimizations that I believe make doing that the wrong choice for these APIs. Most notably because these APIs are considered "perf critical" and one of the primary ways users interact with the SIMD types.

One of the pessimizations is that the IR produced by inlining still differs and is "sub-optimal" in comparison. It is often inserting additional copies, locals, temporaries, or other handling at the "boundary" just before/after the inlined code. A lot of this has been improved with the introduction of forward sub, but there are still notable edge cases. Likewise, it is still dependent on the inliner and while we've made some improvements to that (particularly in adjusting the budget when many intrinsic calls exist), there are still plenty of cases (especially in more complex methods) where the budget is quickly eaten by APIs like this being in managed and it will simply give up. That is often less problematic for other APIs, but represents a particular concern when you're considering APIs that are exposing primitive functionality used to write all your other algorithms.

We ultimately are generating roughly 6 instructions of very specialized code for what is a baseline arithmetic operation on the SIMD types. If we were designing IL "today", we would potentially have SIMD type support built into the IL and so this would be the same handling we'd be doing forCEE_MUL in that case. Because its a baseline arithmetic operation, we want such cases to participate in VN, CSE, potentially in constant folding, to be properly costed, etc.

Notably, it would probably be better for this logic to be in lowering or morph instead and for us to just track it asVector128_op_Multiply in the front end. That's how we handle other primitive operations that tend to be slightly more expensive on some platforms as well (such as conversions, checked arithmetic, long decomposition, etc). But, as is, its consistent with the existing handling for other similar scenarios..

Longer term I would definitely like to see the issues called out above addressed, even if that's done via special casing these APIs in the inliner/etc instead. Once we're confident that's the case, then we should look at incrementally moving all similar functionality into managed instead. We might still end up with some bits staying as handled by the JIT (such as for truly simple operations where equivalent handling exists forint,long,float, ordouble), but it would allow all ofsimdashwintrinsic to be removed (they are really user defined types that are getting special optimizations today and are not ABI primitives) and many of the more complex helpers (likeDot orSum) to have their complexity removed from the JIT.

@JulieLeeMSFTJulieLeeMSFT added this to the9.0.0 milestoneJul 31, 2023
@JulieLeeMSFT
Copy link
Member

Since it is past .NET 8 code complete, setting this to .NET 9.

@ghostghost removed the needs-author-actionAn issue or pull request that requires more info or actions from the author. labelSep 30, 2023
@BladeWise
Copy link
ContributorAuthor

@BruceForstall I have merged the branch and fixed the break with current sources.
I will adress the other concerns (more comments with examples and specific variables for new nodes) as soon as tests are completed.

@DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran@anthonycanino@tannergooding @dotnet/avx512-contrib Any additional comments/suggestions on this code?

Sorry for the late reply. Changes look good at first glance. Couldn't find any similar implementation in native compilers to compare disasm though

@ghostghost removed the needs-author-actionAn issue or pull request that requires more info or actions from the author. labelOct 6, 2023
@BladeWise
Copy link
ContributorAuthor

@BruceForstall as per your request:

  1. I have re-merged with latest main and addressed other concerns.
  2. More comments have been added to explain what each branch wants to do, and samples of the equivalent C# code for relevant statements have been provided. I did not provide an example of the generated ASM because it could be too verbose in some cases. Let me know if you prefer I add it anyway in sources or as a comment to this PR.
  3. All new nodes now use new variables, named to explicitly indicate the contents.

@BladeWise
Copy link
ContributorAuthor

BladeWise commentedOct 7, 2023
edited
Loading

Test failures seem unrelated to the PR (Build Analysis failurelogs report 'No space left on device').
Is it possible to re-run failed checks?

@BruceForstall
Copy link
Contributor

/azp run runtime-coreclr superpmi-diffs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall
Copy link
Contributor

/azp run runtime-coreclr jitstress-isas-avx512, runtime-coreclr jitstress-isas-x86

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BladeWise
Copy link
ContributorAuthor

Failing checks are related to#93163.

Copy link
Contributor

@BruceForstallBruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@BladeWise Thank you for your work on this, and your patience for how long it took to get it in.

And thank you for your contribution to .NET!

@BruceForstallBruceForstall merged commit9330273 intodotnet:mainOct 9, 2023
@BladeWise
Copy link
ContributorAuthor

@BruceForstall thanks to you for reviewing and support!

@BladeWiseBladeWise deleted the feature/simd-byte-multiply branchOctober 9, 2023 19:03
@ghostghost locked asresolvedand limited conversation to collaboratorsNov 9, 2023
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@tannergoodingtannergoodingAwaiting requested review from tannergooding

3 more reviewers

@DeepakRajendrakumaranDeepakRajendrakumaranDeepakRajendrakumaran left review comments

@SingleAccretionSingleAccretionSingleAccretion left review comments

@BruceForstallBruceForstallBruceForstall approved these changes

Reviewers whose approvals may not affect merge requirements

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIcommunity-contributionIndicates that the PR has been added by a community member

Projects

None yet

Milestone

9.0.0

Development

Successfully merging this pull request may close these issues.

7 participants

@BladeWise@kunalspathak@JulieLeeMSFT@tannergooding@BruceForstall@DeepakRajendrakumaran@SingleAccretion

[8]ページ先頭

©2009-2025 Movatter.jp