Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Accelerate Vector128<long>::op_Multiply on x64#103555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
EgorBo merged 21 commits intodotnet:mainfromEgorBo:arm-mul-64bit
Jun 28, 2024

Conversation

@EgorBo
Copy link
Member

@EgorBoEgorBo commentedJun 17, 2024
edited
Loading

This PR optimizesVector128 andVector256 multiplication forlong/ulong when AVX512 is not presented in the system. It makes XxHash128 faster, see#103555 (comment)

publicVector128<long>Foo(Vector128<long>a,Vector128<long>b)=>a*b;

Current codegen on x64 cpu without AVX512:

; Method MyBench:Foopushrsipushrbxsubrsp,104movrbx,rdxmovrdx, qword ptr[r8]mov      qword ptr[rsp+0x58],rdxmovrdx, qword ptr[r9]mov      qword ptr[rsp+0x50],rdxmovrdx, qword ptr[rsp+0x58]imulrdx, qword ptr[rsp+0x50]mov      qword ptr[rsp+0x60],rdxmovrsi, qword ptr[rsp+0x60]movrdx, qword ptr[r8+0x08]mov      qword ptr[rsp+0x40],rdxmovrdx, qword ptr[r9+0x08]mov      qword ptr[rsp+0x38],rdxmovrcx, qword ptr[rsp+0x40]movrdx, qword ptr[rsp+0x38]call[System.Runtime.Intrinsics.Scalar`1[long]:Multiply(long,long):long]   ;;; not inlined call!mov      qword ptr[rsp+0x48],raxmovrax, qword ptr[rsp+0x48]mov      qword ptr[rsp+0x20],rsimov      qword ptr[rsp+0x28],rax       vmovapsxmm0, xmmword ptr[rsp+0x20]vmovups  xmmword ptr[rbx],xmm0movrax,rbxaddrsp,104poprbxpoprsiret; Total bytes of code: 120

New codegen:

; Method MyBench:Foovmovupsxmm0, xmmword ptr[r8]vmovupsxmm1, xmmword ptr[r9]       vpmuludqxmm2,xmm1,xmm0       vpshufdxmm1,xmm1,-79       vpmulldxmm0,xmm1,xmm0       vxorpsxmm1,xmm1,xmm1vphadddxmm0,xmm0,xmm1       vpshufdxmm0,xmm0,115vpaddqxmm0,xmm0,xmm2vmovups  xmmword ptr[rdx],xmm0movrax,rdxret; Total bytes of code: 50

neon-sunset and PaulusParssinen reacted with rocket emoji
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info inarea-owners.md if you want to be subscribed.

@EgorBo
Copy link
MemberAuthor

Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL

@neon-sunset
Copy link
Contributor

Note#103539 (comment) (andhttps://godbolt.org/z/eqsrf341M) from xxHash128 issue.

EgorBoand others added2 commitsJune 17, 2024 17:01
…sics/Vector128_1.csCo-authored-by: Tanner Gooding <tagoo@outlook.com>
@dotnetdotnet deleted a comment fromEgorBotJun 20, 2024
@dotnetdotnet deleted a comment fromEgorBotJun 20, 2024
@EgorBo
Copy link
MemberAuthor

@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128

usingSystem.IO.Hashing;usingBenchmarkDotNet.Attributes;publicclassBench{staticreadonlybyte[]Data=newbyte[1000000];[Benchmark]publicbyte[]BenchXxHash128(){XxHash128hash=new();hash.Append(Data);returnhash.GetHashAndReset();}}
EgorBot reacted with thumbs up emoji

@EgorBot
Copy link

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores  Job-ITXSAG : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI  Job-XSORFZ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMIEnvironmentVariables=DOTNET_PreferredVectorBitWidth=128
MethodToolchainMeanErrorRatio
BenchXxHash128Main43.41 μs0.087 μs1.00
BenchXxHash128PR43.33 μs0.009 μs1.00

BDN_Artifacts.zip

Flame graphs:Main vsPR 🔥
Hot asm:Main vsPR
Hot functions:Main vsPR

For cleanperf results, make sure you have just one[Benchmark] in your app.

@EgorBot
Copy link

Benchmark results on Amd
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores  Job-SUBLYH : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2  Job-OPUYDY : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128
MethodToolchainMeanErrorRatio
BenchXxHash128Main71.20 μs0.022 μs1.00
BenchXxHash128PR43.84 μs0.013 μs0.62

BDN_Artifacts.zip

Flame graphs:Main vsPR 🔥
Hot asm:Main vsPR
Hot functions:Main vsPR

For cleanperf results, make sure you have just one[Benchmark] in your app.

@EgorBot
Copy link

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)Unknown processor  Job-EDPWDU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD  Job-TIALUR : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMDEnvironmentVariables=DOTNET_PreferredVectorBitWidth=128
MethodToolchainMeanErrorRatio
BenchXxHash128Main116.9 μs0.11 μs1.00
BenchXxHash128PR116.8 μs0.07 μs1.00

BDN_Artifacts.zip

Flame graphs:Main vsPR 🔥
Hot asm:Main vsPR
Hot functions:Main vsPR

For cleanperf results, make sure you have just one[Benchmark] in your app.

@EgorBo
Copy link
MemberAuthor

/azp list

@azure-pipelines

This comment was marked as resolved.

@EgorBo
Copy link
MemberAuthor

/azp run runtime-coreclr jitstress-isas-x86

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@EgorBo
Copy link
MemberAuthor

EgorBo commentedJun 21, 2024
edited
Loading

@tannergooding PTAL, I'll add arm64 separately, need to test different impls.
I've expanded it in importer similar to existing op_Multiply expansions

Benchmark improvement:#103555 (comment)

@EgorBoEgorBo marked this pull request as ready for reviewJune 24, 2024 14:26
Comment on lines +21627 to +21631
// Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero);
GenTreeHWIntrinsic* tmp3 =
gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type),
is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd,
CORINFO_TYPE_UINT, simdSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I know in other places we've started avoidinghadd in favor ofshuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I tried to benchmark different implementations for it and they all were equaly fast e.g.#99871 (comment)

tannergooding reacted with thumbs up emoji
if (TARGET_POINTER_SIZE ==4)
{
// TODO-XARCH-CQ:We shouldsupport long/ulong multiplication
// TODO-XARCH-CQ:32bitsupport

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What's blocking 32-bit support? It doesn't look like we're using any_X64 intrinsics in the fallback logic?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not sure to be honest, that check was pre-existing, I only changed comment

Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@tannergoodingtannergoodingtannergooding approved these changes

Assignees

@EgorBoEgorBo

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@EgorBo@neon-sunset@EgorBot@tannergooding

[8]ページ先頭

©2009-2025 Movatter.jp