NotificationsYou must be signed in to change notification settings
Fork5.2k
Star17.2k

Accelerate Vector128<long>::op_Multiply on x64#103555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

EgorBo merged 21 commits intodotnet:mainfromEgorBo:arm-mul-64bit

Jun 28, 2024

Merged

Accelerate Vector128<long>::op_Multiply on x64#103555

EgorBo merged 21 commits intodotnet:mainfromEgorBo:arm-mul-64bit

Jun 28, 2024

Conversation

Copy link

Member

EgorBo commentedJun 17, 2024•
edited
Loading

This PR optimizesVector128 andVector256 multiplication forlong/ulong when AVX512 is not presented in the system. It makes XxHash128 faster, see#103555 (comment)

publicVector128<long>Foo(Vector128<long>a,Vector128<long>b)=>a*b;

Current codegen on x64 cpu without AVX512:

; Method MyBench:Foopushrsipushrbxsubrsp,104movrbx,rdxmovrdx, qword ptr[r8]mov      qword ptr[rsp+0x58],rdxmovrdx, qword ptr[r9]mov      qword ptr[rsp+0x50],rdxmovrdx, qword ptr[rsp+0x58]imulrdx, qword ptr[rsp+0x50]mov      qword ptr[rsp+0x60],rdxmovrsi, qword ptr[rsp+0x60]movrdx, qword ptr[r8+0x08]mov      qword ptr[rsp+0x40],rdxmovrdx, qword ptr[r9+0x08]mov      qword ptr[rsp+0x38],rdxmovrcx, qword ptr[rsp+0x40]movrdx, qword ptr[rsp+0x38]call[System.Runtime.Intrinsics.Scalar`1[long]:Multiply(long,long):long]   ;;; not inlined call!mov      qword ptr[rsp+0x48],raxmovrax, qword ptr[rsp+0x48]mov      qword ptr[rsp+0x20],rsimov      qword ptr[rsp+0x28],rax       vmovapsxmm0, xmmword ptr[rsp+0x20]vmovups  xmmword ptr[rbx],xmm0movrax,rbxaddrsp,104poprbxpoprsiret; Total bytes of code: 120

New codegen:

; Method MyBench:Foovmovupsxmm0, xmmword ptr[r8]vmovupsxmm1, xmmword ptr[r9]       vpmuludqxmm2,xmm1,xmm0       vpshufdxmm1,xmm1,-79       vpmulldxmm0,xmm1,xmm0       vxorpsxmm1,xmm1,xmm1vphadddxmm0,xmm0,xmm1       vpshufdxmm0,xmm0,115vpaddqxmm0,xmm0,xmm2vmovups  xmmword ptr[rdx],xmm0movrax,rdxret; Total bytes of code: 50

Accelerate Vector128 mul for long/ulong

ae17211

ghost added the area-System.Runtime.Intrinsics label

Jun 17, 2024

dotnet-policy-servicebot assignedEgorBo

Jun 17, 2024

Copy link

Contributor

dotnet-policy-servicebot commentedJun 17, 2024

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info inarea-owners.md if you want to be subscribed.

Copy link

MemberAuthor

EgorBo commentedJun 17, 2024

Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL

Copy link

Contributor

neon-sunset commentedJun 17, 2024

Note#103539 (comment) (andhttps://godbolt.org/z/eqsrf341M) from xxHash128 issue.

EgorBo added2 commits

June 17, 2024 12:43

better ulong version

afda312

fix build

ab01574

This was referencedJun 17, 2024

GC/Regressions/v2.0-beta2/452950 failed in CI#103494

Closed

System.Numerics.Tensors.Tests.TensorSpanTests test failure#103525

Closed

EgorBo added2 commits

June 17, 2024 14:33

Update Vector128_1.cs

21b42de

Sse41 version

581f1e2

tannergooding reviewed

Jun 17, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs OutdatedShow resolvedHide resolved

EgorBoand others added2 commits

June 17, 2024 17:01

Update src/libraries/System.Private.CoreLib/src/System/Runtime/Intrin…

49a359f

…sics/Vector128_1.csCo-authored-by: Tanner Gooding <tagoo@outlook.com>

Update Vector128_1.cs

57898f0

dotnet deleted a comment fromEgorBot

Jun 17, 2024

This was referencedJun 17, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137#100558

Closed

SslStreamTlsResumeTests.ClientDisableTlsResume_Succeeds failed in CI#103449

Open

TimeProviderTests.TestProviderTimer failed in CI#103459

Closed

FileSystemWatcher_InternalBufferSize_SynchronizingObject test failed in CI#103495

Closed

[Mono AOT] Failed to compile System.Private.CoreLib.dll.bc#103520

Open

EgorBo added5 commits

June 19, 2024 15:23

Update Vector128_1.cs

f1be705

Update Vector128_1.cs

95d0eb8

Update Vector128_1.cs

dcfd93d

Update Vector128_1.cs

7fec9e3

Update Vector128_1.cs

e172296

This was referencedJun 19, 2024

STJ NullPropertyNameFail test failing in CI#103715

Closed

NativeAOT legs timing out in CI#102239

Closed

The Operation will be canceled. The next steps may not contain expected logs.dotnet/dnceng#3008

Open

Update Vector128_1.cs

0456d12

dotnet deleted a comment fromEgorBot

Jun 20, 2024

dotnet deleted a comment fromEgorBot

Jun 20, 2024

Copy link

MemberAuthor

EgorBo commentedJun 20, 2024

@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128

usingSystem.IO.Hashing;usingBenchmarkDotNet.Attributes;publicclassBench{staticreadonlybyte[]Data=newbyte[1000000];[Benchmark]publicbyte[]BenchXxHash128(){XxHash128hash=new();hash.Append(Data);returnhash.GetHashAndReset();}}

Copy link

EgorBot commentedJun 20, 2024

Benchmark results on Intel

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores  Job-ITXSAG : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI  Job-XSORFZ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMIEnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	43.41 μs	0.087 μs	1.00
BenchXxHash128	PR	43.33 μs	0.009 μs	1.00

BDN_Artifacts.zip

Flame graphs:Main vsPR 🔥
Hot asm:Main vsPR
Hot functions:Main vsPR

For cleanperf results, make sure you have just one[Benchmark] in your app.

Copy link

EgorBot commentedJun 20, 2024

Benchmark results on Amd

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores  Job-SUBLYH : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2  Job-OPUYDY : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	71.20 μs	0.022 μs	1.00
BenchXxHash128	PR	43.84 μs	0.013 μs	0.62

BDN_Artifacts.zip

Flame graphs:Main vsPR 🔥
Hot asm:Main vsPR
Hot functions:Main vsPR

For cleanperf results, make sure you have just one[Benchmark] in your app.

Copy link

EgorBot commentedJun 20, 2024

Benchmark results on Arm64

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)Unknown processor  Job-EDPWDU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD  Job-TIALUR : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMDEnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	116.9 μs	0.11 μs	1.00
BenchXxHash128	PR	116.8 μs	0.07 μs	1.00

BDN_Artifacts.zip

Flame graphs:Main vsPR 🔥
Hot asm:Main vsPR
Hot functions:Main vsPR

For cleanperf results, make sure you have just one[Benchmark] in your app.

This was referencedJun 20, 2024

Test failure: GC\\Features\\HeapExpansion\\Finalizer\\Finalizer.cmd#102706

Closed

[Test Failure] System.Net.Http.WinHttpHandlerFunctional.Tests.BidirectionStreamingTest.BackwardsCompatibility_DowngradeToHttp11#103754

Closed

revert unrelated changes

60441f3

Copy link

MemberAuthor

EgorBo commentedJun 21, 2024

/azp list

This comment was marked as resolved.

Copy link

MemberAuthor

EgorBo commentedJun 21, 2024

/azp run runtime-coreclr jitstress-isas-x86

Copy link

azure-pipelinesbot commentedJun 21, 2024

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

MemberAuthor

EgorBo commentedJun 21, 2024•
edited
Loading

@tannergooding PTAL, I'll add arm64 separately, need to test different impls.
I've expanded it in importer similar to existing op_Multiply expansions

Benchmark improvement:#103555 (comment)

EgorBo requested a review fromtannergooding

June 21, 2024 12:29

build-analysisbot mentioned this pull request

Jun 21, 2024

[browser] Unable to evaluate script: tab crashed#103623

Closed

EgorBo marked this pull request as ready for review

June 24, 2024 14:26

tannergooding reviewed

Jun 27, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp OutdatedShow resolvedHide resolved

tannergooding reviewed

Jun 27, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp

Comment on lines +21627 to +21631

		// Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero);
		GenTreeHWIntrinsic* tmp3 =
		gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type),
		is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd,
		CORINFO_TYPE_UINT, simdSize);

Copy link

Member

tannergoodingJun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I know in other places we've started avoidinghadd in favor ofshuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)