NotificationsYou must be signed in to change notification settings
Fork5.2k
Star17.2k

Optimize Vector128<long> multiplication for arm64#104177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

EgorBo merged 8 commits intodotnet:mainfromEgorBo:mul-long-arm64

Jul 2, 2024

Merged

Optimize Vector128<long> multiplication for arm64#104177

EgorBo merged 8 commits intodotnet:mainfromEgorBo:mul-long-arm64

Jul 2, 2024

Conversation

Copy link

Member

EgorBo commentedJun 28, 2024

Follow up to#103555 for arm64

Optimize Vector128<long> multiplication for arm64

ef0e46f

ghost added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label

Jun 28, 2024

dotnet-policy-servicebot assignedEgorBo

Jun 28, 2024

Copy link

MemberAuthor

dotnet-policy-servicebot commentedJun 28, 2024

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

tannergooding reviewed

Jun 28, 2024

View reviewed changes

src/coreclr/jit/hwintrinsicarm64.cpp OutdatedShow resolvedHide resolved

tannergooding reviewed

Jun 28, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp

Comment on lines 21932 to 21957

		case TYP_LONG:
		case TYP_ULONG:
		{
		assert(simdSize == 16);

		// Make op1 and op2 multi-use:
		GenTree* op1Dup = fgMakeMultiUse(&op1);
		GenTree* op2Dup = fgMakeMultiUse(&op2);

		// long left0 = op1.GetElement(0)
		// long left1 = op1.GetElement(1)
		GenTree* left0 = gtNewSimdGetElementNode(TYP_LONG, op1, gtNewIconNode(0), simdBaseJitType, 16);
		GenTree* left1 = gtNewSimdGetElementNode(TYP_LONG, op1Dup, gtNewIconNode(1), simdBaseJitType, 16);

		// long right0 = op2.GetElement(0)
		// long right1 = op2.GetElement(1)
		GenTree* right0 = gtNewSimdGetElementNode(TYP_LONG, op2, gtNewIconNode(0), simdBaseJitType, 16);
		GenTree* right1 = gtNewSimdGetElementNode(TYP_LONG, op2Dup, gtNewIconNode(1), simdBaseJitType, 16);

		// Vector128<long> vec = Vector128.Create(left0 * right0, left1 * right1)
		op1 = gtNewOperNode(GT_MUL, TYP_LONG, left0, right0);
		op2 = gtNewOperNode(GT_MUL, TYP_LONG, left1, right1);
		GenTree* vec = gtNewSimdCreateScalarUnsafeNode(TYP_SIMD16, op1, simdBaseJitType, 16);
		return gtNewSimdHWIntrinsicNode(TYP_SIMD16, vec, gtNewIconNode(1), op2, NI_AdvSimd_Insert,
		simdBaseJitType, 16);
		}

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this just avoiding the cost of inlining, unrolling, and simplifying the work the JIT would have to do?

Copy link

EgorBot commentedJun 28, 2024

Benchmark results on Arm64

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)Unknown processor  Job-OMCIXQ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD  Job-KTSNVH : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	116.9 μs	0.05 μs	1.00
BenchXxHash128	PR	109.8 μs	0.04 μs	0.94

BDN_Artifacts.zip

Flame graphs:Main vsPR 🔥
Hot asm:Main vsPR
Hot functions:Main vsPR

For cleanperf results, make sure you have just one[Benchmark] in your app.

build-analysisbot mentioned this pull request

Jun 28, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137#100558

Closed

EgorBo added5 commits

June 30, 2024 22:38

Merge branch 'main' ofhttps://github.com/dotnet/runtimeinto mul-lon…

717f62a

…g-arm64

add Vector64

77977f5

remove assert

e0a2942

add a comment

86d4fb3

clean up

818b8bd

Copy link

Contributor

neon-sunset commentedJul 1, 2024•
edited
Loading

I wanted to ask is there a reason LLVM's codegen variant did not work? On some cores,UMOV/SMOV has pretty bad latency vs code that avoids a round-trip to scalar registers.

build-analysisbot mentioned this pull request

Jul 1, 2024

Test failure: GC\\Scenarios\\FinalizeTimeout\\FinalizeTimeout\\FinalizeTimeout.cmd#103874

Closed

tannergooding reviewed

Jul 1, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated

Comment on lines 22005 to 22006

		return gtNewSimdHWIntrinsicNode(type, vec, gtNewIconNode(1), op2, NI_AdvSimd_Insert,
		simdBaseJitType, 16);

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: UsegtNewSimdWithElementNode(type, vec, gtNewIconNode(1), op2, simdBaseJitType, simdSize) which ensures all the optimal handling takes place.

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks! Applied

tannergooding approved these changes

Jul 1, 2024

View reviewed changes

EgorBo added2 commits

July 2, 2024 15:59

Merge branch 'main' ofhttps://github.com/dotnet/runtimeinto mul-lon…

f7a53d3

…g-arm64

handle scalarOp

b2206c9

EgorBo marked this pull request as ready for review

July 2, 2024 18:03

EgorBo merged commit6e039a8 intodotnet:main

Jul 2, 2024

EgorBo deleted the mul-long-arm64 branch

July 2, 2024 18:03

tannergooding reviewed

Jul 2, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp

Comment on lines +21981 to +21982

		op1 = gtNewBitCastNode(TYP_LONG, op1);
		op2 = gtNewBitCastNode(TYP_LONG, op2);

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why bitcast instead ofToScalar? If this is generating better code, it seems like a pretty "core" scenario we're not handling from theToScalar path

Copy link

MemberAuthor

EgorBoJul 2, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@tannergooding because op2 can be either 8-byteTYP_SIMD8 or 8-byte scalar (TYP_LONG) so bitcast allowed me to simplify handling. In my initial version I forgot that this path is used for bothMUL(vector, vector) andMUL(vector, scalar) (wherescalar is broadcasted)

Copy link

Member

tannergoodingJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, that makes sense, 👍

github-actionsbot locked and limited conversation to collaborators

Aug 2, 2024

Labels

area-CodeGen-coreclr

CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Movatterモバイル変換

Optimize Vector128<long> multiplication for arm64#104177

Optimize Vector128<long> multiplication for arm64#104177

Uh oh!

Conversation

EgorBo commentedJun 28, 2024

Uh oh!

EgorBo commentedJun 28, 2024

Uh oh!

dotnet-policy-servicebot commentedJun 28, 2024

Uh oh!

Uh oh!

tannergoodingJun 28, 2024

Choose a reason for hiding this comment

Uh oh!

EgorBot commentedJun 28, 2024

Uh oh!

neon-sunset commentedJul 1, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

tannergoodingJul 1, 2024

Choose a reason for hiding this comment

Uh oh!

EgorBoJul 2, 2024

Choose a reason for hiding this comment

Uh oh!

tannergoodingJul 2, 2024

Choose a reason for hiding this comment

Uh oh!

EgorBoJul 2, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergoodingJul 2, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neon-sunset commentedJul 1, 2024•
edited
Loading

EgorBoJul 2, 2024•
edited
Loading