The optimization is implemented by replacing the constant V512 vector by a V128 and abroadcasti128 node when loweringGT_STOREIND plus an eligible constant V512 vector as its operand.

Currently, the implementation only coversV512 -> broadcasti128(V128), we are open to adjust the implementation or bring more situations into this PR, ideallyV512/256 -> broadcasti128(V128), when AVX512 is available. (Possibly plusV512 -> broadcast64x4(V256).)

ghost added area-CodeGen-coreclr

CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

community-contributionIndicates that the PR has been added by a community member labels

Sep 13, 2023

Copy link

ghost commentedSep 13, 2023

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

Issue Details

This PR is trying to solve#90328.

The optimization is implemented by replacing the constant V512 vector by a V128 and abroadcasti128 node when loweringGT_STOREIND plus an eligible constant V512 vector as its operand.

Author:	Ruihan-Yin
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

Ruihan-Yin changed the title~~[JIT] Optimize constant V512 vector~~[JIT] Optimize constant V512 vector with broadcast

Sep 13, 2023

Ruihan-Yin closed this

Sep 13, 2023

Ruihan-Yin reopened this

Sep 13, 2023

build-analysisbot mentioned this pull request

Sep 14, 2023

Build crashes in System.Runtime.Serialization.SerializationGuard#92007

Closed

Copy link

MemberAuthor

Ruihan-Yin commentedSep 14, 2023•
edited
Loading

Ran the test suite twice, should be some known or random fails, turning PR to ready for review.

Ruihan-Yin marked this pull request as ready for review

September 14, 2023 23:14

Copy link

MemberAuthor

Ruihan-Yin commentedSep 18, 2023

Hi@EgorBo, this PR is ready for review, please see if this PR is able to cover#90328, thanks!

Copy link

Member

tannergooding commentedSep 18, 2023•
edited
Loading

Is this going to perform better or just save on the size of the rodata section?

What about on hardware without AVX-512 (Haswell,Skylake, etc)?

What aboutscalar->V128,scalar->V256,scalar->V512,V128->V256,V256->V512, etc?

For AVX-512, thescalar->Vector scenario can at least be covered by an embedded broadcast. But for other scenarios, this seems like its trading more instructions for smaller data section.

Copy link

MemberAuthor

Ruihan-Yin commentedSep 18, 2023

Is this going to perform better or just save on the size of the rodata section?

The expected improvement is saving some memory space for constant values.

What about on hardware without AVX-512 (Haswell,Skylake, etc)?

I was intended to useVBROADCASTI32X4, which is an AVX512 only instruction, but seemsVBROADCASTI128 can also do this job forV256->V128.

What aboutscalar->V128,scalar->V256,scalar->V512,V128->V256,V256->V512, etc?

I presume the scope is to achieve compressing larger existing constant vector to smaller vector in a pure memory operation, which embedded broadcast might not be able to handle.

If we want to take compressing to scalar into consideration, we might also have the opportunity: V128/256/512 ->Byte/Word/DWord/QWord.

For AVX-512, thescalar->Vector scenario can at least be covered by an embedded broadcast. But for other scenarios, this seems like its trading more instructions for smaller data section.

From my understanding of#90328, the issue is for a pure store instruction case, then the code gen is mostly:

vmovups zmm, zmmword ptr[constant section]vmovups zmmword ptr[target], zmm

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

I might get the issue wrong or incompletely, so please correct me if I have any misunderstanding.

Copy link

Member

tannergooding commentedSep 18, 2023

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

👍, if its primarily for the case where we'd otherwise have avmovups reg1, [addr] then it sounds great to replace that withvbroadcast reg1, [addr] where possible.

I was initially concerned it would also change:

vadd reg1, reg2, [addr]

into

vbroadcast reg3, [addr]vadd reg1, reg2, reg3

Copy link

MemberAuthor

Ruihan-Yin commentedSep 18, 2023

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.
👍, if its primarily for the case where we'd otherwise have avmovups reg1, [addr] then it sounds great to replace that withvbroadcast reg1, [addr] where possible.
I was initially concerned it would also change:
vadd reg1, reg2, [addr]
into
vbroadcast reg3, [addr]vadd reg1, reg2, reg3

I think it wouldn't cover that case (at least it is not intended to cover), as the entry point of this opt isLowerStoreIndir().

Copy link

MemberAuthor

Ruihan-Yin commentedSep 21, 2023

Fail should be unrelated.

Hi,@tannergooding @EgorBo, I added the optimization for V512->V256 and V256->V128, and I think it reaches the expected coverage and ready for the reviews.

EgorBo self-requested a review

October 16, 2023 10:33

Copy link

Member

JulieLeeMSFT commentedOct 23, 2023

@tannergooding, this community PR is ready to review. PTAL.

JulieLeeMSFT requested a review fromtannergooding

October 23, 2023 16:19

tannergooding reviewed

Oct 24, 2023

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated

		return;
		}

		if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64))

Copy link

Member

tannergoodingOct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believe you can just do:

Suggested change

	if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64))
	if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32,TYP_SIMD64))

tannergooding approved these changes

Oct 24, 2023

View reviewed changes

Copy link

Member

tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM. This should get a secondary review from someone on the JIT team

CC. @dotnet/jit-contrib

Copy link

Member

tannergooding commentedOct 24, 2023

CC.@jakobbotsch,@EgorBo in particular

Ruihan-Yin added8 commits

October 25, 2023 09:57

UseBroadcasti128

d264c75

remove un-needed changes

baf3bcc

Nit: remove some unnecessary commetns and line deletion.

698b72e

filter out the AllBitsSet and Zero vector from the opts

afda4de

Apply format patch

e9d6ca7

extend the coverage to V512->V256 and V256->V128

4a3bf08

apply format patch

5352998

Resolve comment

a01fd58

Ruihan-Yin force-pushed thebroadcastMov branch from56041f3 toa165107Compare

October 25, 2023 17:11

Ruihan-Yin force-pushed thebroadcastMov branch froma165107 toa01fd58Compare

October 25, 2023 17:14

Copy link

MemberAuthor

Ruihan-Yin commentedNov 20, 2023

Hi@jakobbotsch @EgorBo, this PR is ready for review, would you please take a look? Thanks!

Copy link

Member

jakobbotsch commentedNov 20, 2023

Since this is AVX-512 backend work I think@BruceForstall should take a look... On my quick glance it seemed a bit odd to do it duringSTORE_INDIR lowering when presumably constants can benefit in many other cases (as long as they're not already contained), but I am not very familiar with these instructions. Also going to close and reopen this to rerun CI.