NotificationsYou must be signed in to change notification settings
Fork5.2k
Star17.2k

Adding`Avx10v1` to the runtime#99784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

BruceForstall merged 10 commits intodotnet:mainfromRuihan-Yin:Avx10

Mar 21, 2024

Merged

Adding`Avx10v1` to the runtime#99784

BruceForstall merged 10 commits intodotnet:mainfromRuihan-Yin:Avx10

Mar 21, 2024

Conversation

Copy link

Member

Ruihan-Yin commentedMar 14, 2024•
edited
Loading

Overview

As discussed in#98069, we will exposeAvx10 converged vector ISA into the runtime. This PR is adding CPUID detection logics of the first version.

Following thespec doc (C1 1.2), availability ofAvx10v1 will be checked in the given paradigm:

A CPUID feature bit indicating that the Intel AVX10 ISA is supported.
A version number to ensure that the supported version is greater than or equal to the desired version (say 1, in this case).
A vector length bit indicating the maximum supported vector length.

We also exposed the following environment variables to control the ISA support:

Avx10MaxVectorLength: an integer that specifies the max vector length expected.
EnableAvx10v1: covers all the instructions operating on the length of 128.
EnableAvx10v1_V256: covers all the instructions operating on the length of 256.
EnableAvx10v1_V512: covers all the instructions operating on the length of 512.

Avx10MaxVectorLength has priority over the rest of the three vars, e.g. whenAvx10MaxVectorLength < 512 andEnableAvx10v1_V512 = 1, Instructions underAvx10v1_V512 will not be available.

Ruihan-Yin added2 commits

March 4, 2024 10:10

script-gen changes

b9b3c8f

hand-written changes

619fd39

ghost added the needs-area-labelAn area label is needed to ensure this gets routed to the appropriate area owners label

Mar 14, 2024

dotnet-policy-servicebot added the community-contributionIndicates that the PR has been added by a community member label

Mar 14, 2024

Copy link

Contributor

DeepakRajendrakumaran commentedMar 14, 2024

@dotnet/avx512-contrib@BruceForstall Is there a similar group for 'arch-avx10' that we can add? Don't think we have permission to add the 'arch-avx10' label

tannergooding added avx10

Related to the AVX10 architecture

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-labelAn area label is needed to ensure this gets routed to the appropriate area owners labels

Mar 15, 2024

tannergooding reviewed

Mar 15, 2024

View reviewed changes

src/coreclr/inc/clrconfigvalues.h Outdated

Comment on lines 764 to 766

		RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableAVX10v1_V256,W("EnableAVX10v1_V256"),1,"Allows AVX10v1_V256+ hardware intrinsics to be disabled")
		RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableAVX10v1_V512,W("EnableAVX10v1_V512"),1,"Allows AVX10v1_V512+ hardware intrinsics to be disabled")
		RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_Avx10MaxVectorLength,W("Avx10MaxVectorLength"),0,"The max vector length supported",CLRConfig::LookupOptions::ParseIntegerAsBase10)

Copy link

Member

tannergoodingMar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think we need these three.

In general theseDOTNET_Enable{Isa} knobs are primarily there for testing purposes. That is, they exist so that developers with newer hardware can still test downlevel code paths without having to recompile their applications. Such code is not typically meant for use in actual production scenarios.

To that end, we typically expose one knob per CPUID ISA bit and so you can disableAvx but notAvx.X64.Avx512F.VL was somewhat an exception and in hindsight, probably a knob we don't actually need to let users control. I expect that theV256 andV512 nested classes fit into the same bucket, where users don't need the ability to enable/disable them directly and independently ofAvx10 itself.

Instead, users who need to control access to the used vector bit width can utilize theDOTNET_PreferredVectorBitWidth knob and the correspondingVector###.IsHardwareAccelerated checks, which is how they're expected to detect this in existing code paths.

Copy link

Member

tannergoodingMar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

--@BruceForstall, there should be no issue in removing a config knob like the variousDOTNET_EnableAvx512*_VL entries, right?

We can simply remove them and document that users should utilizeDOTNET_PreferredVectorBitWidth andDOTNET_EnableAvx512* instead? This would allow a consistent user story here and reduce the overall complexity we need to support/consider

Copy link

Contributor

BruceForstallMar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't see any problem with removing theDOTNET_EnableAvx512*_VL configs. They are available in Release and have been shipped, and use theEXTERNAL prefix, which indicates (or at least historically indicated) a documented config. (Maybe we should have used the prefixUNSUPPORTED?) So technically removing them is a breaking change, but it doesn't seem like a problem.

Copy link

MemberAuthor

Ruihan-YinMar 15, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the feedback.

Please correct me if I am wrong, so we want to removeDOTNET_EnableAvx10v1_V256/512 andDOTNET_Avx10MaxVectorLength. And keepDOTNET_EnableAvx10v1, let it plusDOTNET_PreferredVectorBitWidth to control the emit behavior ofVector256/512 APIs. (Which I suppose won't be covered in this PR.)

As for theVL vars, do we want to handle it in this PR, or separately?

Copy link

Member

tannergoodingMar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

so we want to

Yep!

Which I suppose won't be covered in this PR

This should already be somewhat implicit based on the existing checks we have in the JIT, so I don't expect we need to do anything special. Users will still seeAvx10v1.V512 report supported if the hardware actually supports it and they would simply checkVector512.IsHardwareAccelerated if they don't want limit 512-bit usage to the user selected behavior (which is controlled byPreferredVectorBitWidth).

As for the VL vars, do we want to handle it in this PR, or separately?

I'll cover it in a separate PR.

Ruihan-Yin requested a review fromtannergooding

March 15, 2024 18:32

tannergooding reviewed

Mar 15, 2024

View reviewed changes

src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txtShow resolvedHide resolved

tannergooding reviewed

Mar 15, 2024

View reviewed changes

src/native/minipal/cpufeatures.c OutdatedShow resolvedHide resolved

tannergooding reviewed

Mar 15, 2024

View reviewed changes

src/native/minipal/cpufeatures.c OutdatedShow resolvedHide resolved

tannergooding reviewed

Mar 15, 2024

View reviewed changes

src/coreclr/vm/codeman.cpp OutdatedShow resolvedHide resolved

tannergooding reviewed

Mar 15, 2024

View reviewed changes

src/coreclr/tools/Common/Internal/Runtime/ReadyToRunInstructionSet.csShow resolvedHide resolved

Ruihan-Yin added4 commits

March 15, 2024 15:07

improve the CPUID check logic.

15d9a47

Add missing changes in managed code and fix bad naming.

8e2f33f

Resolve comments:

f269098

Removed seperate env vars to control Avx10v1_V256/512, now Avx10v1 instructions with different vector length will all be controlled by EnableAvx10v1 alone.

Add missing definition and implication in ISA descriptor.

de0a52a

Ruihan-Yin requested a review fromtannergooding

March 18, 2024 16:52

tannergooding reviewed

Mar 18, 2024

View reviewed changes

src/coreclr/tools/Common/Compiler/HardwareIntrinsicHelpers.cs Outdated

Comment on lines 72 to 74

		publicconstintAvx10v1=20000000;
		publicconstintAvx10v1_v256=40000000;
		publicconstintAvx10v1_v512=50000000;

Copy link

Member

tannergoodingMar 18, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These should behex and the last one should be8, not5.

Suggested change

	publicconstintAvx10v1=20000000;
	publicconstintAvx10v1_v256=40000000;
	publicconstintAvx10v1_v512=50000000;
	publicconstintAvx10v1=0x20000000;
	publicconstintAvx10v1_v256=0x40000000;
	publicconstintAvx10v1_v512=0x80000000;

Copy link

MemberAuthor

Ruihan-YinMar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

public const int Avx10v1_v512 = 0x80000000;

This line will raise an error that 0x80000000 cannot be implicitly converted toint, although it can be fixed by explicitly defining as:
public const int Avx10v1_v512 = -2147483648;, I wonder if this is a proper fix (I suppose it should be good in terms of the number itself as it is just a flag bit), considering it has reached the max number of ISA this design can hold, and we are expecting some more ISAs coming in the future, e.g. higher version of Avx10 and APX, etc.

Copy link

Member

tannergoodingMar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Eitherpublic const int Avx10v1_v512 = -2147483648; orpublic const int Avx10v1_v512 = unchecked((int)0x80000000)); should be fine

Changing it touint orulong to hold more bits might also be appropriate, it's probably a better question for@MichalStrehovsky on how he wants it handled long term for this part of the codebase.

Copy link

Member

MichalStrehovskyMar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changing it to uint or ulong to hold more bits might also be appropriate, it's probably a better question for@MichalStrehovsky on how he wants it handled long term for this part of the codebase.

I don't have an opinion; whatever works. At this pace we may run out of ulong bits too. I don't have a good understanding for why we need so many bits: the VL suffixes look redundant; the various VectorT are also redundant with other flags we set at the same time, etc.

I was involved in the design for this up to AVX2/BMI but haven't really though about it since. The old design doesn't scale if we use bits at this pace because we can't make SSE3/SSE41/etc. baseline yet (and free up bits that way), but we may already need new bits. It would be better for someone who actually understands this to design a shape that will work.

There is an advantage if this can fit in 32bits (run-time check when IsSupported is done dynamically at runtime use the same enum and 32bits are simpler), but we can also introduce yet another enum for those purposes that doesn't waste bits, or have two ints, or something like that. If we can reclaim those 7 bits, it might be enough breathing room to keep using this enum and 32bits.

tannergooding reviewed

Mar 18, 2024

View reviewed changes

src/coreclr/tools/Common/InstructionSetHelpers.cs Outdated

Comment on lines 228 to 250

		Debug.Assert(InstructionSet.X64_AVX10v1_V256==InstructionSet.X86_AVX10v1_V256);
		if(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1_V256))
		{
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX2));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_FMA));
		}

		Debug.Assert(InstructionSet.X64_AVX10v1_V512==InstructionSet.X86_AVX10v1_V512);
		if(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1_V512))
		{
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX10v1_V256));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512F));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512F_VL));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512BW));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512BW_VL));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512CD));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512CD_VL));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512DQ));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512DQ_VL));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512VBMI));
		Debug.Assert(supportedInstructionSet.HasInstructionSet(InstructionSet.X64_AVX512VBMI_VL));
		}

Copy link

Member

tannergoodingMar 18, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not sure we need these.

The main intent of these paths is to addoptimisticInstructionSet that is ones where a cached CPUID check is beneficial for AOT code. I don't think that's worth doing forAvx10v1 here since it would only be applicable ifAvx512F was there and that itself requiresVL support.

TheDebug.Assert are then there to setup the expectation thatAvx512F was only supported whenBW+CD+DQ+VL were also supported, even though they are technically supersets ofAvx512F. We don't have a similar consideration here so noDebug.Assert is needed

tannergooding reviewed

Mar 18, 2024

View reviewed changes

src/native/minipal/cpufeatures.c Outdated

		constintversionMask=0xFF;// [7:0]
		if((cpuidInfo[CPUID_EBX]&versionMask) >=1)// version higher than 1
		{
		result \|=XArchIntrinsicConstants_Avx10v1;

Copy link

Member

tannergoodingMar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: We should explicitly checkEBX[16] is set to indicateAVX10/V128. The bit exists, so we should be explicit about checking it.

tannergooding reviewed

Mar 18, 2024

View reviewed changes

src/native/minipal/cpufeatures.c Outdated

Comment on lines 290 to 300

		constintvector256Mask= (1 <<17);
		constintvector512Mask= (1 <<18);
		if((cpuidInfo[CPUID_EBX]&vector256Mask)!=0)
		{
		result \|=XArchIntrinsicConstants_Avx10v1_V256;
		}

		if((cpuidInfo[CPUID_EBX]&vector512Mask)!=0)
		{
		result \|=XArchIntrinsicConstants_Avx10v1_V512;
		}

Copy link

Member

tannergoodingMar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: These can only be considered ifAVX10/V128 is set and if theversion was greater than1, so we should have them appropriately nested

tannergooding reviewed

Mar 18, 2024

View reviewed changes

src/native/minipal/cpufeatures.c Outdated

		result \|=XArchIntrinsicConstants_Avx10v1;
		}

		constintvector256Mask= (1 <<17);

Copy link

Member

tannergoodingMar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: None of the other checks use aconst int like this, they inline the(1 << 17) into the if statement and then add a comment indicating what bit it was checking

Copy link

MemberAuthor

Ruihan-YinMar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I can make it to the commonly used form, it looks unnecessary to define this way as the bit-shift form should be easy enough to read in this case.

Resolve comments:

cc876a0

1. Make sure the version check and length check are correctly nested.2. remove unnecessary debug assert in mananged code.

Ruihan-Yin requested a review fromtannergooding

March 19, 2024 20:03

tannergooding approved these changes

Mar 19, 2024

View reviewed changes

Copy link

Member

tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM.

Copy link

Member

tannergooding commentedMar 19, 2024

CC. @dotnet/jit-contrib for secondary review and merge

BruceForstall reviewed

Mar 19, 2024

View reviewed changes

Copy link

Contributor

BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please resolve the merge conflict with the JIT GUID

src/coreclr/vm/codeman.cpp

		CPUCompileFlags.Set(InstructionSet_AVX10v1);
		}

		if (((cpuFeatures & XArchIntrinsicConstants_Avx10v1_V256) !=0))

Copy link

Contributor

BruceForstallMar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Doesn't this also need&& CLRConfig::GetConfigValue(CLRConfig::EXTERNAL_EnableAVX10v1)?

And the Avx10v1_V512 case below?

That is, shouldn't all the AVX10 instruction sets be subject to the same configuration flag? (Since you didn't add separate flags for each)

Copy link

MemberAuthor

Ruihan-YinMar 19, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please see thediscussion here.

We intended to let the implication thatAvx10v1_V512 guaranteesAvx10v1_V256, andAvx10v1_V256 guaranteesAvx10v1 to handle the invalid case (The implication is confirmed by the Avx10 spec doc.). Say, the functionEnusreValidInstructionSetSupport would cancel the illegal flag combination set during the flow.

I can put some comments there if needed.

Copy link

Contributor

BruceForstallMar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Comments would be useful.

Copy link

Contributor

BruceForstallMar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So on an AVX10 + V512 machine, withDOTNET_EnableAVX10v1=0, we'll leave this code with:

InstructionSet_AVX10v1_V256InstructionSet_AVX10v1_V512

set butInstructionSet_AVX10v1 unset.

Then, we callCPUCompileFlags.EnsureValidInstructionSetSupport(); to "clean it up" in EnsureInstructionSetFlagsAreValid` which has this new code:

        if (resultflags.HasInstructionSet(InstructionSet_AVX10v1_V512) && !resultflags.HasInstructionSet(InstructionSet_AVX10v1_V256))            resultflags.RemoveInstructionSet(InstructionSet_AVX10v1_V512);        if (resultflags.HasInstructionSet(InstructionSet_AVX10v1_V256) && !resultflags.HasInstructionSet(InstructionSet_AVX10v1))            resultflags.RemoveInstructionSet(InstructionSet_AVX10v1_V256);        if (resultflags.HasInstructionSet(InstructionSet_AVX10v1) && !resultflags.HasInstructionSet(InstructionSet_AVX2))            resultflags.RemoveInstructionSet(InstructionSet_AVX10v1);

The first case won't hit, but the others will. That seems to mean thatInstructionSet_AVX10v1_V512 will not be disabled.

If the code ran in a different order, it would work (i.e., check from deepest in the dependency tree first).

Copy link

MemberAuthor

Ruihan-YinMar 20, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I see, thanks for pointing out, in that case, if we want to keep the idea that only checkEnableAvx10v1 once, I can makeAvx10v1_V512 impliesAvx10v1 andAvx10_V256, so there will be an extra check like:

        if (resultflags.HasInstructionSet(InstructionSet_AVX10v1_V512) && !resultflags.HasInstructionSet(InstructionSet_AVX10v1))            resultflags.RemoveInstructionSet(InstructionSet_AVX10v1_V512);

this should be able to avoid the described case.

I presume changing the order for the checks could fix the issue but might not be a solid solution, as if the order changes somehow in the future, this "implicit" bug could be hard to find.

Copy link

Contributor

BruceForstallMar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess it makes sense to add:

implication        ,X86   ,AVX10v1_V512         ,AVX10v1

and the auto-generated code will do the rest?

Seems like either implication handling could be improved to handle transitive implications better, or, implications are not expected to be transitive. It would be helpful if the documentation/comments made it clear.

Copy link

MemberAuthor

Ruihan-YinMar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree, I will document the conclusion in the code.

Copy link

Member

tannergoodingMar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the issue may just be in the order the implications are listed in the doc. We typically want the newest ISAs listed at the bottom

Copy link

Contributor

BruceForstall commentedMar 19, 2024

If this PR is no longer in "Draft", then please mark it "Ready for review".

Ruihan-Yin marked this pull request as ready for review

March 19, 2024 23:52

Copy link

MemberAuthor

Ruihan-Yin commentedMar 19, 2024

Please resolve the merge conflict with the JIT GUID

I am not sure how I can resolve this conflict, shall I just rebase the branch and overwrite the GUID with the one in my branch, or I need to rebase and then re-generate the ID?

Copy link

Contributor

BruceForstall commentedMar 20, 2024

I am not sure how I can resolve this conflict, shall I just rebase the branch and overwrite the GUID with the one in my branch, or I need to rebase and then re-generate the ID?

I believe "merge" is preferred over "rebase".

In any case, simply overwrite the GUID with yours. No need to create a new GUID, since yours is already new and hasn't been used.