Movatterモバイル変換

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs Outdated


		// Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'.
		// Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values.
		str1=AdvSimd.Or(decOne1,decTwo1);

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Question for others: When writing an routine that is purelyAdvSimd, for basic functions such asOr should we be usingAdvSimd throughout or mixing APIs and usingVector128 ?

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree. We should just usedecOne1 | decTwo1

Copy link

Member

tannergoodingApr 8, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In general we want to avoid writing routines that are purely for a given architecture or ISA where-ever possible.

It is simply not maintainable to write and support implementations for all of:

Arm64
- AdvSimd
- Sve/Sve2
x86/x64
- Sse2/Sse4.1
- Avx/Avx2
- Avx512/Avx10
WASM
RISC-V
LoongArch64
etc

So, in the ideal world we are using xplat code and sharing across all target platforms for the vast majority of code we write. The only duplication we should have in the common case is for different vector sizes. That is, currently duplicating code across V128/V256/V512 and Vector, based on the sizes we want to support -- Even this is something we're looking at allowing more sharing of in the future viaISimdVector<TSelf, T>, so it is truly write once as much as possible.

We then opportunistically light up usage of platform specific behavior only if the additional complexity is justified by the perf gains and ideally only where absolutely required, still using shared xplat code for most of the algorithm.

The xplat APIs should cover most core functionality and the JIT should recognize common patterns and optimize to better platform specific codegen where possible. Where there is some functionality that might be missing, we can propose and expose new APIs that work better across the range of platforms we need to support.

a74nh reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs OutdatedShow resolvedHide resolved

a74nh approved these changes

Copy link

Contributor

a74nh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Otherwise LGTM

kunalspathak reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs OutdatedShow resolvedHide resolved

kunalspathak suggested changes

GC\Regressions\v2.0-beta2\452950\452950\452950.cmd failing on Mono minijit Windows x64#99729

Copy link

Contributor

kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

add few comments and could you please share the code generated.

SwapnilGaikwad added2 commits

April 9, 2024 17:47

Merge main

d881cdb

Address review comments

1ab9bcf

Copy link

ContributorAuthor

SwapnilGaikwad commentedApr 9, 2024

add few comments and could you please share the code generated.

Assembly for the 'AdvSimdDecode'. It's extracted from the 'System.Private.CoreLib.dll'.

b.cc0xffffa8327378  // b.lo, b.ul, b.lastmov     x6, x27str     x28,[x29, #272]movi    v8.16b, #0x3f           // <==== Line#894: DuplicateToVector128()str     q8,[x29, #256]ldr     q9,0xffffa83278b0str     q9,[x29, #96]ldr     q10,0xffffa83278c0str     q10,[x29, #80]ldr     q11,0xffffa83278d0str     q11,[x29, #64]ldr     q12,0xffffa83278e0str     q12,[x29, #48]ldr     q13,0xffffa83278f0str     q13,[x29, #32]ldr     q14,0xffffa8327900str     q14,[x29, #16]str     w4,[x29, #344]mov     w2, w4str     x6,[x29, #280]mov     x0, x6mov     x1, x27adrp    x11,0xffffa8e1c000add     x11, x11, #0xef8mov     v8.d[0], v9.d[1]mov     v11.d[0], v10.d[1]ldr     x13,[x11]blr     x13ldr     x6,[x29, #280]ld4     {v16.16b-v19.16b},[x6]  // <==== Line#902: LoadVector128x4AndUnzip()/approx.loop startstp     q16, q17,[x29, #192]stp     q18, q19,[x29, #224]ldp     q16, q17,[x29, #192]ldp     q18, q19,[x29, #224]ldr     q20,[x29, #256]uqsub   v21.16b, v16.16b, v20.16buqsub   v22.16b, v17.16b, v20.16buqsub   v23.16b, v18.16b, v20.16buqsub   v24.16b, v19.16b, v20.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v9.d[1], v8.d[0]mov     v10.d[1], v11.d[0]mov     v27.16b, v9.16bmov     v28.16b, v10.16btbl     v16.16b, {v25.16b-v28.16b}, v16.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v27.16b, v25.16bmov     v28.16b, v26.16bmov     v29.16b, v9.16bmov     v30.16b, v10.16btbl     v17.16b, {v27.16b-v30.16b}, v17.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v27.16b, v9.16bmov     v28.16b, v10.16btbl     v18.16b, {v25.16b-v28.16b}, v18.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v27.16b, v25.16bmov     v28.16b, v26.16bmov     v29.16b, v9.16bmov     v30.16b, v10.16btbl     v19.16b, {v27.16b-v30.16b}, v19.16bldp     q26, q25,[x29, #48]ldp     q28, q27,[x29, #16]tbx     v21.16b, {v25.16b-v28.16b}, v21.16btbx     v22.16b, {v25.16b-v28.16b}, v22.16btbx     v23.16b, {v25.16b-v28.16b}, v23.16btbx     v24.16b, {v25.16b-v28.16b}, v24.16borr     v16.16b, v16.16b, v21.16borr     v17.16b, v17.16b, v22.16borr     v18.16b, v18.16b, v23.16borr     v19.16b, v19.16b, v24.16bcmhi    v21.16b, v16.16b, v20.16bcmhi    v22.16b, v17.16b, v20.16borr     v21.16b, v21.16b, v22.16bcmhi    v22.16b, v18.16b, v20.16borr     v21.16b, v21.16b, v22.16bcmhi    v22.16b, v19.16b, v20.16borr     v21.16b, v21.16b, v22.16bumaxp   v21.4s, v21.4s, v21.4smov     x2, v21.d[0]cmp     x2, #0x0b.ne0xffffa8327350  // b.anyshl     v16.16b, v16.16b, #2ushr    v21.16b, v17.16b, #4orr     v8.16b, v16.16b, v21.16bshl     v16.16b, v17.16b, #4ushr    v17.16b, v18.16b, #2orr     v11.16b, v16.16b, v17.16bshl     v16.16b, v18.16b, #6orr     v12.16b, v16.16b, v19.16bmov     w2, w19ldr     x0,[x29, #272]mov     x1, x28adrp    x11,0xffffa8e1c000add     x11, x11, #0xf00mov     v14.d[0], v8.d[1]mov     v9.d[0], v11.d[1]mov     v10.d[0], v12.d[1]ldr     x3,[x11]blr     x3mov     v8.d[1], v14.d[0]mov     v11.d[1], v9.d[0]mov     v12.d[1], v10.d[0]ldr     x7,[x29, #272]mov     v16.16b, v8.16bmov     v17.16b, v11.16bmov     v18.16b, v12.16bst3     {v16.16b-v18.16b},[x7]ldr     x6,[x29, #280]add     x6, x6, #0x40add     x7, x7, #0x30ldr     x3,[x29, #288]cmp     x6, x3str     x7,[x29, #272]str     x3,[x29, #288]ldp     q10, q9,[x29, #80]b.ls0xffffa832754c  // b.plaststr     x6,[x29, #280]         //  <==== End ofloopldr     x6,[x29, #280]mov     x4, x6ldr     x7,[x29, #272]mov     x5, x7ldr     x6,[x29, #312]cmp     x4, x6b.eq0xffffa8327740  // b.none

This was referencedApr 9, 2024

Closed

at GC_Regressions._v2_0_beta2_452950_452950_452950_._v2_0_beta2_452950_452950_452950_cmd()#100174

Closed

[browser]Process terminated assert handling at icall.c:6162#100249

Closed

System.Text.RegularExpressions.Tests on Mono_Minijit_Debug-Ubuntu#100715

Closed

Copy link

ContributorAuthor

SwapnilGaikwad commentedApr 13, 2024

Hi@kunalspathak, there is no change in assembly after the refactoring.

New assembly sequence

b.cc0xffffa8337378  // b.lo, b.ul, b.lastmovx6, x27strx28, [x29, #272]moviv8.16b, #0x3fstrq8, [x29, #256]ldrq9, 0xffffa83378b0strq9, [x29, #96]ldrq10, 0xffffa83378c0strq10, [x29, #80]ldrq11, 0xffffa83378d0strq11, [x29, #64]ldrq12, 0xffffa83378e0strq12, [x29, #48]ldrq13, 0xffffa83378f0strq13, [x29, #32]ldrq14, 0xffffa8337900strq14, [x29, #16]strw4, [x29, #344]movw2, w4strx6, [x29, #280]movx0, x6movx1, x27adrpx11, 0xffffa8e2c000addx11, x11, #0xef8movv8.d[0], v9.d[1]movv11.d[0], v10.d[1]ldrx13, [x11]blrx13ldrx6, [x29, #280]ld4{v16.16b-v19.16b}, [x6]stpq16, q17, [x29, #192]stpq18, q19, [x29, #224]ldpq16, q17, [x29, #192]ldpq18, q19, [x29, #224]ldrq20, [x29, #256]uqsubv21.16b, v16.16b, v20.16buqsubv22.16b, v17.16b, v20.16buqsubv23.16b, v18.16b, v20.16buqsubv24.16b, v19.16b, v20.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv9.d[1], v8.d[0]movv10.d[1], v11.d[0]movv27.16b, v9.16bmovv28.16b, v10.16btblv16.16b, {v25.16b-v28.16b}, v16.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv27.16b, v25.16bmovv28.16b, v26.16bmovv29.16b, v9.16bmovv30.16b, v10.16btblv17.16b, {v27.16b-v30.16b}, v17.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv27.16b, v9.16bmovv28.16b, v10.16btblv18.16b, {v25.16b-v28.16b}, v18.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv27.16b, v25.16bmovv28.16b, v26.16bmovv29.16b, v9.16bmovv30.16b, v10.16btblv19.16b, {v27.16b-v30.16b}, v19.16bldpq26, q25, [x29, #48]ldpq28, q27, [x29, #16]tbxv21.16b, {v25.16b-v28.16b}, v21.16btbxv22.16b, {v25.16b-v28.16b}, v22.16btbxv23.16b, {v25.16b-v28.16b}, v23.16btbxv24.16b, {v25.16b-v28.16b}, v24.16borrv16.16b, v16.16b, v21.16borrv17.16b, v17.16b, v22.16borrv18.16b, v18.16b, v23.16borrv19.16b, v19.16b, v24.16bcmhiv21.16b, v16.16b, v20.16bcmhiv22.16b, v17.16b, v20.16borrv21.16b, v21.16b, v22.16bcmhiv22.16b, v18.16b, v20.16borrv21.16b, v21.16b, v22.16bcmhiv22.16b, v19.16b, v20.16borrv21.16b, v21.16b, v22.16bumaxpv21.4s, v21.4s, v21.4smovx2, v21.d[0]cmpx2, #0x0b.ne0xffffa8337350  // b.anyshlv16.16b, v16.16b, #2ushrv21.16b, v17.16b, #4orrv8.16b, v16.16b, v21.16bshlv16.16b, v17.16b, #4ushrv17.16b, v18.16b, #2orrv11.16b, v16.16b, v17.16bshlv16.16b, v18.16b, #6orrv12.16b, v16.16b, v19.16bmovw2, w19ldrx0, [x29, #272]movx1, x28adrpx11, 0xffffa8e2c000addx11, x11, #0xf00movv14.d[0], v8.d[1]movv9.d[0], v11.d[1]movv10.d[0], v12.d[1]ldrx3, [x11]blrx3movv8.d[1], v14.d[0]movv11.d[1], v9.d[0]movv12.d[1], v10.d[0]ldrx7, [x29, #272]movv16.16b, v8.16bmovv17.16b, v11.16bmovv18.16b, v12.16bst3{v16.16b-v18.16b}, [x7]ldrx6, [x29, #280]addx6, x6, #0x40addx7, x7, #0x30ldrx3, [x29, #288]cmpx6, x3strx7, [x29, #272]strx3, [x29, #288]ldpq10, q9, [x29, #80]b.ls0xffffa833754c  // b.plaststrx6, [x29, #280]ldrx6, [x29, #280]movx4, x6ldrx7, [x29, #272]movx5, x7ldrx6, [x29, #312]cmpx4, x6b.eq0xffffa8337740

kunalspathak approved these changes

Copy link

Contributor

kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

kunalspathak merged commitfa1164c intodotnet:main

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Comment on lines +867 to +868

		Vector128<byte>decLutOne1=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();
		Vector128<byte>decLutOne2=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();

Copy link

Member

tannergoodingApr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These could have beenVector128<byte>.AllBitsSet

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Comment on lines +867 to +874

		Vector128<byte>decLutOne1=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();
		Vector128<byte>decLutOne2=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();
		Vector128<byte>decLutOne3=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0x3EFFFFFF,0x3FFFFFFF).AsByte();
		Vector128<byte>decLutOne4=Vector128.Create(0x37363534,0x3B3A3938,0xFFFF3D3C,0xFFFFFFFF).AsByte();
		Vector128<byte>decLutTwo1=Vector128.Create(0x0100FF00,0x05040302,0x09080706,0x0D0C0B0A).AsByte();
		Vector128<byte>decLutTwo2=Vector128.Create(0x11100F0E,0x15141312,0x19181716,0xFFFFFFFF).AsByte();
		Vector128<byte>decLutTwo3=Vector128.Create(0x1B1AFFFF,0x1F1E1D1C,0x23222120,0x27262524).AsByte();
		Vector128<byte>decLutTwo4=Vector128.Create(0x2B2A2928,0x2F2E2D2C,0x33323130,0xFFFFFFFF).AsByte();

Copy link

Member

tannergoodingApr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's not clear why these all need to be declared up front. It's effectively defining unnecessary locals and work for the JIT to do.

This could have beenvar decLutOne = (Vector128<byte>.AllBitsSet, Vector128<byte>.AllBitsSet, Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0x3EFFFFFF, 0x3FFFFFFF).AsByte(), Vector128.Create(0x37363534, 0x3B3A3938, 0xFFFF3D3C, 0xFFFFFFFF).AsByte()) instead, as an example.

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Comment on lines +876 to +890

		Vector128<byte>decOne1;
		Vector128<byte>decOne2;
		Vector128<byte>decOne3;
		Vector128<byte>decOne4;
		Vector128<byte>decTwo1;
		Vector128<byte>decTwo2;
		Vector128<byte>decTwo3;
		Vector128<byte>decTwo4;
		Vector128<byte>str1;
		Vector128<byte>str2;
		Vector128<byte>str3;
		Vector128<byte>str4;
		Vector128<byte>res1;
		Vector128<byte>res2;
		Vector128<byte>res3;

Copy link

Member

tannergoodingApr 14, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Similar here, it's not clear why these were defined up front like this. They could have been declared in the loop at the declaration site (since they aren't used outside the loop) making the overall code much more readable, without needing 15 up front declarations

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs


		byte*src=srcBytes;
		byte*dest=destBytes;
		Vector128<byte>offset=AdvSimd.DuplicateToVector128((byte)0x3F);

Copy link

Member

tannergoodingApr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Vector128.Create<byte>(0x3F) is preferred overAdvSimd.DuplicateToVector128.

The former allows more general optimizations to kick in as it does not prescribe a particular instruction be emitted.

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

		do
		{
		// Load 64 bytes and de-interleave.
		AssertRead<Vector128<byte>>(src,srcStart,sourceLength);

Copy link

Member

tannergoodingApr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I know this is pre-existing to the PR, but this is not a great method name. It should probably be something likeAssertReadIsValid or similar.

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

		{
		// Load 64 bytes and de-interleave.
		AssertRead<Vector128<byte>>(src,srcStart,sourceLength);
		(str1,str2,str3,str4)=AdvSimd.Arm64.LoadVector128x4AndUnzip(src);

Copy link

Member

tannergoodingApr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be nice ifLoadVector128x4AndUnzip was something we could pattern match and implicitly light up.

Copy link

ContributorAuthor

SwapnilGaikwadApr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Did you mean a wrapper method, sayLoadVectorAndUnzip(), that would do the right thing based on the return type?

Copy link

Member

tannergoodingApr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No, rather as per my comment above (#100589 (comment)) we prefer to write xplat code so logic can be shared across CPU architectures where possible.

To facilitate this, we ideally have the JIT recognize simple patterns and optimize them.LoadVector128x4, for example, could be done automatically if the JIT were to recognize 4x sequentialVector128.Load.

LoadVector128x4AndUnzip could theoretically have the same done if we were to recognize 4x sequenceVector128.Load and a general pattern forunzip. Today that would be recognizingShuffle in particular, but longer term it might be a good reason for there to be an xplatVector128.Unzip (since that just corresponds toUnzip on Arm64 andUnpack onx64).

The driving question tends to be, "does this logic actually have to be platform/ISA specific, or is there a simple pattern we could enable the JIT to recognize instead?". In the case simple pattern recognition is possible, then it's generally ideal to enable it so that existing user code lights up and gets faster on Arm64 without the need for users to go and manually do the work. It also lets the same code automatically light-up for things likeAdvSimd vsSVE, rather than being strictly tied down toAdvSimd.

Copy link

ContributorAuthor

SwapnilGaikwadApr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Right, thanks for this clarification. I'll keep this mind for the common patterns in the future.

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Comment on lines +942 to +944

		res1=AdvSimd.ShiftLeftLogical(str1,2)\|AdvSimd.ShiftRightLogical(str2,4);
		res2=AdvSimd.ShiftLeftLogical(str2,4)\|AdvSimd.ShiftRightLogical(str3,2);
		res3=AdvSimd.ShiftLeftLogical(str3,6)\|str4;

Copy link

Member

tannergoodingApr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These could have been of the formres1 = (str1 << 2) | (str2 << 4);

tannergooding reviewed

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Comment on lines +930 to +933

		Vector128<byte>classified=AdvSimd.CompareGreaterThan(str1,offset)
		\|AdvSimd.CompareGreaterThan(str2,offset)
		\|AdvSimd.CompareGreaterThan(str3,offset)
		\|AdvSimd.CompareGreaterThan(str4,offset);

Copy link

Member

tannergoodingApr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These could have been the xplatVector128.GreaterThan(str1, offset)

Copy link

ContributorAuthor

SwapnilGaikwadApr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I didn't understand the xplat bit. Could you elaborate on this, please?

Copy link

Member

tannergoodingApr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We have APIs defined onVector128 that expose functionality common across multiple ISAs (AdvSimd, SVE, etc) and common across multiple architectures (Arm64, x64, WASM, etc).

Using these APIs tends to be preferred over using the platform specific APIs, particularly when there is a clear 1-to-1 mapping. Thusx + y is preferred overAdvSimd.Add andVector128.GreaterThan is preffered overAdvSimd.CompareGreaterThan.

Using the xplat (cross platform) APIs gives the JIT more flexibility in the code it generates and allows more portability, while also often improving readability.

Copy link

ContributorAuthor

SwapnilGaikwadApr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Makes sense, I'll update these calls 👍

tannergooding reviewed