- Notifications
You must be signed in to change notification settings - Fork5.2k
Use multi-reg load/store for DecodeFromUTF8#100589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
SwapnilGaikwad commentedApr 3, 2024
On an N1 machine, the patch improves performance: |
SwapnilGaikwad commentedApr 5, 2024
kunalspathak commentedApr 5, 2024
@dotnet/jit-contrib@tannergooding |
| // Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'. | ||
| // Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values. | ||
| str1=AdvSimd.Or(decOne1,decTwo1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Question for others: When writing an routine that is purelyAdvSimd, for basic functions such asOr should we be usingAdvSimd throughout or mixing APIs and usingVector128 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I agree. We should just usedecOne1 | decTwo1
tannergoodingApr 8, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
In general we want to avoid writing routines that are purely for a given architecture or ISA where-ever possible.
It is simply not maintainable to write and support implementations for all of:
- Arm64
- AdvSimd
- Sve/Sve2
- x86/x64
- Sse2/Sse4.1
- Avx/Avx2
- Avx512/Avx10
- WASM
- RISC-V
- LoongArch64
- etc
So, in the ideal world we are using xplat code and sharing across all target platforms for the vast majority of code we write. The only duplication we should have in the common case is for different vector sizes. That is, currently duplicating code across V128/V256/V512 and Vector, based on the sizes we want to support -- Even this is something we're looking at allowing more sharing of in the future viaISimdVector<TSelf, T>, so it is truly write once as much as possible.
We then opportunistically light up usage of platform specific behavior only if the additional complexity is justified by the perf gains and ideally only where absolutely required, still using shared xplat code for most of the algorithm.
The xplat APIs should cover most core functionality and the JIT should recognize common patterns and optimize to better platform specific codegen where possible. Where there is some functionality that might be missing, we can propose and expose new APIs that work better across the range of platforms we need to support.
src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
a74nh left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Otherwise LGTM
src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
kunalspathak left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
add few comments and could you please share the code generated.
SwapnilGaikwad commentedApr 9, 2024
Assembly for the 'AdvSimdDecode'. It's extracted from the 'System.Private.CoreLib.dll'.b.cc0xffffa8327378 // b.lo, b.ul, b.lastmov x6, x27str x28,[x29, #272]movi v8.16b, #0x3f // <==== Line#894: DuplicateToVector128()str q8,[x29, #256]ldr q9,0xffffa83278b0str q9,[x29, #96]ldr q10,0xffffa83278c0str q10,[x29, #80]ldr q11,0xffffa83278d0str q11,[x29, #64]ldr q12,0xffffa83278e0str q12,[x29, #48]ldr q13,0xffffa83278f0str q13,[x29, #32]ldr q14,0xffffa8327900str q14,[x29, #16]str w4,[x29, #344]mov w2, w4str x6,[x29, #280]mov x0, x6mov x1, x27adrp x11,0xffffa8e1c000add x11, x11, #0xef8mov v8.d[0], v9.d[1]mov v11.d[0], v10.d[1]ldr x13,[x11]blr x13ldr x6,[x29, #280]ld4 {v16.16b-v19.16b},[x6] // <==== Line#902: LoadVector128x4AndUnzip()/approx.loop startstp q16, q17,[x29, #192]stp q18, q19,[x29, #224]ldp q16, q17,[x29, #192]ldp q18, q19,[x29, #224]ldr q20,[x29, #256]uqsub v21.16b, v16.16b, v20.16buqsub v22.16b, v17.16b, v20.16buqsub v23.16b, v18.16b, v20.16buqsub v24.16b, v19.16b, v20.16bmvni v25.4s, #0x0mvni v26.4s, #0x0mov v9.d[1], v8.d[0]mov v10.d[1], v11.d[0]mov v27.16b, v9.16bmov v28.16b, v10.16btbl v16.16b, {v25.16b-v28.16b}, v16.16bmvni v25.4s, #0x0mvni v26.4s, #0x0mov v27.16b, v25.16bmov v28.16b, v26.16bmov v29.16b, v9.16bmov v30.16b, v10.16btbl v17.16b, {v27.16b-v30.16b}, v17.16bmvni v25.4s, #0x0mvni v26.4s, #0x0mov v27.16b, v9.16bmov v28.16b, v10.16btbl v18.16b, {v25.16b-v28.16b}, v18.16bmvni v25.4s, #0x0mvni v26.4s, #0x0mov v27.16b, v25.16bmov v28.16b, v26.16bmov v29.16b, v9.16bmov v30.16b, v10.16btbl v19.16b, {v27.16b-v30.16b}, v19.16bldp q26, q25,[x29, #48]ldp q28, q27,[x29, #16]tbx v21.16b, {v25.16b-v28.16b}, v21.16btbx v22.16b, {v25.16b-v28.16b}, v22.16btbx v23.16b, {v25.16b-v28.16b}, v23.16btbx v24.16b, {v25.16b-v28.16b}, v24.16borr v16.16b, v16.16b, v21.16borr v17.16b, v17.16b, v22.16borr v18.16b, v18.16b, v23.16borr v19.16b, v19.16b, v24.16bcmhi v21.16b, v16.16b, v20.16bcmhi v22.16b, v17.16b, v20.16borr v21.16b, v21.16b, v22.16bcmhi v22.16b, v18.16b, v20.16borr v21.16b, v21.16b, v22.16bcmhi v22.16b, v19.16b, v20.16borr v21.16b, v21.16b, v22.16bumaxp v21.4s, v21.4s, v21.4smov x2, v21.d[0]cmp x2, #0x0b.ne0xffffa8327350 // b.anyshl v16.16b, v16.16b, #2ushr v21.16b, v17.16b, #4orr v8.16b, v16.16b, v21.16bshl v16.16b, v17.16b, #4ushr v17.16b, v18.16b, #2orr v11.16b, v16.16b, v17.16bshl v16.16b, v18.16b, #6orr v12.16b, v16.16b, v19.16bmov w2, w19ldr x0,[x29, #272]mov x1, x28adrp x11,0xffffa8e1c000add x11, x11, #0xf00mov v14.d[0], v8.d[1]mov v9.d[0], v11.d[1]mov v10.d[0], v12.d[1]ldr x3,[x11]blr x3mov v8.d[1], v14.d[0]mov v11.d[1], v9.d[0]mov v12.d[1], v10.d[0]ldr x7,[x29, #272]mov v16.16b, v8.16bmov v17.16b, v11.16bmov v18.16b, v12.16bst3 {v16.16b-v18.16b},[x7]ldr x6,[x29, #280]add x6, x6, #0x40add x7, x7, #0x30ldr x3,[x29, #288]cmp x6, x3str x7,[x29, #272]str x3,[x29, #288]ldp q10, q9,[x29, #80]b.ls0xffffa832754c // b.plaststr x6,[x29, #280] // <==== End ofloopldr x6,[x29, #280]mov x4, x6ldr x7,[x29, #272]mov x5, x7ldr x6,[x29, #312]cmp x4, x6b.eq0xffffa8327740 // b.none |
SwapnilGaikwad commentedApr 13, 2024
Hi@kunalspathak, there is no change in assembly after the refactoring. New assembly sequence |
kunalspathak left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM
| Vector128<byte>decLutOne1=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte(); | ||
| Vector128<byte>decLutOne2=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
These could have beenVector128<byte>.AllBitsSet
| Vector128<byte>decLutOne1=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte(); | ||
| Vector128<byte>decLutOne2=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte(); | ||
| Vector128<byte>decLutOne3=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0x3EFFFFFF,0x3FFFFFFF).AsByte(); | ||
| Vector128<byte>decLutOne4=Vector128.Create(0x37363534,0x3B3A3938,0xFFFF3D3C,0xFFFFFFFF).AsByte(); | ||
| Vector128<byte>decLutTwo1=Vector128.Create(0x0100FF00,0x05040302,0x09080706,0x0D0C0B0A).AsByte(); | ||
| Vector128<byte>decLutTwo2=Vector128.Create(0x11100F0E,0x15141312,0x19181716,0xFFFFFFFF).AsByte(); | ||
| Vector128<byte>decLutTwo3=Vector128.Create(0x1B1AFFFF,0x1F1E1D1C,0x23222120,0x27262524).AsByte(); | ||
| Vector128<byte>decLutTwo4=Vector128.Create(0x2B2A2928,0x2F2E2D2C,0x33323130,0xFFFFFFFF).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It's not clear why these all need to be declared up front. It's effectively defining unnecessary locals and work for the JIT to do.
This could have beenvar decLutOne = (Vector128<byte>.AllBitsSet, Vector128<byte>.AllBitsSet, Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0x3EFFFFFF, 0x3FFFFFFF).AsByte(), Vector128.Create(0x37363534, 0x3B3A3938, 0xFFFF3D3C, 0xFFFFFFFF).AsByte()) instead, as an example.
| Vector128<byte>decOne1; | ||
| Vector128<byte>decOne2; | ||
| Vector128<byte>decOne3; | ||
| Vector128<byte>decOne4; | ||
| Vector128<byte>decTwo1; | ||
| Vector128<byte>decTwo2; | ||
| Vector128<byte>decTwo3; | ||
| Vector128<byte>decTwo4; | ||
| Vector128<byte>str1; | ||
| Vector128<byte>str2; | ||
| Vector128<byte>str3; | ||
| Vector128<byte>str4; | ||
| Vector128<byte>res1; | ||
| Vector128<byte>res2; | ||
| Vector128<byte>res3; |
tannergoodingApr 14, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Similar here, it's not clear why these were defined up front like this. They could have been declared in the loop at the declaration site (since they aren't used outside the loop) making the overall code much more readable, without needing 15 up front declarations
| byte*src=srcBytes; | ||
| byte*dest=destBytes; | ||
| Vector128<byte>offset=AdvSimd.DuplicateToVector128((byte)0x3F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Vector128.Create<byte>(0x3F) is preferred overAdvSimd.DuplicateToVector128.
The former allows more general optimizations to kick in as it does not prescribe a particular instruction be emitted.
| do | ||
| { | ||
| // Load 64 bytes and de-interleave. | ||
| AssertRead<Vector128<byte>>(src,srcStart,sourceLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I know this is pre-existing to the PR, but this is not a great method name. It should probably be something likeAssertReadIsValid or similar.
| { | ||
| // Load 64 bytes and de-interleave. | ||
| AssertRead<Vector128<byte>>(src,srcStart,sourceLength); | ||
| (str1,str2,str3,str4)=AdvSimd.Arm64.LoadVector128x4AndUnzip(src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It would be nice ifLoadVector128x4AndUnzip was something we could pattern match and implicitly light up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Did you mean a wrapper method, sayLoadVectorAndUnzip(), that would do the right thing based on the return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No, rather as per my comment above (#100589 (comment)) we prefer to write xplat code so logic can be shared across CPU architectures where possible.
To facilitate this, we ideally have the JIT recognize simple patterns and optimize them.LoadVector128x4, for example, could be done automatically if the JIT were to recognize 4x sequentialVector128.Load.
LoadVector128x4AndUnzip could theoretically have the same done if we were to recognize 4x sequenceVector128.Load and a general pattern forunzip. Today that would be recognizingShuffle in particular, but longer term it might be a good reason for there to be an xplatVector128.Unzip (since that just corresponds toUnzip on Arm64 andUnpack onx64).
The driving question tends to be, "does this logic actually have to be platform/ISA specific, or is there a simple pattern we could enable the JIT to recognize instead?". In the case simple pattern recognition is possible, then it's generally ideal to enable it so that existing user code lights up and gets faster on Arm64 without the need for users to go and manually do the work. It also lets the same code automatically light-up for things likeAdvSimd vsSVE, rather than being strictly tied down toAdvSimd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Right, thanks for this clarification. I'll keep this mind for the common patterns in the future.
| res1=AdvSimd.ShiftLeftLogical(str1,2)|AdvSimd.ShiftRightLogical(str2,4); | ||
| res2=AdvSimd.ShiftLeftLogical(str2,4)|AdvSimd.ShiftRightLogical(str3,2); | ||
| res3=AdvSimd.ShiftLeftLogical(str3,6)|str4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
These could have been of the formres1 = (str1 << 2) | (str2 << 4);
| Vector128<byte>classified=AdvSimd.CompareGreaterThan(str1,offset) | ||
| |AdvSimd.CompareGreaterThan(str2,offset) | ||
| |AdvSimd.CompareGreaterThan(str3,offset) | ||
| |AdvSimd.CompareGreaterThan(str4,offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
These could have been the xplatVector128.GreaterThan(str1, offset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I didn't understand the xplat bit. Could you elaborate on this, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We have APIs defined onVector128 that expose functionality common across multiple ISAs (AdvSimd, SVE, etc) and common across multiple architectures (Arm64, x64, WASM, etc).
Using these APIs tends to be preferred over using the platform specific APIs, particularly when there is a clear 1-to-1 mapping. Thusx + y is preferred overAdvSimd.Add andVector128.GreaterThan is preffered overAdvSimd.CompareGreaterThan.
Using the xplat (cross platform) APIs gives the JIT more flexibility in the code it generates and allows more portability, while also often improving readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Makes sense, I'll update these calls 👍
| // Get indices for second LUT: | ||
| decTwo1=AdvSimd.SubtractSaturate(str1,offset); | ||
| decTwo2=AdvSimd.SubtractSaturate(str2,offset); | ||
| decTwo3=AdvSimd.SubtractSaturate(str3,offset); | ||
| decTwo4=AdvSimd.SubtractSaturate(str4,offset); | ||
| // Get values from first LUT. Out-of-range indices are set to 0. | ||
| decOne1=AdvSimd.Arm64.VectorTableLookup(decLutOne,str1); | ||
| decOne2=AdvSimd.Arm64.VectorTableLookup(decLutOne,str2); | ||
| decOne3=AdvSimd.Arm64.VectorTableLookup(decLutOne,str3); | ||
| decOne4=AdvSimd.Arm64.VectorTableLookup(decLutOne,str4); | ||
| // Get values from second LUT. Out-of-range indices are unchanged. | ||
| decTwo1=AdvSimd.Arm64.VectorTableLookupExtension(decTwo1,decLutTwo,decTwo1); | ||
| decTwo2=AdvSimd.Arm64.VectorTableLookupExtension(decTwo2,decLutTwo,decTwo2); | ||
| decTwo3=AdvSimd.Arm64.VectorTableLookupExtension(decTwo3,decLutTwo,decTwo3); | ||
| decTwo4=AdvSimd.Arm64.VectorTableLookupExtension(decTwo4,decLutTwo,decTwo4); | ||
| // Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'. | ||
| // Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values. | ||
| str1=decOne1|decTwo1; | ||
| str2=decOne2|decTwo2; | ||
| str3=decOne3|decTwo3; | ||
| str4=decOne4|decTwo4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This general algorithm probably could do with a comment walking through what's happening so the "magic" is better understood, much as exists forVector128Decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Agreed, it's difficult to follow. Although, it's a bit tricky to describe the entire workings with lookup tables and interleaving. I'll try to make it less opaque.
SwapnilGaikwad commentedApr 15, 2024
Hi@tannergooding , thanks for your comments. I'll add a follow-up PR incorporating as many suggestions as I can. |
SwapnilGaikwad commentedApr 26, 2024
#101620 addresses above suggestions. |
* Use multi-reg load/store for DecodeFromUTF8* Address review comments
It implements the decode from UTF8 algorithm listedhere.