Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Use multi-reg load/store for DecodeFromUTF8#100589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@SwapnilGaikwad
Copy link
Contributor

It implements the decode from UTF8 algorithm listedhere.

@ghostghost added the needs-area-labelAn area label is needed to ensure this gets routed to the appropriate area owners labelApr 3, 2024
@dotnet-policy-servicedotnet-policy-servicebot added the community-contributionIndicates that the PR has been added by a community member labelApr 3, 2024
@MihaZupanMihaZupan added area-System.Memory and removed needs-area-labelAn area label is needed to ensure this gets routed to the appropriate area owners labelsApr 3, 2024
@SwapnilGaikwad
Copy link
ContributorAuthor

On an N1 machine, the patch improves performance:

| Method                          | Toolchain                                                                  | MemoryRandomization | NumberOfBytes | Mean     | Error    | StdDev   | Median   | Min      | Max      | Ratio | MannWhitney(2%) ||-------------------------------- |--------------------------------------------------------------------------- |-------------------- |-------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- || Base64Decode                    | /base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | Default             | 1000          | 357.7 ns |  0.18 ns |  0.16 ns | 357.7 ns | 357.3 ns | 357.9 ns |  1.00 | Base            || Base64Decode                    | /patch/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | Default             | 1000          | 283.7 ns |  0.03 ns |  0.03 ns | 283.7 ns | 283.7 ns | 283.8 ns |  0.79 | Faster          ||                                 |                                                                            |                     |               |          |          |          |          |          |          |       |                 || Base64DecodeDestinationTooSmall | /base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | True                | 1000          | 369.0 ns | 10.47 ns | 12.06 ns | 368.5 ns | 356.9 ns | 383.1 ns |  1.00 | Base            || Base64DecodeDestinationTooSmall | /patch/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | True                | 1000          | 286.4 ns |  3.69 ns |  3.46 ns | 283.9 ns | 283.8 ns | 291.5 ns |  0.78 | Faster          |
kunalspathak reacted with hooray emoji

@SwapnilGaikwad
Copy link
ContributorAuthor

@kunalspathak@a74nh

@kunalspathak
Copy link
Contributor

@dotnet/jit-contrib@tannergooding


// Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'.
// Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values.
str1=AdvSimd.Or(decOne1,decTwo1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Question for others: When writing an routine that is purelyAdvSimd, for basic functions such asOr should we be usingAdvSimd throughout or mixing APIs and usingVector128 ?

SwapnilGaikwad reacted with thumbs up emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree. We should just usedecOne1 | decTwo1

Copy link
Member

@tannergoodingtannergoodingApr 8, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In general we want to avoid writing routines that are purely for a given architecture or ISA where-ever possible.

It is simply not maintainable to write and support implementations for all of:

  • Arm64
    • AdvSimd
    • Sve/Sve2
  • x86/x64
    • Sse2/Sse4.1
    • Avx/Avx2
    • Avx512/Avx10
  • WASM
  • RISC-V
  • LoongArch64
  • etc

So, in the ideal world we are using xplat code and sharing across all target platforms for the vast majority of code we write. The only duplication we should have in the common case is for different vector sizes. That is, currently duplicating code across V128/V256/V512 and Vector, based on the sizes we want to support -- Even this is something we're looking at allowing more sharing of in the future viaISimdVector<TSelf, T>, so it is truly write once as much as possible.

We then opportunistically light up usage of platform specific behavior only if the additional complexity is justified by the perf gains and ideally only where absolutely required, still using shared xplat code for most of the algorithm.

The xplat APIs should cover most core functionality and the JIT should recognize common patterns and optimize to better platform specific codegen where possible. Where there is some functionality that might be missing, we can propose and expose new APIs that work better across the range of platforms we need to support.

Copy link
Contributor

@a74nha74nh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Otherwise LGTM

Copy link
Contributor

@kunalspathakkunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

add few comments and could you please share the code generated.

SwapnilGaikwad reacted with thumbs up emoji
@SwapnilGaikwad
Copy link
ContributorAuthor

add few comments and could you please share the code generated.

Assembly for the 'AdvSimdDecode'. It's extracted from the 'System.Private.CoreLib.dll'.
b.cc0xffffa8327378  // b.lo, b.ul, b.lastmov     x6, x27str     x28,[x29, #272]movi    v8.16b, #0x3f           // <==== Line#894: DuplicateToVector128()str     q8,[x29, #256]ldr     q9,0xffffa83278b0str     q9,[x29, #96]ldr     q10,0xffffa83278c0str     q10,[x29, #80]ldr     q11,0xffffa83278d0str     q11,[x29, #64]ldr     q12,0xffffa83278e0str     q12,[x29, #48]ldr     q13,0xffffa83278f0str     q13,[x29, #32]ldr     q14,0xffffa8327900str     q14,[x29, #16]str     w4,[x29, #344]mov     w2, w4str     x6,[x29, #280]mov     x0, x6mov     x1, x27adrp    x11,0xffffa8e1c000add     x11, x11, #0xef8mov     v8.d[0], v9.d[1]mov     v11.d[0], v10.d[1]ldr     x13,[x11]blr     x13ldr     x6,[x29, #280]ld4     {v16.16b-v19.16b},[x6]  // <==== Line#902: LoadVector128x4AndUnzip()/approx.loop startstp     q16, q17,[x29, #192]stp     q18, q19,[x29, #224]ldp     q16, q17,[x29, #192]ldp     q18, q19,[x29, #224]ldr     q20,[x29, #256]uqsub   v21.16b, v16.16b, v20.16buqsub   v22.16b, v17.16b, v20.16buqsub   v23.16b, v18.16b, v20.16buqsub   v24.16b, v19.16b, v20.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v9.d[1], v8.d[0]mov     v10.d[1], v11.d[0]mov     v27.16b, v9.16bmov     v28.16b, v10.16btbl     v16.16b, {v25.16b-v28.16b}, v16.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v27.16b, v25.16bmov     v28.16b, v26.16bmov     v29.16b, v9.16bmov     v30.16b, v10.16btbl     v17.16b, {v27.16b-v30.16b}, v17.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v27.16b, v9.16bmov     v28.16b, v10.16btbl     v18.16b, {v25.16b-v28.16b}, v18.16bmvni    v25.4s, #0x0mvni    v26.4s, #0x0mov     v27.16b, v25.16bmov     v28.16b, v26.16bmov     v29.16b, v9.16bmov     v30.16b, v10.16btbl     v19.16b, {v27.16b-v30.16b}, v19.16bldp     q26, q25,[x29, #48]ldp     q28, q27,[x29, #16]tbx     v21.16b, {v25.16b-v28.16b}, v21.16btbx     v22.16b, {v25.16b-v28.16b}, v22.16btbx     v23.16b, {v25.16b-v28.16b}, v23.16btbx     v24.16b, {v25.16b-v28.16b}, v24.16borr     v16.16b, v16.16b, v21.16borr     v17.16b, v17.16b, v22.16borr     v18.16b, v18.16b, v23.16borr     v19.16b, v19.16b, v24.16bcmhi    v21.16b, v16.16b, v20.16bcmhi    v22.16b, v17.16b, v20.16borr     v21.16b, v21.16b, v22.16bcmhi    v22.16b, v18.16b, v20.16borr     v21.16b, v21.16b, v22.16bcmhi    v22.16b, v19.16b, v20.16borr     v21.16b, v21.16b, v22.16bumaxp   v21.4s, v21.4s, v21.4smov     x2, v21.d[0]cmp     x2, #0x0b.ne0xffffa8327350  // b.anyshl     v16.16b, v16.16b, #2ushr    v21.16b, v17.16b, #4orr     v8.16b, v16.16b, v21.16bshl     v16.16b, v17.16b, #4ushr    v17.16b, v18.16b, #2orr     v11.16b, v16.16b, v17.16bshl     v16.16b, v18.16b, #6orr     v12.16b, v16.16b, v19.16bmov     w2, w19ldr     x0,[x29, #272]mov     x1, x28adrp    x11,0xffffa8e1c000add     x11, x11, #0xf00mov     v14.d[0], v8.d[1]mov     v9.d[0], v11.d[1]mov     v10.d[0], v12.d[1]ldr     x3,[x11]blr     x3mov     v8.d[1], v14.d[0]mov     v11.d[1], v9.d[0]mov     v12.d[1], v10.d[0]ldr     x7,[x29, #272]mov     v16.16b, v8.16bmov     v17.16b, v11.16bmov     v18.16b, v12.16bst3     {v16.16b-v18.16b},[x7]ldr     x6,[x29, #280]add     x6, x6, #0x40add     x7, x7, #0x30ldr     x3,[x29, #288]cmp     x6, x3str     x7,[x29, #272]str     x3,[x29, #288]ldp     q10, q9,[x29, #80]b.ls0xffffa832754c  // b.plaststr     x6,[x29, #280]         //  <==== End ofloopldr     x6,[x29, #280]mov     x4, x6ldr     x7,[x29, #272]mov     x5, x7ldr     x6,[x29, #312]cmp     x4, x6b.eq0xffffa8327740  // b.none
kunalspathak reacted with thumbs up emoji

@SwapnilGaikwad
Copy link
ContributorAuthor

Hi@kunalspathak, there is no change in assembly after the refactoring.

New assembly sequence
b.cc0xffffa8337378  // b.lo, b.ul, b.lastmovx6, x27strx28, [x29, #272]moviv8.16b, #0x3fstrq8, [x29, #256]ldrq9, 0xffffa83378b0strq9, [x29, #96]ldrq10, 0xffffa83378c0strq10, [x29, #80]ldrq11, 0xffffa83378d0strq11, [x29, #64]ldrq12, 0xffffa83378e0strq12, [x29, #48]ldrq13, 0xffffa83378f0strq13, [x29, #32]ldrq14, 0xffffa8337900strq14, [x29, #16]strw4, [x29, #344]movw2, w4strx6, [x29, #280]movx0, x6movx1, x27adrpx11, 0xffffa8e2c000addx11, x11, #0xef8movv8.d[0], v9.d[1]movv11.d[0], v10.d[1]ldrx13, [x11]blrx13ldrx6, [x29, #280]ld4{v16.16b-v19.16b}, [x6]stpq16, q17, [x29, #192]stpq18, q19, [x29, #224]ldpq16, q17, [x29, #192]ldpq18, q19, [x29, #224]ldrq20, [x29, #256]uqsubv21.16b, v16.16b, v20.16buqsubv22.16b, v17.16b, v20.16buqsubv23.16b, v18.16b, v20.16buqsubv24.16b, v19.16b, v20.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv9.d[1], v8.d[0]movv10.d[1], v11.d[0]movv27.16b, v9.16bmovv28.16b, v10.16btblv16.16b, {v25.16b-v28.16b}, v16.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv27.16b, v25.16bmovv28.16b, v26.16bmovv29.16b, v9.16bmovv30.16b, v10.16btblv17.16b, {v27.16b-v30.16b}, v17.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv27.16b, v9.16bmovv28.16b, v10.16btblv18.16b, {v25.16b-v28.16b}, v18.16bmvniv25.4s, #0x0mvniv26.4s, #0x0movv27.16b, v25.16bmovv28.16b, v26.16bmovv29.16b, v9.16bmovv30.16b, v10.16btblv19.16b, {v27.16b-v30.16b}, v19.16bldpq26, q25, [x29, #48]ldpq28, q27, [x29, #16]tbxv21.16b, {v25.16b-v28.16b}, v21.16btbxv22.16b, {v25.16b-v28.16b}, v22.16btbxv23.16b, {v25.16b-v28.16b}, v23.16btbxv24.16b, {v25.16b-v28.16b}, v24.16borrv16.16b, v16.16b, v21.16borrv17.16b, v17.16b, v22.16borrv18.16b, v18.16b, v23.16borrv19.16b, v19.16b, v24.16bcmhiv21.16b, v16.16b, v20.16bcmhiv22.16b, v17.16b, v20.16borrv21.16b, v21.16b, v22.16bcmhiv22.16b, v18.16b, v20.16borrv21.16b, v21.16b, v22.16bcmhiv22.16b, v19.16b, v20.16borrv21.16b, v21.16b, v22.16bumaxpv21.4s, v21.4s, v21.4smovx2, v21.d[0]cmpx2, #0x0b.ne0xffffa8337350  // b.anyshlv16.16b, v16.16b, #2ushrv21.16b, v17.16b, #4orrv8.16b, v16.16b, v21.16bshlv16.16b, v17.16b, #4ushrv17.16b, v18.16b, #2orrv11.16b, v16.16b, v17.16bshlv16.16b, v18.16b, #6orrv12.16b, v16.16b, v19.16bmovw2, w19ldrx0, [x29, #272]movx1, x28adrpx11, 0xffffa8e2c000addx11, x11, #0xf00movv14.d[0], v8.d[1]movv9.d[0], v11.d[1]movv10.d[0], v12.d[1]ldrx3, [x11]blrx3movv8.d[1], v14.d[0]movv11.d[1], v9.d[0]movv12.d[1], v10.d[0]ldrx7, [x29, #272]movv16.16b, v8.16bmovv17.16b, v11.16bmovv18.16b, v12.16bst3{v16.16b-v18.16b}, [x7]ldrx6, [x29, #280]addx6, x6, #0x40addx7, x7, #0x30ldrx3, [x29, #288]cmpx6, x3strx7, [x29, #272]strx3, [x29, #288]ldpq10, q9, [x29, #80]b.ls0xffffa833754c  // b.plaststrx6, [x29, #280]ldrx6, [x29, #280]movx4, x6ldrx7, [x29, #272]movx5, x7ldrx6, [x29, #312]cmpx4, x6b.eq0xffffa8337740
kunalspathak reacted with thumbs up emoji

Copy link
Contributor

@kunalspathakkunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

@kunalspathakkunalspathak merged commitfa1164c intodotnet:mainApr 14, 2024
Comment on lines +867 to +868
Vector128<byte>decLutOne1=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();
Vector128<byte>decLutOne2=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These could have beenVector128<byte>.AllBitsSet

SwapnilGaikwad reacted with thumbs up emoji
Comment on lines +867 to +874
Vector128<byte>decLutOne1=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();
Vector128<byte>decLutOne2=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF).AsByte();
Vector128<byte>decLutOne3=Vector128.Create(0xFFFFFFFF,0xFFFFFFFF,0x3EFFFFFF,0x3FFFFFFF).AsByte();
Vector128<byte>decLutOne4=Vector128.Create(0x37363534,0x3B3A3938,0xFFFF3D3C,0xFFFFFFFF).AsByte();
Vector128<byte>decLutTwo1=Vector128.Create(0x0100FF00,0x05040302,0x09080706,0x0D0C0B0A).AsByte();
Vector128<byte>decLutTwo2=Vector128.Create(0x11100F0E,0x15141312,0x19181716,0xFFFFFFFF).AsByte();
Vector128<byte>decLutTwo3=Vector128.Create(0x1B1AFFFF,0x1F1E1D1C,0x23222120,0x27262524).AsByte();
Vector128<byte>decLutTwo4=Vector128.Create(0x2B2A2928,0x2F2E2D2C,0x33323130,0xFFFFFFFF).AsByte();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's not clear why these all need to be declared up front. It's effectively defining unnecessary locals and work for the JIT to do.

This could have beenvar decLutOne = (Vector128<byte>.AllBitsSet, Vector128<byte>.AllBitsSet, Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0x3EFFFFFF, 0x3FFFFFFF).AsByte(), Vector128.Create(0x37363534, 0x3B3A3938, 0xFFFF3D3C, 0xFFFFFFFF).AsByte()) instead, as an example.

SwapnilGaikwad reacted with thumbs up emoji
Comment on lines +876 to +890
Vector128<byte>decOne1;
Vector128<byte>decOne2;
Vector128<byte>decOne3;
Vector128<byte>decOne4;
Vector128<byte>decTwo1;
Vector128<byte>decTwo2;
Vector128<byte>decTwo3;
Vector128<byte>decTwo4;
Vector128<byte>str1;
Vector128<byte>str2;
Vector128<byte>str3;
Vector128<byte>str4;
Vector128<byte>res1;
Vector128<byte>res2;
Vector128<byte>res3;
Copy link
Member

@tannergoodingtannergoodingApr 14, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Similar here, it's not clear why these were defined up front like this. They could have been declared in the loop at the declaration site (since they aren't used outside the loop) making the overall code much more readable, without needing 15 up front declarations

SwapnilGaikwad reacted with thumbs up emoji

byte*src=srcBytes;
byte*dest=destBytes;
Vector128<byte>offset=AdvSimd.DuplicateToVector128((byte)0x3F);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Vector128.Create<byte>(0x3F) is preferred overAdvSimd.DuplicateToVector128.

The former allows more general optimizations to kick in as it does not prescribe a particular instruction be emitted.

SwapnilGaikwad reacted with thumbs up emoji
do
{
// Load 64 bytes and de-interleave.
AssertRead<Vector128<byte>>(src,srcStart,sourceLength);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I know this is pre-existing to the PR, but this is not a great method name. It should probably be something likeAssertReadIsValid or similar.

{
// Load 64 bytes and de-interleave.
AssertRead<Vector128<byte>>(src,srcStart,sourceLength);
(str1,str2,str3,str4)=AdvSimd.Arm64.LoadVector128x4AndUnzip(src);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be nice ifLoadVector128x4AndUnzip was something we could pattern match and implicitly light up.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Did you mean a wrapper method, sayLoadVectorAndUnzip(), that would do the right thing based on the return type?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No, rather as per my comment above (#100589 (comment)) we prefer to write xplat code so logic can be shared across CPU architectures where possible.

To facilitate this, we ideally have the JIT recognize simple patterns and optimize them.LoadVector128x4, for example, could be done automatically if the JIT were to recognize 4x sequentialVector128.Load.

LoadVector128x4AndUnzip could theoretically have the same done if we were to recognize 4x sequenceVector128.Load and a general pattern forunzip. Today that would be recognizingShuffle in particular, but longer term it might be a good reason for there to be an xplatVector128.Unzip (since that just corresponds toUnzip on Arm64 andUnpack onx64).


The driving question tends to be, "does this logic actually have to be platform/ISA specific, or is there a simple pattern we could enable the JIT to recognize instead?". In the case simple pattern recognition is possible, then it's generally ideal to enable it so that existing user code lights up and gets faster on Arm64 without the need for users to go and manually do the work. It also lets the same code automatically light-up for things likeAdvSimd vsSVE, rather than being strictly tied down toAdvSimd.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Right, thanks for this clarification. I'll keep this mind for the common patterns in the future.

Comment on lines +942 to +944
res1=AdvSimd.ShiftLeftLogical(str1,2)|AdvSimd.ShiftRightLogical(str2,4);
res2=AdvSimd.ShiftLeftLogical(str2,4)|AdvSimd.ShiftRightLogical(str3,2);
res3=AdvSimd.ShiftLeftLogical(str3,6)|str4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These could have been of the formres1 = (str1 << 2) | (str2 << 4);

SwapnilGaikwad reacted with thumbs up emoji
Comment on lines +930 to +933
Vector128<byte>classified=AdvSimd.CompareGreaterThan(str1,offset)
|AdvSimd.CompareGreaterThan(str2,offset)
|AdvSimd.CompareGreaterThan(str3,offset)
|AdvSimd.CompareGreaterThan(str4,offset);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These could have been the xplatVector128.GreaterThan(str1, offset)

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I didn't understand the xplat bit. Could you elaborate on this, please?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We have APIs defined onVector128 that expose functionality common across multiple ISAs (AdvSimd, SVE, etc) and common across multiple architectures (Arm64, x64, WASM, etc).

Using these APIs tends to be preferred over using the platform specific APIs, particularly when there is a clear 1-to-1 mapping. Thusx + y is preferred overAdvSimd.Add andVector128.GreaterThan is preffered overAdvSimd.CompareGreaterThan.

Using the xplat (cross platform) APIs gives the JIT more flexibility in the code it generates and allows more portability, while also often improving readability.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Makes sense, I'll update these calls 👍

Comment on lines +904 to +927
// Get indices for second LUT:
decTwo1=AdvSimd.SubtractSaturate(str1,offset);
decTwo2=AdvSimd.SubtractSaturate(str2,offset);
decTwo3=AdvSimd.SubtractSaturate(str3,offset);
decTwo4=AdvSimd.SubtractSaturate(str4,offset);

// Get values from first LUT. Out-of-range indices are set to 0.
decOne1=AdvSimd.Arm64.VectorTableLookup(decLutOne,str1);
decOne2=AdvSimd.Arm64.VectorTableLookup(decLutOne,str2);
decOne3=AdvSimd.Arm64.VectorTableLookup(decLutOne,str3);
decOne4=AdvSimd.Arm64.VectorTableLookup(decLutOne,str4);

// Get values from second LUT. Out-of-range indices are unchanged.
decTwo1=AdvSimd.Arm64.VectorTableLookupExtension(decTwo1,decLutTwo,decTwo1);
decTwo2=AdvSimd.Arm64.VectorTableLookupExtension(decTwo2,decLutTwo,decTwo2);
decTwo3=AdvSimd.Arm64.VectorTableLookupExtension(decTwo3,decLutTwo,decTwo3);
decTwo4=AdvSimd.Arm64.VectorTableLookupExtension(decTwo4,decLutTwo,decTwo4);

// Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'.
// Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values.
str1=decOne1|decTwo1;
str2=decOne2|decTwo2;
str3=decOne3|decTwo3;
str4=decOne4|decTwo4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This general algorithm probably could do with a comment walking through what's happening so the "magic" is better understood, much as exists forVector128Decode

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Agreed, it's difficult to follow. Although, it's a bit tricky to describe the entire workings with lookup tables and interleaving. I'll try to make it less opaque.

@SwapnilGaikwad
Copy link
ContributorAuthor

Hi@tannergooding , thanks for your comments. I'll add a follow-up PR incorporating as many suggestions as I can.

tannergooding and kunalspathak reacted with thumbs up emoji

@SwapnilGaikwad
Copy link
ContributorAuthor

Hi@tannergooding , thanks for your comments. I'll add a follow-up PR incorporating as many suggestions as I can.

#101620 addresses above suggestions.

matouskozak pushed a commit to matouskozak/runtime that referenced this pull requestApr 30, 2024
* Use multi-reg load/store for DecodeFromUTF8* Address review comments
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsMay 27, 2024
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@tannergoodingtannergoodingtannergooding left review comments

+2 more reviewers

@a74nha74nha74nh approved these changes

@kunalspathakkunalspathakkunalspathak approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

arch-arm64area-System.Memorycommunity-contributionIndicates that the PR has been added by a community member

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

6 participants

@SwapnilGaikwad@kunalspathak@a74nh@tannergooding@BruceForstall@MihaZupan

[8]ページ先頭

©2009-2025 Movatter.jp