Visual Studio 2026 Insiders is here!

The world’s most popular IDE just got an upgrade.

October 3rd, 2023
heart8 reactions

Arm64 Performance Improvements in .NET 8

Kunal Pathak
Senior Software Engineer

.NET 8 comes with a host of powerful features, enhanced support for cutting-edge architectural capabilities, and a significant boost in performance. For a comprehensive look at the general performance enhancements in .NET 8, be sure to delve intoStephen Toub’s insightful blog post on the subject.

Building on the foundation laid by previous releases such asARM64 Performance in .NET 5 andARM64 Performance in .NET 7, this article will focus on the specific feature enhancements and performance optimizations tailored to ARM64 architecture in .NET 8.

A key objective for .NET 8 was to enhance the performance of the platform on Arm64 systems. However, we also set our sights on incorporating support for advanced features offered by the Arm architecture, thereby elevating the overall code quality for Arm-based platforms. In this blog post, we’ll examine some of these noteworthy feature additions before diving into the extensive performance improvements that have been implemented. Lastly, we’ll provide insights into the results of our performance analysis on real-world applications designed for Arm64 devices.

Conditional Selection

Branches within the code can significantly impact the efficiency of the processor pipeline, potentially causing execution delays. While modern processors have made strides in branch prediction to maximize accuracy, they still mispredict branches, resulting in wasted processing cycles. Conditional selection instructions offer an alternative by encoding the necessary conditional flags directly within the instruction, eliminating the need for branch instructions.This article offers a comprehensive overview of the various formats of these instructions.

In .NET 8, we’ve taken on the challenge of addressing all the scenarios outlined indotnet/runtime#55364 to implement conditional selection instructions effectively. Indotnet/runtime#73472,@a74nh introduced a new “if-conversion” phase designed to generate conditional selection instructions. Let’s dig into an illustrative example.

if (op1 > 0) { op1 = 5; }

In .NET 7, we generatedbranch for such code:

G_M12157_IG02:            cmp     w0, #0            ble     G_M12157_IG04G_M12157_IG03:            mov     w0, #5G_M12157_IG04:            ...

In .NET 8, we introduced the generation ofcsel instructions, illustrated below. Thecmp instruction sets the flags based on the value inw0, determining whether it’s greater than, equal to, or less than zero. Thecsel instruction then evaluates the “le” condition (the 4th operand ofcsel) triggered by thecmp. Essentially, it checks ifw0 <= 0. If this condition holds true, it preserves the current value ofw0 (the 2nd operand ofcsel) as the result (the 1st operand ofcsel). However, ifw0 > 0, it updatesw0 with the value inw1 (the 3rd operand ofcsel).

G_M12157_IG02:            mov     w1, #5            cmp     w0, #0            csel    w0, w0, w1, le

dotnet/runtime#77728 extended this work to handleelse conditions as seen in following example:

 if (op1 < 7) { op2 = 5; } else { op2 = 9; }

Previously, we generated two branch instructions forif-else style code:

 G_M52833_IG02:            cmp     w0, #7            bge     G_M52833_IG04G_M52833_IG03:            mov     w1, #5            b       G_M52833_IG05G_M52833_IG04:            mov     w1, #9G_M52833_IG05:            ...

With the introduction ofcsel, there is no longer a need for branches in this example. Instead, we can set the appropriate result value based on the conditional flag encoded within the instruction itself.

 G_M52833_IG02:            mov     w2, #9            mov     w3, #5            cmp     w0, #7            csel    w0, w2, w3, ge

As indicated inthis,this andthis reports, we have observed nearly a 50% improvement in such scenarios.

pr_csel_1.png

Conditional Comparison

Indotnet/runtime#83089, we enhanced the code even further by consolidating multiple conditions separated by the|| operation into a conditional comparisonccmp instruction.

[MethodImpl(MethodImplOptions.NoInlining)]public static int Test3(int a, int b, int c){    if (a < b || b < c || c < 10)        return 42;    return 13;}

In the code snippet above, we observe several conditions linked together using the logical OR|| operator. As a quick reminder, when using||, we evaluate the first condition, and if it’s true, we don’t check the remaining conditions. We only evaluate the subsequent conditions if the preceding ones arefalse.

Prior to .NET 8, our approach involved generating numerous instructions to conduct separate comparisons usingcmp and storing the result of each comparison in a register. Even if the first condition evaluated totrue, we still processed all the conditions joined by the|| operator. The outcomes of these individual comparisons were then combined using theorr instruction, and a branch instructioncbz was employed to determine whether to jump based on the final result of all the conditions collectively.

G_M35702_IG02:            cmp     w0, w1            cset    x0, lt            cmp     w1, w2            cset    x1, lt            orr     w0, w0, w1            cmp     w2, #10            cset    x1, lt            orr     w0, w0, w1            cbz     w0, G_M35702_IG05G_M35702_IG03:            mov     w0, #42G_M35702_IG04:            ldp     fp, lr, [sp],#0x10            ret     lrG_M35702_IG05:            mov     w0, #13G_M35702_IG06:            ...

In .NET 8, we’ve introduced a more efficient approach. We initiate the process by comparing the values ofw0 andw1 (as indicated by thea < b condition in our code snippet) using thecmp instruction. Subsequently, theccmp w1, w2, nc, ge instruction comes into play. This instruction will only compare the values ofw1 andw2 if the previouscmp instruction determined that the conditionge (greater than or equal) was met. In simpler terms,w1 andw2 are compared only when thecmp instruction finds thatw0 is greater than or equal tow1, meaning our conditiona < b is false. If the first condition evaluates to true, there’s no need for further condition checks, so theccmp instruction simply sets thenc flags. This process is repeated for the subsequentccmp instruction. Finally, thecsel instruction determines the value (w3 orw4) to place in the result (w0) based on the outcome of thege condition. This example illustrates how processor cycles are conserved through the use of theccmp instruction.

G_M35702_IG02:            mov     w3, #13            mov     w4, #42            cmp     w0, w1            ccmp    w1, w2, nc, ge            ccmp    w2, #10, nc, ge            csel    w0, w3, w4, ge

Conditional Increment, Negation and Inversion

Thecinc instruction is part of the conditional selection family of instructions, and it serves to increment the value of the source register by1 if a specific condition is satisfied.

Indotnet/runtime#82031, the code sequence was optimized by@SwapnilGaikwad to generatecinc instructions instead ofcsel instructions whenever the code conditionally needed to increase a value by 1. This optimization helps streamline the code and improves its efficiency by directly incrementing the value when necessary, rather than relying on conditional selection for this specific operation.

public static int Test4(bool c){    return c ? 100 : 101;}

Before .NET 8, two registers were required to store the values of both branches. After thecmp instruction, a selection would be made to determine which of the two values (w1 andw2) should be stored in the result registerw0. This approach used additional registers and instructions, which could impact performance and code size.

Here’s how the code looked in older releases:

            mov     w1, #100            mov     w2, #101            cmp     w0, #0            csel    w0, w1, w2, ne

With this structure,w0 would either contain the value ofw1 orw2 based on the condition, leading to potential register usage inefficiencies.

In .NET 8, we have eliminated the need to maintain an additional register. Instead, we use thecinc instruction to increment the value if thetst instruction succeeds. This enhancement simplifies the code and reduces the reliance on extra registers, potentially leading to improved performance and more efficient code execution.

            mov     w1, #100            tst     w0, #255            cinc    w0, w1, eq

Similarly, indotnet/runtime#84926, we introduced support forcinv andcneg instructions, replacing the use ofcsel. This is demonstrated in the following example:

public static int cinv_test(int x, bool c){    return c ? x : ~x;}public static int cneg_test(int x, bool c){    return c ? x : -x;}
            tst     w1, #255            csinv   w0, w0, w0, ne            ...            tst     w1, #255            csneg   w0, w0, w0, ne

@a74nh has published a three part deep dive blog series explaining this work that was done in .NET. Readpart 1,part 2 andpart 3 for more details.

VectorTableLookup and VectorTableLookupExtension

In .NET 8, we added two new set of APIs underSystem.Runtime.Intrinsics.Arm namespace:VectorTableLookup andVectorTableLookupExtension.

      public static Vector64<byte> VectorTableLookup((Vector128<byte>, Vector128<byte>) table, Vector64<byte> byteIndexes);      public static Vector64<byte> VectorTableLookup(Vector64<byte> defaultValues, (Vector128<byte>, Vector128<byte>) table, Vector64<byte> byteIndexes);

Let us see an example of each these APIs.

// Vector128<byte> a = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16// Vector128<byte> b = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160// Vector64<byte> index = 3, 31, 4, 40, 18, 19, 30, 1Vector64<byte> ans = VectorTableLookup((a, b), index);// ans = 4, 160, 5, 0, 30, 40, 150, 2

In the example above, the vectorsa andb are treated as a single table with a total of 32 entries (16 froma and 16 fromb), indexed starting from 0. Theindex parameter allows you to retrieve values from this table at specific indices. If an index is out of bounds, such as attempting to access index40 in our example, the API will return a value of 0 for that out-of-bounds index.

// Vector64<byte> d = 100, 200, 300, 400, 500, 600, 700, 800// Vector128<byte> a = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16// Vector128<byte> b = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160// Vector64<byte> index = 3, 31, 4, 40, 18, 19, 30, 1Vector64<byte> ans = VectorTableLookupExtension(d, (a, b), index);// ans = 4, 160, 5, 400, 30, 40, 150, 2

In contrast to theVectorTableLookup, when usingVectorTableLookupExtension method, if an index falls outside the valid range, the corresponding element in the result will be determined by the values provided in thedefaultValues parameter. It’s worth noting that there are other variations of these APIs that operate on 3-entity and 4-entity tuples as well, providing flexibility for various use cases.

Indotnet/runtime#85189,@MihaZupan leveraged this API to optimizeIndexOfAny, resulting in a remarkable 30% improvement in performance. Similarly, indotnet/runtime#87126,@SwapnilGaikwad significantly enhanced the performance of the Guid formatter, achieving up to a 40% performance boost. These optimizations demonstrate the substantial performance gains that can be achieved by harnessing this powerful API.

Consecutive Register Allocation

TheVectorTableLookup andVectorTableLookupExtension APIs utilize thetbl andtbx instructions, respectively, for table lookup operations. These instructions work with input vectors contained in thetable parameter, which can consist of 1, 2, 3, or 4 vector entities. It’s important to note that these instructions require the input vectors to be stored in consecutive vector registers on the processor. For example, consider the codeVectorTableLookup((a, b, c, d), index), which generates thetbl v11.16b, {v16.16b, v17.16b, v18.16b, v19.16b}, v20.16b instruction. In this instruction,v11 serves as the result vector,v20 holds the value of theindex variable, and thetable parameter, represented as the tuple(a, b, c, d), is stored in consecutive registersv16 throughv19.

Before .NET 8, RyuJIT’s register allocator could only allocate a single register for each variable. However, indotnet/runtime#80297, we introduced a new feature in the register allocator to allocate multiple consecutive registers for these types of instructions, enabling more efficient code generation.

To implement this feature, we made adjustments to the existing allocator algorithm and certain data structures to effectively manage variables that belong to a tuple series requiring consecutive registers. Here’s an overview of how it works:

  1. Tracking Consecutive Register Requirement: The algorithm now identifies the first variable in a tuple series that requires consecutive registers.

  2. Register Allocation: Before assigning registers to variables, the algorithm checks for the availability of X consecutive free registers, where X is the number of variables in thetable tuple.

  3. Handling Register Shortages: If there aren’t X consecutive free registers available, the algorithm looks for busy registers that are adjacent to some of the free registers. It then frees up these busy registers to meet the requirement of assigning X consecutive registers to the X variables in thetable tuple.

  4. Versatility: The allocation of consecutive registers isn’t limited to justtbl andtbx instructions. It’s also used for various load and store instructions that will be implemented in .NET 9. For instance,ld2,ld3, andld4 are load instructions capable of loading 2, 3, or 4 vector registers from memory, respectively. These instructions, as well as store instructions likest2,st3, andst4, all require consecutive registers.

This enhancement in the register allocator ensures that variables involved in tuple-based operations are allocated consecutive registers when necessary, improving code generation efficiency not only for existing instructions but also for upcoming load and store instructions in future .NET versions.

Peephole optimizations

In .NET 5, we encountered several issues, as documented indotnet/runtime#55365, where the application of peephole optimizations could significantly enhance the generated .NET code. Collaborating closely with engineers from Arm Corp., including@a74nh,@SwapnilGaikwad, and@AndyJGraham, we successfully addressed all these issues in .NET 8. In the sections below, I will provide an overview of each of these improvements, along with illustrative examples.

Replace successiveldr andstr withldp andstp

Through the combined efforts ofdotnet/runtime#77540,dotnet/runtime#85032, anddotnet/runtime#84399, we’ve introduced an optimization that replaces successive pairs of load (ldr) and store (str) instructions with load pair (ldp) and store pair (stp) instructions, respectively. This enhancement has led to a remarkable reduction in the code size of several of our libraries and benchmark methods, amounting to a decrease of nearly 700KB in bytes.

As a result of this optimization, the following sequence of twoldr instructions and twostr instructions has been streamlined to generate a singleldp instruction and a singlestp instruction:

- ldr x0, [fp, #0x10]- ldr x1, [fp, #0x16]+ ldp x0, x1, [fp, #0x10]...- str x12, [x14]- str x12, [x14, #8]+ stp x12, [x14]

This transformation contributes to improved code efficiency and a reduction in overall binary size.

Use ldp/stp for SIMD registers

Indotnet/runtime#84135, we introduced an optimization where a pair of load and store instructions involving SIMD (Single Instruction, Multiple Data) registers were replaced with a singleldp (Load Pair) andstp (Store Pair) instruction. This optimization is demonstrated in the following example:

- ldr q1, [x0, #0x20]- ldr q2, [x0, #0x30]+ ldp q1, q2, [x0, #0x20]

Replace pair ofstr wzr withstr xzr

Indotnet/runtime#84350, we introduced an optimization that consolidated a pair of stores involving 4-byte zero registers into a single store instruction that utilized an 8-byte zero register. This optimization enhances code efficiency by reducing the number of instructions required, resulting in improved performance and potentially smaller code size.

- stp wzr, wzr, [x2, #0x08]+ str xzr, [x2, #0x08]......- stp wzr, wzr, [x14, #0x20]- str wzr, [x14, #0x18]+ stp xzr, xzr, [x14, #0x18]

Replace load with cheapermov

Indotnet/runtime#83458, we implemented an optimization that replaced a heavier load instruction with a more efficientmov instruction in certain scenarios. For example, when the content of memory loaded into registerw1 was subsequently loaded intow0, we used themov instruction instead of performing a redundant load operation.

ldr w1, [fp, #0x28]- ldr w0, [fp, #0x28]+ mov w0, w1

The same PR also eliminated redundant load instructions when the contents were already present in a register. For instance, in the following example, ifw1 had already been loaded from[fp + 0x20], the second load instruction could be removed.

ldr w1, [fp, #0x20]- ldr w1, [fp, #0x20]

Convert mul + neg -> mneg

Indotnet/runtime#79550, we transformed two operations involving multiplication and negation instruction to a singlemneg instruction.

Code quality improvements

In .NET 8, we also implemented various enhancements in other aspects of RyuJIT to enhance the performance of Arm64 code.

Faster Vector128/Vector64 compare

Indotnet/runtime#75864, we optimized vector comparisons in commonly used algorithms likeSequenceEqual orIndexOf. Let’s take a look at the code generated before these optimizations.

bool Test1(Vector128<int> a, Vector128<int> b) => a == b;

In the generated code, the first step involves comparing two vectors,a andb, which are stored in vector registersv0 andv1. This comparison is performed using thecmeq instruction that compares two vectors bitwise. This instruction compares bytes in each lane of the two vectors,v0 andv1. If the bytes are equal, it sets the corresponding lane to0xFF, otherwise, it sets it to0x00. Following the comparison, theuminv instruction is used to find the minimum byte among all the lanes. It’s important to note that theuminv instruction has a higher latency because it operates on all lanes after the data for all the lanes is available, and therefore, it does not operate in parallel.

          cmeq    v16.4s, v0.4s, v1.4s          uminv   b16, v16.16b          umov    w0, v16.b[0]          cmp     w0, #0          cset    x0, ne

Indotnet/runtime#75864, we made a significant improvement by changing the instruction fromuminv touminp in the generated code. Theuminp instruction finds the minimum bytes in pairwise lanes, which allows it to operate more efficiently. After that, theumov instruction gathers the combined bytes from the first half of the vector. Finally, thecmn instruction checks if those bytes are zero or not. This change results in better performance compared to the previousuminv instruction becauseuminp can find the minimum value in parallel.

          uminp   v16.4s, v16.4s, v16.4s          umov    x0, v16.d[0]          cmn     x0, #1          cset    x0, eq

We observed significant improvements in various core .NET library methods as a result of these optimizations, with performance gains of up to 40%. You can find more details about these improvements inthis report.

pr_75864_1.png

pr_75864_3.png

Improve vector == Vector128<>.Zero

Thanks to the suggestion from@TamarChristinaArm, we made further improvements to vector comparisons with theZero vector indotnet/runtime#75999. These optimizations resulted insignificant performance improvements. In this optimization, we replaced theumaxv instruction, which compares the maximum value across all lanes, with the more efficientumaxp instruction that performs pairwise comparisons to find the maximum value.

static bool IsZero1(Vector128<int> v) => v == Vector128<int>.Zero;

In the previous code, we observed that theumaxv instruction was generated, which found the maximum value across all lanes and stored it in the 0th lane. Then, the value from the 0th lane was extracted and compared with0.

          umaxv   s16, v0.4s          umov    w0, v16.s[0]          cmp     w0, #0

Now, we useumaxp instruction which has better latency.

          umaxp   v16.4s, v0.4s, v0.4s          umov    x0, v16.d[0]          cmp     x0, #0

Here is an example of performance improvement ofSequenceEqual benchmark.pr_75999_1.png

Unroll Memmove

Indotnet/runtime#83740, we introduced a feature to unroll the memory move operation in theBuffer.Memmove API. Without unrolling, the memory move operations need to be done in a loop that operates on smaller chunks of the memory data to be moved. Unrolling optimizes the operation by reducing the loop overhead and enhancing memory access patterns. By copying larger chunks of data in each iteration, it minimizes the number of loop control instructions and leverages modern processors’ ability to perform multiple memory transfers in parallel. This can result in faster and more efficient code execution, especially when working with large data sets or when optimizing critical performance-sensitive code paths. This enhancement is utilized in various APIs such asTryCopyTo,ToArray, and others.

As demonstrated inthis andthis performance reports, these optimizations have led to significant performance improvements, with speedups of up to 20%.

pr_83740.png

Throughput improvements

In our continuous efforts to enhance the quality and performance of our code, we are equally dedicated to ensuring that the time taken to produce code remains efficient. This aspect is particularly crucial for applications where startup time is of paramount importance, such as user interface applications. As a result, we have integrated measures to evaluate the Just-In-Time (JIT) compiler’s throughput in our development process. For those unfamiliar,superpmi is an internally developed tool used by RyuJIT for validating the compilation of millions of methods. Thesuperpmi-diff functionality is employed to validate changes made to the codebase by comparing the assembly code produced before and after these changes. This comparison not only checks the generated code but also evaluates the time it takes for code compilation. With the recent introduction of oursuperpmi-diffs CI pipeline, we now systematically assess the JIT throughput impact of every pull request that involves the JIT codebase. This rigorous process ensures that any changes made do not compromise the efficiency of code production, ultimately benefiting the performance of .NET applications.

Efficient throughput is essential for applications with strict startup time requirements. To optimize this aspect, we have implemented several improvements, with a focus on the JIT’s register allocation process for generating Arm64 code.

Indotnet/runtime#85744, we introduced a mechanism to detect whether a method utilizes floating-point variables. If no floating-point variables are present, we skip the iteration over floating-point registers during the register allocation phase. This seemingly minor change resulted in a throughput gain of up to 0.5%.

During register allocation, the algorithm iterates through various access points of both user-defined variables and internally created variables used for storing temporary results. When determining which register to assign at a given access point, the algorithm traditionally iterated through all possible registers. However, it is more efficient to iterate only over the pre-determined set of allocatable registers at each access point. Indotnet/runtime#87424, we addressed this issue, leading to significant throughput improvements of up to 5%, as demonstrated below:

lsra_87424.png

It’s important to note that due to the larger number of registers available in Arm64 compared to x64 architecture, these changes resulted in more substantial throughput improvements for Arm64 code generation as compared to x64 targets.

lsra_87424_x64.png

Conclusion

In .NET 8, our collaboration with Arm Holdings engineers resulted in significant feature enhancements and improved code quality for Arm64 platforms. We addressed longstanding issues, introduced peephole optimizations, and adopted advanced instructions such as conditional selection. The addition of consecutive register allocation was a crucial feature that not only enabled instructions likeVectorTableLookup but also paved the way for future instructions like those capable of loading and storing multiple vectors as seen in the API proposaldotnet/runtime#84510. Looking ahead, our goals also include adding support for advanced architecture features like SVE and SVE2.

We extend our gratitude to the many contributors who have helped us deliver a faster .NET 8 on Arm64 devices.

Thank you for taking the time to explore .NET on Arm64, and please share your feedback with us. Happy coding on Arm64!

Author

Kunal Pathak
Senior Software Engineer

Kunal Pathak is a developer on JIT team of .NET Runtime.

7 comments

Discussion is closed.Login to edit/delete existing comments.

Sort by :

Stay informed

Get notified when new posts are published.
Follow this blog
facebooklinkedinyoutubetwitchStackoverflow