This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can trysigning in orchanging directories.
Access to this page requires authorization. You can trychanging directories.
This article describes new features and performance improvements in the .NET runtime for .NET 10. It has been updated for Preview 6.
The JIT compiler in .NET 10 includes significant enhancements that improve performance through better code generation and optimization strategies.
.NET's JIT compiler is capable of an optimization called physical promotion, where the members of a struct are placed in registers rather than on the stack, eliminating memory accesses. This optimization is particularly useful when passing a struct to a method, and the calling convention requires the struct members to be passed in registers.
.NET 10 improves the JIT compiler's internal representation to handle values that share a register. Previously, when struct members needed to be packed into a single register, the JIT would store values to memory first and then load them into a register. Now, the JIT compiler can place the promoted members of struct arguments into shared registers directly, eliminating unnecessary memory operations.
Consider the following example:
struct Point{ public long X; public long Y; public Point(long x, long y) { X = x; Y = y; }}[MethodImpl(MethodImplOptions.NoInlining)]private static void Consume(Point p){ Console.WriteLine(p.X + p.Y);}private static void Main(){ Point p = new Point(10, 20); Consume(p);}
On x64, we pass the members ofPoint
toConsume
in separate registers, and since physical promotion kicked in for the localp
, we don't allocate anything on the stack first:
Program:Main() (FullOpts): mov edi, 10 mov esi, 20 tail.jmp [Program:Consume(Program+Point)]
Now, suppose we changed the type of the members ofPoint
toint
instead oflong
. Because anint
is four bytes wide, and registers are eight bytes wide on x64, the calling convention requires us to pass the members ofPoint
in one register. Previously, the JIT compiler would first store the values to memory, and then load the eight-byte chunk into a register. With the .NET 10 improvements, the JIT compiler can now place the promoted members of struct arguments into shared registers directly:
Program:Main() (FullOpts): mov rdi, 0x140000000A tail.jmp [Program:Consume(Program+Point)]
This eliminates the need for intermediate memory storage, resulting in more efficient assembly code.
The JIT compiler can hoist the condition of awhile
loop and transform the loop body into ado-while
loop, producing the final shape:
if (loopCondition){ do { // loop body } while (loopCondition);}
This transformation is called loop inversion. By moving the condition to the bottom of the loop, the JIT removes the need to branch to the top of the loop to test the condition, improving code layout. Numerous optimizations (like loop cloning, loop unrolling, and induction variable optimizations) also depend on loop inversion to produce this shape to aid analysis.
.NET 10 enhances loop inversion by switching from a lexical analysis implementation to a graph-based loop recognition implementation. This change brings improved precision by considering all natural loops (loops with a single entry point) and ignoring false positives that were previously considered. This translates into higher optimization potential for .NET programs withfor
andwhile
statements.
One of thefocus areas for .NET 10 is to reduce the abstraction overhead of popular language features. In pursuit of this goal, the JIT's ability to devirtualize method calls has expanded to cover array interface methods.
Consider the typical approach of looping over an array:
static int Sum(int[] array){ int sum = 0; for (int i = 0; i < array.Length; i++) { sum += array[i]; } return sum;}
This code shape is easy for the JIT to optimize, mainly because there aren't any virtual calls to reason about. Instead, the JIT can focus on removing bounds checks on the array access and applying theloop optimizations that were added in .NET 9. The following example adds some virtual calls:
static int Sum(int[] array){ int sum = 0; IEnumerable<int> temp = array; foreach (var num in temp) { sum += num; } return sum;}
The type of the underlying collection is clear, and the JIT should be able to transform this snippet into the first one. However, array interfaces are implemented differently from "normal" interfaces, such that the JIT doesn't know how to devirtualize them. This means the enumerator calls in theforeach
loop remain virtual, blocking multiple optimizations such as inlining and stack allocation.
Starting in .NET 10, the JIT can devirtualize and inline array interface methods. This is the first of many steps to achieve performance parity between the implementations, as detailed in the.NET 10 de-abstraction plans.
Efforts to reduce the abstraction overhead of array iteration via enumerators have improved the JIT's inlining, stack allocation, and loop cloning abilities. For example, the overhead of enumerating arrays viaIEnumerable
is reduced, and conditional escape analysis now enables stack allocation of enumerators in certain scenarios.
The JIT compiler in .NET 10 introduces a new approach to organizing method code into basic blocks for better runtime performance. Previously, the JIT used a reverse postorder (RPO) traversal of the program's flowgraph as an initial layout, followed by iterative transformations. While effective, this approach had limitations in modeling the trade-offs between reducing branching and increasing hot code density.
In .NET 10, the JIT models the block reordering problem as a reduction of the asymmetric Travelling Salesman Problem and implements the 3-opt heuristic to find a near-optimal traversal. This optimization improves hot path density and reduces branch distances, resulting in better runtime performance.
Various inlining improvements have been made in .NET 10.
The JIT can now inline methods that become eligible for devirtualization due to previous inlining. This improvement allows the JIT to uncover more optimization opportunities, such as further inlining and devirtualization.
Some methods that have exception-handling semantics, in particular those withtry-finally
blocks, can also be inlined.
To better take advantage of the JIT's ability to stack-allocate some arrays, the inliner's heuristics have been adjusted to increase the profitability of candidates that might be returning small, fixed-sized arrays.
During inlining, the JIT now updates the type of temporary variables that hold return values. If all return sites in a callee yield the same type, this precise type information is used to devirtualize subsequent calls. This enhancement complements the improvements in late devirtualization and array enumeration de-abstraction.
.NET 10 enhances the JIT's inlining policy to take better advantage of profile data. Among numerous heuristics, the JIT's inliner doesn't consider methods over a certain size to avoid bloating the caller method. When the caller has profile data that suggests an inlining candidate is frequently executed, the inliner increases its size tolerance for the candidate.
Suppose the JIT inlines some calleeCallee
without profile data into some callerCaller
with profile data. This discrepancy can occur if the callee is too small to be worth instrumenting, or if it's inlined too often to have a sufficient call count. IfCallee
has its own inlining candidates, the JIT previously didn't consider them with its default size limit due toCallee
lacking profile data. Now, the JIT will realizeCaller
has profile data and loosen its size restriction (but, to account for loss of precision, not to the same degree as ifCallee
had profile data).
Similarly, when the JIT decides a call site isn't profitable for inlining, it marks the method withNoInlining
to save future inlining attempts from considering it. However, many inlining heuristics are sensitive to profile data. For example, the JIT might decide a method is too large to be worth inlining in the absence of profile data. But when the caller is sufficiently hot, the JIT might be willing to relax its size restriction and inline the call. In .NET 10, the JIT no longer flags unprofitable inlinees withNoInlining
to avoid pessimizing call sites with profile data.
.NET 10 introduces support for the Advanced Vector Extensions (AVX) 10.2 for x64-based processors. The new intrinsics available in theSystem.Runtime.Intrinsics.X86.Avx10v2 class can be tested once capable hardware is available.
Because AVX10.2-enabled hardware isn't yet available, the JIT's support for AVX10.2 is currently disabled by default.
Stack allocation reduces the number of objects the GC has to track, and it also unlocks other optimizations. For example, after an object is stack-allocated, the JIT can consider replacing it entirely with its scalar values. Because of this, stack allocation is key to reducing the abstraction penalty of reference types. .NET 10 adds stack allocation forsmall arrays of value typesandsmall arrays of reference types. It also includesescape analysis for local struct fields and delegates. (Objects that can't escape can be allocated on the stack.)
The JIT now stack-allocates small, fixed-sized arrays of value types that don't contain GC pointers when they can be guaranteed not to outlive their parent method. In the following example, the JIT knows at compile time thatnumbers
is an array of only three integers that doesn't outlive a call toSum
, and therefore allocates it on the stack.
static void Sum(){ int[] numbers = {1, 2, 3}; int sum = 0; for (int i = 0; i < numbers.Length; i++) { sum += numbers[i]; } Console.WriteLine(sum);}
.NET 10 extends the.NET 9 stack allocation improvements to small arrays of reference types. Previously, arrays of reference types were always allocated on the heap, even when their lifetime was scoped to a single method. Now, the JIT can stack-allocate such arrays when it determines that they don't outlive their creation context. In the following example, the arraywords
is now allocated on the stack.
static void Print(){ string[] words = {"Hello", "World!"}; foreach (var str in words) { Console.WriteLine(str); }}
Escape analysis determines if an object can outlive its parent method. Objects "escape" when assigned to non-local variables or passed to functions not inlined by the JIT. If an object can't escape, it can be allocated on the stack. .NET 10 includes escape analysis for:
Starting in .NET 10, the JIT considers objects referenced bystruct fields, which enables more stack allocations and reduces heap overhead. Consider the following example:
public class Program{ struct GCStruct { public int[] arr; } public static void Main() { int[] x = new int[10]; GCStruct y = new GCStruct() { arr = x }; return y.arr[0]; }}
Normally, the JIT stack-allocates small, fixed-sized arrays that don't escape, such asx
. Its assignment toy.arr
doesn't causex
to escape, becausey
doesn't escape either. However, the JIT's previous escape analysis implementation didn't model struct field references. In .NET 9, the x64 assembly generated forMain
includes a call toCORINFO_HELP_NEWARR_1_VC
to allocatex
on the heap, indicating it was marked as escaping:
Program:Main():int (FullOpts): push rax mov rdi, 0x719E28028A98 ; int[] mov esi, 10 call CORINFO_HELP_NEWARR_1_VC mov eax, dword ptr [rax+0x10] add rsp, 8 ret
In .NET 10, the JIT no longer marks objects referenced by local struct fields as escaping, as long as the struct in question does not escape. The assembly now looks like this (notice that the heap allocation helper call is gone):
Program:Main():int (FullOpts): sub rsp, 56 vxorps xmm8, xmm8, xmm8 vmovdqu ymmword ptr [rsp], ymm8 vmovdqa xmmword ptr [rsp+0x20], xmm8 xor eax, eax mov qword ptr [rsp+0x30], rax mov rax, 0x7F9FC16F8CC8 ; int[] mov qword ptr [rsp], rax lea rax, [rsp] mov dword ptr [rax+0x08], 10 lea rax, [rsp] mov eax, dword ptr [rax+0x10] add rsp, 56 ret
For more information about de-abstraction improvements in .NET 10, seedotnet/runtime#108913.
When source code is compiled to IL, each delegate is transformed into a closure class with a method corresponding to the delegate's definition and fields matching any captured variables. At run time, a closure object is created to instantiate the captured variables, along with aFunc
object to invoke the delegate. If escape analysis determines theFunc
object won't outlive its current scope, the JIT allocates it on the stack.
Consider the followingMain
method:
public static int Main(){ int local = 1; int[] arr = new int[100]; var func = (int x) => x + local; int sum = 0; foreach (int num in arr) { sum += func(num); } return sum;}
Previously, the JIT produces the following abbreviated x64 assembly forMain
. Before entering the loop,arr
,func
, and the closure class forfunc
calledProgram+<>c__DisplayClass0_0
are all allocated on theheap, as indicated by theCORINFO_HELP_NEW*
calls.
; prolog omitted for brevity mov rdi, 0x7DD0AE362E28 ; Program+<>c__DisplayClass0_0 call CORINFO_HELP_NEWSFAST mov rbx, rax mov dword ptr [rbx+0x08], 1 mov rdi, 0x7DD0AE268A98 ; int[] mov esi, 100 call CORINFO_HELP_NEWARR_1_VC mov r15, rax mov rdi, 0x7DD0AE4A9C58 ; System.Func`2[int,int] call CORINFO_HELP_NEWSFAST mov r14, rax lea rdi, bword ptr [r14+0x08] mov rsi, rbx call CORINFO_HELP_ASSIGN_REF mov rsi, 0x7DD0AE461140 ; code for Program+<>c__DisplayClass0_0:<Main>b__0(int):int:this mov qword ptr [r14+0x18], rsi xor ebx, ebx add r15, 16 mov r13d, 100G_M24375_IG03: ;; offset=0x0075 mov esi, dword ptr [r15] mov rdi, gword ptr [r14+0x08] call [r14+0x18]System.Func`2[int,int]:Invoke(int):int:this add ebx, eax add r15, 4 dec r13d jne SHORT G_M24375_IG03 ; epilog omitted for brevity
Now, becausefunc
is never referenced outside the scope ofMain
, it's also allocated on thestack:
; prolog omitted for brevity mov rdi, 0x7B52F7837958 ; Program+<>c__DisplayClass0_0 call CORINFO_HELP_NEWSFAST mov rbx, rax mov dword ptr [rbx+0x08], 1 mov rsi, 0x7B52F7718CC8 ; int[] mov qword ptr [rbp-0x1C0], rsi lea rsi, [rbp-0x1C0] mov dword ptr [rsi+0x08], 100 lea r15, [rbp-0x1C0] xor r14d, r14d add r15, 16 mov r13d, 100G_M24375_IG03: ;; offset=0x0099 mov esi, dword ptr [r15] mov rdi, rbx mov rax, 0x7B52F7901638 ; address of definition for "func" call rax add r14d, eax add r15, 4 dec r13d jne SHORT G_M24375_IG03 ; epilog omitted for brevity
Notice there is one remainingCORINFO_HELP_NEW*
call, which is the heap allocation for the closure. The runtime team plans to expand escape analysis to support stack allocation of closures in a future release.
NativeAOT's type preinitializer now supports all variants of theconv.*
andneg
opcodes. This enhancement allows preinitialization of methods that include casting or negation operations, further optimizing runtime performance.
.NET's garbage collector (GC) is generational, meaning it separates live objects by age to improve collection performance. The GC collects younger generations more often under the assumption that long-lived objects are less likely to be unreferenced (or "dead") at any given time. However, suppose an old object starts referencing a young object; the GC needs to know it can't collect the young object. However, needing to scan older objects to collect a young object defeats the performance gains of a generational GC.
To solve this problem, the JIT inserts write barriers before object reference updates to keep the GC informed. On x64, the runtime can dynamically switch between write-barrier implementations to balance write speeds and collection efficiency, depending on the GC's configuration. In .NET 10, this functionality is also available on Arm64. In particular, the new default write-barrier implementation on Arm64 handles GC regions more precisely, which improves collection performance at a slight cost to write-barrier throughput. Benchmarks show GC pause improvements from 8% to over 20% with the new GC defaults.
Was this page helpful?
Was this page helpful?