GitHub Copilot is now available for free

No trial. No credit card required. Just your GitHub account.

September 12th, 2024

150 reactions

Performance Improvements in .NET 9

Partner Software Engineer

Each year, summer arrives to find me daunted and excited to write about the performance improvements in the upcoming release of .NET. “Daunted,” because these posts, covering .NET 8,.NET 7,.NET 6,.NET 5,.NET Core 3.0,.NET Core 2.1, and.NET Core 2.0, have garnered a bit of a reputation I want to ensure the next iteration lives up to. And “excited,” because there’s such an abundance of material to cover due to just how much goodness has been packed into the next .NET release, I struggle to get it all written down as quickly as my thoughts whirl.

And so, every year, I start these posts talking about how the next release of .NET is the fastest and best release to date. That’s true for .NET 9 as well, of course, but the statement that .NET 9 is the fastest and best release of .NET to date is now a bit… mundane. So, let’s spice it up a bit. How about… a haiku?

As the falcon flies,.NET 9 speeds joy intoDevelopers' hearts.

Or, maybe a limerick:

In the coding world, there's a star,.NET 9, the best by far.With speed that's supreme,It's every coder's dream,Taking development to a new par.

A little gimmicky? Maybe something more classical, a sonnet perhaps:

In realms of code where brilliance finds its way,.NET 9 shines with an unmatched array.Its speed and grace, a marvel to behold,Transforming tasks to treasures, fast and bold.Developers, with joy, embrace its might,Their projects soar, efficiency in sight.No longer bound by limits of the past,In .NET 9, their dreams are built to last.Its libs, a symphony of pure delight,Turning complex to simple, dim to light.With every line of code, a masterpiece,In .NET 9, dev burdens find release.Oh, wondrous .NET 9, you light the way,In your embrace, our future's bright as day.

Ok, so, yeah, I should stick to writing software rather than poetry (something with which my college poetry professor likely agreed). Nevertheless, the sentiment remains: .NET 9 is an incredibly exciting release. More than 7,500 pull requests (PRs) have merged intodotnet/runtime in the last year, of which a significant percentage have touched on performance in one way, shape, or form. In this post, we’ll take a tour through over 350 PRs that have all found their way into packing .NET 9 full of performance yumminess. Please grab a large cup of your favorite hot beverage, sit back, settle in, and enjoy.

Benchmarking Setup

In this post, I’ve included micro-benchmarks to showcase various performance improvements. Most of these benchmarks are implemented usingBenchmarkDotNet v0.14.0, and, unless otherwise noted, there is a simple setup for each.

To follow along, first make sure you have.NET 8 and.NET 9 installed. The numbers I share were gathered using the .NET 9 Release Candidate.

Once you have the appropriate prerequisites installed, create a new C# project in a new benchmarks directory:

dotnet new console -o benchmarkscd benchmarks

The resulting directory will contain two files:benchmarks.csproj, which is the project file with information about how the application should be compiled, andProgram.cs, which contains the code for the application. Replace the entire contents ofbenchmarks.csproj with this:

<Project Sdk="Microsoft.NET.Sdk">  <PropertyGroup>    <OutputType>Exe</OutputType>    <TargetFrameworks>net9.0;net8.0</TargetFrameworks>    <LangVersion>Preview</LangVersion>    <ImplicitUsings>enable</ImplicitUsings>    <Nullable>enable</Nullable>    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>    <ServerGarbageCollection>true</ServerGarbageCollection>  </PropertyGroup>  <ItemGroup>    <PackageReference Include="BenchmarkDotNet" Version="0.14.0" />  </ItemGroup></Project>

The preceding project file tells the build system we want:

to build a runnable application, as opposed to a library.
to be able to run on both .NET 8 and .NET 9, so that BenchmarkDotNet can build multiple versions of the application, one to run on each version, in order to compare the results.
to be able to use all of the latest features from the C# language even though C# 13 hasn’t officially shipped yet.
to automatically import common namespaces.
to be able to use nullable reference type annotations in the code.
to be able to use theunsafe keyword in the code.
to configure the garbage collector (GC) into its “server” configuration, which impacts the trade-offs it makes between memory consumption and throughput. This isn’t required, but it’s how most services are configured.
to pull inBenchmarkDotNet v0.14.0 from NuGet so that we’re able to use the library inProgram.cs.

For each benchmark, I’ve then included the fullProgram.cs source; to test it, just replace the entire contents of yourProgram.cs with the shown benchmark. Each test may be configured slightly differently from others, in order to highlight the key aspects being shown. For example, some tests include the[MemoryDiagnoser(false)] attribute, which tells BenchmarkDotNet to track allocation-related metrics, or the[DisassemblyDiagnoser] attribute, which tells BenchmarkDotNet to find and share the assembly code for the test, or the[HideColumns] attribute, which removes some output columns that BenchmarkDotNet might otherwise emit but that are unnecessary clutter for our needs in this post.

Running the benchmarks is then simple. Each test includes a comment at its top for thedotnet command to use to run the benchmark. It’s typically something like this:

dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0

That:

builds the benchmarks in a Release build. Compiling for Release is important as both the C# compiler and the JIT compiler have optimizations that are disabled for Debug. Thankfully, BenchmarkDotNet warns if Debug is accidentally used:

// Validating benchmarks://    * Assembly Benchmarks which defines benchmarks is non-optimizedBenchmark was built without optimization enabled (most probably a DEBUG configuration). Please, build it in RELEASE.If you want to debug the benchmarks, please see https://benchmarkdotnet.org/articles/guides/troubleshooting.html#debugging-benchmarks.

targets .NET 8 for the host project. There are multiple builds involved here: the “host” application you run with the above command, which uses BenchmarkDotNet, which will in turn generate and build an application per target runtime. Because the code for the benchmark is compiled into all of these, you typically want the host project to target the oldest runtime you’ll be testing, so that building the host application will fail if you try to use an API that’s not available in all of the target runtimes.
runs all of the benchmarks in the whole program. If you don’t specify the--filter argument, BenchmarkDotNet will prompt you to ask which benchmarks to run. By specifying “*”, we’re saying “don’t prompt, just run ’em all.” You can also specify an expression to filter down which subset of the tests you want invoked.
runs the tests on both .NET 8 and .NET 9.

Throughout the post, I’ve shown many benchmarks and the results I received from running them. Unless otherwise stated (e.g. because I’m demonstrating an OS-specific improvement), the results shown for benchmarks are from running them on Linux (Ubuntu 22.04) on an x64 processor.

BenchmarkDotNet v0.14.0, Ubuntu 22.04.3 LTS (Jammy Jellyfish) WSL11th Gen Intel Core i9-11950H 2.60GHz, 1 CPU, 16 logical and 8 physical cores.NET SDK 9.0.100-rc.1.24452.12  [Host]     : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

My standard caveat: these are micro-benchmarks, often measuring operations that take very short periods of time, but where improvements to those times add up to be impactful when executed over and over and over. Different hardware, different operating systems, what other processes might be running on your machine, who you had breakfast with this morning, and the alignment of the planets can all impact the numbers you get out. In short, the numbers you see are unlikely to match exactly the numbers I share here; however, I’ve chosen benchmarks that should be broadly repeatable.

With all that out of the way, let’s do this!

JIT

Improvements in .NET show up at all levels of the stack. Some changes result in large improvements in one specific area. Other changes result in small improvements across many things. When it comes to broad-reaching impact, there are few areas of .NET that result in changes more broadly-impactful than those changes made to the Just In Time (JIT) compiler. Code generation improvements help make everything better, and it’s where we’ll start our journey.

PGO

InPerformance Improvements in .NET 8, I called out the enabling of dynamic profile guided optimization (PGO) as my favorite feature in the release, so PGO seems like a good place to start for .NET 9.

As a brief refresher, dynamic PGO is a feature that enables the JIT to profile code and use what it learns from that profiling to help it generate more efficient code based on the exact usage patterns of the application. The JIT utilizes tiered compilation, which allows code to be compiled and then re-compiled, possibly multiple times, achieving something new each time the code is compiled. For example, a typical method might start out at “tier 0,” where the JIT applies very few optimizations and has a goal of simply getting to functional assembly as quickly as possible. This helps with startup performance, as optimizations are one of the most costly things a compiler does. Then the runtime tracks the number of times the method is invoked, and if the number of invocations trips over a particular threshold, such that it seems like performance could actually matter, the JIT will re-generate code for it, still at tier 0, but this time with a bunch of additional instrumentation injected into the method, tracking all manner of things that could help the JIT better optimize, e.g. for a given virtual dispatch, what is the most common type on which the call is being performed. Then after enough data has been gathered, the JIT can compile the method yet again, this time at “tier 1,” fully optimized, also incorporating all of the learnings from that profile data. This same flow is relevant as well for code that’s already been pre-compiled with ReadyToRun (R2R), except instead of instrumenting tier 0 code, the JIT will generate optimized, instrumented code on its way to generating a re-optimized implementation.

In .NET 8, the JIT in particular paid attention to PGO data about types and methods involved in virtual, interface, and delegate dispatch. In .NET 9, it’s also able to use PGO data to optimize casts. Thanks todotnet/runtime#90594,dotnet/runtime#90735,dotnet/runtime#96597,dotnet/runtime#96731, anddotnet/runtime#97773, dynamic PGO is now able to track the most common input types to cast operations (castclass/isinst, e.g. what you get from doing operations like(T)obj orobj is T), and then when generating the optimized code, emit special checks that add fast paths for the most common types. For example, in the following benchmark, we have a field of typeA initialized to a typeC that’s derived from bothB andA. Then the benchmark is type checking the instance stored in thatA field to see whether it’s aB or anything derived fromB.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private A _obj = new C();    [Benchmark]    public bool IsInstanceOf() => _obj is B;    public class A { }    public class B : A { }    public class C : B { }}

ThatIsInstanceOf benchmark results in the following disassembly on .NET 8:

; Tests.IsInstanceOf()       push      rax       mov       rsi,[rdi+8]       mov       rdi,offset MT_Tests+B       call      qword ptr [7F3D91524360]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)       test      rax,rax       setne     al       movzx     eax,al       add       rsp,8       ret; Total bytes of code 35

but now on .NET 9, it produces this:

; Tests.IsInstanceOf()       push      rbp       mov       rbp,rsp       mov       rsi,[rdi+8]       mov       rcx,rsi       test      rcx,rcx       je        short M00_L00       mov       rax,offset MT_Tests+C       cmp       [rcx],rax       jne       short M00_L01M00_L00:       test      rcx,rcx       setne     al       movzx     eax,al       pop       rbp       retM00_L01:       mov       rdi,offset MT_Tests+B       call      System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)       mov       rcx,rax       jmp       short M00_L00; Total bytes of code 62

On .NET 8, it’s loading the reference to the object and the desired method token forB, and calling theCastHelpers.IsInstanceOfClass JIT helper to do the type check. On .NET 9, instead it’s loading the method token forC, which it saw during profiling to be the most common type used, and then comparing that against the actual object’s method token. If they match, since the JIT knows thatC derives fromB, it then knows the object is in fact aB. If they don’t match, then it jumps down to the fallback path where it does the same thing that was being done on .NET 8, loading the reference and the desired method token forB and callingIsInstanceOfClass.

It’s also capable of optimizing for the negative case where the cast most often fails. Consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private object _obj = "hello";    [Benchmark]    public bool IsInstanceOf() => _obj is Tests;}

On .NET 9, we get this assembly:

; Tests.IsInstanceOf()       push      rbp       mov       rbp,rsp       mov       rsi,[rdi+8]       mov       rcx,rsi       test      rcx,rcx       je        short M00_L00       mov       rax,offset MT_System.String       cmp       [rcx],rax       jne       short M00_L01       xor       ecx,ecxM00_L00:       test      rcx,rcx       setne     al       movzx     eax,al       pop       rbp       retM00_L01:       mov       rdi,offset MT_Tests       call      System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)       mov       rcx,rax       jmp       short M00_L00; Total bytes of code 64

Here the incoming object is always astring and never theTests class that’s being tested for. The generated code is comparing the incoming object againststring, and then, assuming the types match, the JIT knows the object is not aTests.

dotnet/runtime#96311 also breaks new ground with dynamic PGO, by teaching it how to profile integers and paying attention to their most common values. Then in conjunction withdotnet/runtime#96571, it uses this super power to optimizeBuffer.Memmove (which is the workhorse behind methods likeSpan<T>.CopyTo) andSpanHelpers.SequenceEqual (which is the implementation behind methods likestring.Equals). Previously, the JIT was taught how to unroll such operations, where if a constant length was provided, the JIT could generate the exact code sequence to implement the operation for that length. Now with this capability, the JIT can track the most common lengths provided to these methods, and if there’s one length that really stands out, it can special-case it, unrolling and vectorizing the operation when the length matches and falling back to calling the original when it doesn’t. While this is expected to improve in the future, for .NET 9 this set of length-profiling optimizations only kicks in when R2R is disabled, as the JIT is otherwise unable to do the exact profiling required. Disabling R2R is something services can do when startup performance isn’t a big concern and they instead care about maximum throughput at run-time.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_ReadyToRun", "0"))    .AddJob(Job.Default.WithId(".NET 9").WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable("DOTNET_ReadyToRun", "0"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "a", "b")]public class Tests{    [Benchmark]    [Arguments("abcd", "abcg")]    public bool Equals(string a, string b) => a == b;}

Method	Runtime	Mean	Code Size
Equals	.NET 8.0	2.8592 ns	78 B
Equals	.NET 9.0	0.6754 ns	87 B

Tier 0

Tier 0 is all about getting to functioning code quickly, and as such most optimizations are disabled. However, every now and then there’s a reason to do a bit more optimization in tier 0, in situations where the benefits of doing so outweigh the cons. Several of those occurred in .NET 9.

dotnet/runtime#104815 is a simple example. TheArgumentNullException.ThrowIfNull method is now used in thousands upon thousands of places for doing argument validation. It’s a non-generic method, accepting anobject argument and checking to see whether it’snull. That non-genericity causes some friction for folks when it’s used with value types. It’s rare for someone to directly callThrowIfNull with a value type (other than maybe with aNullable<T>), and in fact if they do, thanks todotnet/roslyn-analyzers from@CollinAlpert, there’s now theCA2264 analyzer that will warn that what’s being done is nonsensical: CA2264 using ThrowIfNull with a value type Instead, the common case is when the argument being validated is an unconstrained generic. In such cases, if the generic argument ends up being a value type, it’ll be boxed in the call toThrowIfNull. That boxing allocation gets removed in tier 1, because theThrowIfNull call gets inlined and the JIT can see at the call site that the boxing was unnecessary. But, because inlining doesn’t happen in tier 0, such boxing has remained in tier 0. As the API is so ubiquitous, this caused developers to fret that there was something bad happening, and it caused enough consternation that the JIT now special-casesArgumentNullException.ThrowIfNull and avoids the boxing, even in tier 0. This is easy to see with a little test console app:

// dotnet run -c Release -f net8.0 --filter "*"// dotnet run -c Release -f net9.0 --filter "*"using System.Runtime.CompilerServices;while (true){    Test();}[MethodImpl(MethodImplOptions.NoInlining)]static void Test(){    long gc = GC.GetAllocatedBytesForCurrentThread();    for (int i = 0; i < 100; i++)    {        ThrowIfNull(i);    }    gc = GC.GetAllocatedBytesForCurrentThread() - gc;    Console.WriteLine(gc);    Thread.Sleep(1000);}static void ThrowIfNull<T>(T value) => ArgumentNullException.ThrowIfNull(value);

When I run that on .NET 8, I get results like this:

240024002400000

The first few iterations are invokingTest() at tier 0, such that each call toArgumentNullException.ThrowIfNull boxes the inputint. Then when the method gets recompiled at tier 1, the boxing gets elided, and we stabilize at zero allocation. Now on .NET 9, I get results like this:

With these tweaks to tier 0, the boxing is also elided in tier 0, and so starts out without any allocation.

Another tier 0 boxing example isdotnet/runtime#90496. There’s a hot path method in theasync/await machinery:AsyncTaskMethodBuilder<TResult>.AwaitUnsafeOnCompleted (seeHow Async/Await Really Works in C# for all the details). It’s really important that this method be optimized well, but it performs various type tests that can end up boxing in tier 0. In a previous release, that boxing was deemed too impactful to startup forasync methods invoked early in an application’s lifetime, so[MethodImpl(MethodImplOptions.AggressiveOptimization)] was used to opt the method out of tiering, such that it gets optimized from the get-go. But that itself has downsides, because if it skips tiering up, it also skips dynamic PGO, and thus the optimized code isn’t as good as it possibly could be. So, this PR specifically addresses those type tests patterns that box, removing the boxing in tier 0, enabling removing thatAggressiveOptimization fromAwaitUnsafeOnCompleted, and thereby enabling better optimized code generation for it.

Optimizations are avoided in tier 0 because they might slow down compilation. If there are really cheap optimizations, though, and they can have a meaningful impact, they can be worth enabling. That’s especially true if the optimizations can actually help to make compilations and startup faster, such as by minimizing calls to helpers that may take locks, trigger certain kinds of loading, etc. And that’s whatdotnet/runtime#105190 does, enabling some constant folding in tier 0 at relatively little cost. Even with the low cost, though, there were still concerns about possible impact to JIT throughput, and the PR was fast-followed bydotnet/runtime#105250 which optimized some JIT code paths to make up for any impact from the former change.

Another similar case isdotnet/runtime#91403 from@MichalPetryka, which allows optimizations aroundRuntimeHelpers.CreateSpan to kick in for tier 0. Without that, the runtime can end up allocating many field stubs, which themselves add overhead to the startup path.

Loops

Applications spend a lot of time iterating through loops, and finding ways to reduce the overheads of loops has been a key focus for .NET 9. It’s also been quite successful.

dotnet/runtime#102261 anddotnet/runtime#103181 help to remove some instructions from even the tightest of loops by converting upward counting loops into downward counting loops. Consider a loop like the following:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public int UpwardCounting()    {        int count = 0;        for (int i = 0; i < 100; i++)        {            count++;        }        return count;    }}

Here’s what the generated assembly code for that core loop looks like on .NET 8:

M00_L00:       inc       eax       inc       ecx       cmp       ecx,64       jl        short M00_L00

It’s incrementingeax, which is storingcount. And it’s incrementingecx, which is storingi. It’s then comparingecx against 100 (0x64) to see if it’s reached the end of the loop, and jumping back up to the beginning of the loop if it hasn’t.

Now let’s manually rewrite the loop to be downward counting:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public int DownwardCounting()    {        int count = 0;        for (int i = 99; i >= 0; i--)        {            count++;        }        return count;    }}

And here’s what the generated assembly code for that core loop looks like there:

M00_L00:       inc       eax       dec       ecx       jns       short M00_L00

The key observation here is that by counting down, we can replace acmp/jl for comparing against a specific bound to instead just be ajns that jumps if the value isn’t negative. We’ve thus removed an instruction from a tight loop that only had four to begin with. With the aforementioned PRs, the JIT can now do that transformation automatically where it’s applicable and deemed valuable, such that the loop inUpwardCounting now results in the same assembly code on .NET 9 as does the loop inDownwardCounting.

Method	Runtime	Mean	Ratio
UpwardCounting	.NET 8.0	30.27 ns	1.00
UpwardCounting	.NET 9.0	26.52 ns	0.88

However, the JIT is only able to do this transformation if the iteration variable (i) isn’t used in the body of the loop, and obviously there are many loops where it is, such as by indexing into an array being iterated over. Thankfully, other optimizations in .NET 9 are able to reduce the actual reliance on the iteration variable, such that this optimization now kicks in frequently.

One such optimization is strength reduction in loops. In compilers, “strength reduction” is the simple idea of taking something relatively expensive and replacing it with something cheaper. In the context of loops, that typically means introducing more “induction variables” (variables whose values change in a predictable pattern on each iteration, such as being incremented by a constant amount). For example, consider a simple loop that sums all of the elements of an array:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _array = Enumerable.Range(0, 1000).ToArray();    [Benchmark]    public int Sum()    {        int[] array = _array;        int sum = 0;        for (int i = 0; i < array.Length; i++)        {            sum += array[i];        }        return sum;    }}

We get the following assembly on .NET 8:

; Tests.Sum()       push      rbp       mov       rbp,rsp       mov       rax,[rdi+8]       xor       ecx,ecx       xor       edx,edx       mov       edi,[rax+8]       test      edi,edi       jle       short M00_L01M00_L00:       mov       esi,edx       add       ecx,[rax+rsi*4+10]       inc       edx       cmp       edi,edx       jg        short M00_L00M00_L01:       mov       eax,ecx       pop       rbp       ret; Total bytes of code 35

The interesting part is the loop starting atM00_L00.i is being stored inedx (though it gets copied intoesi), and as part of adding the next element from the array tosum (which is stored inecx), we’re loading that next value from the array with the addressrax+rsi*4+10. A strength reduction view of this would say “rather than re-computing the address on each iteration, we can instead have another induction variable and increment it by 4 on each iteration.” A key benefit of that is it then removes a dependency oni from inside of the loop, which then means the iteration variable is no longer used in the loop, enabling the aforementioned downward counting optimization to kick in. That leads to the following assembly on .NET 9:

; Tests.Sum()       push      rbp       mov       rbp,rsp       mov       rax,[rdi+8]       xor       ecx,ecx       mov       edx,[rax+8]       test      edx,edx       jle       short M00_L01       add       rax,10M00_L00:       add       ecx,[rax]       add       rax,4       dec       edx       jne       short M00_L00M00_L01:       mov       eax,ecx       pop       rbp       ret; Total bytes of code 35

Note the loop atM00_L00: it’s now downward counting, reading the next value from the array is simply dereferencing the address inrax, and the address inrax is incremented by 4 each go around.

A lot of work went into enabling this strength reduction, including providing the basic implementation (dotnet/runtime#104243), enabling it by default (dotnet/runtime#105131), finding more opportunities to apply it (dotnet/runtime#105169), and using it to enable post-indexed addressing (dotnet/runtime#105181 anddotnet/runtime#105185), which is an Arm addressing mode where the address stored in the base register is used but then that register is updated to point to the next target memory location. A new phase was also added to the JIT to help with optimizing such induction variables (dotnet/runtime#97865), and in particular, to do induction variable widening where 32-bit induction variables (think of every loop you’ve ever written that starts withfor (int i = ...)) are widened to 64-bit induction variables. This widening can help to avoid zero extensions that might otherwise occur on every iteration of the loop.

These optimizations are all new, but of course there are also many loop optimizations already present in the JIT compiler, from loop unrolling to loop cloning to loop hoisting. In order to apply such loop optimizations, though, the JIT first needs to recognize loops, and that can sometimes be more challenging than it would seem (dotnet/runtime#43713 describes a case where the JIT was failing to do so). Historically, the JIT’s loop recognition was based on a relatively simplistic lexical analysis. In .NET 8, as part of the work to improve dynamic PGO, a more powerful graph-based loop analyzer was added that was able to recognize many more loops. For .NET 9 withdotnet/runtime#95251, that analyzer was factored out so that it could be used for generalized loop reasoning. And then with PRs likedotnet/runtime#96756 for loop alignment,dotnet/runtime#96754 anddotnet/runtime#96553 for loop cloning,dotnet/runtime#96752 for loop unrolling,dotnet/runtime#96751 for loop canonicalization, anddotnet/runtime#96753 for loop hoisting, many of these loop-related optimizations have now been moved to the better scheme. All of that means that more loops get optimized.

Bounds Checks

.NET code is, by default, “memory safe.” Unlike in C, where you can iterate through an array and easily walk off the end of it, by default accesses to arrays, strings, and spans are “bounds checked” to ensure you can’t walk off the end or before the beginning. Of course, such bounds checking adds overhead, and so wherever the JIT can prove that adding such checks would be unnecessary, it’ll elide the bounds check, knowing that it’s impossible for the guarded accesses to be problematic. The quintessential example of this is a loop over an array from0 toarray.Length. Let’s look at the same benchmark we just looked at, summing all the elements of an integer array:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _array = new int[1000];    [Benchmark]    public int Test()    {        int[] array = _array;        int sum = 0;        for (int i = 0; i < array.Length; i++)        {            sum += array[i];        }        return sum;    }}

ThatTest benchmark results in this assembly code on .NET 8:

; Tests.Test()       push      rbp       mov       rbp,rsp       mov       rax,[rdi+8]       xor       ecx,ecx       xor       edx,edx       mov       edi,[rax+8]       test      edi,edi       jle       short M00_L01M00_L00:       mov       esi,edx       add       ecx,[rax+rsi*4+10]       inc       edx       cmp       edi,edx       jg        short M00_L00M00_L01:       mov       eax,ecx       pop       rbp       ret; Total bytes of code 35

The key part to pay attention to is the loop atM00_L00, for which the only branch is the one comparingedx (which is trackingi) toedi (which was earlier on initialized to the length of the array,[rax+8]) as part of knowing when it’s done iterating. There’s no additional check required to make this safe, as the JIT knows the loop started at0 (and thus isn’t walking off the beginning of the array) and the JIT knows iteration ends at the array length, which the JIT is already checking for, so it’s safe to index into the array without additional checks.

Now, let’s tweak the benchmark ever so slightly. In the above, I was copying the_array field to a localarray and then doing all accesses against thatarray; this is critical, because there’s nothing else that could be changing that local out from under the loop. But if we instead change the code to refer to the field directly:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _array = new int[1000];    [Benchmark]    public int Test()    {        int sum = 0;        for (int i = 0; i < _array.Length; i++)        {            sum += _array[i];        }        return sum;    }}

now we get this on .NET 8:

; Tests.Test()       push      rbp       mov       rbp,rsp       xor       eax,eax       xor       ecx,ecx       mov       rdx,[rdi+8]       cmp       dword ptr [rdx+8],0       jle       short M00_L01       nop       dword ptr [rax]       nop       dword ptr [rax]M00_L00:       mov       rdi,rdx       cmp       ecx,[rdi+8]       jae       short M00_L02       mov       esi,ecx       add       eax,[rdi+rsi*4+10]       inc       ecx       cmp       [rdx+8],ecx       jg        short M00_L00M00_L01:       pop       rbp       retM00_L02:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 61

That’s a whole lot worse. Note how much that loop starting atM00_L00 has grown, and in particular note that instead of just having the onecmp/jg pair at the end, there’s anothercmp/jae pair in the middle, just before it accesses the array element. Since the code is reading from the field on every access, the JIT needs to accommodate the fact that the reference could change between any two accesses; thus, even though the JIT is comparing against_array.Length as part of the loop bounds, it also needs to ensure that the subsequent reference to_array[i] is still in bounds, since by then_array may be an entirely different object. That’s a “bounds check,” which is obvious from the tell-tale sign that immediately after thecmp, there’s a conditional jump to code that unconditionally callsCORINFO_HELP_RNGCHKFAIL; that’s the helper function that’s called to throw anIndexOutOfRangeException when you try to walk off the end of one of these data structures.

Every release the JIT gets better at removing more and more bounds checks where it can prove they’re superfluous. One of my favorite such improvements in .NET 9 is there on my favorites list because I’ve historically expected the optimization to “just work”, for various reasons it didn’t, and now it does (it also shows up in a fair amount of real code, which is why I’ve bumped up against it). In this benchmark, the function is handed an offset and a span, and its job is to sum all of the numbers from that offset to the end of the span.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(3)]    public int Test() => M(0, "1234567890abcdefghijklmnopqrstuvwxyz");    [MethodImpl(MethodImplOptions.NoInlining)]    public static int M(int i, ReadOnlySpan<char> src)    {        int sum = 0;        while (true)        {            if ((uint)i >= src.Length)            {                break;            }            sum += src[i++];        }        return sum;    }}

By castingi touint as part of the comparison tosrc.Length, the JIT knows thati is in bounds ofsrc by the timei is used to index intosrc, because ifi were negative, the cast touint would have made it larger thanint.MaxValue and thus also larger thansrc.Length (which can’t possibly be larger thanint.MaxValue). The .NET 8 assembly shows the bounds check has been elided (note the lack ofCORINFO_HELP_RNGCHKFAIL):

; Tests.M(Int32, System.ReadOnlySpan`1<Char>)       push      rbp       mov       rbp,rsp       xor       eax,eaxM01_L00:       cmp       edi,edx       jae       short M01_L01       lea       ecx,[rdi+1]       mov       edi,edi       movzx     edi,word ptr [rsi+rdi*2]       add       eax,edi       mov       edi,ecx       jmp       short M01_L00M01_L01:       pop       rbp       ret; Total bytes of code 27

But, this is a fairly awkward way to write such a condition. A more natural way would be to have that check as part of the loop condition:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(3)]    public int Test() => M(0, "1234567890abcdefghijklmnopqrstuvwxyz");    [MethodImpl(MethodImplOptions.NoInlining)]    public static int M(int i, ReadOnlySpan<char> src)    {        int sum = 0;        for (; (uint)i < src.Length; i++)        {            sum += src[i];        }        return sum;    }}

Unfortunately, as a result of my code cleanup here to make the code more canonical, the JIT in .NET 8 fails to see that the bounds check can be elided… note theCORINFO_HELP_RNGCHKFAIL at the end:

; Tests.M(Int32, System.ReadOnlySpan`1<Char>)       push      rbp       mov       rbp,rsp       xor       eax,eax       cmp       edi,edx       jae       short M01_L01M01_L00:       cmp       edi,edx       jae       short M01_L02       mov       ecx,edi       movzx     ecx,word ptr [rsi+rcx*2]       add       eax,ecx       inc       edi       cmp       edi,edx       jb        short M01_L00M01_L01:       pop       rbp       retM01_L02:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 36

But in .NET 9, thanks todotnet/runtime#100777, the JIT is better able to track the knowledge about guarantees made by the loop condition and is able to elide the bounds check on this variation as well.

; Tests.M(Int32, System.ReadOnlySpan`1<Char>)       push      rbp       mov       rbp,rsp       xor       eax,eax       cmp       edi,edx       jae       short M01_L01       mov       ecx,ediM01_L00:       movzx     edi,word ptr [rsi+rcx*2]       add       eax,edi       inc       ecx       cmp       ecx,edx       jb        short M01_L00M01_L01:       pop       rbp       ret; Total bytes of code 26

Yay!

Now consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(3)]    public int Test(int i)    {        ReadOnlySpan<byte> rva = [1, 2, 3, 5, 8, 13, 21, 34];        return rva[7 - (i & 7)];    }}

The test method here has a span of data initialized in a way where the JIT is able to see how long it is. It’s then indexing into the span, using the supplied index to read not from the start but from the end (the(i & 7) is there to ensure the JIT can see that the value will always be in range); if it were reading from the start, this was already optimized, but from the end, the JIT hadn’t previously been taught how to reason about the bounds checks. On .NET 8, it can’t prove the access is always in-bounds, and we can see the bounds check in place:

; Tests.Test(Int32)       push      rax       and       esi,7       mov       eax,esi       neg       eax       add       eax,7       cmp       eax,8       jae       short M00_L00       mov       rcx,7FC98A741EC8       movzx     eax,byte ptr [rax+rcx]       add       rsp,8       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 41

But, now on .NET 9, thanks todotnet/runtime#96123, the bounds check gets elided.

; Tests.Test(Int32)       and       esi,7       mov       eax,esi       neg       eax       add       eax,7       mov       rcx,7F39B8724EC8       movzx     eax,byte ptr [rax+rcx]       ret; Total bytes of code 25

Here’s another case. We’re special-casing spans of lengths less than or equal to 1, returningstring.Empty if the span is of length 0 or returning the first string if the span is of length 1:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(3)]    public string? Test() => M(["123"]);    [MethodImpl(MethodImplOptions.NoInlining)]    private static string? M(ReadOnlySpan<string> values)    {        if (values.Length <= 1)        {            return values.Length == 0 ?                string.Empty :                values[0];        }        return null;    }}

You and I can see that the access tovalues[0] will always succeed, but on .NET 8 we get this:

; Tests.M(System.ReadOnlySpan`1<System.String>)       push      rbp       mov       rbp,rsp       cmp       esi,1       jg        short M01_L01       test      esi,esi       je        short M01_L00       test      esi,esi       je        short M01_L02       mov       rax,[rdi]       pop       rbp       retM01_L00:       mov       rax,7FB62147C008       pop       rbp       retM01_L01:       xor       eax,eax       pop       rbp       retM01_L02:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 44

The JIT keeps track of what it knows about the lengths of various things, what conditions it’s proved, but here it’s lost track of the fact that, for the else branch of the ternary,values is guaranteed to be of length1, and thus indexing at index0 is safe.dotnet/runtime#101323 improves the JIT’s range tracking ability, such that on .NET 9, the bounds check is successfully elided:

; Tests.M(System.ReadOnlySpan`1<System.String>)       push      rbp       mov       rbp,rsp       cmp       esi,1       jg        short M01_L01       test      esi,esi       je        short M01_L00       mov       rax,[rdi]       pop       rbp       retM01_L00:       mov       rax,7F5700FB1008       pop       rbp       retM01_L01:       xor       eax,eax       pop       rbp       ret; Total bytes of code 34

Most if not all of these bounds check elimination improvements come about because someone is optimizing something and sees a bounds check that could have been eliminated but wasn’t. In the case that inspired the improvement indotnet/runtime#101352, that someone was me, while working on improvingEnum for .NET 8. Enums can be backed by various numerical types, includingulong, and there’s a code path inEnum.GetName that’s effectively this:

if (ulongValue < (ulong)names.Length){    return names[(uint)ulongValue];}

That bounds check wasn’t previously being removed, but now in .NET 9, it is:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string[] _names = Enum.GetNames<MyEnum>();    [Benchmark]    [Arguments(2)]    public string? GetNameOrNull(ulong ulValue)    {        string[] names = _names;        return ulValue < (ulong)names.Length ?            names[(uint)ulValue] :            null;    }    public enum MyEnum : ulong { A, B, C, D }}

// .NET 8; Tests.GetNameOrNull(UInt64)       push      rbp       mov       rbp,rsp       mov       rax,[rdi+8]       mov       ecx,[rax+8]       mov       edx,ecx       cmp       rdx,rsi       jbe       short M00_L00       cmp       esi,ecx       jae       short M00_L01       mov       ecx,esi       mov       rax,[rax+rcx*8+10]       pop       rbp       retM00_L00:       xor       eax,eax       pop       rbp       retM00_L01:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 41// .NET 9; Tests.GetNameOrNull(UInt64)       mov       rax,[rdi+8]       mov       ecx,[rax+8]       cmp       rcx,rsi       jbe       short M00_L00       mov       ecx,esi       mov       rax,[rax+rcx*8+10]       retM00_L00:       xor       eax,eax       ret; Total bytes of code 23

Sometimes eliding bounds checks is about learning new tricks; other times, it’s about fixing old ones. Consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static ReadOnlySpan<int> Lookup => [1, 2, 3, 5, 8, 13, 21];    [Benchmark]    [Arguments(3)]    public int Test1(int i) => (uint)i < 7 ? Lookup[i] : -1;    [Benchmark]    [Arguments(3)]    public int Test2(int i) => (uint)i <= 6 ? Lookup[i] : -1;}

Test1 andTest2 are effectively the same thing, both guarding a lookup table by a known length and only accessing the table if we know the index to be in bounds. The bounds check will then be elided by the JIT in both cases, right? Wrong. On .NET 8, we get this:

; Tests.Test1(Int32)       cmp       esi,7       jae       short M00_L00       mov       eax,esi       mov       rcx,7F6D40064030       mov       eax,[rcx+rax*4]       retM00_L00:       mov       eax,0FFFFFFFF       ret; Total bytes of code 27; Tests.Test2(Int32)       push      rbp       mov       rbp,rsp       cmp       esi,6       ja        short M00_L00       cmp       esi,7       jae       short M00_L01       mov       eax,esi       mov       rcx,7F8D11621030       mov       eax,[rcx+rax*4]       pop       rbp       retM00_L00:       mov       eax,0FFFFFFFF       pop       rbp       retM00_L01:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 44

Note the bounds check inTest2.dotnet/runtime#97908 fixes this, such that on .NET 9Test2 now successfully elides the bounds check as well:

; Tests.Test1(Int32)       cmp       esi,7       jae       short M00_L00       mov       eax,esi       mov       rcx,7F5B9DC5E030       mov       eax,[rcx+rax*4]       retM00_L00:       mov       eax,0FFFFFFFF       ret; Total bytes of code 27; Tests.Test2(Int32)       cmp       esi,6       ja        short M00_L00       mov       eax,esi       mov       rcx,7F7FDE2C9030       mov       eax,[rcx+rax*4]       retM00_L00:       mov       eax,0FFFFFFFF       ret; Total bytes of code 27

Interestingly, sometimes even if we can’t elide a bounds check, we can learn things from the fact that one occurred, and then use that knowledge to optimize subsequent things. Consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly int[] _x = new int[10];    [Benchmark]    [Arguments(2)]    public int Add(int y) => _x[y] + (y % 8);}

There’s nothing the JIT can do here to elide the bounds check on_x[y]; it has no information about the value ofy or the length of_x. As such, as shown in the .NET 8 assembly here, we see a bounds check:

; Tests.Add(Int32)       push      rax       mov       rax,[rdi+8]       cmp       esi,[rax+8]       jae       short M00_L00       mov       ecx,esi       mov       edx,esi       sar       edx,1F       and       edx,7       add       edx,esi       and       edx,0FFFFFFF8       mov       edi,esi       sub       edi,edx       add       edi,[rax+rcx*4+10]       mov       eax,edi       add       rsp,8       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 46

However, all is not lost. After indexing into the array, we proceed to usey as the numerator of a% operation. C#’s% operator supports bothint anduint numerators, but it has to do a little more work forint in case the value is negative. However, by the time we get to that% operation, weknow thaty is not negative, as if it were negative, the_x[y] would have thrown and we’d never end up here.dotnet/runtime#102089 teaches the JIT how to learn such non-negative information from such bounds checks, such that in .NET 9, we get code generation equivalent to if we’d explicitly casty touint.

; Tests.Add(Int32)       push      rax       mov       rax,[rdi+8]       cmp       esi,[rax+8]       jae       short M00_L00       mov       ecx,esi       and       esi,7       add       esi,[rax+rcx*4+10]       mov       eax,esi       add       rsp,8       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 32

Arm64

Making .NET on Arm an awesome and fast experience has been a critical, multi-year investment. You can read about it inArm64 Performance Improvements in .NET 5,Arm64 Performance Improvements in .NET 7, andArm64 Performance Improvements in .NET 8. And things continue to improve even further in .NET 9. Here are some examples:

Better barriers.dotnet/runtime#91553 implements volatile writes via using thestlur (Store-Release Register) instruction rather than admb (Data Memory Barrier) /str (Store) pair of instructions (stlur is generally cheaper). Similarly,dotnet/runtime#101359 eliminates full memory barriers when dealing with volatile reads and writes onfloats. For example, code that would previously have produced aldr (Load Register) /dmb pair may now produce aldar (Load-Acquire Register) /fmov (Floating-point Move) pair.
Better switches. Depending on the shape of aswitch statement, the C# compiler may generate a variety of IL patterns, one of which is to use aswitch IL instruction. Normally for aswitch IL instruction, the JIT will generate a jump table, but for some forms, it has an optimization to instead rely on a bit test. Thus far this optimization only existed for x86/64, with thebt (Bit Test) instruction. Now withdotnet/runtime#91811, it also exists for Arm, with thetbz (Test bit and Branch if Zero) instruction.
Better conditionals. Arm has conditional instructions that logically contain a branch albeit without any branching, e.g.Performance Improvements in .NET 8 talked about thecsel (Conditional Select) instruction that “conditionally selects” a value from one of two registers based on some condition. Another such instruction iscsinc (Conditional Select Increment), which conditionally selects either the value from one register or the value from another register incremented by one.dotnet/runtime#91262 from@c272 enables the JIT to utilizecsinc, so that a statement likex = condition ? x + 1 : y; will be able to compile down to acsinc rather than to a branching construct.dotnet/runtime#92810 also improves the custom comparison operation the JIT emits for someSequenceEqual operations (e.g."hello, there"u8.SequenceEqual(spanOfBytes)) to be able to useccmp (Conditional Compare).
Better multiplies. Arm has single instructions that represent doing a multiply followed by an addition, subtraction, or negation.dotnet/runtime#91886 from@c272 finds such sequences of multiplies followed by one of those operations and consolidates them to use the single combined instruction.
Better loads. Arm has instructions for loading a value from memory into a single register, but it also has instructions for loading multiple values into multiple registers. When the JIT emits a customized memory copy (such as forbyteArray.AsSpan(0, 32).SequenceEqual(otherByteArray)), it may emit multipleldr instructions for loading a value into a register.dotnet/runtime#92704 enables consolidating pairs of those intoldp (Load Pair of Registers) instructions, which load two values into two registers.

ARM SVE

Bringing up a new instruction set is a huge deal and a huge undertaking. I’ve mentioned in the past my process for gearing up to write one of these “Performance Improvements in .NET X” posts, including that throughout the year I keep a running list of the PRs I might want to talk about when it comes time to actually put pen to paper. Just for “SVE”, I found myself with over 200 links. I’m not going to bore you with such a laundry list; if you’re interested, you can search forSVE PRs, which includes PRs from@a74nh, from@ebepho, from@mikabl-arm, from@snickolls-arm, and from@SwapnilGaikwad. But, we can still talk a bit about what it is and what it means for .NET.

Single instruction, multiple data (SIMD) is a kind of parallel processing where one instruction performs the same operation on multiple pieces of data at the same time, rather than one instruction manipulating just a single piece of data. For example, theadd instruction on x86/64 can add together one pair of 32-bit integers, whereas thepaddd (Add Packed Doubleword Integers) instruction that’s part of Intel’s SSE2 (Streaming SIMD Extensions 2) instruction set operates on a pair ofxmm registers that can each store four 32-bit integer values at once. Many such instructions have been added to many different hardware platforms over the years, coming in groups referred to as instruction set architectures (ISA), where an ISA defines what the instructions are, what registers they interact with, how memory is accessed, and so on. Even if you’re not steeped in this stuff, you’ve likely heard names of these ISAs mentioned, like Intel’s SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions), or Arm’s Advanced SIMD (also known as Neon). In general, the instructions in all of these ISAs operate on a fixed number of values of a fixed size, e.g. thepaddd previously mentioned only works with 128-bits at a time, no more, no less. Different instructions exist for 256 bits at a time or 512 bits at a time.

SVE, or “Scalable Vector Extensions,” is an ISA from Arm that’s a bit different. The instructions in SVE don’t operate on a fixed size. Rather, the specification allows for them to operate on sizes from 128 bits up to 2048 bits, and the specific hardware can choose which size to use (allowed sizes are multiples of 128, and with SVE 2 further constrained to be powers of 2). The same assembly code using these instructions might operate on 128 bits at a time on one piece of hardware and 256 bits at a time on another piece of hardware.

There are multiple ways such an ISA impacts .NET, and in particular the JIT. The JIT needs to be able to be able to work with the ISA, understand the associated registers and be able to do register allocation, be taught about encoding and emitting the instructions, and so on. The JIT needs to be taught when and where it’s appropriate to use these instructions, so that as part of compiling IL down to assembly, if operating on a machine that supports SVE, the JIT might be able to pick SVE instructions for use in the generated assembly. And the JIT needs to be taught how to represent this data, these vectors, to user code. All of that is a huge amount of work, especially when you consider that there are thousands of operations represented. What makes it even more work is hardware intrinsics.

Hardware intrinsics are a feature of .NET where, effectively, each of these instructions shows up as its own dedicated .NET method, such asSse2.Add, and the JIT emits use of that method as the underlying instruction to which it maps. If you look atSve.cs in dotnet/runtime, you’ll see theSystem.Runtime.Intrinsics.Arm.Sve type, which already exposes more than 1400 public methods (that number is not a typo).

Two interesting things to notice if you open that file (beyond its sheer length):

The use ofVector<T>. .NET’s foray into SIMDstarted in 2014 and was accompanied by theVector<T> type.Vector<T> represents a single vector (list) of theT numeric type. To provide a platform-agnostic representation, since different platforms were capable of different vector widths,Vector<T> was defined to be variable in size, so for example on x86/x64 hardware that supported AVX2,Vector<T> might be 256 bits wide, whereas on an Arm machine that supported Neon,Vector<T> might be 128 bits wide. If the hardware supported both 128 bits and 256 bits,Vector<T> would map to the larger. Since the introduction ofVector<T>, various fixed-width vector types have been introduced, likeVector64<T>,Vector128<T>,Vector256<T>, andVector512<T>, and the hardware intrinsics for most of the other ISAs are all in terms of those fixed-width vector sizes, since the instructions themselves are fixed width. But SVE is not; its instructions might be 128 bits here and 512 bits there, thus it’s not possible to use those same fixed-width vector types in theSve definition… but it makes a lot of sense to use the variable withVector<T>. What’s old is new again.
TheSve class is tagged as[Experimental]. The[Experimental] attribute was introduced in .NET 8 and C# 12. The intent is it can be used to indicate that some functionality in an otherwise stable assembly is not yet stable and may change in the future. If code tries to use such a member, by default the C# compiler will issue an error telling the developer they’re using something that could break in the future. As long as the developer is willing to accept such breaking change risk, they can then suppress the error. Designing and enabling the SVE support is a monstrous, multi-year effort, and while the support is functional and folks are encouraged to take it for a spin, it’s not yet baked enough for us to be 100% confident the shape won’t need to evolve (for .NET 9, it’s also restricted to hardware with a vector width of 128 bits, but that restriction will be removed subsequently). Hence,[Experimental].

AVX10.1

Even with the size of the SVE effort, it’s not the only new ISA available in .NET 9. Thanks in large part todotnet/runtime#99784 from@Ruihan-Yin anddotnet/runtime#101938 from@khushal1996, .NET 9 now also supports AVX10.1 (AVX10 version 1). AVX10.1 provides everything AVX512 provides, all of the base support, the updated encodings, support for embedded broadcasts, masking, and so on, but it only requires 256-bit support in the hardware (with 512-bits being optional, whereas AVX512 requires 512-bit support), and it does so in a much less incremental manner (AVX512 has multiple instruction sets like “F”, “DQ”, “Vbmi”, etc.). That’s modeled in the .NET APIs as well, where you can checkAvx10v1.IsSupported as well asAvx10v1.V512.IsSupported, both of which govern more than 500 new APIs available for consumption. (Note that at the time of this writing, there aren’t actually any chips on the market that support AVX10.1, but they’re expected in the foreseeable future.)

AVX512

On the subject of ISAs, it’s worth mentioning AVX512. .NET 8 added broad support for AVX512, including support in the JIT and employment of it throughout the libraries. Both of those improve further in .NET 9. We’ll talk more about places it’s better used in the libraries later. For now, here are some JIT-specific improvements.

One of the things the JIT needs to generate code for is zeroing, e.g. by default all locals in a method need to be set to zero, and even if[SkipLocalsInit] is employed, references still need to be zeroed (otherwise, when the GC does a pass through all of the locals looking for references to objects to see what’s no longer referenced, it could see the references as being whatever garbage happened to be in that location in memory and end up making bad choices). Such zeroing of locals is overhead that occurs on every invocation of that method, so obviously it’s valuable for that to be as efficient as possible. Rather than zeroing out each word with a single instruction, if the current hardware supports the appropriate SIMD instructions, the JIT can instead emit code to use those instructions, so that it can zero out more per instruction. Withdotnet/runtime#91166, it’s now able to use AVX512 instructions if available to zero out 512 bits per instruction, rather than “only” 256 bits or 128 bits using other ISAs. As an example, here’s a benchmark that needs to zero out 256 bytes:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;using System.Runtime.InteropServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public unsafe class Tests{    [Benchmark]    public void Sum()    {        Bytes values;        Nop(&values);    }    [SkipLocalsInit]    [Benchmark]    public void SumSkipLocalsInit()    {        Bytes values;        Nop(&values);    }    [MethodImpl(MethodImplOptions.NoInlining)]    private static void Nop(Bytes* value) { }    [StructLayout(LayoutKind.Sequential, Size = 256)]    private struct Bytes { }}

Here’s the assembly forSum on .NET 8:

; Tests.Sum()       sub       rsp,108       xor       eax,eax       mov       [rsp+8],rax       vxorps    xmm8,xmm8,xmm8       mov       rax,0FFFFFFFFFFFFFF10M00_L00:       vmovdqa   xmmword ptr [rsp+rax+100],xmm8       vmovdqa   xmmword ptr [rsp+rax+110],xmm8       vmovdqa   xmmword ptr [rsp+rax+120],xmm8       add       rax,30       jne       short M00_L00       mov       [rsp+100],rax       lea       rdi,[rsp+8]       call      qword ptr [7F6B56B85CB0]; Tests.Nop(Bytes*)       nop       add       rsp,108       ret; Total bytes of code 90

This is on a machine with AVX512 hardware support, but we can see the zero’ing is happening using a loop (M00_L00 through to thejne that jumps back to it), as with only 256-bit instructions, this was deemed by the JIT’s heuristics too large to unroll completely. Now, here’s .NET 9:

; Tests.Sum()       sub       rsp,108       xor       eax,eax       mov       [rsp+8],rax       vxorps    xmm8,xmm8,xmm8       vmovdqu32 [rsp+10],zmm8       vmovdqu32 [rsp+50],zmm8       vmovdqu32 [rsp+90],zmm8       vmovdqa   xmmword ptr [rsp+0D0],xmm8       vmovdqa   xmmword ptr [rsp+0E0],xmm8       vmovdqa   xmmword ptr [rsp+0F0],xmm8       mov       [rsp+100],rax       lea       rdi,[rsp+8]       call      qword ptr [7F4D3D3A44C8]; Tests.Nop(Bytes*)       nop       add       rsp,108       ret; Total bytes of code 107

Now there’s no loop, becausevmovdqu32 (Move unaligned packed doubleword integer values) can be used to zero twice as much at a time (64 bytes) as vmovdqa (Move aligned packed integer values), and thus the zeroing can be done in fewer instructions that’s still considered a reasonable number.

Zeroing also shows up elsewhere, such as when initializing structs. Those have also previously employed SIMD instructions where relevant, e.g. this:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public MyStruct Init() => new();    public struct MyStruct    {        public Int128 A, B, C, D;    }}

produces this assembly today on .NET 8:

; Tests.Init()       vzeroupper       vxorps    ymm0,ymm0,ymm0       vmovdqu32 [rsi],zmm0       mov       rax,rsi       ret; Total bytes of code 17

But, if we tweakMyStruct to add a field of a reference type anywhere in the struct (e.g. addpublic string Oops; as the first line of the struct above), it knocks the initialization off this optimized path, and we end up with initialization like this on .NET 8:

; Tests.Init()       xor       eax,eax       mov       [rsi],rax       mov       [rsi+8],rax       mov       [rsi+10],rax       mov       [rsi+18],rax       mov       [rsi+20],rax       mov       [rsi+28],rax       mov       [rsi+30],rax       mov       [rsi+38],rax       mov       [rsi+40],rax       mov       [rsi+48],rax       mov       rax,rsi       ret; Total bytes of code 45

This is due to alignment requirements in order to provide necessary atomicity guarantees. But rather than giving up wholesale,dotnet/runtime#102132 allows the SIMD zeroing to be used for the contiguous portions that don’t contain GC references, so now on .NET 9 we get this:

; Tests.Init()       xor       eax,eax       mov       [rsi],rax       vxorps    xmm0,xmm0,xmm0       vmovdqu32 [rsi+8],zmm0       mov       [rsi+48],rax       mov       rax,rsi       ret; Total bytes of code 27

This optimization isn’t specific to AVX512, but it includes the ability to use AVX512 instructions when available. (dotnet/runtime#99140 provides similar support for Arm64.)

Other optimizations improve the JIT’s ability to select AVX512 instructions as part of generating code. One neat example of this isdotnet/runtime#91227 from@Ruihan-Yin, which utilizes the coolvpternlog (Bitwise Ternary Logic) instruction. Imagine you have threebools (a,b, andc), and you want to perform a series of Boolean operations on them, e.g.a ? (b ^ c) : (b & c). If you were to naively compile that down, you’d end up with branches. We could make it branchless by distributing thea to both sides of the ternary, e.g.(a & (b ^ c)) | (!a & (b & c)), but now we’ve gone from one branch and one Boolean operation to six Boolean operations. What if instead we could do all of that in a single instructionand do it for all of the lanes in a vector at the same time so it could be applied to multiple values as part of a SIMD operation? That’d be cool, right? That’s whatvpternlog enables. Try running this:

// dotnet run -c Release -f net9.0internal class Program{    private static bool Exp(bool a, bool b, bool c) => (a & (b ^ c)) | (!a & b & c);    private static void Main()    {        Console.WriteLine("a b c result");        Console.WriteLine("------------");        int control = 0;        foreach (var (a, b, c, result) in from a in new[] { true, false }                                          from b in new[] { true, false }                                          from c in new[] { true, false }                                          select (a, b, c, Exp(a, b, c)))        {            Console.WriteLine($"{Convert.ToInt32(a)} {Convert.ToInt32(b)} {Convert.ToInt32(c)} {Convert.ToInt32(result)}");            control = control << 1 | Convert.ToInt32(result);        }        Console.WriteLine("------------");        Console.WriteLine($"Control: {control:b8} == 0x{control:X2}");    }}

Here we’ve put our Boolean operation into anExp function, which is then being invoked for all 8 possible combinations of inputs (each of the threebools each having two possible values). We’re then printing out the resulting “truth table,” that details the Boolean output for each possible input. With this particular Boolean expression, that yields this truth table being output:

a b c result------------1 1 1 01 1 0 11 0 1 11 0 0 00 1 1 10 1 0 00 0 1 00 0 0 0------------

We then take that lastresult column and we treat it as a binary number:

Control: 01101000 == 0x68

So the values are0 1 1 0 1 0 0 0, which we read as the binary0b01101000, which is0x68. That byte is used as a “control code” to thevpternlog instruction to encode which of the 256 possible truth tables that exist for any possible (deterministic) Boolean combination of those inputs is being chosen. This PR then teaches the JIT how to analyze the tree structures produced by the JIT to recognize such sequences of Boolean operations, compute the control code, and substitute in the use of the better instruction. Of course, the JIT isn’t going to do the enumeration I did above; turns out there’s a more efficient way to compute the control code, performing the same sequence of operations but on specific byte values instead of Booleans, e.g. this:

// dotnet run -c Release -f net9.0Console.WriteLine($"0x{Exp(0xF0, 0xCC, 0xAA):X2}");static int Exp(int a, int b, int c) => (a & (b ^ c)) | (~a & b & c);

also yields:

0x68

Why those specific three values of0xF0,0xCC, and0xAA? Let’s expand them from hex to binary:0b11110000,0b11001100,0b10101010. Look familiar? They’re the columns fora,b, andc in the earlier truth table, so we’re really just running this expression over each of the 8 rows in the table at the same time. Fun.

Another neat example is indotnet/runtime#92017 from@Ruihan-Yin, which optimizes 512-bit vector constants viabroadcast. “broadcast” is a fancy way of saying “replicate,” or “copy to each.” The instruction is used to take a single value and duplicate it to be used for each element of a vector. If, for example, I write:

Vector512<int> vector = Vector512.Create(42);

that’s broadcasting the single value42, replicating it 16 times to fill up the 512-bit vector. Now imagine I have the following C# code, which is creating aVector512<byte> composed of the byte sequence for the hex digits, but manually replicated four times, to fill up the 64 bytes that compose a 512-bit vector.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.Intrinsics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public Vector512<byte> HexLookupTable() =>        Vector512.Create("0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"u8);}

This would result in that whole byte sequence being stored in the assembly data section, and then the JIT would emit the code to load that data into the appropriate registers; no broadcasting. But instead, the JIT should be able to recognize that this is actually the same 16-byte sequence repeated four times, store the sequence once, and then use abroadcast to load and replicate that value to fill out the vector. With this PR, that’s exactly what happens.

// .NET 8; Tests.HexLookupTable()       push      rax       vzeroupper       vmovups   zmm0,[7FB205399700]       vmovups   [rsi],zmm0       mov       rax,rsi       vzeroupper       add       rsp,8       ret; Total bytes of code 31// .NET 9; Tests.HexLookupTable()       push      rax       vbroadcasti32x4 zmm0,xmmword ptr [7F78F75290F0]       vmovups   [rsi],zmm0       mov       rax,rsi       vzeroupper       add       rsp,8       ret; Total bytes of code 28

This is beneficial for a variety of reasons, including less data to store, less data to load, and if the register containing this state needed to be spilled (meaning something else needs to be put into the register, so the value currently in the register is temporarily stored in memory), reloading it is similarly cheaper.

Two of the more far-reaching changes related to AVX512, though, come fromdotnet/runtime#97675 anddotnet/runtime#101886, which do the work to enable the JIT to utilize AVX512 “embedded masking.” Masking is a commonly needed solution when writing SIMD code; anywhere you see aConditionalSelect, that’s masking. Consider again a ternary operation, e.g.a ? (b + c) : (b - c). Here,a would be considered the “mask”: anywhere it’s true, the value ofb + c is used, and anywhere it’s false, the value ofb - c is used. If each of these wereVector512<byte>, for example, it would look like this in C#:

public static Vector512<byte> Exp(Vector512<byte> a, Vector512<byte> b, Vector512<byte> c) =>    Vector512.ConditionalSelect(a, b + c, b - c);

And guess what I’d get for assembly? You guessed it, our good friendvpternlogd:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;using System.Runtime.Intrinsics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public Vector512<byte> Test() => Exp(default, default, default);    [MethodImpl(MethodImplOptions.NoInlining)]    public static Vector512<byte> Exp(Vector512<byte> a, Vector512<byte> b, Vector512<byte> c) =>        Vector512.ConditionalSelect(a, b + c, b - c);}

; Tests.Exp(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)       vzeroupper       vmovups   zmm0,[rsp+48]       vmovups   zmm1,[rsp+88]       vpaddb    zmm2,zmm0,zmm1       vpsubb    zmm0,zmm0,zmm1       vmovups   zmm1,[rsp+8]       vpternlogd zmm1,zmm2,zmm0,0CA       vmovups   [rdi],zmm1       mov       rax,rdi       vzeroupper       ret; Total bytes of code 68

We can see it’s computing both theb + c (vpaddb zmm2,zmm0,zmm1) and theb - c (vpsubb zmm0,zmm0,zmm1), and it’s then choosing between them based on the mask ([rsp+8], aka thea parameter). In this example, the maska was being passed in and computed in a manner unknown to theConditionalSelect. A more common scheme, however, is that the mask is computed as an argument to theConditionalSelect. Let’s say for example that instead of passing ina as a mask, we pass inVector512.LessThan(b, c) as the mask:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;using System.Runtime.Intrinsics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public Vector512<byte> Test() => Exp(default, default);    [MethodImpl(MethodImplOptions.NoInlining)]    public static Vector512<byte> Exp(Vector512<byte> b, Vector512<byte> c) =>        Vector512.ConditionalSelect(Vector512.LessThan(b, c), b + c, b - c);}

AVX512 supports this implicitly via embedded masking, which means that instructions can include the masking operation as part of them rather than performing the operation separately and then doing the masking viavpternlogd. Instructions like the comparison operation employed byLessThan can target storing the result into a new kind of register defined by AVX512, a mask register, and then that mask register can be used as part of other compound operations to incorporate the mask into them. .NET developers don’t need to do anything to take advantage of this support, though: the JIT just uses the specialized masking instructions where it sees an opportunity to do so. For the previous example, on .NET 8, we’d get this:

; Tests.Exp(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)       vzeroupper       vmovups   zmm0,[rsp+8]       vmovups   zmm1,[rsp+48]       vpcmpltub k1,zmm0,zmm1       vpmovm2b  zmm2,k1       vpaddb    zmm3,zmm0,zmm1       vpsubb    zmm0,zmm0,zmm1       vpternlogd zmm2,zmm3,zmm0,0CA       vmovups   [rdi],zmm2       mov       rax,rdi       vzeroupper       ret; Total bytes of code 70

Here we still have avpternlogd. But, with the aforementioned PRs, now here’s what we get on .NET 9:

; Tests.Exp(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)       vmovups   zmm0,[rsp+8]       vmovups   zmm1,[rsp+48]       vpcmpltub k1,zmm0,zmm1       vpsubb    zmm2,zmm0,zmm1       vpaddb    zmm2{k1},zmm0,zmm1       vmovups   [rdi],zmm2       mov       rax,rdi       vzeroupper       ret; Total bytes of code 54

Thatvpcmpltub instruction is doing theLessThan betweenb andc and storing the result as a mask in thek1 masking register. Thevpsubbfor theb - c still happens as it did before. But now theb + c operation is significantly different, and note there’s novpternlogd anymore. Thevpternlogd and thevpaddb we previously saw have now effectively been folded into a singlevpaddb instructionwith the mask. The result of theb - c is sitting in thezmm2 register. Thevpaddb instruction then performs the addition betweenzmm0 (b) andzmm1 (c), and uses the maskk1 to decide whether to use that addition result or the existing subtraction result inzmm2. (dotnet/runtime#97468 also enables some such usage ofvpternlogd to instead usevblendmps.vblendmps is similar tovpternlogd except that it’s specific to floating-point and works with one of the dedicated mask registers.)

dotnet/runtime#97529 also improved casting fromdouble andfloat to integer types, in particular when AVX512 is available such that it can benefit from dedicated AVX512 instructions for the purpose, e.g. theVCVTTSD2USI (Convert With Truncation Scalar Double Precision Floating-Point Value to Unsigned Integer) instruction.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Linq; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private double[] _doubles = Enumerable.Range(0, 1024).Select(i => (double)i).ToArray();    private ulong[] _ulongs = new ulong[1024];     [Benchmark]    public void DoubleToUlong()    {        ReadOnlySpan doubles = _doubles;        Span ulongs = _ulongs;        for (int i = 0; i < doubles.Length; i++)        {            ulongs[i] = (ulong)doubles[i];        }    }}

Method	Runtime	Mean	Ratio	Code Size
DoubleToUlong	.NET 8.0	1,386.5 ns	1.00	135 B
DoubleToUlong	.NET 9.0	461.4 ns	0.33	102 B

Vectorization

In addition to improvements that teach the JIT about entirely new architectures, there have also been a plethora of improvements that simply help the JIT to better employ SIMD in general.

One of my favorites isdotnet/runtime#92852, which merges consecutive stores into a single operation. Consider wanting to implement a method likebool.TryFormat:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private bool _value;    private char[] _destination = new char[10];    [Benchmark]    public bool TryFormat() => TryFormat(_destination, out _);    private bool TryFormat(char[] destination, out int charsWritten)    {        if (_value)        {            if (destination.Length >= 4)            {                destination[0] = 'T';                destination[1] = 'r';                destination[2] = 'u';                destination[3] = 'e';                charsWritten = 4;                return true;            }        }        else        {            if (destination.Length >= 5)            {                destination[0] = 'F';                destination[1] = 'a';                destination[2] = 'l';                destination[3] = 's';                destination[4] = 'e';                charsWritten = 5;                return true;            }        }        charsWritten = 0;        return false;    }}

Pretty simple: we’re writing out each individual value. That’s a bit unfortunate, though, in that we’re naively then spending severalmovs to write each character individually, when instead we could pack all of those values into a single value to write. In fact, that’s exactly what the realbool.TryFormat does. Here is its handling of thetrue case today:

if (destination.Length > 3){    ulong true_val = BitConverter.IsLittleEndian ? 0x65007500720054ul : 0x54007200750065ul; // "True"    MemoryMarshal.Write(MemoryMarshal.AsBytes(destination), in true_val);    charsWritten = 4;    return true;}

The developer has manually done the work of computing the value of the merged writes, e.g.

ulong true_val = (((ulong)'e' << 48) | ((ulong)'u' << 32) | ((ulong)'r' << 16) | (ulong)'T')Assert.Equal(0x65007500720054ul, true_val);

in order to be able to perform a single write rather than doing four individual ones. For this particular case, now in .NET 9, the JIT can automatically do this merging so the developer doesn’t have to. The developer just writes the code that’s natural to write, and the JIT does the heavy lifting of optimizing its output (note below themov rax, 65007500720054 instruction, loading the same value we manually computed above).

// .NET 8; Tests.TryFormat(Char[], Int32 ByRef)       push      rbp       mov       rbp,rsp       cmp       byte ptr [rdi+10],0       jne       short M01_L01       mov       ecx,[rsi+8]       cmp       ecx,5       jl        short M01_L00       mov       word ptr [rsi+10],46       mov       word ptr [rsi+12],61       mov       word ptr [rsi+14],6C       mov       word ptr [rsi+16],73       mov       word ptr [rsi+18],65       mov       dword ptr [rdx],5       mov       eax,1       pop       rbp       retM01_L00:       xor       eax,eax       mov       [rdx],eax       pop       rbp       retM01_L01:       mov       ecx,[rsi+8]       cmp       ecx,4       jl        short M01_L00       mov       word ptr [rsi+10],54       mov       word ptr [rsi+12],72       mov       word ptr [rsi+14],75       mov       word ptr [rsi+16],65       mov       dword ptr [rdx],4       mov       eax,1       pop       rbp       ret; Total bytes of code 112// .NET 9; Tests.TryFormat(Char[], Int32 ByRef)       push      rbp       mov       rbp,rsp       cmp       byte ptr [rdi+10],0       jne       short M01_L00       mov       ecx,[rsi+8]       cmp       ecx,5       jl        short M01_L01       mov       rax,73006C00610046       mov       [rsi+10],rax       mov       word ptr [rsi+18],65       mov       dword ptr [rdx],5       mov       eax,1       pop       rbp       retM01_L00:       mov       ecx,[rsi+8]       cmp       ecx,4       jl        short M01_L01       mov       rax,65007500720054       mov       [rsi+10],rax       mov       dword ptr [rdx],4       mov       eax,1       pop       rbp       retM01_L01:       xor       eax,eax       mov       [rdx],eax       pop       rbp       ret; Total bytes of code 92

dotnet/runtime#92939 improves this further by enabling longer sequences to similarly be merged using SIMD instructions.

Of course, you may then wonder, why wasn’tbool.TryFormat reverted to use the simpler code? The unfortunate answer is that this optimization only currently applies to array targets rather than span targets. That’s because there are alignment requirements for performing these kinds of writes, and whereas the JIT can make certain assumptions about the alignment of arrays, it can’t make those same assumptions about spans, which can represent slices of something else at unaligned boundaries. This is now one of the few cases where arrays are better than spans; typically span is as good or better. But I’m hopeful it will be improved in the future.

Another nice improvement isdotnet/runtime#86811 from@BladeWise, which adds SIMD support for multiplying two vectors ofbytes orsbytes. Previously this would end up falling back to a software implementation, which is very slow compared to true SIMD operations. Now, the code is much faster and much more compact.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.Intrinsics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Vector128<byte> _v1 = Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);    [Benchmark]    public Vector128<byte> Square() => _v1 * _v1;}

Method	Runtime	Mean	Ratio
Square	.NET 8.0	15.4731 ns	1.000
Square	.NET 9.0	0.0284 ns	0.002

dotnet/runtime#103555 (x64, when AVX512 isn’t available) anddotnet/runtime#104177 (Arm64) also improve vector multiplication, this time forlong/ulong. This can be seen with a simple micro-benchmark (and because I’m running on a machine that supports AVX512, the benchmark is explicitly disabling it):

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Running;using System.Runtime.Intrinsics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0").AsBaseline())    .AddJob(Job.Default.WithId(".NET 9").WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0")));[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Vector256<long> _a = Vector256.Create(1, 2, 3, 4);    private Vector256<long> _b = Vector256.Create(5, 6, 7, 8);    [Benchmark]    public Vector256<long> Multiply() => Vector256.Multiply(_a, _b);}

Method	Runtime	Mean	Ratio
Multiply	.NET 8.0	9.5448 ns	1.00
Multiply	.NET 9.0	0.3868 ns	0.04

It’s also evident, however, on higher-level benchmarks, for example on this benchmark forXxHash128, an implementation that makes heavy use of multiplication of such vectors.

// Add a <PackageReference Include="System.IO.Hashing" Version="8.0.0" /> to the csproj.// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Environments;using System.IO.Hashing;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0").AsBaseline())    .AddJob(Job.Default.WithId(".NET 9").WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0")));[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _data;    [GlobalSetup]    public void Setup()    {        _data = new byte[1024 * 1024];        new Random(42).NextBytes(_data);    }    [Benchmark]    public UInt128 Hash() => XxHash128.HashToUInt128(_data);}

This benchmark references the System.IO.Hashing nuget package. Note that we’re explicitly adding in a reference to the 8.0.0 version; that means that even when running on .NET 9, we’re using the .NET 8 version of the hashing code, yet it’s still significantly faster, because of these runtime improvements.

Method	Runtime	Mean	Ratio
Hash	.NET 8.0	40.49 us	1.00
Hash	.NET 9.0	26.40 us	0.65

Some other notable examples:

Improved SIMD comparisons.dotnet/runtime#104944 anddotnet/runtime#104215 improve how vector comparisons are handled.
Improved ConditionalSelects.dotnet/runtime#104092 from@ezhevita improves the generated code forConditionalSelects when the condition is a set of constants.
Better Const Handling. Certain operations are only optimized when one of their arguments is a constant, otherwise falling back to a much slower software emulation implementation.dotnet/runtime#102827 enables such instructions (like for shuffling) to continue to be treated as optimized operations if the non-const argument becomes a constant as part of other optimizations (like inlining).
Unblocking other optimizations. Some changes don’t themselves introduce optimizations, but instead make tweaks that enable other optimizations to do a better job.dotnet/runtime#104517 decomposes some bitwise operations (e.g. replacing a unified “and not” operation with an “and” and a “not”), which in turn enables other existing optimizations like common sub-expression elimination (CSE) to kick in more often. Anddotnet/runtime#104214 normalized various negation patterns, which similarly enables other optimizations to apply in more places.

Branching

Just like the JIT tries to elide redundant bounds checking, where it can prove the bounds check is unnecessary, it similarly does so for branching.

The ability to handle the relationships between branches is improved in .NET 9. Consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(50)]    public void Test(int x)    {        if (x > 100)        {            Helper(x);        }    }    private void Helper(int x)    {        if (x > 10)        {            Console.WriteLine("Hello!");        }    }}

TheHelper function is simple enough to be inlined, and in .NET 8 we end up with this assembly:

; Tests.Test(Int32)       push      rbp       mov       rbp,rsp       cmp       esi,64       jg        short M00_L01M00_L00:       pop       rbp       retM00_L01:       cmp       esi,0A       jle       short M00_L00       mov       rdi,7F35E44C7E18       pop       rbp       jmp       qword ptr [7F35E914C7C8]; Total bytes of code 33

We can see in the original code that the branch within the inlinedHelper is entirely unnecessary: we’re only there ifx is greater than 100, so it’s definitely greater than 10, yet in the assembly code, we have both comparisons happening (notice the twocmps). Now in .NET 9, thanks todotnet/runtime#95234 which improves the JIT’s ability to reason about the relationship between two ranges and whether one is implied by the other, we get this instead:

; Tests.Test(Int32)       cmp       esi,64       jg        short M00_L00       retM00_L00:       mov       rdi,7F81C120EE20       jmp       qword ptr [7F8148626628]; Total bytes of code 22

Just the one outercmp. The same thing happens for the negative case: if we tweak thex > 10 to instead bex < 10, we end up with this:

// .NET 8; Tests.Test(Int32)       push      rbp       mov       rbp,rsp       cmp       esi,64       jg        short M00_L01M00_L00:       pop       rbp       retM00_L01:       cmp       esi,0A       jge       short M00_L00       mov       rdi,7F6138428DE0       pop       rbp       jmp       qword ptr [7FA1DDD4C7C8]; Total bytes of code 33// .NET 9; Tests.Test(Int32)       ret; Total bytes of code 1

Similar to thex > 10 case, on .NET 8 the JIT retained both branches. But on .NET 9, it recognized that not only was the inner conditional redundant, it was redundant in a way that would make it always false, which then allowed it to dead-code eliminate the body of thatif, leaving the whole method a nop.dotnet/runtime#94689 makes this kind of information flow by enabling the JIT’s support for “cross-block local assertion prop”.

Another PR that eliminated some redundant branches isdotnet/runtime#94563, which feeds information from value numbering (a technique used to eliminate redundant expressions by giving every unique expression its own unique identifier) into the building of PHIs (a kind of node in the JIT’s intermediate representation of the code that aids in determining a variable’s value based on control flow). Consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public unsafe class Tests{    [Benchmark]    [Arguments(50)]    public void Test(int x)    {        byte[] data = new byte[128];        fixed (byte* ptr = data)        {            Nop(ptr);        }    }    [MethodImpl(MethodImplOptions.NoInlining)]    private static void Nop(byte* ptr) { }}

This is allocating abyte[] and then pinning it in order to use it with a method that requires a pointer. The C# specification forfixed width arrays states “If the array expression isnull or if the array has zero elements, the initializer computes an address equal to zero,” and as such if you look at the IL for this code, you’ll see that it’s checking the length and setting the pointer equal to 0 if the length is 0. You can see this same behavior explicitly implemented as well for spans if you look atSpan<T>‘sGetPinnableReference implementation:

public ref T GetPinnableReference(){    ref T ret = ref Unsafe.NullRef<T>();    if (_length != 0) ret = ref _reference;    return ref ret;}

As such, there’s actually an extra branch not visible in theTests.Test test. But, in this particular case, that branch is also redundant, because we can very clearly see (and the JIT should be able to as well) that the length of the array is non-0. On .NET 8, we still get that branch:

; Tests.Test(Int32)       push      rbp       sub       rsp,10       lea       rbp,[rsp+10]       xor       eax,eax       mov       [rbp-8],rax       mov       rdi,offset MT_System.Byte[]       mov       esi,80       call      CORINFO_HELP_NEWARR_1_VC       mov       [rbp-8],rax       mov       rdi,[rbp-8]       cmp       dword ptr [rdi+8],0       je        short M00_L01       mov       rdi,[rbp-8]       cmp       dword ptr [rdi+8],0       jbe       short M00_L02       mov       rdi,[rbp-8]       add       rdi,10M00_L00:       call      qword ptr [7F3F99B45C98]; Tests.Nop(Byte*)       xor       eax,eax       mov       [rbp-8],rax       add       rsp,10       pop       rbp       retM00_L01:       xor       edi,edi       jmp       short M00_L00M00_L02:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 96

But now on .NET 9, that branch (in fact, multiple redundant branches) is removed:

; Tests.Test(Int32)       push      rax       xor       eax,eax       mov       [rsp],rax       mov       rdi,offset MT_System.Byte[]       mov       esi,80       call      CORINFO_HELP_NEWARR_1_VC       mov       [rsp],rax       add       rax,10       mov       rdi,rax       call      qword ptr [7F22DAC844C8]; Tests.Nop(Byte*)       xor       eax,eax       mov       [rsp],rax       add       rsp,8       ret; Total bytes of code 55

dotnet/runtime#87656 is another nice example and addition to the JIT’s optimization repertoire. As was discussed earlier, branches have costs associated with them. A hardware’s branch predictor can often do a very good job of mitigating the bulk of those costs, but there’s still some, and even if it were fully mitigated in the common case, a branch prediction failure can be relatively very costly. As such, minimizing branches can be very helpful, and if nothing else, turning branch-based operations into branchless ones leads to more consistent and predictable throughput, as it’s then less subject to the nature of the data being processed. Consider the following function that’s used to determine whether a character is a particular subset of whitespace characters:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments('s')]    public bool IsJsonWhitespace(int c)    {        if (c == ' ' || c == '\t' || c == '\r' || c == '\n')        {            return true;        }        return false;    }}

On .NET 8, we get what you’d probably expect, a series ofcmps followed by conditional jumps, one for each character:

; Tests.IsJsonWhitespace(Int32)       push      rbp       mov       rbp,rsp       cmp       esi,20       je        short M00_L00       cmp       esi,9       je        short M00_L00       cmp       esi,0D       je        short M00_L00       cmp       esi,0A       je        short M00_L00       xor       eax,eax       pop       rbp       retM00_L00:       mov       eax,1       pop       rbp       ret; Total bytes of code 35

On .NET 9, though, we now get this:

; Tests.IsJsonWhitespace(Int32)       push      rbp       mov       rbp,rsp       cmp       esi,20       ja        short M00_L00       mov       eax,0FFFFD9FF       bt        rax,rsi       jae       short M00_L01M00_L00:       xor       eax,eax       pop       rbp       retM00_L01:       mov       eax,1       pop       rbp       ret; Total bytes of code 31

It’s now using abt instruction (a bit test) against a pattern where there’s a bit set for each of the characters being tested against, consolidating most of the branches down to just this one.

Unfortunately, this also highlights that such optimizations, which are looking for a particular pattern, can get knocked off their golden path, at which point the optimization won’t kick in. In this case, there are several ways it can get knocked off. The most obvious is if there are too many values or if they’re too spread out, such that they can’t fit into the 32-bit or 64-bit bit mask. More interesting, if you switch it to instead use C# pattern matching (e.g.c is ' ' or '\t' or '\r' or '\n'), it also doesn’t kick in. Why? Because the C# compiler itself is trying to optimize, and the pattern it ends up generating in the IL is different from what this optimization is expecting. I expect this’ll get better in the future, but it’s a good reminder that these kinds of optimizations are useful when they make arbitrary code better, but if you’re coding to the exact nature of the optimization and relying on it happening, you really need to be paying attention.

A related optimization was added indotnet/runtime#93521. Consider a function like the following, which is checking to see whether a character is a lower-case hexadecimal char:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments('s')]    public bool IsHexLower(char c)    {        if ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f'))        {            return true;        }        return false;    }}

On .NET 8, we get a comparison against'0', a comparison again'9', a comparison against'a', and a comparison against'f', with a conditional branch for each:

; Tests.IsHexLower(Char)       push      rbp       mov       rbp,rsp       movzx     eax,si       cmp       eax,30       jl        short M00_L00       cmp       eax,39       jle       short M00_L02M00_L00:       cmp       eax,61       jl        short M00_L01       cmp       eax,66       jle       short M00_L02M00_L01:       xor       eax,eax       pop       rbp       retM00_L02:       mov       eax,1       pop       rbp       ret; Total bytes of code 38

But on .NET 9, we instead get this:

; Tests.IsHexLower(Char)       push      rbp       mov       rbp,rsp       movzx     eax,si       mov       ecx,eax       sub       ecx,30       cmp       ecx,9       jbe       short M00_L00       sub       eax,61       cmp       eax,5       jbe       short M00_L00       xor       eax,eax       pop       rbp       retM00_L00:       mov       eax,1       pop       rbp       ret; Total bytes of code 36

Effectively the JIT has rewritten the condition as if I’d written it like this:

(((uint)c - '0') <= ('9' - '0')) || (((uint)c - 'a') <= ('f' - 'a'))

which is nice, because it’s replaced two of the conditional branches with two (cheaper) subtractions.

Write Barriers

The .NET garbage collector (GC) is a generational collector. That means it divides the heap up logically by object age, where “generation 0” (or “gen0”) are the newest objects that haven’t been around for very long, “gen2” are the objects that have been around for a while, and “gen1” are in the middle. This approach is based on the theory (that also generally plays out in practice) that most objects end up being very short-lived, created for some task and then quickly dropped, and conversely that if an object has been around for a while, there’s a really good chance it’ll continue to be around for a while. By partitioning up objects like this, the GC can reduce the amount of work it needs to do when it scans for objects to be collected. It can do a scan focused only on gen0 objects, allowing it to ignore anything in gen1 or gen2 and thereby make its scan much faster. Or at least, that’s the goal. If it were to only scan gen0 objects, though, it could easily think a gen0 object wasn’t referenced because it couldn’t find any references to one from other gen0 objects… but there may have been a reference from a gen1 or gen2 object. That would be bad. How does the GC deal with this then, having its cake and eating it, too? It colludes with the rest of the runtime to track any time its generational assumptions might be violated. The GC maintains a table (called the “card table”) that indicates whether an object in a higher generationmight contain a reference to a lower generation object, and any time a reference is written such that there could end up being a reference from a higher generation to a lower one, this table is updated. Then when the GC does its scan, it only needs to examine higher generation objects if the relevant bit in the table is set (the table doesn’t track individual objects, just ranges of them, so it’s similar to a“Bloom filter”, where the lack of a bit means there’s definitely not a reference but the presence of a bit only means theremight be a reference).

The code that’s executed to track the reference write and possibly update the card table is referred to as a GC write barrier. And, obviously, if that code is happening every time a reference is written to an object, you really, really, really want that code to be efficient. There are actually multiple different forms of GC write barriers, all specialized for slightly different purposes.

The standard GC write barrier isCORINFO_HELP_ASSIGN_REF. However, there’s another one calledCORINFO_HELP_CHECKED_ASSIGN_REF that needs to do a bit more work. The JIT is the one deciding which of these to use, and it uses the latter when it’s possible the target isn’t on the heap, in which case the barrier needs to do a little more work to figure that out.

dotnet/runtime#98166 helps the JIT do better in a certain case. If you have a static field of a value type:

static SomeStruct s_someField;...struct SomeStruct{    public object Obj;}

the runtime implements that by having a box associated with that field for storing that struct. Such static boxes are always on the heap, so if you then do:

static void Store(object o) => s_someField.Obj = o;

the JIT can prove that the cheaper unchecked write barrier may be used, and this PR teaches it that. Previously sometimes the JIT would be able to figure it out, but this effectively ensures it.

Another similar improvement comes fromdotnet/runtime#97953. Here’s an example based onConcurrentQueue<T>, which maintains arrays of elements, each of which is the actual item tagged with a sequence number that’s used by the implementation to maintain correctness in the face of concurrency.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Slot<object>[] _arr = new Slot<object>[1];    private object _obj = new object();    [Benchmark]    public void Test() => Store(_arr, _obj);    private static void Store<T>(Slot<T>[] arr, T o)    {        arr[0].Item = o;        arr[0].SequenceNumber = 1;    }    private struct Slot<T>    {        public T Item;        public int SequenceNumber;    }}

Here as well we can see on .NET 8 it’s using the more expensive checked write barrier, but on .NET 9 the JIT has recognized it can use the cheaper unchecked write barrier:

// .NET 8; Tests.Test()       push      rbx       mov       rbx,[rdi+8]       mov       rsi,[rdi+10]       cmp       dword ptr [rbx+8],0       jbe       short M00_L00       add       rbx,10       mov       rdi,rbx       call      CORINFO_HELP_CHECKED_ASSIGN_REF       mov       dword ptr [rbx+8],1       pop       rbx       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 42// .NET 9; Tests.Test()       push      rbx       mov       rbx,[rdi+8]       mov       rsi,[rdi+10]       cmp       dword ptr [rbx+8],0       jbe       short M00_L00       add       rbx,10       mov       rdi,rbx       call      CORINFO_HELP_ASSIGN_REF       mov       dword ptr [rbx+8],1       pop       rbx       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 42

dotnet/runtime#101761 actually introduces a new form of write barrier. Consider this:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private MyStruct _value;    private Wrapper _wrapper = new();    [Benchmark]    public void Store() => _wrapper.Value = _value;    private record struct MyStruct(string a1, string a2, string a3, string a4);    private class Wrapper    {        public MyStruct Value;    }}

Previously as part of copying that struct, each of those fields (represented bya1 througha4) would individually incur a write barrier:

; Tests.Store()       push      rax       mov       [rsp],rdi       mov       rax,[rdi+8]       lea       rdi,[rax+8]       mov       rsi,[rsp]       add       rsi,10       call      CORINFO_HELP_ASSIGN_BYREF       call      CORINFO_HELP_ASSIGN_BYREF       call      CORINFO_HELP_ASSIGN_BYREF       call      CORINFO_HELP_ASSIGN_BYREF       nop       add       rsp,8       ret; Total bytes of code 47

Now in .NET 9, this PR added a new bulk write barrier, which can implement the operation more efficiently.

; Tests.Store()       push      rax       mov       rsi,[rdi+8]       add       rsi,8       cmp       [rsi],sil       add       rdi,10       mov       [rsp],rdi       cmp       [rdi],dil       mov       rdi,rsi       mov       rsi,[rsp]       mov       edx,20       call      qword ptr [7F5831BC5740]; System.Buffer.BulkMoveWithWriteBarrier(Byte ByRef, Byte ByRef, UIntPtr)       nop       add       rsp,8       ret; Total bytes of code 47

Making GC write barriers faster is good; after all, they’re useda lot. However, switching from the checked write barrier to the non-checked write barrier is a very micro optimization; the extra overhead of the checked variant is often just a couple of comparisons. A better optimization is avoiding the need for a barrier entirely!dotnet/runtime#103503 recognizes thatref structs can’t possibly be on the GC heap by their very nature, and as such, write barriers can be entirely elided when writing into the fields of aref struct.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public void Store()    {        MyRefStruct s = default;        Test(ref s, new object(), new object());    }    [MethodImpl(MethodImplOptions.NoInlining)]    private void Test(ref MyRefStruct s, object o1, object o2)    {        s.Obj1 = o1;        s.Obj2 = o2;    }    private ref struct MyRefStruct    {        public object Obj1;        public object Obj2;    }}

On .NET 8, we have two barriers; on .NET 9, zero:

// .NET 8; Tests.Test(MyRefStruct ByRef, System.Object, System.Object)       push      r15       push      rbx       mov       rbx,rsi       mov       r15,rcx       mov       rdi,rbx       mov       rsi,rdx       call      CORINFO_HELP_CHECKED_ASSIGN_REF       lea       rdi,[rbx+8]       mov       rsi,r15       call      CORINFO_HELP_CHECKED_ASSIGN_REF       nop       pop       rbx       pop       r15       ret; Total bytes of code 37// .NET 9; Tests.Test(MyRefStruct ByRef, System.Object, System.Object)       mov       [rsi],rdx       mov       [rsi+8],rcx       ret; Total bytes of code 8

Similarly,dotnet/runtime#102084 is able to remove some barriers on Arm64 as part ofref struct copies.

Object Stack Allocation

For years, .NET has explored the possibility of stack-allocating managed objects. It’s something that other managed languages like Java are already capable of doing, but it’s also more critical in Java, which lacks the equivalent of value types (e.g. if you want a list of integers, that’d most likely beList<Integer>, which will box each integer value added to the list, similar to ifList<object> were used in .NET). In .NET 9, object stack allocation starts to happen. Before you get too excited, it’s limited in scope right now, but in the future it’s likely to expand out further.

The hardest part of stack allocating objects is ensuring that it’s safe. If a reference to the object were to escape and end up being stored somewhere that outlived the stack frame containing the stack-allocated object, that would be very bad; when the method returned, those outstanding references would be pointing to garbage. So, the JIT needs to perform escape analysis to ensure that never happens, and doing that well is extremely challenging. For .NET 9, the support was introduced indotnet/runtime#103361 (and brought to Native AOT indotnet/runtime#104411), and it doesn’t do any interprocedural analysis, which means it’s limited to only handling cases where it can easily prove the object reference doesn’t leave the current frame. Even so, there are plenty of situations where this will help to eliminate allocations, and I expect it’ll be expanded to handle more and more cases in the future. When the JIT does choose to allocate an object on the stack, it effectively promotes the fields of the object to be individual variables in the stack frame.

Here’s a very simple example of the mechanism in action:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public int GetValue() => new MyObj(42).Value;    private class MyObj    {        public MyObj(int value) => Value = value;        public int Value { get; }    }}

On .NET 8, the generated code forGetValue looks like this:

; Tests.GetValue()       push      rax       mov       rdi,offset MT_Tests+MyObj       call      CORINFO_HELP_NEWSFAST       mov       dword ptr [rax+8],2A       mov       eax,[rax+8]       add       rsp,8       ret; Total bytes of code 31

The generated code is allocating a new object, populating that object’sValue, and then reading thatValue as the value to return. On .NET 9, we instead end up with this picture of simplicity:

; Tests.GetValue()       mov       eax,2A       ret; Total bytes of code 6

The JIT has inlined the constructor, inlined accesses to theValue property, promoted the field backing that property to be a variable, and in effect optimized the entire operation simply to bereturn 42;.

Method	Runtime	Mean	Ratio	Code Size	Allocated	Alloc Ratio
GetValue	.NET 8.0	3.6037 ns	1.00	31 B	24 B	1.00
GetValue	.NET 9.0	0.0519 ns	0.01	6 B	–	0.00

Here’s another more impactful example. When it comes to performance optimization, it’s really nice when the right things just happen; otherwise, developers need to learn the minute differences between performing an operation this way or that way. Every programming language and platform has non-trivial amounts of such things, but we really want to drive the number of them down. One interesting case for .NET has had to do with structs and casting. Consider these twoDispose1 andDispose2 methods:

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public void Test()    {        Dispose1<MyStruct>(default);        Dispose2<MyStruct>(default);    }    [MethodImpl(MethodImplOptions.NoInlining)]    private bool Dispose1<T>(T o)    {        bool disposed = false;        if (o is IDisposable disposable)        {            disposable.Dispose();            disposed = true;        }        return disposed;    }    [MethodImpl(MethodImplOptions.NoInlining)]    private bool Dispose2<T>(T o)    {        bool disposed = false;        if (o is IDisposable)        {            ((IDisposable)o).Dispose();            disposed = true;        }        return disposed;    }    private struct MyStruct : IDisposable    {        public void Dispose() { }    }}

Ideally, if you call them with a value typeT, there wouldn’t be any allocation, but unfortunately, inDispose1 because of how things line up here, the JIT would end up needing to boxo to produce theIDisposable. Interestingly, due to optimizations several years ago, inDispose2 the JIT is in fact able to elide the boxing. On .NET 8, we get this:

; Tests.Dispose1[[Tests+MyStruct, benchmarks]](MyStruct)       push      rbx       mov       rdi,offset MT_Tests+MyStruct       call      CORINFO_HELP_NEWSFAST       add       rax,8       mov       ebx,[rsp+10]       mov       [rax],bl       mov       eax,1       pop       rbx       ret; Total bytes of code 33; Tests.Dispose2[[Tests+MyStruct, benchmarks]](MyStruct)       mov       eax,1       ret; Total bytes of code 6

This is one of those things that a developer would have to “just know,” and also fight against tooling likeIDE0038 that pushes developers to write this code like in my first version, whereas for structs the latter ends up being more efficient. This work on stack allocation makes that difference go away, because the boxing that occurs as part of the first version is a quintessential example of the allocation the compiler is now able to stack allocate. On .NET 9, we now end up with this:

; Tests.Dispose1[[Tests+MyStruct, benchmarks]](MyStruct)       mov       eax,1       ret; Total bytes of code 6; Tests.Dispose2[[Tests+MyStruct, benchmarks]](MyStruct)       mov       eax,1       ret; Total bytes of code 6

Method	Runtime	Mean	Ratio	Code Size	Allocated	Alloc Ratio
Test	.NET 8.0	5.726 ns	1.00	94 B	24 B	1.00
Test	.NET 9.0	2.095 ns	0.37	45 B	–	0.00

Inlining

Improvements in inlining were a major focus of previous releases, and will likely be a major focus again in the future. For .NET 9, there weren’t a ton of changes, but there was one particularly impactful improvement.

As a motivating example, considerArgumentNullException.ThrowIfNull again. It is defined like this:

public static void ThrowIfNull(object? arg, [CallerArgumentExpression(nameof(arg))] string? paramName = null);

Notably, it’s non-generic, and that’s a question we get asked about at some relevant frequency. We chose not to make it generic for three reasons:

The main benefit of making it generic would be to avoid boxing structs, but the JIT already eliminated said boxing in tier 1, and as was highlighted earlier in this post, it’s possible for it to eliminate it in tier 0 as well (and now does).
Every generic instantiation (using a generic with a different type) adds runtime overhead. We didn’t want to bloat a process with such additional metadata and runtime data structures just to support argument validation that should rarely if ever fail in production.
When used with reference types (which is its raison d’etre), it would not play well with inlining, but inlining of such a “throw helper” is critical for performance. Generic methods with coreclr and Native AOT work in one of two ways. For value types, every time a generic is used with a different value type, an entire copy of the generic method is made and specialized for that parameter type; it’s as if you wrote a dedicated version of that generic code that wasn’t generic and was instead customized specifically for that type. For reference types, there’s only one copy of the code that’s then shared across all reference types, and it’s parameterized at run-time based on the actual type being used. When you access such a shared generic, at run-time it ends up looking up in a dictionary the information about the generic argument and using the discovered information to inform the rest of the method. Historically, this has not been conducive to inlining.

So,ThrowIfNull is non-generic. But, there are other throw helpers, many of them are generic. That’s because a) they’re primarily expected to work with value types, and b) we had no choice, given the nature of the methods. So, for example,ArgumentOutOfRangeException.ThrowIfEqual is generic onT, accepting two values ofT to compare and throw if they’re the same. And ifT is a reference type, on .NET 8 it may not successfully inline if the caller is a shared generic as well. With this:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;namespace Benchmarks;[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public unsafe class Tests{    private static void Main(string[] args) =>        BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);    [Benchmark]    public void Test() => ThrowOrDispose(new Version(1, 0), new Version(1, 1));    [MethodImpl(MethodImplOptions.NoInlining)]    private static void ThrowOrDispose<T>(T value, T invalid) where T : IEquatable<T>    {        ArgumentOutOfRangeException.ThrowIfEqual(value, invalid);        if (value is IDisposable disposable)        {            disposable.Dispose();        }    }}

on .NET 8 we get this for theThrowOrDispose method (this example benchmark has a slightly different shape from previous examples and this output is from Windows, for reasons to be made clearer shortly):

; Benchmarks.Tests.ThrowOrDispose[[System.__Canon, System.Private.CoreLib]](System.__Canon, System.__Canon)       push      rsi       push      rbx       sub       rsp,28       mov       [rsp+20],rcx       mov       rbx,rdx       mov       rsi,r8       mov       rdx,[rcx+10]       mov       rax,[rdx+10]       test      rax,rax       je        short M01_L00       mov       rcx,rax       jmp       short M01_L01M01_L00:       mov       rdx,7FF996A8B170       call      CORINFO_HELP_RUNTIMEHANDLE_METHOD       mov       rcx,raxM01_L01:       mov       rdx,rbx       mov       r8,rsi       mov       r9,1DB81B20390       call      qword ptr [7FF996AC5BC0]; System.ArgumentOutOfRangeException.ThrowIfEqual[[System.__Canon, System.Private.CoreLib]](System.__Canon, System.__Canon, System.String)       mov       rdx,rbx       mov       rcx,offset MT_System.IDisposable       call      qword ptr [7FF996664348]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfInterface(Void*, System.Object)       test      rax,rax       jne       short M01_L03M01_L02:       add       rsp,28       pop       rbx       pop       rsi       retM01_L03:       mov       rcx,rax       mov       r11,7FF9965204F8       call      qword ptr [r11]       jmp       short M01_L02; Total bytes of code 124

Two things in particular to notice here. First, we see there’s acall toCORINFO_HELP_RUNTIMEHANDLE_METHOD; that’s the helper being used to obtain information about the actual typeT being used. Second,ThrowIfEqual is not being inlined; if that were being inlined, we wouldn’t see thatcall toThrowIfEqual here but instead we’d see the actual code forThrowIfEqual. We can confirm why it’s not being inlined via another BenchmarkDotNet diagnoser:[InliningDiagnoser]. The JIT is capable of emitting events for much of its activity, including reporting on any successful or failed inlining operations, and[InliningDiagnoser] listens to those events and reports them as part of the benchmarking results. This particular diagnoser is in a separateBenchmarkDotNet.Diagnostics.Windows package and only works today when running on Windows, because it relies on ETW, hence for comparison why I made the previous benchmark also be Windows. When I add:

[InliningDiagnoser(allowedNamespaces: ["Benchmarks"])]

to myTests class, and run the benchmarks for .NET 8:

// Add a <PackageReference Include="BenchmarkDotNet.Diagnostics.Windows" Version="9.0.0" /> to the csproj.// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Diagnostics.Windows.Configs;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;namespace Benchmarks;[InliningDiagnoser(allowedNamespaces: ["Benchmarks"])][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static void Main(string[] args) =>        BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);    [Benchmark]    public void Test() => ThrowOrDispose(new Version(1, 0), new Version(1, 1));    [MethodImpl(MethodImplOptions.NoInlining)]    private static void ThrowOrDispose<T>(T value, T invalid) where T : IEquatable<T>    {        ArgumentOutOfRangeException.ThrowIfEqual(value, invalid);        if (value is IDisposable disposable)        {            disposable.Dispose();        }    }}

I see this as part of the output:

Inliner: Benchmarks.Tests.ThrowOrDispose - generic void  (!!0,!!0)Inlinee: System.ArgumentOutOfRangeException.ThrowIfEqual - generic void  (!!0,!!0,class System.String)Fail Reason: runtime dictionary lookup

In other words,ThrowOrDispose calledThrowIfEqual but couldn’t inline it becauseThrowIfEqual contained a “runtime dictionary lookup;” in other words, it’s a shared generic method.

Now on .NET 9, thanks todotnet/runtime#99265, it is inlined! The resulting assembly is too large for me to show here, but we can see the impact in the benchmark results:

Method	Runtime	Mean	Ratio
Test	.NET 8.0	17.54 ns	1.00
Test	.NET 9.0	12.76 ns	0.73

and we can see it in the inlining report as successfully inlining.

GC

Applications end up having different needs when it comes to memory management. Would you be willing to throw more memory at maximizing throughput, or do you care more about minimizing working set? How important is it that unused memory be returned to the system aggressively? Is your expected workload constant or ebbing and flowing in nature? The GC has long had lots of knobs for configuring behavior based on these kinds of questions, but none more apparent than the choice of whether to use the “workstation GC” or “server GC”.

By default, an application uses the workstation GC, though some environments (like ASP.NET) opt-in to using server GC automatically. You can explicitly opt-in in a variety of ways, including by adding<ServerGarbageCollection>true</ServerGarbageCollection> into your project file (as we did in theBenchmarking Setup section of this post). Workstation GC optimizes for reduced memory consumption, while server GC optimizes for maximum throughput. Historically, workstation employs a single heap, whereas server employs a heap per core. That typically represents a tradeoff between amount of memory consumed and the overhead of accessing a heap, such as the cost of allocating. If a bunch of threads are all trying to allocate at the same time, with server GC they’re very likely to all be accessing different heaps, thereby reducing contention, whereas with workstation GC, they’re all going to be fighting for access. Conversely, more heaps generally means more memory consumed (even though each heap could be smaller than the single one), especially in lull periods where the system might not be fully loaded, yet is paying in working set for those extra heaps.

The decision for which to use isn’t always so clear. Especially in the presence of containers, you frequently still care about really good throughput, but also don’t want to be spending memory uselessly. Enter“DATAS,” or “Dynamically Adapting To Application Sizes”. DATAS was introduced in .NET 8 and serves to narrow the gap between workstation and server GC, bringing server GC closer to workstation memory consumption. It dynamically scales how much memory is being consumed by server GC, such that in times of less load, less memory is being used. While DATAS shipped in .NET 8, it was only on by default for Native AOT-based projects, and even there it still had some issues to be sorted. Those issues have now been sorted (e.g.dotnet/runtime#98743,dotnet/runtime#100390,dotnet/runtime#102368, anddotnet/runtime#105545), such that in .NET 9, as ofdotnet/runtime#103374, DATAS is now enabled by default for server GC.

If you have a workload where absolute best possible throughput is paramount and you’re ok with additional memory being consumed to enable that, you should feel free to disable DATAS, e.g. by adding this to your project file:

<GarbageCollectionAdaptationMode>0</GarbageCollectionAdaptationMode>

While DATAS being enabled by default is a very impactful improvement for .NET 9, there are other GC-related improvements in the release as well. For example, when compacting heaps, the GC may end up sorting objects by addresses. For large numbers of objects, this sort can be relatively expensive, and it behooves the GC to parallelize the sorting operation. For this purpose, several releases ago the GC incorporated a parallelized sorting algorithm called vxsort, which is effectively a quicksort with a parallelized partitioning step. However, it was only enabled for Windows (and only on x64). In .NET 9, it’s enabled for Linux as well as part ofdotnet/runtime#98712. This helps to reduce GC pause times.

VM

The .NET runtime provides many services to managed code. There’s the GC, of course, and the JIT compiler, and then there’s a whole bunch of functionality around things like assembly and type loading, exception handling, configuration management, virtual dispatch, interop infrastructure, stub management, and so on. All of that functionality is generally referred to as being part of the coreclr virtual machine (VM).

Many performance changes in this area are hard to demonstrate, but they’re still worth mentioning.dotnet/runtime#101580 lazily-allocates some information related to method entrypoints, resulting in smaller heap sizes and less work on startup.dotnet/runtime#96857 also removed some unnecessary allocation happening related to data structures around methods.dotnet/runtime#96703 reduced the algorithmic complexity of some key functions involved in building up method tables, whiledotnet/runtime#96466 streamlined access to those tables, minimizing the number of indirections involved.

Another set of changes went in to improving various calls from managed code into the VM. When managed code needs to call into the runtime, it has a couple of mechanisms it can employ. One is a “QCALL,” which is effectively just a P/Invoke /DllImport into functions declared in the runtime. The other is an “FCALL,” a much more specialized and complicated mechanism for invoking runtime code that’s capable of accessing managed objects. FCALL used to be the dominant mechanism, but each release more and more such calls are transitioned over to being QCALLs, which helps with both correctness (FCALLs can be hard to “get right”) and in some cases performance (some FCALLS need helper method frames that in turn typically make them more expensive than QCALLs).dotnet/runtime#96860 converted over members ofMarshal,dotnet/runtime#96916 did so forInterlocked,dotnet/runtime#96926 handled several more threading-related members,dotnet/runtime#97432 converted some of the built-in marshaling support,dotnet/runtime#97469 anddotnet/runtime#100939 handled methods fromGC and throughout reflection,dotnet/runtime#103211 from@AustinWise convertedGC.ReRegisterForFinalize, anddotnet/runtime#105584 convertedDelegate.GetMulticastInvoke (which is used in APIs likeDelegate.Combine andDelegate.Remove).dotnet/runtime#97590 did the same for the slow path inValueType.GetHashCode, while also converting the fast path to managed to avoid the transition entirely.

But arguably the most impactful change in this area for .NET 9 is around exceptions. Exceptions are expensive and should be avoided where performance matters. But… just because they’re expensive doesn’t mean it’s not valuable to make them less expensive. And in fact, there are cases where it’s really worthwhile to make them less expensive. One of the things we sporadically observe in the wild are “exception storms.” Some failure happens, which causes another failure, which causes another. Each of those incurs exceptions. CPU consumption starts to spike as the overhead of those exceptions is incurred. Now other things start to time out because they’re getting starved, and they throw exceptions, which in turn causes more failures. You get the idea.

InPerformance Improvements in .NET 8, I highlighted that in my opinion the single most important performance improvement in the release was a single character change, enabling dynamic PGO by default. Now in .NET 9,dotnet/runtime#98570 is similar, a super small and simple PR that belies all the work that came before it. Earlier on,dotnet/runtime#88034 had ported the Native AOT exception handling implementation over to coreclr, but it was disabled by default due to still needing bake time. It’s now had that bake time, and the new implementation is now on by default in .NET 9. And it’s much faster. Things get better still withdotnet/runtime#103076, which removes a global spinlock involved in the handling of exceptions.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public async Task ExceptionThrowCatch()    {        for (int i = 0; i < 1000; i++)        {            try { await Recur(10); } catch { }        }    }    private async Task Recur(int depth)    {        if (depth <= 0)        {            await Task.Yield();            throw new Exception();        }        await Recur(depth - 1);    }}

Method	Runtime	Mean	Ratio
ExceptionThrowCatch	.NET 8.0	123.03 ms	1.00
ExceptionThrowCatch	.NET 9.0	54.68 ms	0.44

Mono

We frequently say “the runtime,” but in reality there are currently multiple runtime implementations in .NET. “coreclr” is the runtime thus far referred to, which is the default runtime used on Windows, Linux, and macOS, and for services and desktop applications, but there’s also “mono,” which is mainly used when the form factor of the target application requires a small runtime: by default, it’s the runtime that’s used when building mobile apps for Android and iOS today, as well as the runtime used for Blazor WASM apps. mono has also seen a multitude of performance improvements in .NET 9:

Save/restoring of profile data. One of the features provided by mono is an interpreter, which enables .NET code to execute in environments where JIT’ing isn’t permitted, as well as to enable faster startup. Specifically for when targeting WASM, the interpreter has a form of PGO where after methods have been invoked some number of times and are deemed important, it’ll generate WASM on-the-fly to optimize those methods. This tiering gets better in .NET 9 withdotnet/runtime#92981, which enables keeping track of which methods tiered up, and if the code is running in a browser, storing that information in the browser’s cache for subsequent runs. When the code then runs subsequently, it can incorporate the previous learnings to tier up better and more quickly.
SSA-based Optimization. The compiler that generates that WASM applied optimizations primarily at the basic block level.dotnet/runtime#96315 overhauls the implementation to employ Static Single Assignment (SSA) form, which is commonly used by optimizing compilers and which ensures that every variable is assigned in exactly one location. That form simplifies many resulting analyses and thus helps to better optimize the code.
Vector improvements. More and more vectorization is being done by the core libraries, utilizing hardware intrinsics and the variousVector types. To enable such library code to execute well on mono, the various mono backends need to also handle those operations efficiently. One of the most impactful changes here isdotnet/runtime#105299, which updated mono to accelerateShuffle for types other thanbyte andsbyte (which were already handled). This is very impactful to functionality in the core libraries, many of which useShuffle as part of core algorithms, like throughoutIndexOfAny, hex encoding and decoding, Base64 encoding and decoding,Guid, and more.dotnet/runtime#92714 anddotnet/runtime#98037 also improved vector construction, such as by enabling the mono JIT to utilize the Arm64ins (Insert) instruction when creating onefloat ordouble vector from the values in another.
More intrinsics.dotnet/runtime#98077,dotnet/runtime#98514, anddotnet/runtime#98710 implemented variousAdvSimd.Load* andAdvSimd.Store* APIs.dotnet/runtime#99115 anddotnet/runtime#101622 intrinsified several clearing and filling methods that backSpan<T>.Clear/Fill. Anddotnet/runtime#105150 anddotnet/runtime#104698 optimized variousUnsafe methods, such asBitCast.dotnet/runtime#91813 also significantly improved unaligned access on a variety of CPUs by not forcing the implementation down a slow path if the CPU is able to handle such reads and writes.
Startup.dotnet/runtime#100146 is a fun one, as it had accidentally positive benefits for mono startup. The change updated dotnet/runtime’s configuration to enable more static analysis, and in particular enforcingCA1865, CA1866, and CA1867, which we hadn’t yet gotten around to enabling for the repo. The change included fixing all of the violations of the rules, which mostly meant fixing call sites likeIndexOf("!") (IndexOf taking a single-characterstring) and replacing it withIndexOf('!'). The intent of the rule was that doing so is a little bit faster and the call site ends up being a little bit cleaner. ButIndexOf(string) is cultural-aware, which means using it can force the globalization library ICU to be loaded and initialized. As it turns out, some of these uses were on mono’s startup path and were forcing ICU to be loaded when it wasn’t actually necessary. Fixing those meant the loading could be delayed, and startup performance improved as a result.dotnet/runtime#101312 also improved startup with the interpreter by adding a cache to the code that does vtable setups. This uses a custom hash table implementation added indotnet/runtime#100386, which is then also used elsewhere, such as indotnet/runtime#101460 anddotnet/runtime#102476. That hash table is itself interesting, as it’s lookups are vectorized for x64, Arm, and WASM and it’s generally optimized for cache locality.
Variance check removal. When storing objects into an array, that operation needs to be validated to ensure compatibility between the type being stored and the concrete type of the array. Given a base typeB and two derived typesD1 : B andD2 : B, you could have an arrayB[] array = new D1[42];, and then the codearray[0] = new D2() would successfully compile, becauseD2 is aB, but at run-time this must fail, asD2 is not aD1, and so the runtime needs a check to ensure correctness. If the array’s type is sealed, though, this check can be avoided, since then you can’t end up with this discrepancy. coreclr already does that optimization; now as part ofdotnet/runtime#99829, the mono interpreter does so as well.

Native AOT

Native AOT is a solution for generating native executables directly from .NET applications. The resulting binary doesn’t require .NET to be installed and does not require JIT’ing; instead it contains in it all of the assembly code for the whole app, inclusive of the code for any core library functionality accessed, the assembly for the garbage collector, and so on. Native AOT first shipped in .NET 7 and was then significantly improved for .NET 8, in particular around reducing the size of the resulting applications. Now in .NET 9, investment continues in Native AOT, with some very nice fruits of the labor on it. (Note that the Native AOT tool chain uses the JIT to generate assembly code, so most of the code generation improvements discussed in theJIT section and elsewhere in this post accrue to Native AOT as well.)

One of the biggest concerns for Native AOT is size and trimming. Native AOT-based applications and libraries compile everything, all user code, all the library code, the runtime, everything, into the single native binary. It’s thus imperative that the tool chain goes to extra lengths to get rid of as much as possible in order to keep that size down. This can include being more clever about how the runtime stores the state necessary for execution. It can include being more thoughtful about generics in order to reduce the possible code size explosion that can result from lots of generic instantiations (effectively multiple copies of the exact same code all specialized for different type arguments). And it can include being very diligent about avoiding dependencies that can bring in lots of code unexpectedly and that the trimming tools are unable to reason about enough to remove. Here are some examples of all of these in .NET 9:

Refactoring choke points. Think through your code: how many times have you written a method that takes some input and then dispatches to one of many different kinds of things based on the input provided? That’s reasonably common. Unfortunately, it can also be problematic for Native AOT code size. A good example is fixed bydotnet/runtime#91185 inSystem.Security.Cryptography. There are a bunch of hashing related types, likeSHA256 orSHA3_384, that all offer aHashData method. Then there are places where the exact hashing algorithm to be used is specified via aHashAlgorithmName. You can likely envision the large switch statement that results (or don’t imagine and instead just look atthe code), where based on the exactHashAlgorithmName specified, the implementation selects the right type’sHashData method to call. That is what’s often referred to as a “choke point,” where all callers end up coming through this one method, which then fans out to the relevant implementations, but that also then causes this size problem for Native AOT: if that choke point is referenced, it typically ends up needing to generate the code for all of the referenced methods, even if only a subset are actually used. Some of these cases are really challenging to solve. In this particular case, though, thankfully all of thoseHashData methods turned around and called to a parameterized, shared implementation. So the fix was to just skip the middle tier and have theHashAlgorithmName layer go directly to the workhorse implementation, without naming the intermediate layer methods.
Less LINQ. LINQ is an amazing productivity tool. We love LINQ and invest in it every release of .NET (see the later section in this post ontons of performance improvements in LINQ in .NET 9). With Native AOT, however, significant use of LINQ can also measurably increase code size, in particular when value types are involved. As will be discussed later when talking about LINQ optimizations, one of the optimizations LINQ employs is to special-case based on the inputs what kind ofIEnumerable<T> its methods give back. So, for example, if you callSelect with an array input, theIEnumerable<T> you get back might actually be an instance of the internalArraySelectIterator<T>, and if you callSelect with aList<T>, theIEnumerable<T> you get back might actually be an instance of the internalListSelectIterator<T>. The Native AOT trimmer can’t readily determine which of those paths might be used, so the Native AOT compiler needs to generate code for all such types when you callSelect<T>. If theT is a reference type, there will just be a single copy of the generated code shared for all reference types. But if theT is a value type, there will be a custom stamp of the code generated for and optimized for each uniqueT. That means if such LINQ APIs (and other similar APIs) are used a lot, they can disproportionately increase the size of a Native AOT binary.dotnet/runtime#98109 is an example of a PR that replaced a bit of LINQ code in order to measurably reduce the size of ASP.NET applications compiled with Native AOT. But you can also see that PR being thoughtful about which LINQ usage was removed, citing these few specific instances making a measurable difference and leaving the rest of the LINQ usage in the library intact.
Avoiding unnecessary array types. TheSharedArrayPool<T> that backsArrayPool<T>.Shared was storing lots of state, including several fields with types along the lines ofT[][]. This makes sense; it’s pooling arrays, so it needs an array of arrays. From a Native AOT perspective, though, ifT is a value type (as is very common withArrayPool<T>),T[][] as its own unique array type needs its own code generated for it, distinct from the code for, for example,T[]. As it turns out,ArrayPool<T> doesn’t actually need to work with the array instances in these cases, so it doesn’t actually need the strongly-typed nature of the arrays; this could just as well beobject[] orArray[]. And that’s one of the main things thatdotnet/runtime#97058 does: with that, the compiled binary can carry the code generated for justArray[] rather than needing code forbyte[][] andchar[][] andobject[][] and for whatever other typeTs are used withArrayPool<T> in the application.
Avoiding unnecessarily generic code. The Native AOT compiler doesn’t do any kind of “outlining” today (the opposite of inlining, where, rather than moving code from a called method into the caller, the compiler would extract code from a method out into a separate method). If you have a large method, the compiler will need to generate code for the whole method, and if that method is generic and multiple generic specializations are compiled, the whole method will be compiled and optimized for each. But, if you have any meaningful amounts of code in that method that don’t actually depend on the generic types in question, you can avoid that duplication by refactoring the code into separate non-generic methods that are invoked by the generic. That’s whatdotnet/runtime#101474 does in some of the types inMicrosoft.Extensions.Logging.Console, likeSimpleConsoleFormatter andJsonConsoleFormatter. There’s a genericWrite<TState> method, but theTState is only used in the very first line of the method, which formats the arguments into a string. After that, there’s a lot of logic about doing the actual writing, but all of it only needs the output of that formatting operation, not the input. So, this PR simply refactors thatWrite<TState> to do the formatting and then delegates to the bulk of the work in a separate non-generic method.
Cutting out unnecessary dependencies. There are many small but meaningful dependencies one doesn’t think about until they start focusing on generated code size and zooming in on exactly where all that code size is coming from.dotnet/runtime#95710 is a good example. TheAppContext.OnProcessExit method is rooted (never trimmed) by the runtime because it’s invoked when the process is exiting. ThatOnProcessExit was accessingAppDomain.CurrentDomain, which returns anAppDomain.AppDomain‘sToString override depends on a bunch of stuff. AndToString on a type that’s not trimmed away is itself basically never trimmed because if anything anywhere calls to the baseobject.ToString, the system needs to know that any possible derived type that might find its way to that call site will be invokable. That all means that all of that stuff used byAppDomain.ToString was never being trimmed. This small refactoring made it so that all that stuff would only need to be kept ifAppDomain.CurrentDomain is ever actually accessed by user code. Another example of this comes indotnet/runtime#101858, which removes a dependency on some of theConvert methods.
Using a better tool for the job. Sometimes there’s just a better, simpler answer.dotnet/runtime#100916 highlights one such case. Some code inMicrosoft.Extensions.DependencyInjection needed aMethodInfo for a particular method, and it was usingSystem.Linq.Expressions to extract one, when it could instead just use a delegate. That’s not only cheaper in terms of allocation and overhead, it removes a dependency on theExpressions library.
Compile time instead of run time. Source generators are a great boon for Native AOT, as they enable computing things at build time and baking the results into the assembly rather than computing those same things at run time (which, in the relevant situations, typically is done once and then cached). That’s useful for startup performance, as you’re not having to do that work just to get going. It’s useful for steady-state throughput, as you can often take more time to do a better job when the work is being done at build time. But it’s also useful for size, because it removes a dependency on anything that might have been used as part of the computation. And often that dependency is reflection, which brings with it a lot of size. As it turns out,System.Private.CoreLib has its own source generator that’s used when buildingCoreLib, anddotnet/runtime#102164 augmented that source generator to generate a dedicated implementation ofEnvironment.Version andRuntimeInformation.FrameworkDescription. Previously, both of these methods that are implemented inCoreLib would use reflection to look up attributes also inCoreLib, but that’s something the source generator can instead do at a build time, and just bake the answer into the implementation of these methods.
Avoiding duplication. It’s not uncommon to have two methods somewhere in your app that have the same implementations, especially for small helper methods, like property accessors.dotnet/runtime#101969 teaches the Native AOT tool chain to deduplicate those, such that the code is only stored once.
Interfaces be gone. Previously, unused interface methods could be trimmed away (effectively removing them from the interface type and removing all implementations of that method), but the compiler wasn’t able to fully remove the actual interface types themselves. Now withdotnet/runtime#100000, it can.
Unnecessary static constructors. The trimmer was keeping the static constructor of a type if any field was accessed. This is unnecessarily broad: those static constructors only need to be kept if astatic field was accessed.dotnet/runtime#96656 improves that.

Previous releases saw a considerable amount of time spent on driving down binary sizes, but these kinds of changes chip away at them even further. Let’s create a new ASP.NET minimal APIs application using Native AOT. This command uses thewebapiaot template and creates the new project in a newmyapp directory:

dotnet new webapiaot -o myapp

Replace the contents of the generatedmyapp.csproj with this:

<Project Sdk="Microsoft.NET.Sdk.Web">  <PropertyGroup>    <TargetFrameworks>net9.0;net8.0</TargetFrameworks>    <Nullable>enable</Nullable>    <ImplicitUsings>enable</ImplicitUsings>    <InvariantGlobalization>true</InvariantGlobalization>    <PublishAot>true</PublishAot>    <OptimizationPreference>Size</OptimizationPreference>    <StackTraceSupport>false</StackTraceSupport>  </PropertyGroup></Project>

All I’ve done on top of the template’s defaults is have bothnet9.0 andnet8.0 as target frameworks and then add a couple of settings (at the bottom) focused on driving down the size of Native AOT apps. The app is a simple site that exposes a/todos list as JSON. Todos application

We can publish this app with Native AOT:

dotnet publish -f net8.0 -r linux-x64 -c Releasels -hs bin/Release/net8.0/linux-x64/publish/myapp

which yields:

9.4M bin/Release/net8.0/linux-x64/publish/myapp

We can see here that the whole site, web server, garbage collector, everything, are contained inmyapp app, which on .NET 8 is weighing in at 9.4 megabytes. Now, let’s do the same thing for .NET 9:

dotnet publish -f net9.0 -r linux-x64 -c Releasels -hs bin/Release/net9.0/linux-x64/publish/myapp

which results in:

8.5M bin/Release/net9.0/linux-x64/publish/myapp

Now, just by moving to the new version, that samemyapp has shrunk to 8.5 megabytes, an ~10% reduction in binary size.

Beyond a focus on size, ahead-of-time compilation also differs from just-in-time compilation in that each has their own opportunities for unique optimizations. The JIT can see the exact details of the current machine and employ the best possible instructions based on what’s available (e.g. using AVX512 instructions on hardware that supports it), and the JIT can use dynamic PGO to evolve the code based on execution characteristic. But, Native AOT is capable of doing whole program optimization, where it can look at everything in the program and optimize based on the totality of everything involved (in contrast, a JIT’d .NET application may load additional .NET libraries at any point). For example,dotnet/runtime#92923 enables automatically making fieldsreadonly based on looking at the whole program to see if anything could possibly write to the field from outside of the constructor; this can in turn help things like improving pre-initialization.

dotnet/runtime#99761 provides a nice example where, based on whole program analysis, the compiler can see that a particular type is never instantiated. If it’s never instantiated, then type checks for that type will never succeed. And thus if a program has a check likeif (variable is SomethingNeverInstantiated), that can be turned into a constantfalse, and all of the code associated with thatif block then eliminated.dotnet/runtime#102248 is similar, but for types; if code is doingif (someType == typeof(X)) and the compiler never had to construct a method table forX, it can similarly turn this into a constant result.

Whole program analysis is also applicable to devirtualization in really cool ways. Withdotnet/runtime#92440, the compiler can now devirtualize all calls to a virtual methodC.M if the compiler doesn’t see any instantiations of types that derive fromC. And withdotnet/runtime#97812 anddotnet/runtime#97867, the compiler can now treatvirtual methods as instead being non-virtual andsealed when there are no overrides of those methods anywhere in the program.

NativeAOT also has a super power in its ability to do pre-initialization. The compiler contains an interpreter that’s able to evaluate code at build time and replace that code with just the result; for some objects, it’s then also able to blit the object’s data into the binary in a way that it can be cheaply dehydrated at execution time. The interpreter is limited in what it’s able and allowed to do, but over time its capabilities are improving.dotnet/runtime#92470 extends it to support more type checks, static interface method calls, constrained method calls, and various operations on spans, whiledotnet/runtime#92666 expands it to have some support for hardware intrinsics and the variousIsSupported methods.dotnet/runtime#92739 further rounds it out with support forstackalloc‘ing spans,IntPtr/nint math, andUnsafe.Add.

Threading

Since the beginning of .NET, general wisdom has been that the vast majority of code that needs to synchronize access to shared state should just useMonitor, either directly or more likely via the the C# language syntax for it withlock(...). There are a plethora of other synchronization primitives available, at various levels of complexity and with varying purposes, butlock(...) is the workhorse and the thing that everyone should reach for by default.

Over 20 years after the introduction of .NET, that’s evolving, just a bit.lock(...) is still the go-to syntax, but in .NET 9 as ofdotnet/runtime#87672 anddotnet/runtime#102222, there is now a dedicatedSystem.Threading.Lock type. Anywhere you were previously allocating anobject just to use thatobject withlock(...), you should consider using the newLock type. You can absolutely still just useobject, and you’ll still need to do so in certain situations, like if you’re using the “condition variable” aspects ofMonitor (such asSignal andWait), and you’ll still want to in others (such as if you’re trying to reduce managed allocation and you have another existing object that can serve double-duty as the monitor). But locking aLock can be a more efficient answer. It can also help to be self-documenting, making the code cleaner and more maintainable.

As is evident from this benchmark, the syntax for using both can be identical.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly object _monitor = new();    private readonly Lock _lock = new();    private int _value;    [Benchmark]    public void WithMonitor()    {        lock (_monitor)        {            _value++;        }    }    [Benchmark]    public void WithLock()    {        lock (_lock)        {            _value++;        }    }}

Lock, however, will generally be a tad cheaper (and in the future, as most locking shifts to use the new type, we may be able to make mostobjects lighter weight by not optimizing for direct locking on arbitrary objects):

Method	Mean
WithMonitor	14.30 ns
WithLock	13.86 ns

Note that C# 13 has special-recognition ofSystem.Threading.Lock. If you look at the code that’s generated forWithMonitor above, it’s equivalent to this:

public void WithMonitor(){    object monitor = _monitor;    bool lockTaken = false;    try    {        Monitor.Enter(monitor, ref lockTaken);        _value++;    }    finally    {        if (lockTaken)        {            Monitor.Exit(monitor);        }    }}

but even though the syntax is identical, here’s an equivalent of what’s generated forWithLock:

Lock.Scope scope = _lock.EnterScope();try{    _value++;}finally{    scope.Dispose();}

We’ve also started usingLock internally.dotnet/runtime#103085 anddotnet/runtime#103104 used it instead ofobject locks inTimer,ThreadLocal, andRegisteredWaitHandle. In time, I expect to see more and more use switched over.

Of course, while locks are the default recommendation for synchronization, there’s still a lot of code that demands the higher throughput and scalability that comes from more lock-free programming, and the workhorse for such implementations isInterlocked. In .NET 9,Interlocked.Exchange andInterlocked.CompareExchange gain some very welcome capabilities. First,dotnet/runtime#92974 from@MichalPetryka,dotnet/runtime#97588 from@filipnavara, anddotnet/runtime#106660 grantInterlocked some new powers, the ability to operate over types smaller thanint. It introduces new overloads ofExchange andCompareExchange that can work onbyte,sbyte,ushort, andshort. These overloads are public and available for anyone to call, but they’re also then consumed bydotnet/runtime#97528 from@MichalPetryka to improveParallel.ForAsync<T>.ForAsync is given a range ofT to be processed, and schedules multiple workers that all need to repeatedly get the next item from the range, until the range is exhausted. For arbitrary types, that meansForAsync needs to lock to protect the increment while iterating through the range. But for types where anInterlocked operation is available, we can use that with low-lock techniques to avoid the lock entirely (both the need to access it and the need to allocate it).

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public async Task ParallelForAsync()    {        await Parallel.ForAsync('\0', '\uFFFF', async (c, _) =>        {        });    }}

Method	Runtime	Mean	Ratio
ParallelForAsync	.NET 8.0	42.807 ms	1.00
ParallelForAsync	.NET 9.0	7.184 ms	0.17

Even with those new overloads, though, there are still places it’s desirable to useInterlocked.Exchange orInterlocked.CompareExchange where they can’t be used easily. Consider the aforementionedParallel.ForAsync. It’d be really nice if we could just callInterlocked.CompareExchange<T>, butCompareExchange<T> only works with reference types. So we’re instead left with unsafe code:

static unsafe bool CompareExchange(ref T location, T value, T comparand) =>    sizeof(T) == sizeof(byte) ? Interlocked.CompareExchange(ref Unsafe.As<T, byte>(ref location), Unsafe.As<T, byte>(ref value), Unsafe.As<T, byte>(ref comparand)) == Unsafe.As<T, byte>(ref comparand) :    sizeof(T) == sizeof(ushort) ? Interlocked.CompareExchange(ref Unsafe.As<T, ushort>(ref location), Unsafe.As<T, ushort>(ref value), Unsafe.As<T, ushort>(ref comparand)) == Unsafe.As<T, ushort>(ref comparand) :    sizeof(T) == sizeof(uint) ? Interlocked.CompareExchange(ref Unsafe.As<T, uint>(ref location), Unsafe.As<T, uint>(ref value), Unsafe.As<T, uint>(ref comparand)) == Unsafe.As<T, uint>(ref comparand) :    sizeof(T) == sizeof(ulong) ? Interlocked.CompareExchange(ref Unsafe.As<T, ulong>(ref location), Unsafe.As<T, ulong>(ref value), Unsafe.As<T, ulong>(ref comparand)) == Unsafe.As<T, ulong>(ref comparand) :    throw new UnreachableException();

Another place it’d be really nice to useInterlocked.Exchange andInterlocked.CompareExchange is with enums. It’s reasonable common to use these APIs to transition between states in some algorithm, and often the ideal is for those states to be represented as an enum. However, there are no overloads of{Compare}Exchange that have worked with enums, so developers have been forced to use integers instead, often with comments stating something along the lines of “This should be an enum, but enums can’t work with CompareExchange.” Or, at least, they couldn’t, until .NET 9.

Now in .NET 9, as ofdotnet/runtime#104558 the genericExchange andCompareExchange have had theirclass constraint removed. This means use ofExchange<T> andCompareExchange<T> will compile for anyT. Then at runtime, theT is checked to ensure it’s a reference type, a primitive type, or an enum type; anything else, and it’ll throw. When it is one of those, it delegates to the correspondingly-sized overload. For example, this now compiles and runs successfully:

static DayOfWeek UpdateIfEqual(ref DayOfWeek location, DayOfWeek newValue, DayOfWeek expectedValue) =>    Interlocked.CompareExchange(ref location, newValue, expectedValue);

This is not only good for usability, it’s good for performance in a few ways. First, it enables performance improvements like theParallel.ForAsync one described without needing to resort toUnsafe tricks. But second, it enables smaller objects. The previously listed change not only updatedCompareExchange to remove the constraint but also then employed the overload in dozens of places. InHttp3Connection, for example, the object previously had these three fields which were updated withInterlocked.Exchange:

private int _haveServerControlStream;private int _haveServerQpackDecodeStream;private int _haveServerQpackEncodeStream;

but these are really justbools masquerading asints, exactly because they needed to be updated atomically. Now withInterlocked.Exchange<T> andInterlocked.CompareExchange<T> supportingbool, these have been updated to just be:

private bool _haveServerControlStream;private bool _haveServerQpackDecodeStream;private bool _haveServerQpackEncodeStream;

Any additional padding aside, that reduces 12 bytes down to 3 bytes on the object.

Also related toInterlocked,dotnet/runtime#96258 intrinsifies theInterlocked.And andInterlocked.Or methods for additional platforms; previously they were specially handled on Arm, but now they’re also specially handled on x86/64. As an example, the implementation in theAnd method is a fairly typicalCompareExchange loop:

public static int And(ref int location1, int value){    int current = location1;    while (true)    {        int newValue = current & value;        int oldValue = CompareExchange(ref location1, newValue, current);        if (oldValue == current)        {            return oldValue;        }        current = oldValue;    }}

You’ll see a very similar loop any time you want to use optimistic concurrency to create a new value and substitute it for the original in an atomic manner. The actual& operation is just one line here, and to highlight that this is broadly applicable, you could create a generalized version of this method for any operation using a delegate, like this:

public static int CompareExchange(ref int location1, int value, Func<int, int, int> update){    int current = location1;    while (true)    {        int newValue = update(current, value);        int oldValue = CompareExchange(ref location1, newValue, current);        if (oldValue == current)        {            return oldValue;        }        current = oldValue;    }}

such thatAnd could be implemented then like:

public static int And(ref int location1, int value) =>    CompareExchange(ref location1, value, static (current, value) => current & value)

The approach employed byAnd is reasonable when there’s nothing better you can do, but as it turns out, modern hardware platforms have single instructions capable of performing such an interlockedand andor in a much more efficient manner. The JIT already handled this for Arm because the instructions on Arm have semantics that very closely align with the semantics ofInterlocked.And andInterlocked.Or. On x86/64, however, the relevant instruction sequence (lock and orlock or) doesn’t enable accessing the original value atomically replaced, whereasAnd/Or require that as part of their definition. Luckily, most uses ofInterlocked.And/Interlocked.Or don’t actually need the return value. For example,SafeHandle.SetHandleAsInvalid simply wants to atomically OR an additional flag into some bit flags, ignoring the result ofOr:

public void SetHandleAsInvalid(){    Interlocked.Or(ref _state, StateBits.Closed);    GC.SuppressFinalize(this);}

And luckily, the JIT can see that it’s ignoring the result. As such, on x86/64, the JIT can use the optimal sequence when it can see that the result isn’t being used, and even if it is being used, it can still emit a slightly more concise instruction sequence than would have naturally resulted from our open-coded implementation:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int _location;    [Benchmark] public void Test_ResultNotUsed() => Interlocked.And(ref _location, 42);    [Benchmark] public int  Test_ResultUsed() =>    Interlocked.And(ref _location, 42);}

// .NET 8; Tests.Test_ResultNotUsed()       push      rbp       sub       rsp,10       lea       rbp,[rsp+10]       add       rdi,8       mov       eax,[rdi]M00_L00:       mov       ecx,eax       and       ecx,2A       mov       [rbp-4],eax       lock cmpxchg [rdi],ecx       mov       ecx,[rbp-4]       cmp       eax,ecx       je        short M00_L01       mov       ecx,eax       mov       eax,ecx       jmp       short M00_L00M00_L01:       add       rsp,10       pop       rbp       ret; Total bytes of code 47; Tests.Test_ResultUsed()       push      rbp       sub       rsp,10       lea       rbp,[rsp+10]       add       rdi,8       mov       eax,[rdi]M00_L00:       mov       ecx,eax       and       ecx,2A       mov       [rbp-4],eax       lock cmpxchg [rdi],ecx       mov       ecx,[rbp-4]       cmp       eax,ecx       je        short M00_L01       mov       ecx,eax       mov       eax,ecx       jmp       short M00_L00M00_L01:       add       rsp,10       pop       rbp       ret; Total bytes of code 47// .NET 9; Tests.Test_ResultNotUsed()       add       rdi,8       mov       eax,2A       lock and  [rdi],eax       ret; Total bytes of code 13; Tests.Test_ResultUsed()       add       rdi,8       mov       ecx,2A       mov       eax,[rdi]M00_L00:       mov       edx,eax       and       edx,ecx       lock cmpxchg [rdi],edx       jne       short M00_L00       ret; Total bytes of code 22

Method	Runtime	Mean	Ratio	Code Size
Test_ResultNotUsed	.NET 8.0	6.630 ns	1.00	47 B
Test_ResultNotUsed	.NET 9.0	3.132 ns	0.47	13 B

Test_ResultUsed	.NET 8.0	6.853 ns	1.00	47 B
Test_ResultUsed	.NET 9.0	6.435 ns	0.94	22 B

Locks and interlocked operations are about coordinating between operations, at a relatively low level. There are higher level coordination constructs as well; that’s effectively whatTask is, providing a representation for a piece of work with which you can later join. Such joining can be accomplished withawait along with a myriad of APIs that faciliate joining with tasks in various ways. In that regard, one of my favorite new APIs in .NET 9 is onTask:Task.WhenEach. I like it because it utilizes newer language features to cleanly solve a problem that we wanted to solve over a decade ago whenTask was originally introduced, and the lack of it has led to folks writing code with known pits of failure.

Task.WhenAll is fairly easy to understand: you give it a collection of tasks, and the task it returns will complete only when all of the constituent tasks have completed:

await Task.WhenAll([t1, t2, t3]);... // only get here when t1, t2, and t3 have all completed successfully

Task.WhenAny is a bit more complex, in that it returns when any of the constituent tasks has completed, and it gives you back a reference to that task:

Task tCompleted = await Task.WhenAny([t1, t2, t3]);... // tCompleted is either t1, t2, or t3, and will be completed here

and you can then explicitly join with that returned task to observe any exceptions it may have incurred or consume its result value if it has one. But what do you then do to join with the remaining two tasks? You might end up writing code something like this:

List<Task> tasks = new() { t1, t2, t3 };while (tasks.Count > 0){    Task completed = await Task.WhenAny(tasks);    Handle(completed);    tasks.Remove(completed);}

That’s not terribly hard, but it’s also not terribly efficient. Or, rather, for larger number of tasks, it’s terribly inefficient, as it’s anO(N^2) algorithm. Some of the complexity is likely obvious: you’ve got a loop and inside that loop you’ve got aList<T>.Remove call, which will do anO(N) walk of the list looking for the target element to remove: there’s theO(N^2). But, there’s actually another less obviousO(N) operation in the loop: theWhenAny itself. Every call toWhenAny needs to hook a continuation up to each of the constituentTask objects. (There are of course cheaper ways to implement this functionality than using such aWhenAny, but they’re all more complicated and thus not the answers towards which folks have gravitated.)

EnterTask.WhenEach.WhenEach‘s purpose is to make consuming tasks as they complete both simple and efficient. To do so, rather than returning aTask<Task> as doesWhenAny, it returns anIAsyncEnumerable<Task>, so one can simply iterate through the completing tasks as they complete.

await foreach (Task completed in Task.WhenEach([t1, t2, t3])){    Debug.Assert(completed.IsCompleted);    Handle(completed);}

It’s a little hard to get a good applies-to-apples comparison of the overhead here, but this benchmark is a reasonable approximation:

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Params(10, 1_000)]    public int Count { get; set; }    [Benchmark]    public async Task WithWhenAny()    {        var tcs = Enumerable.Range(0, Count).Select(_ => new TaskCompletionSource()).ToList();        List<Task> tasks = tcs.Select(t => t.Task).ToList();        tcs[^1].SetResult();        while (tasks.Count > 0)        {            Task completed = await Task.WhenAny(tasks);            tasks.Remove(completed);            tcs.RemoveAt(tcs.Count - 1);            if (tasks.Count == 0) break;            tcs[^1].SetResult();        }    }    [Benchmark]    public async Task WithWhenEach()    {        var tcs = Enumerable.Range(0, Count).Select(_ => new TaskCompletionSource()).ToList();        int remaining = tcs.Count - 1;        tcs[remaining].SetResult();        await foreach (Task completed in Task.WhenEach(tcs.Select(t => t.Task)))        {            if (remaining == 0) break;            tcs[--remaining].SetResult();        }    }}

Method	Count	Mean	Allocated
WithWhenAny	10	3.232 us	3.47 KB
WithWhenEach	10	1.223 us	1.43 KB
WithWhenAny	1000	20,082.683 us	4207.12 KB
WithWhenEach	1000	102.759 us	94.24 KB

WhenAll also gets a bit cheaper, in a couple of ways.dotnet/runtime#93953 utilizes a trick employed elsewhere inTask in .NET 8, which is to use its usedm_stateObject field (unused because there’s no way to set it withWhenAll) to store some of the state that previously had a dedicated field (a field for storing information about constituent tasks that failed or were canceled). That means theTask objectWhenAll returns gets 8 bytes smaller (on 64-bit). On top of that,dotnet/runtime#101308 adds newReadOnlySpan<T>-based overloads to a bunch of methods, includingTask.WhenAll. This enables passing in any number of tasks without needing to allocate.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public async Task WhenAll()    {        var atmb1 = new AsyncTaskMethodBuilder();        var atmb2 = new AsyncTaskMethodBuilder();        Task whenAll = Task.WhenAll([atmb1.Task, atmb2.Task]);        atmb1.SetResult();        atmb2.SetResult();        await whenAll;    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
WhenAll	.NET 8.0	123.8 ns	1.00	264 B	1.00
WhenAll	.NET 9.0	103.8 ns	0.86	216 B	0.82

There are some other interesting performance improvements in threading in .NET 9 as well.

Debugger.NotifyOfCrossThreadDependency. This is a big deal. When you’re debugging a .NET process and you break in the debugger, it pauses all threads in the debuggee process so that nothing is making forward progress while you examine state. However, .NET debuggers, like the one in Visual Studio, support invoking properties and methods in the debuggee while debugging. That can be a big problem if the functionality being invoked relies on one of those paused threads to do something, e.g. if the property you access tries to take a lock that’s held by another thread or tries toWait on aTask. To mitigate problems here, theDebugger.NotifyOfCrossThreadDependency method exists. Functionality that relies on another thread to do something can callNotifyOfCrossThreadDependency; if there’s no debugger attached, it’s a nop, but if there is a debugger attached, this signals the problem to the debugger, which can then react accordingly. The Visual Studio debugger reacts by stopping the evaluation but then by offering an opt-in option of “slipping” all threads, unpausing all threads until the evaluated operation completes, at which point all threads will be paused again, thereby again trying to mitigate any problems that might occur from the cross-thread dependency.NotifyOfCrossThreadDependency is generally not used by application code, but it’s used in a few critical choke points in the core libraries, in particular throughoutSystem.Threading and the infrastructure forasync/await. That means, for example, that this method is being called any time youawait aTask that’s not yet completed. And, unfortunately, while the method is a cheap nop when the debugger isn’t attached, historically it’s been fairly expensive when the debugger is attached, to the point where it can meaningfully impact a developer’s experience in the tool. Thankfully, .NET 9 addresses this withdotnet/runtime#101864, which significantly improves the performance ofNotifyOfCrossThreadDependency when a debugger is attached. We can see this with a low-tech benchmark. Replace the contents of yourProgram.cs with this:
```
using System.Diagnostics;const int Iters = 100_000;Stopwatch sw = new();while (true){    sw.Restart();    for (int i = 0; i < Iters; i++)    {        Debugger.NotifyOfCrossThreadDependency();    }    sw.Stop();    Console.WriteLine($"{sw.Elapsed.TotalMicroseconds / Iters:N3}us");}
```
open the project in Visual Studio, ensure that .NET 8 is selected as the target framework and that you’re targeting Release:
and run with the debugger attached (F5, not ctrl-F5). When I do that, I see numbers like this (on Windows):
```
48.360us45.281us46.714us46.945us46.525us
```
Then change the target framework to be .NET 9, and run with the debugger attached again. I then see numbers like this:
```
1.973us1.713us1.714us1.871us1.963us
```
While such an improvement shouldn’t impact your production workloads, it can make an impactful difference to your productivity as a developer.
Volatile. A “memory model” is a description of how threads interact with memory and what guarantees are made about how different threads produce and consume changes in shared memory. Reads and writes to memory from a single thread are guaranteed to be observed by that thread in the order they occurred, but once multiple threads enter the picture, it’s up to the memory model to define what behaviors can be relied on and which can’t. For example, if there are two fields,_a and_b, both of which start as0, and if one thread does:
```
_a = 1;_b = 2;
```
and then another does:
```
while (_b != 2);Assert(_a == 1);
```
is that assert guaranteed to always pass? It depends on the memory model, and whether the writes from thread 1 might get reordered (by any of the involved compilers or even by the hardware) such that the write to_b became visible to thread 2 before the write to_a. For the longest time, the only official memory model for .NET was the one defined by theECMA 335 specification, but real implementations, including coreclr, generally had stronger guarantees than what ECMA detailed. Thankfully, the official.NET memory model has now been documented. However, some of the practices that were being employed in the core libraries (due to defensive coding or uncertainty of the memory model or out-of-date requirements) are no longer necessary. One of the main tools available for folks coding at a level where memory model is relevant is thevolatile keyword / theVolatile class. Marking a field asvolatile causes any reads or writes of that field to be considered “volatile,” just as does usingVolatile.Read/Volatile.Write to perform that read or write. Making the read or write volatile means it prevents certain kinds of “movement,” e.g. if both_a and_b in the previous example were marked asvolatile, the assert would always pass. Marking fields or operations asvolatile can come with an expense, depending on the circumstance and the target platform. For example, it can restrict the C# compiler and the JIT compiler from performing certain optimizations. Let’s take a simple example. This code:
```
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private volatile int _volatile;    private int _nonVolatile;    [Benchmark]    public int UsingVolatile() => _volatile + _volatile;    [Benchmark]    public int UsingNonVolatile() => _nonVolatile + _nonVolatile;}
```
results in this assembly on .NET 9:
```
; Tests.UsingVolatile()       mov       eax,[rdi+8]       add       eax,[rdi+8]       ret; Total bytes of code 7; Tests.UsingNonVolatile()       mov       eax,[rdi+0C]       add       eax,eax       ret; Total bytes of code 6
```
The important difference between the two assembly blocks is in theadd instruction. In theUsingVolatile method, the first instruction is loading the value from memory stored at addressrcx+8, and then re-reading that samercx+8 memory location again to add whatever is there with what it just read. InUsingNonVolatile, it starts the same way, reading the value stored atrcx+0xc, but then theadd isn’t doing another memory read and is instead just doubling the value stored in the register. One of the effects ofvolatile requiring that reads can’t be moved is also that they can’t beremoved, which means both reads in the code are required to stay. Here’s another example:
```
using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private volatile bool _volatile;    private bool _nonVolatile;    [Benchmark]    public int CountVolatile()    {        int count = 0;        while (_volatile) count++;        return count;    }    [Benchmark]    public int CountNonVolatile()    {        int count = 0;        while (_nonVolatile) count++;        return count;    }}
```
which on .NET 9 produces this assembly:
```
; Tests.CountVolatile()       push      rbp       mov       rbp,rsp       xor       eax,eax       cmp       byte ptr [rdi+8],0       jne       short M00_L01M00_L00:       pop       rbp       retM00_L01:       inc       eax       cmp       byte ptr [rdi+8],0       jne       short M00_L01       jmp       short M00_L00; Total bytes of code 24; Tests.CountNonVolatile()       push      rbp       mov       rbp,rsp       xor       eax,eax       cmp       byte ptr [rdi+9],0       jne       short M00_L00       pop       rbp       retM00_L00:       jmp       short M00_L00; Total bytes of code 16
```
They look somewhat similar, in fact the first five instructions are almost identical, but there’s a critical difference. In both cases, thebool value is being loaded and checked to see if it’sfalse (thecmp against0 followed by a conditional jump), in which case the implementations both fall through to the endingret to exit out of the method. The compiler is rewriting thewhile (cond) { ... } loop to instead be more like anif (cond) { do { ... } while(cond); }, so this initial test is the one for thatif (cond). But then things diverge meaningfully.CountVolatile then proceeds to do thedo while equivalent, incrementing thecount (stored ineax), reading_volatile and comparing it to0 (false), and if it’s stilltrue, jumping back up to loop again. So basically what you’d expect. But now look atCountNonVolatile. The loop is just this:
```
M00_L00:       jmp       short M00_L00
```
It’s now sitting in an infinite loop, with an unconditional jump back to the samejmp instruction, looping forever. That’s because the JIT was able to hoist the read of_nonVolatile out of the loop. It then also sees that no one will ever observecount‘s changed value, so it also elides the increment. At which point it’s more like if I’d written this C#:
```
public int CountNonVolatile(){    int count = 0;    if (_nonVolatile)    {        while (true);    }    return count;}
```
That hoisting can’t be done when the field isvolatile, because it can’t reorder or remove reads associated with the field. But with_nonVolatile, nothing prevents that. On multiple occasions I’ve seen folks trying to engage in low-lock programming experience the ramifications of this latter example: they’ll be using somebool to signal to a consumer that it should break out of the loop, but thebool isn’t volatile, and the consumer then never notices when the producer eventually sets it.
Those are examples of the ramifications ofvolatile in terms of what the C# or JIT compiler are constrained from doing. But there are also things the JITneeds to do (rather than avoid) in order to ensure the hardware can respect the requirements put in place by the developer. On some hardware, like x64, the memory model of the hardware is relatively “strong,” meaning it doesn’t do most of the kinds of reorderings thatvolatile inhibits, and therefore you won’t see anything emitted into the assembly code by the JIT to help the hardware enforce the constraints. On other hardware, like Arm64, though, the hardware has a relatively “weaker” model, meaning it allows more of these kinds of reorderings, and as a result, the JIT needs to actively inhibit such reorderings by inserting appropriate “memory barriers” into the code. On Arm, this shows up with instructions likedmb (“data memory barrier”). Such barriers have some overhead associated with them.
For all of these reasons, fewervolatiles is good for performance, but of course you need to ensure you have enoughvolatiles to actually achieve a correct application (with the best answer being avoid writing lock-lock code in the first place, and then you never need to know or think aboutvolatile). It’s a balance. Luckily, and bringing us full circle to why we’re talking about this, there are a set of common cases wherevolatile used to be recommended but now that we have a well-defined memory model, those uses are obsolete. Removing them can help to avoid a layer of thin cost across the code. Sodotnet/runtime#100969 anddotnet/runtime#101346 removed a bunch ofvolatile usage where it was no longer necessary. Almost all of these uses were as part of lazy initialization of reference types, e.g.
```
private volatile MyReferenceType? _instance;public MyReferenceType Instance => _instance ??= new MyReferenceType();
```
which if we expand that out to not use??= looks something like this:
```
private MyReferenceType? _instance;public MyReferenceType Instance{    get    {        MyReferenceType? instance = _instance;        if (instance is null)        {            _instance = instance = new MyReferenceType();        }        return instance;    }}
```
The reason for thevolatile here would be two-fold, one for the part of the operation that reads and one for the part of the operation that writes. Without thevolatile, the concern would be that one of the compilers or the hardware could actually “introduce a read,” effectively making the code equivalent to this:
```
private MyReferenceType? _instance;public MyReferenceType Instance{    get    {        MyReferenceType? instance = _instance;        if (_instance is null) // NOTE THE _ HERE        {            _instance = instance = new MyReferenceType();        }        return instance;    }}
```
If that were to happen, there’s a problem that between the two reads, the value of_instance could go fromnull to non-null, in which caseinstance could be assignednull,_instance is null might then be false, andreturn instance would returnnull. However, the .NET memory model explicitly states “Reads cannot be introduced.” Then there’s the concern about the write. The concern there that leads tovolatile being used is the initialization that happens inside ofMyReferenceType. Imagine ifMyReferenceType were defined like this:
```
class MyReferenceType(){    internal int _value;    public MyReferenceType() => _value = 42;}
```
The question then becomes “is it possible for the write to_value inside of the constructor to be viewed by another threadafter the write of the instance to_instance“? In other words, could the code logically become the equivalent of this:
```
private MyReferenceType? _instance;public MyReferenceType Instance{    get    {        MyReferenceType? instance = _instance;        if (_instance is null)        {            _instance = instance = RuntimeHelpers.GetUninitializedObject(typeof(MyReferenceType));            instance._value = 42;        }        return instance;    }}
```
If that could happen, then two threads could be racing to accessInstance, one of them could get as far as setting_instance (but_value hasn’t been set yet), then another thread could accessInstance, see_instance as non-null, and start using it, even though_value hasn’t yet been initialized. Thankfully here as well, the .NET memory model doesn’t allow such transformations, explicitly covering this point:
“Object assignment to a location potentially accessible by other threads is a release with respect to accesses to the instance’s fields/elements and metadata. An optimizing compiler must preserve the order of object assignment and data-dependent memory accesses. The motivation is to ensure that storing an object reference to shared memory acts as a “committing point” to all modifications that are reachable through the instance reference.”
Phew!
ManagedThreadId.dotnet/runtime#91232 is fun, in a “why didn’t we already do this” sort of way.Thread.ManagedThreadId is implemented as an internal call (an FCALL) into the runtime, resulting in a call toThreadNative::GetManagedThreadId, which in turn reads the thread object’sm_ManagedThreadId field. At least, that’s the field in the C definition of the object. The managedThread object has corresponding fields at the exact location that are available for the C# code to use, in this case_managedThreadId. So what did this PR do? It removed those complicated gymnastics and just made the whole implementation bepublic int ManagedThreadId => _managedThreadId. (It’s worth noting, though, thatThread.CurrentThread.ManagedThreadId was already previously recognized specially by the JIT, so this is only relevant when accessing theManagedThreadId from some otherThread instance.) The main benefit of this is avoiding the extra function call, as the FCALL can’t be inlined.
```
using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Thread _thread = Thread.CurrentThread;    [Benchmark]    public int GetID() => _thread.ManagedThreadId;}
```
```
// .NET 8; Tests.GetID()       mov       rdi,[rdi+8]       cmp       [rdi],edi       jmp       near ptr System.Threading.Thread.get_ManagedThreadId(); Total bytes of code 11**Extern method**System.Threading.Thread.get_ManagedThreadId()// .NET 9; Tests.GetID()       mov       rax,[rdi+8]       mov       eax,[rax+34]       ret; Total bytes of code 8
```
Ports to NativeAOT. Previous releases of .NET enabled inlining the fast path of thread-local state (TLS) access on coreclr. Withdotnet/runtime#104282,dotnet/runtime#89472, anddotnet/runtime#97910, this improvement comes to NativeAOT as well. Similarly,dotnet/runtime#103675 ports coreclr’s “yield normalization” implementation to NativeAOT; this is in support of enabling the runtime to measure the cost of variouspause instructions, which can then be used as part of tuning spinning and spin waiting.
Startup time. Performance improvements related to threading are generally about steady-state throughput improvements, e.g. reducing synchronization costs while processing requests. That’s what makesdotnet/runtime#106724 from@harisokanovic so interesting, in that it’s instead about reducing startup overheads of a process using .NET on Linux. The GC uses the equivalent of a process-wide memory barrier (also exposed publicly asInterlocked.MemoryBarrierProcessWide) to ensure that all threads involved in a collection see a consistent state. On Linux, implementing this method efficiently involves using themembarrier system call withMEMBARRIER_CMD_PRIVATE_EXPEDITED, and using that requires the same syscall to have been made earlier on withMEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, which really means doing so at startup. However, the Linux kernel has some optimizations that make this registration a lot cheaper to use when there’s only one thread in the process. The way it was being used in .NET previously guaranteed there would be multiple. This PR changed where this initialization was performed in order to maximize the possibility of there only being the single thread in the process, which in turn makes startup faster. The improvement was upwards of 10ms on various systems on which it was measured, which is a large percentage of a .NET process’ startup overhead on Linux.

Reflection

Reflection is a very powerful (though sometimes overused) capability of .NET that enables code to load and introspect .NET assemblies and invoke their functionality. It is used in all manner of library and application, including by the core .NET libraries themselves, and it’s important that we continue to find ways to reduce the overheads associated with reflection.

Several PRs in .NET 9 whittle away at some of the allocation overheads in reflection.dotnet/runtime#92310 anddotnet/runtime#93115 avoid some defensive array copies by instead handing aroundReadOnlySpan<T> instances, whiledotnet/runtime#95952 removed a use ofstring.Split that turned out to only be used with constants and thus could be replaced by just manually splitting those constants. But a more interesting and impactful addition comes fromdotnet/runtime#97683, which added an allocation-free way to get the invocation list from a delegate. Delegates in .NET are “multicast,” meaning a single delegate instance might actually represent multiple methods to be invoked; this is how .NET events are implemented. If I invoke a delegate, the delegate implementation handles invoking each constituent method, sequentially, in turn. But what if I want to customize the invocation logic? Maybe I want to wrap each individual method in a try/catch, or maybe I want to track the return values from all of the methods rather than just the last, or some such behavior. To achieve that, delegates expose a way to get an array of delegates, one for each method that’s part of the original. So, if I have:

Action action = () => Console.Write("A ");action += () => Console.Write("B ");action += () => Console.Write("C ");action();

that’ll print out"A B C ", and if I have:

Action action = () => Console.Write("A ");action += () => Console.Write("B ");action += () => Console.Write("C ");Delegate[] actions = action.GetInvocationList();for (int i = 0; i < actions.Length; ++i){    Console.Write($"{i}: ");    ((Action)actions[i])();    Console.WriteLine();}

that’ll print out:

0: A1: B2: C

However, thatGetInvocationList needs to allocate. Now in .NET 9, there’s the newDelegate.EnumerateInvocationList<TDelegate> method, which returns astruct-based enumerable for iterating through the delegates rather than needing to allocate a new array to store them all.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Action _action;    private int _count;    [GlobalSetup]    public void Setup()    {        _action = () => _count++;        _action += () => _count += 2;        _action += () => _count += 3;    }    [Benchmark(Baseline = true)]    public void GetInvocationList()    {        foreach (Action action in _action.GetInvocationList())        {            action();        }    }    [Benchmark]    public void EnumerateInvocationList()    {        foreach (Action action in Delegate.EnumerateInvocationList(_action))        {            action();        }    }}

Method	Mean	Ratio	Allocated	Alloc Ratio
GetInvocationList	32.11 ns	1.00	48 B	1.00
EnumerateInvocationList	11.07 ns	0.34	–	0.00

Reflection is particularly important with libraries involved in dependency injection, where object construction is frequently done in a more dynamic fashion.ActivatorUtilities.CreateInstance plays a key role there, and has also seen improvements from allocation reduction.dotnet/runtime#99383, in particular, helped to significantly reduce allocation by employing theConstructorInvoker type introduced in .NET 8, and by piggybacking on the changes fromdotnet/runtime#99175 to cut down on the number of constructors it needs to examine.

// Add a <PackageReference Include="Microsoft.Extensions.DependencyInjection" Version="8.0.0" /> to the csproj.// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using Microsoft.Extensions.DependencyInjection;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80)        .WithNuGet("Microsoft.Extensions.DependencyInjection", "8.0.0")        .WithNuGet("Microsoft.Extensions.DependencyInjection.Abstractions", "8.0.1").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90)        .WithNuGet("Microsoft.Extensions.DependencyInjection", "9.0.0-rc.1.24431.7")        .WithNuGet("Microsoft.Extensions.DependencyInjection.Abstractions", "9.0.0-rc.1.24431.7"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    private IServiceProvider _serviceProvider = new ServiceCollection().BuildServiceProvider();    [Benchmark]    public MyClass Create() => ActivatorUtilities.CreateInstance<MyClass>(_serviceProvider, 1, 2, 3);    public class MyClass    {        public MyClass() { }        public MyClass(int a) { }        public MyClass(int a, int b) { }        [ActivatorUtilitiesConstructor]        public MyClass(int a, int b, int c) { }    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Create	.NET 8.0	163.60 ns	1.00	288 B	1.00
Create	.NET 9.0	83.46 ns	0.51	144 B	0.50

The aforementionedConstructorInvoker, along with aMethodInvoker, was introduced in .NET 8 as a way to cache first-use information to enable all subsequent operations to be much faster. Without introducing a new publicFieldInvoker,dotnet/runtime#98199 is able to achieve similar levels of speedup for field access via aFieldInfo by employing an internalFieldAccessor that’s cached onto theFieldInfo object (dotnet/runtime#92512 also helped here by moving some native runtime implementations back up into C#). Varying levels of large speedups are achieved depending on the exact nature of the field being accessed.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Reflection;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    private static object s_staticReferenceField = new object();    private object _instanceReferenceField = new object();    private static int s_staticValueField = 1;    private int _instanceValueField = 2;    private object _obj = new();    private FieldInfo _staticReferenceFieldInfo = typeof(Tests).GetField(nameof(s_staticReferenceField), BindingFlags.NonPublic | BindingFlags.Static)!;    private FieldInfo _instanceReferenceFieldInfo = typeof(Tests).GetField(nameof(_instanceReferenceField), BindingFlags.NonPublic | BindingFlags.Instance)!;    private FieldInfo _staticValueFieldInfo = typeof(Tests).GetField(nameof(s_staticValueField), BindingFlags.NonPublic | BindingFlags.Static)!;    private FieldInfo _instanceValueFieldInfo = typeof(Tests).GetField(nameof(_instanceValueField), BindingFlags.NonPublic | BindingFlags.Instance)!;    [Benchmark] public object? GetStaticReferenceField() => _staticReferenceFieldInfo.GetValue(null);    [Benchmark] public void SetStaticReferenceField() => _staticReferenceFieldInfo.SetValue(null, _obj);    [Benchmark] public object? GetInstanceReferenceField() => _instanceReferenceFieldInfo.GetValue(this);    [Benchmark] public void SetInstanceReferenceField() => _instanceReferenceFieldInfo.SetValue(this, _obj);    [Benchmark] public int GetStaticValueField() => (int)_staticValueFieldInfo.GetValue(null)!;    [Benchmark] public void SetStaticValueField() => _staticValueFieldInfo.SetValue(null, 3);    [Benchmark] public int GetInstanceValueField() => (int)_instanceValueFieldInfo.GetValue(this)!;    [Benchmark] public void SetInstanceValueField() => _instanceValueFieldInfo.SetValue(this, 4);}

Method	Runtime	Mean	Ratio
GetStaticReferenceField	.NET 8.0	24.839 ns	1.00
GetStaticReferenceField	.NET 9.0	1.720 ns	0.07

SetStaticReferenceField	.NET 8.0	41.025 ns	1.00
SetStaticReferenceField	.NET 9.0	6.964 ns	0.17

GetInstanceReferenceField	.NET 8.0	29.595 ns	1.00
GetInstanceReferenceField	.NET 9.0	5.960 ns	0.20

SetInstanceReferenceField	.NET 8.0	31.753 ns	1.00
SetInstanceReferenceField	.NET 9.0	9.577 ns	0.30

GetStaticValueField	.NET 8.0	43.847 ns	1.00
GetStaticValueField	.NET 9.0	36.011 ns	0.82

SetStaticValueField	.NET 8.0	39.462 ns	1.00
SetStaticValueField	.NET 9.0	10.396 ns	0.26

GetInstanceValueField	.NET 8.0	45.125 ns	1.00
GetInstanceValueField	.NET 9.0	39.104 ns	0.87

SetInstanceValueField	.NET 8.0	36.664 ns	1.00
SetInstanceValueField	.NET 9.0	13.571 ns	0.37

Of course, if you can avoid using these more expensive reflection approaches in the first place, that’s very desirable. One reason for using reflection is to access private members of other types, and while that’s a scary thing to do and generally something to be avoided, there are valid cases for it where having an efficient solution is highly desirable. .NET 8 added such a mechanism in[UnsafeAccessor], which enables a type to declare a method that effectively serves as direct access to a member of another type. So, for example, with this:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Reflection;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    private MyClass _myClass = new MyClass(new List<int>() { 1, 2, 3 });    private FieldInfo _fieldInfo = typeof(MyClass).GetField("_list", BindingFlags.NonPublic | BindingFlags.Instance)!;    private static class Accessors    {        [UnsafeAccessor(UnsafeAccessorKind.Field, Name = "_list")]        public static extern ref object GetList(MyClass myClass);    }    [Benchmark(Baseline = true)]    public object WithFieldInfo() => _fieldInfo.GetValue(_myClass)!;    [Benchmark]    public object WithUnsafeAccessor() => Accessors.GetList(_myClass);}public class MyClass(object list){    private object _list = list;}

I get this:

Method	Runtime	Mean	Ratio
WithFieldInfo	.NET 8.0	27.5299 ns	1.00
WithFieldInfo	.NET 9.0	4.0789 ns	0.15
WithUnsafeAccessor	.NET 8.0	0.5005 ns	0.02
WithUnsafeAccessor	.NET 9.0	0.5499 ns	0.02

However, in .NET 8, this mechanism could only be used with non-generic members. Now in .NET 9, thanks todotnet/runtime#99468 anddotnet/runtime#99830, this capability now extends to generics, as well.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Reflection;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    private MyClass<int> _myClass = new MyClass<int>(new List<int>() { 1, 2, 3 });    private FieldInfo _fieldInfo = typeof(MyClass<int>).GetField("_list", BindingFlags.NonPublic | BindingFlags.Instance)!;    private static class Accessors<T>    {        [UnsafeAccessor(UnsafeAccessorKind.Field, Name = "_list")]        public static extern ref List<T> GetList(MyClass<T> myClass);    }    [Benchmark(Baseline = true)]    public List<int> WithFieldInfo() => (List<int>)_fieldInfo.GetValue(_myClass)!;    [Benchmark]    public List<int> WithUnsafeAccessor() => Accessors<int>.GetList(_myClass);}public class MyClass<T>(List<T> list){    private List<T> _list = list;}

Method	Mean	Ratio
WithFieldInfo	4.4251 ns	1.00
WithUnsafeAccessor	0.5147 ns	0.12

Parsing that occurs as part of reflection, and in particular as part of type names, was also improved as part of some work to consolidate type name parsing into a reusable component.dotnet/runtime#100094‘s primary purpose wasn’t to improve performance, but it ended up doing so, anyway.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public Type? Parse() =>        Type.GetType("System.Collections.Generic.Dictionary`2[" +                         "[System.Collections.Generic.List`1[" +                            "[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], " +                            "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]," +                         "[System.Collections.Generic.List`1[" +                            "[System.String, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], " +                            "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], " +                         "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e");}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Parse	.NET 8.0	7.590 us	1.00	5.03 KB	1.00
Parse	.NET 9.0	6.361 us	0.84	4.73 KB	0.94

And then there are more intrinsics. In compiler speak, an “intrinsic” is something the compiler has “intrinsic” knowledge of, a fancy way of saying it’s something the compiler implicitly knows about. This typically manifests as a method whose implementation is provided by the compiler, sometimes always or sometimes based on certain conditions. For example,string.Equals is attributed as[Intrinsic]: it has its own fully-functional implementation, but if the JIT sees that at least one of the inputs is a constant string, the JIT may emit its own optimized implementation forEquals that unrolls and vectorizes the comparison based on the exact value being compared.

Several new members became intrinsics in .NET 9.dotnet/runtime#96226 turnstypeof(T).IsPrimitive into an intrinsic, which allows the JIT to supply a constant replacement for the expression, which in turn allows branches to be eliminated and possibly whole swaths of then dead code to follow. For example, as part of its code path for moving to the next value,Parallel.ForAsync has a code path that looks like this:

if (typeof(T).IsPrimitive){    UseInterlockedCompareExchangeToAdvance();}else{    UseALockAroundAReadIncrementStoreToAdvance();}

WithIsPrimitive as an intrinsic, thatif/else will reduce entirely to either:

UseInterlockedCompareExchangeToAdvance();

UseALockAroundAReadIncrementStoreToAdvance();

based on the nature ofT.

typeof(T).IsGenericType andtypeof(T).GetGenericTypeDefinition were also made into intrinsics, bydotnet/runtime#99555 anddotnet/runtime#103528, respectively. Imagine code like that in ASP.NET where it wants to special-case APIs that returnTask<T> vsValueTask<T> vsIAsyncEnumerable<T> vsT vs other types; it’ll often use members likeIsGenericType andGetGenericTypeDefinition (which will throw an exception ifIsGenericType isfalse) to determine whether a concrete instantiation of a generic type is the one in question. With this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public bool Test() => IsTaskT<Task<string>>();    private static bool IsTaskT<T>() =>        typeof(T).IsGenericType &&        typeof(T).GetGenericTypeDefinition() == typeof(Task<>);}

on .NET 8 we end up with over 250 bytes of assembly code for implementing this operation. On .NET 9, we get just this:

; Tests.Test()       mov       eax,1       ret; Total bytes of code 6

The magic of intrinsics.

Numerics

Primitive Types

The core data types in .NET sit at the very bottom of the stack and are used everywhere. It’s thus a desire every release to whittle away at any overheads we can avoid. .NET 9 is no exception, where a multitude of PRs have gone into reducing overheads of various operations on these core types.

ConsiderDateTime. When it comes to performance optimization, we typically focus on the happy path, on the “hot path,” on the successful path. Exceptions already add significant expense to error paths, and are intended to be “exceptional” and relatively rare, and so we generally don’t worry about an extra operation here or an extra allocation there. But, sometimes, one type’s error path is another type’s success path. This is especially true withTry methods, where failure is conveyed via abool rather than with an expensive exception. As part of profiling a commonly-used .NET library, the profiler highlighted some unexpected allocations coming fromDateTime handling, unexpected because we’ve spent a lot of time over the years trying to eliminate allocations in this area of the code. The allocation, it turned out, was occurring on an error path, both withDateTime.Parse when an exception would be thrown, butalso withDateTime.TryParse whenfalse would be returned. As it happened, deep in the call tree where parsing work is going on, if an error is encountered, the code stores information about the failure (e.g. aParseFailureKind enum value); after unwinding the call stack back to the public method,Parse uses that to throw an appropriately-detailed exception, whileTryParse just ignores it and returnsfalse. But the way the code was written, that enum value would end up getting boxed when it was stored, resulting in an allocation as part ofTryParse returningfalse. The consuming library was usingTryParse on a bunch of different data primitive types as part of interpreting the data, e.g.

if (int.TryParse(value, out int parsedInt32)) { ... }else if (DateTime.TryParse(value, out DateTime parsedDateTime)) { ... }else if (double.TryParse(value, out double parsedDouble)) { ... }else if ...

such thatits success path might include the failure path from some number of these primitives’TryParse methods.dotnet/runtime#91303 tweaked how the information is stored to avoid that boxing, while also reducing a bit of additional overhead along the way.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "input")]public class Tests{    [Benchmark]    [Arguments("hello")]    public bool TryParse(string input) => DateTime.TryParse(input, out _);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
TryParse	.NET 8.0	31.95 ns	1.00	24 B	1.00
TryParse	.NET 9.0	25.96 ns	0.81	–	0.00

BothDateTime andTimeSpan also saw parsing and formatting gains fromdotnet/runtime#101640 from@lilinus. The PR takes advantage of an existing internalCountDigits helper that was optimized in .NET 8 as part of integer parsing; it employs a lookup table to compute the number of digits that will be required for a number, doing so in just a few instructions. And it replaces aswitch with a lookup table as part of computing powers of ten, replacing a method likePow10_Old in this benchmark with one more likePow10_New:

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "input")]public class Tests{    private int _pow = 3;    [Benchmark(Baseline = true)]    public long Pow10_Old() =>        _pow switch        {            0 => 1,            1 => 10,            2 => 100,            3 => 1000,            4 => 10000,            5 => 100000,            6 => 1000000,            _ => 10000000, // _pow will never be greater than 7        };    [Benchmark]    public long Pow10_New()    {        ReadOnlySpan<int> powersOfTen =        [            1,            10,            100,            1000,            10000,            100000,            1000000,            10000000, // _pow will never be greater than 7        ];        return powersOfTen[_pow];    }}

The JIT is able to do a bit better job with the latter, for the former producing:

; Tests.Pow10_Old()       push      rbp       mov       rbp,rspM00_L00:       mov       ecx,[rdi+8]       cmp       ecx,3       jne       short M00_L02       mov       edx,3E8M00_L01:       movsxd    rax,edx       pop       rbp       retM00_L02:       cmp       ecx,6       ja        short M00_L03       mov       edx,ecx       lea       rax,[7F3D29A690E8]       mov       eax,[rax+rdx*4]       lea       rcx,[M00_L00]       add       rax,rcx       jmp       raxM00_L03:       mov       edx,989680       jmp       short M00_L01       mov       edx,0F4240       jmp       short M00_L01       mov       edx,186A0       jmp       short M00_L01       mov       edx,2710       jmp       short M00_L01       mov       edx,64       jmp       short M00_L01       mov       edx,0A       jmp       short M00_L01       mov       edx,1       jmp       short M00_L01; Total bytes of code 100

but for the latter producing:

; Tests.Pow10_New()       push      rax       mov       eax,[rdi+8]       cmp       eax,8       jae       short M00_L00       mov       rcx,7F3CC0AE6018       movsxd    rax,dword ptr [rcx+rax*4]       add       rsp,8       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 34

The net result is a nice improvement to these operations, e.g.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _input = TimeSpan.FromMilliseconds(12345.6789).ToString();    [Benchmark]    public TimeSpan Parse() => TimeSpan.Parse(_input);}

Method	Runtime	Mean	Ratio
Parse	.NET 8.0	137.55 ns	1.00
Parse	.NET 9.0	117.78 ns	0.86

Various operations on the primitive types were also improved across a plethora of PRs:

Round.dotnet/runtime#98186 from@MichalPetryka optimized the variousMath.Round andMathF.Round overloads (which are the same implementations asdouble.Round andfloat.Round).

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private double _value = 12345.6789;    [Benchmark]    public double RoundDigits() => Math.Round(_value, 2);}

Method	Runtime	Mean	Ratio
RoundDigits	.NET 8.0	1.6930 ns	1.00
RoundDigits	.NET 9.0	0.3496 ns	0.21

SinCos.dotnet/runtime#103724 updatedMath.SinCos andMathF.SinCos to use the internalRuntimeHelpers.IsKnownConstant intrinsic. This method enables code inCoreLib to check whether the argument to the method is coming in as a constant the JIT can see, at which point the implementation might choose to do something special. In this case, theSin andCos functions are already capable of producing constant results for constant input, so rather than doing the normal implementation, which tries to reuse most of the computation that’s shared betweenSin andCos, it instead just calls to each, knowing that the constant input for each will result in a constant output overall.
```
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public float Sum() => SumSinCos(123.456f);    private float SumSinCos(float f)    {        (float sin, float cos) = MathF.SinCos(f);        return sin + cos;    }}
```
Method Runtime Mean Ratio Code Size
Sum .NET 8.0 5.2719 ns 1.000 46 B
Sum .NET 9.0 0.0177 ns 0.003 9 B
In cases like this, it’s helpful to pay attention to the warnings BenchmarkDotNet issues:
```
// * Warnings *ZeroMeasurement  Tests.Sum: Runtime=.NET 9.0, Toolchain=net9.0 -> The method duration is indistinguishable from the empty method duration
```
The .NET 9 run is indistinguishable from an empty method because itis an empty method, or at least a method that just returns a constant. We can see that by looking at the disassembly. The .NET 8 code has a few moves and loads and then calls toSinCos:
```
; Tests.Sum()       push      rax       vzeroupper       vmovss    xmm0,dword ptr [7F7686979610]       lea       rdi,[rsp+4]       lea       rsi,[rsp]       call      System.MathF.SinCos(Single, Single*, Single*)       vmovss    xmm0,dword ptr [rsp+4]       vmovss    xmm1,dword ptr [rsp]       vaddss    xmm0,xmm0,xmm1       add       rsp,8       ret; Total bytes of code 46
```
In contrast, here’s .NET 9:
```
; Tests.Sum()       vmovss    xmm0,dword ptr [7F4D10FD9080]       ret; Total bytes of code 9
```
It’s simply loading a value and returning it, since the whole operation compiled down to a constant.
Enum.{Try}Parse. Interop scenarios drove the introduction of two newRuntimeHelpers APIs,SizeOf indotnet/runtime#100618 andBox indotnet/runtime#100561. But,dotnet/runtime#100846 was then able to utilize these APIs to optimize the implementation of the non-genericEnum.Parse andEnum.TryParse overloads, which give back the parsed enum value asobject. This is a special kind of boxing, because the parse methods internally extract a numerical value but then need the boxed object to be of the enum type (rather than the numerical type) specified via theType argument.
```
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _input = "Monday";    [Benchmark]    public object Parse() => Enum.Parse(typeof(DayOfWeek), _input);}
```
Method Runtime Mean Ratio
Parse .NET 8.0 62.01 ns 1.00
Parse .NET 9.0 28.13 ns 0.45

Method	Runtime	Mean	Ratio	Code Size
Sum	.NET 8.0	5.2719 ns	1.000	46 B
Sum	.NET 9.0	0.0177 ns	0.003	9 B

Method	Runtime	Mean	Ratio
Parse	.NET 8.0	62.01 ns	1.00
Parse	.NET 9.0	28.13 ns	0.45

Integer Division. Consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(5)]    public uint DivideBy4_UInt32(uint value) => value / 4;    [Benchmark]    [Arguments(5)]    public int DivideBy4_Int32(int value) => value / 4;}

With theuint-based example, dividing by 4 is already optimized into a simple right shift, since for auint,value / 4 andvalue >> 2 are functionally equivalent. However, that’s not the case for anint, or at least, not always. For a non-negativeint, the same optimization could be employed, but if theint is negative, for some values switching fromvalue / 4 tovalue >> 2 would be functionally incorrect. Consider-5 / 4… the answer is-1. But-5 >> 2 is-2. Oops. So when you look at the assembly code for theint case (here on .NET 8), it’s more complex:

; Tests.DivideBy4_UInt32(UInt32)       mov       eax,esi       shr       eax,2       ret; Total bytes of code 6; Tests.DivideBy4_Int32(Int32)       mov       eax,esi       sar       eax,1F       and       eax,3       add       eax,esi       sar       eax,2       ret; Total bytes of code 14

Given that, you might hope that if the compiler could prove that theint was non-negative, it could still employ the simpler shifting:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(5)]    public int DivideBy4_Int32(int value) => value < 4 ? 0 : value / 4;}

But alas, on .NET 8, we still get:

; Tests.DivideBy4_Int32(Int32)       cmp       esi,4       jl        short M00_L00       mov       eax,esi       sar       eax,1F       and       eax,3       add       eax,esi       sar       eax,2       retM00_L00:       xor       eax,eax       ret; Total bytes of code 22

On .NET 9,dotnet/runtime#94347 updates the JIT for exactly that, replacing signed division with unsigned division if it can prove that both the numerator and denominator are non-negative.

; Tests.DivideBy4_Int32(Int32)       cmp       esi,4       jl        short M00_L00       mov       eax,esi       shr       eax,2       retM00_L00:       xor       eax,eax       ret; Total bytes of code 14

Nullable. Several optimizations went into makingNullable<T> cheaper, in particular when used with generics. Consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public void TestStruct() => Dispose<List.Enumerator>(default);    [Benchmark]    public void TestNullableStruct() => Dispose<List<int>.Enumerator?>(default);    [MethodImpl(MethodImplOptions.AggressiveInlining)]    private static void Dispose<T>(T t)    {        if (t is IDisposable)        {            ((IDisposable)t).Dispose();        }    }}

We have an unconstrained generic methodDispose whose job it is to cast the argument toIDisposable and invoke itsDispose. While such an operation would seemingly box ifT were a value type, for a long time now the JIT has had optimizations that end up eliminating that boxing. In the case of theList<T>.Enumerator, itsDispose implementation is a nop, so withDispose<T> getting inlined, no boxing, and theIDisposable.Dispose implementation nop’ing, this whole method is a nop (on both .NET 8 and .NET 9):

; Tests.TestStruct()       ret; Total bytes of code 1

That’s unfortunately not the case forTestNullableStruct. Theonly difference betweenTestStruct andTestNullableStruct is that pesky? in the generic type argument, which meansT will be aNullable<List<int>.Enumerator> rather thanList<int>.Enumerator. That complicates things.Nullable<T> is very special, with a boxed nullable implementing the same interfaces as does the underlying struct, but it ends up being very hard for the JIT to deal with. On .NET 8, we end up with this assembly:

; Tests.TestNullableStruct()       push      rbp       sub       rsp,20       lea       rbp,[rsp+20]       vxorps    xmm8,xmm8,xmm8       vmovdqa   xmmword ptr [rbp-20],xmm8       vmovdqa   xmmword ptr [rbp-10],xmm8       cmp       byte ptr [rbp-20],0       jne       short M00_L01M00_L00:       add       rsp,20       pop       rbp       retM00_L01:       lea       rsi,[rbp-20]       mov       rdi,offset MT_System.Nullable`1[[System.Collections.Generic.List`1+Enumerator[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]]       call      CORINFO_HELP_BOX_NULLABLE       mov       rsi,rax       mov       rdi,offset MT_System.IDisposable       call      qword ptr [7F178F2543C0]; System.Runtime.CompilerServices.CastHelpers.ChkCastInterface(Void*, System.Object)       mov       rdi,rax       mov       r11,7F178E5904C0       call      qword ptr [r11]       jmp       short M00_L00; Total bytes of code 93

That’s a whole lot more than aret. Thankfully, for .NET 9,dotnet/runtime#95764 makes this better by optimizingcastclass forNullable<T>:

; Tests.TestNullableStruct()       sub       rsp,28       xor       eax,eax       mov       [rsp+8],rax       vxorps    xmm8,xmm8,xmm8       vmovdqa   xmmword ptr [rsp+10],xmm8       mov       [rsp+20],rax       cmp       byte ptr [rsp+8],0       jne       short M00_L01M00_L00:       add       rsp,28       retM00_L01:       lea       rsi,[rsp+8]       mov       rdi,offset MT_System.Nullable`1[[System.Collections.Generic.List`1+Enumerator[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]]       call      CORINFO_HELP_BOX_NULLABLE       movsx     rax,byte ptr [rax+8]       jmp       short M00_L00; Total bytes of code 66

We still have thecall toCORINFO_HELP_BOX_NULLABLE, but the relatively expensivecall toChkCastInterface is now gone. While this may seem a little corner case, it actually shows up in well-known places. For example:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int? _value = 42;    [Benchmark]    public string Interpolate() => $"{_value}";}

Here we’re just doing string interpolation, using a nullable value type as one of the arguments. TheDefaultInterpolatedStringHandler has a genericAppendFormatted method which this will end up using, passing theNullable<int> as its argument, and that method employs similar patterns of type testing for an interface and using it if it’s available. And as a result, this optimization can have a measurable impact on such interpolated string use:

Method	Runtime	Mean	Ratio
Interpolate	.NET 8.0	78.12 ns	1.00
Interpolate	.NET 9.0	62.95 ns	0.81

AnotherNullable<T>-related optimization isdotnet/runtime#95711, which ends up avoiding boxing for some forms of type testing. Consider this:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int? _value = 42;    [Benchmark]    public bool Test() => IsInt(_value);    private static bool IsInt<T>(T value) => value is int;}

This should be relatively straightforward: the JIT can see thatT is aNullable<int>, and then whether it satisfies the type test is a question of whether the value isnull or not, since if it’snull, it’s not anint, and if it’s notnull, it is anint. Unfortunately, on .NET 8, not so much:

; Tests.Test()       push      rbp       sub       rsp,10       lea       rbp,[rsp+10]       mov       rsi,[rdi+8]       mov       [rbp-8],rsi       lea       rsi,[rbp-8]       mov       rdi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]]       call      CORINFO_HELP_BOX_NULLABLE       test      rax,rax       je        short M00_L00       mov       rcx,offset MT_System.Int32       xor       edx,edx       cmp       [rax],rcx       cmovne    rax,rdxM00_L00:       test      rax,rax       setne     al       movzx     eax,al       add       rsp,10       pop       rbp       ret; Total bytes of code 76

In fact, we can see it’s usingCORINFO_HELP_BOX_NULLABLE tobox theNullable<int>, which means we actually end up with an allocation as part of this type test. And that’s visible in the benchmark results:

Method	Runtime	Mean	Ratio	Code Size	Allocated	Alloc Ratio
Test	.NET 8.0	39.1567 ns	1.000	76 B	24 B	1.00
Test	.NET 9.0	0.0006 ns	0.000	5 B	–	0.00

On .NET 9, it ends up being what we thought it should be, a simplenull check:

; Tests.Test()       movzx     eax,byte ptr [rdi+8]       ret; Total bytes of code 5

where the result of the method is simplyNullable<T>.HasValue.

As a small tangent since we’re talking about optimizing casting,dotnet/runtime#98284 improves code generation for casts where the JIT can end up seeing that the object being cast isnull (while you’d probably never explicitly writeif (null is SomeClass), you might very well writeif (GetObject() is SomeClass) wereGetObject() might get inlined and returnnull, especially ifGetObject() is virtual and due to dynamic PGO anull-returning override gets inlined).

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public Tests? NullCast() => GetObj() as Tests;    private object? GetObj() => null;}

On .NET 8, it doesn’t pay attention to whether it knows that the source will benull, but now in .NET 9, it does:

// .NET 8; Tests.NullCast()       push      rax       mov       rdi,offset MT_Tests       xor       esi,esi       call      qword ptr [7F0457E24360]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)       nop       add       rsp,8       ret; Total bytes of code 25// .NET 9; Tests.NullCast()       push      rax       xor       eax,eax       add       rsp,8       ret; Total bytes of code 8

Back toNullable<T>,dotnet/runtime#105073 enables the JIT to inline the fast path of the unboxing helper that’s used when extracting aNullable<T> from anobject. There’s anCORINFO_HELP_UNBOX_NULLABLE helper function that’s called to perform the unboxing (e.g.(int?)o for someobject o), but the success path (where the object is eithernull or the boxed target type) is small and it’s worth inlining that.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private object _o = 42;    [Benchmark]    public int? Unbox() => (int?)_o;}

On .NET 8, we get the following, effectively just a call toCORINFO_HELP_UNBOX_NULLABLE:

; Tests.Unbox()       push      rax       mov       rdx,[rdi+8]       lea       rdi,[rsp]       mov       rsi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]]       call      CORINFO_HELP_UNBOX_NULLABLE       mov       rax,[rsp]       add       rsp,8       ret; Total bytes of code 33

whereas on .NET 9, we get the following, which is creating a defaultNullable<int> if the object isnull, or aNullable<int> with the value from the object if it’s a boxedint, or callingCORINFO_HELP_UNBOX_NULLABLE if it’s something else (in which case we’ll be throwing an exception shortly):

; Tests.Unbox()       push      rbp       sub       rsp,10       lea       rbp,[rsp+10]       mov       rdx,[rdi+8]       test      rdx,rdx       jne       short M00_L00       xor       edx,edx       mov       [rbp-8],rdx       jmp       short M00_L01M00_L00:       mov       rax,offset MT_System.Int32       cmp       [rdx],rax       jne       short M00_L02       mov       byte ptr [rbp-8],1       mov       eax,[rdx+8]       mov       [rbp-4],eaxM00_L01:       mov       rax,[rbp-8]       add       rsp,10       pop       rbp       retM00_L02:       lea       rdi,[rbp-8]       mov       rsi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]]       call      CORINFO_HELP_UNBOX_NULLABLE       jmp       short M00_L01; Total bytes of code 83

This is one of those cases where you actually want the code to be larger, at least for the micro-benchmark, because the inlining is the purpose and is bringing in more code.

Method	Runtime	Mean	Ratio	Code Size
Unbox	.NET 8.0	6.014 ns	1.00	33 B
Unbox	.NET 9.0	2.854 ns	0.47	83 B

BigInteger

Not exactly a “primitive” type, but in the same ballpark, isBigInteger. As withsbyte,short,int, andlong,System.Numerics.BigInteger is anIBinaryInteger<> andISignedNumber<>. Unlike those types, which are all of a fixed bit size (8, 16, 32, and 64 bits, respectively),BigInteger can represent signed integers with any number of bits (within reason… the current representation allows up toArray.MaxLength / 64 bits, which means representing numbers up to 2^33,554,432… that’s… big). Such large sizes brings with it performance complexities, and historicallyBigInteger hasn’t been a beacon of high throughput. While there’s still more that can be done (and in fact there are several pending PRs even as I write this), a bunch of nice changes have landed for .NET 9.

dotnet/runtime#91176 from@Rob-Hague improvedBigInteger‘sbyte-based constructors (e.g.public BigInteger(byte[] value)) by utilizing vectorized operations fromMemoryMarshal andBinaryPrimitives. In particular, a lot of the time spent in theseBigInteger constructors is in walking the list of bytes, building up integers out of each grouping of four, and storing those into a destinationuint[]. With spans, however, that whole operation is unnecessary and can be achieved with an optimizedCopyTo operation (effectively amemcpy) with the destination just being thatuint[] reinterpreted as a span of bytes.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _bytes;    [GlobalSetup]    public void Setup()    {        _bytes = new byte[10_000];        new Random(42).NextBytes(_bytes);    }    [Benchmark]    public BigInteger NewBigInteger() => new BigInteger(_bytes);}

Method	Runtime	Mean	Ratio
NewBigInteger	.NET 8.0	5.886 us	1.00
NewBigInteger	.NET 9.0	1.434 us	0.24

Parsing is another common way of creatingBigIntegers.dotnet/runtime#95543 improved the performance of parsing hex and binary-formatted values (this is on top of the .NET 9 addition indotnet/runtime#85392 from@lateapexearlyspeed that added support for the"b" format specifier for formatting and parsingBigInteger as binary). Previously, parsing would go digit-by-digit, but with the new algorithm, it parses multiple chars at the same time, using a vectorized implementation for larger inputs.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Globalization;using System.Numerics;BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _hex = string.Create(1024, 0, (dest, _) => new Random(42).GetItems<char>("0123456789abcdef", dest));    [Benchmark]    public BigInteger ParseHex() => BigInteger.Parse(_hex, NumberStyles.HexNumber);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
ParseHex	.NET 8.0	5,155.5 ns	1.00	5208 B	1.00
ParseHex	.NET 9.0	236.8 ns	0.05	536 B	0.10

This isn’t the first time efforts have been made to improveBigInteger parsing. .NET 7, for example, included a change that introduced a new parsing algorithm. The previous algorithm wasO(N^2) in the number of digits, and the new algorithm had a lower algorithmic complexity, but due to the constants involved was only worthwhile with a larger number of digits. Both algorithms were included, switching between them based on a cut-off of 20,000 digits. As it turns out, with more analysis, that threshold was significantly higher than it needed to be, anddotnet/runtime#97101 from@kzrnm lowered the threshold to a much smaller value (1233). On top of this,dotnet/runtime#97589 from@kzrnm improves parsing further by a) recognizing that the multiplier being used during parsing (shifting down digits to make room for adding in the next set) includes many leading zeros that can be ignored during the operation, and b) trailing zeros when parsing powers of 10 could be calculated more efficiently.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics;BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _digits = string.Create(2000, 0, (dest, _) => new Random(42).GetItems<char>("0123456789", dest));    [Benchmark]    public BigInteger ParseDecimal() => BigInteger.Parse(_digits);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
ParseDecimal	.NET 8.0	24.60 us	1.00	5528 B	1.00
ParseDecimal	.NET 9.0	18.95 us	0.77	856 B	0.15

Once you have aBigInteger, there are of course various operations you can do with it.BigInteger.Equals was improved bydotnet/runtime#91416 from@Rob-Hague, which changed the implementation to use the optimizedMemoryExtensions.SequenceEqual rather than walking the arrays backing eachBigInteger element-by-element.dotnet/runtime#104513 from@Rob-Hague improvedBigInteger.IsPowerOfTwo by similarly replacing a manual walk of the elements with a call toContainsAnyExcept, looking to see whether all elements after a certain point were 0.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics;BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private BigInteger _value1, _value2;    [GlobalSetup]    public void Setup()    {        var value1 = new byte[10_000];        new Random(42).NextBytes(value1);        _value1 = new BigInteger(value1);        _value2 = new BigInteger(value1.AsSpan().ToArray());    }    [Benchmark]    public bool Equals() => _value1 == _value2;}

Method	Runtime	Mean	Ratio
Equals	.NET 8.0	1,110.38 ns	1.00
Equals	.NET 9.0	79.80 ns	0.07

dotnet/runtime#92208 from@kzrnm also improvedBigInteger.Multiply, and in particular when multiplying a first value that’s much larger than a second value.

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics;BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private BigInteger _value1 = BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 1000)));    private BigInteger _value2 = BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 300)));    [Benchmark]    public BigInteger MultiplyLargeSmall() => _value1 * _value2;}

Method	Runtime	Mean	Ratio
MultiplyLargeSmall	.NET 8.0	231.0 us	1.00
MultiplyLargeSmall	.NET 9.0	118.8 us	0.51

Lastly, in addition to parsing,BigInteger formatting also saw some improvements.dotnet/runtime#100181 removed various temporary buffer allocations that were occurring as part of formatting and optimized various calculations in order to reduce overheads while formatting these values.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics;BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private BigInteger _value = BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 300)));    private char[] _dest = new char[10_000];    [Benchmark]    public bool TryFormat() => _value.TryFormat(_dest, out _);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
TryFormat	.NET 8.0	102.49 us	1.00	7456 B	1.00
TryFormat	.NET 9.0	94.52 us	0.92	–	0.00

TensorPrimitives

Numerics has been a big focus for .NET over the last several releases. A large stable of numerical operations is now exposed on every numerical type as well as on a set of generic interfaces those types implement. But sometimes you want to perform the same operation on a set of values rather than on an individual value, and for that, we haveTensorPrimitives. .NET 8 introduced theTensorPrimitive type, which provides a plethora of numerical APIs, but for spans of them rather than for individual values. For example,float has aCosh method:

public static float Cosh(float x);

which provides thehyberbolic cosine of onefloat, and a corresponding method shows up on theIHyperbolicFunctions<TSelf> interface:

static abstract TSelf Cosh(TSelf x);

TensorPrimitives then has a corresponding method, but rather than accepting onefloat, it accepts a span of them, and rather than returning the results, it writes the results into a provided destination span:

public static void Cosh(ReadOnlySpan<float> x, Span<float> destination);

In .NET 8,TensorPrimitives provided approximately 40 such methods, and only did so forfloat. Now in .NET 9, this has been significantly expanded. There are now over 200 overloads onTensorPrimitives, covering most of the numerical operations that are also exposed on the generic math interfaces (and some that aren’t),and they’re exposed using generics, such that they can work with many more data types than justfloat. For example, while it maintains itsfloat-specific overload ofCosh for backwards binary compatibility,TensorPrimitives now also sports this overload:

public static void Cosh<T>(ReadOnlySpan<T> x, Span<T> destination)    where T : IHyperbolicFunctions<T>

such that it can be used withHalf,float,double,NFloat, or any custom floating-point type you might have, as long as it implements the relevant interface. Most of these operations are also vectorized, such that it’s more than just a simple loop around the corresponding scalar function.

// Add a <PackageReference Include="System.Numerics.Tensors" Version="9.0.0" /> to the csproj.// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics.Tensors;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private float[] _source, _destination;    [GlobalSetup]    public void Setup()    {        var r = new Random(42);        _source = Enumerable.Range(0, 1024).Select(_ => (float)r.NextSingle()).ToArray();        _destination = new float[1024];    }    [Benchmark(Baseline = true)]    public void ManualLoop()    {        ReadOnlySpan<float> source = _source;        Span<float> destination = _destination;        for (int i = 0; i < source.Length; i++)        {            destination[i] = float.Cosh(source[i]);        }    }    [Benchmark]    public void BuiltIn()    {        TensorPrimitives.Cosh<float>(_source, _destination);    }}

Method	Mean	Ratio
ManualLoop	7,804.4 ns	1.00
BuiltIn	621.6 ns	0.08

A huge number of APIs is available, most of which see similar or better gains over the simple loop. Here’s what’s currently available in .NET 9, all as generic methods, and with multiple overloads available for most:

Abs, Acosh, AcosPi, Acos, AddMultiply, Add, Asinh, AsinPi, Asin, Atan2Pi, Atan2, Atanh, AtanPi, Atan, BitwiseAnd, BitwiseOr, Cbrt, Ceiling, ConvertChecked, ConvertSaturating, ConvertTruncating, ConvertToHalf, ConvertToSingle, CopySign, CosPi, Cos, Cosh, CosineSimilarity, DegreesToRadians, Distance, Divide, Dot, Exp, Exp10M1, Exp10, Exp2M1, Exp2, ExpM1, Floor, FusedMultiplyAdd, HammingDistance, HammingBitDistance, Hypot, Ieee754Remainder, ILogB, IndexOfMaxMagnitude, IndexOfMax, IndexOfMinMagnitude, IndexOfMin, LeadingZeroCount, Lerp, Log2, Log2P1, LogP1, Log, Log10P1, Log10, MaxMagnitude, MaxMagnitudeNumber, Max, MaxNumber, MinMagnitude, MinMagnitudeNumber, Min, MinNumber, MultiplyAdd, MultiplyAddEstimate, Multiply, Negate, Norm, OnesComplement, PopCount, Pow, ProductOfDifferences, ProductOfSums, Product, RadiansToDegrees, ReciprocalEstimate, ReciprocalSqrtEstimate, ReciprocalSqrt, Reciprocal, RootN, RotateLeft, RotateRight, Round, ScaleB, ShiftLeft, ShiftRightArithmetic, ShiftRightLogical, Sigmoid, SinCosPi, SinCos, Sinh, SinPi, Sin, SoftMax, Sqrt, Subtract, SumOfMagnitudes, SumOfSquares, Sum, Tanh, TanPi, Tan, TrailingZeroCount, Truncate, Xor

The possible speedups are even more pronounced on other operations and data types; for example, here is a manual implementation of hamming distance on two inputbyte arrays (hamming distance is the number of elements that differ between the two inputs), and an implementation usingTensorPrimitives.HammingDistance<byte>:

// Add a <PackageReference Include="System.Numerics.Tensors" Version="9.0.0" /> to the csproj.// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics.Tensors;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _x, _y;    [GlobalSetup]    public void Setup()    {        var r = new Random(42);        _x = Enumerable.Range(0, 1024).Select(_ => (byte)r.Next(0, 256)).ToArray();        _y = Enumerable.Range(0, 1024).Select(_ => (byte)r.Next(0, 256)).ToArray();    }    [Benchmark(Baseline = true)]    public int ManualLoop()    {        ReadOnlySpan<byte> source = _x;        Span<byte> destination = _y;        int count = 0;        for (int i = 0; i < source.Length; i++)        {            if (source[i] != destination[i])            {                count++;            }        }        return count;    }    [Benchmark]    public int BuiltIn() => TensorPrimitives.HammingDistance<byte>(_x, _y);}

Method	Mean	Ratio
ManualLoop	484.61 ns	1.00
BuiltIn	15.76 ns	0.03

A slew of PRs went into making this happen. The generic method surface area was added viadotnet/runtime#94555,dotnet/runtime#97192,dotnet/runtime#97572,dotnet/runtime#101435,dotnet/runtime#103305, anddotnet/runtime#104651. And then many more PRs added or improved vectorization, includingdotnet/runtime#97361,dotnet/runtime#97623,dotnet/runtime#97682,dotnet/runtime#98281,dotnet/runtime#97835,dotnet/runtime#97846,dotnet/runtime#97874,dotnet/runtime#97999,dotnet/runtime#98877,dotnet/runtime#103214 from@neon-sunset, anddotnet/runtime#103820.

As part of all of this work, there was also a recognition that we had the scalar operations and we had the operations on an unbounded number of elements as part of spans, but doing the latter efficiently required effectively having the same set of operations on the variousVector128<T>,Vector256<T>, andVector512<T> types, since the typical structure of one of these operations will process vectors of elements at time. As such, progress has been made towards also exposing the same set of operations on these vector types. That’s been done indotnet/runtime#104848,dotnet/runtime#102181,dotnet/runtime#103837,dotnet/runtime#97114, anddotnet/runtime#96455. More to come.

Other related numerical types have also seen improvements. Quaternion multiplication was vectorized indotnet/runtime#96624 by@TJHeuvel, anddotnet/runtime#103527 accelerated a variety of operations onQuaternion,Plane,Vector2,Vector3,Vector4,Matrix4x4, andMatrix3x2.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Quaternion _value1 = Quaternion.CreateFromYawPitchRoll(0.5f, 0.3f, 0.2f);    private Quaternion _value2 = Quaternion.CreateFromYawPitchRoll(0.1f, 0.2f, 0.3f);    [Benchmark]    public Quaternion Multiply() => _value1 * _value2;}

Method	Runtime	Mean	Ratio
Multiply	.NET 8.0	3.064 ns	1.00
Multiply	.NET 9.0	1.086 ns	0.35

dotnet/runtime#102301 also moves a lot of the implementation for types likeQuaternion out of the JIT / native code into C#, something that’s only possible now because of many of the other improvements discussed elsewhere in this post.

Strings, Arrays, Spans

IndexOf

As previously noted inPerformance Improvements in .NET 8 and earlier in this post, my single favorite performance improvement in .NET 8 came from enabling dynamic PGO. But my second favorite improvement came from the introduction ofSearchValues<T>.SearchValues<T> enables optimizing searches, by pre-computing an algorithm to use when searching for a specific set of values (or for anything other than those specific values) and storing that information for later repeated use. Internally, .NET 8 included upwards of 15 different implementations that might be chosen based on the nature of the supplied data. The type was so good at what it did that it was used in over 60 places as part of the .NET 8 release. In .NET 9, it’s used even more, and it gets even better, in a multitude of ways.

TheSearchValues<T> type is generic, so in theory it can be used for anyT, but in practice, the algorithms involved need to special-case the nature of the data, and so theSearchValues.Create factory methods only enabled creatingSearchValues<byte> andSearchValues<char> instances for which dedicated implementations were provided. For example, many of the previously noted uses ofSearchValues<T> are searching for a subset of ASCII, such as this use fromRegex.Escape, which enables quickly searching for all characters that require escaping:

private static readonly SearchValues<char> s_metachars = SearchValues.Create("\t\n\f\r #$()*+.?[\\^{|");

If you print out the name of the type of the instance returned by thatCreate call, as an implementation detail today you’ll see something like this:

System.Buffers.AsciiCharSearchValues`1[System.Buffers.IndexOfAnyAsciiSearcher+Default]

That type provides a specialization ofSearchValues<char> optimized for searching for any ASCII subset, doing so with an implementation based on the “Universal algorithm” described athttp://0x80.pl/articles/simd-byte-lookup.html. Essentially, the algorithm maintains an 8 by 16 bitmap, which not coincidentally is the size of ASCII (0 through 127). Each of the 128 bits in the bitmap represents whether the corresponding ASCII value is in the set. The input chars are mapped down to bytes in a way where chars greater than 127 are mapped to a value meaning no match. The lower nibble (4 bits) of the ASCII value is used to select one of the 16 bitmap rows, and the upper nibble is used to select one of the 8 bitmap columns. And the beauty of this algorithm is, on most supported platforms, there exist SIMD instructions that enable the processing of many characters concurrently as part of just a few instructions.

So, in .NET 8,SearchValues<T> was only forbyte andchar. But, now in .NET 9, thanks todotnet/runtime#88394,dotnet/runtime#96429,dotnet/runtime#96928,dotnet/runtime#98901, anddotnet/runtime#98902, you can also createSearchValues<string> instances. The string handling is different frombyte andchar, however. Withbyte, you’re searching for one of a set ofbytes within a span ofbytes. Withchar, you’re searching for one of a set ofchars within a span ofchars. But withstring,SearchValues<string> doesn’t search for one of a set ofstrings within a span ofstrings, but rather it enables searching for one of a set ofstrings within a span ofchars. In other words, it’s a multi-string search. For example, let’s say you want to search some text for the ISO 8601 days of the week, and to do so in an ordinal case-insensitive manner (such that, for example, both “Monday” and “MONDAY” would match). That can now be expressed like this:

private static readonly SearchValues<string> s_daysOfWeek = SearchValues.Create(    ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],    StringComparison.OrdinalIgnoreCase);...ReadOnlySpan<char> textToSearch = ...;int i = textToSearch.IndexOfAny(s_daysOfWeek);

This also highlights another interesting difference from the existingbyte andchar support. For those types,SearchValues is purely an optimization:IndexOfAny overloads have long existed for searching for sets ofT values within a larger collection ofTs (e.g.string.IndexOfAny(char[] anyOf) was introduced over two decades ago), and theSearchValues support simply makes those use cases faster (oftenmuch faster). In contrast, until .NET 9 there have not been any built-in methods for doing multi-string search, so this new support both adds such support and adds it in a way that is highly-efficient.

But, let’s say we did want to perform such a search, without that functionality existing in the core libraries. One approach is to simply walk through the input, position by position, comparing each of the target values at that location:

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    private static readonly string[] s_daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];    [Benchmark(Baseline = true)]    public bool Contains_Iterate()    {        ReadOnlySpan<char> input = s_input;        for (int i = 0; i < input.Length; i++)        {            foreach (string dow in s_daysOfWeek)            {                if (input.Slice(i).StartsWith(dow, StringComparison.OrdinalIgnoreCase))                {                    return true;                }            }        }        return false;    }}

Method	Mean	Ratio
Contains_Iterate	227.526 us	1.000

Classic. Functional. And slow. This is doing a fair amount of work for every single character in the input, for each looping over every day name and doing a comparison. How can we do better? First, we could try making the inner loop more efficient. Rather than iterating through the strings, we could hardcode our own switch:

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    [Benchmark]    public bool Contains_Iterate_Switch()    {        ReadOnlySpan<char> input = s_input;        for (int i = 0; i < input.Length; i++)        {            ReadOnlySpan<char> slice = input.Slice(i);            switch ((char)(input[i] | 0x20))            {                case 's' when slice.StartsWith("Sunday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Saturday", StringComparison.OrdinalIgnoreCase):                case 'm' when slice.StartsWith("Monday", StringComparison.OrdinalIgnoreCase):                case 't' when slice.StartsWith("Tuesday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Thursday", StringComparison.OrdinalIgnoreCase):                case 'w' when slice.StartsWith("Wednesday", StringComparison.OrdinalIgnoreCase):                case 'f' when slice.StartsWith("Friday", StringComparison.OrdinalIgnoreCase):                    return true;            }        }        return false;    }}

The main benefit of this is it makes theStartsWith calls much more efficient. Because each call is dedicated to a specific needle that the JIT can see, it can emit customized code to optimize that comparison (for context on my choice of language, “needle” is often used when describing a thing being searched for, a reference to the proverbial “needle in a haystack,” and thus “haystack” is used to describe the thing being searched). We’re also reducing the number of cases in the switch by employing an ASCII casing trick; the upper case ASCII letters differ in numerical value from the lower case ASCII letters by a single bit, so we simply ensure that bit is set and then compare against only the lower case letters.

Method	Mean	Ratio
Contains_Iterate	227.526 us	1.000
Contains_Iterate_Switch	13.885 us	0.061

Much better, more than a 16x improvement. What if we instead just kept things simple and searched for each individual string using the already-optimizedIndexOf?

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    private static readonly string[] s_daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];    [Benchmark]    public bool Contains_ContainsEachNeedle()    {        ReadOnlySpan<char> input = s_input;        foreach (string dow in s_daysOfWeek)        {            if (input.Contains(dow, StringComparison.OrdinalIgnoreCase))            {                return true;            }        }        return false;    }}

Nice and simple, but…

Method	Mean	Ratio
Contains_Iterate	227.526 us	1.000
Contains_Iterate_Switch	13.885 us	0.061
Contains_ContainsEachNeedle	302.330 us	1.329

Ouch. On the positive side, this approach benefits from vectorization, as theContains operation itself is vectorized to efficiently check multiple locations at once using SIMD. Unfortunately, this case is heavily impacted by the order in which we perform the search. As it turns out, most of the days of the week show up in the input text (in this case, “War and Peace”), but at very different positions, and Monday doesn’t show up at all. This:

using var hc = new HttpClient();var s = await hc.GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt");Console.WriteLine($"Length:    {s.Length}");Console.WriteLine($"Monday:    {s.IndexOf("Monday", StringComparison.OrdinalIgnoreCase)}");Console.WriteLine($"Tuesday:   {s.IndexOf("Tuesday", StringComparison.OrdinalIgnoreCase)}");Console.WriteLine($"Wednesday: {s.IndexOf("Wednesday", StringComparison.OrdinalIgnoreCase)}");Console.WriteLine($"Thursday:  {s.IndexOf("Thursday", StringComparison.OrdinalIgnoreCase)}");Console.WriteLine($"Friday:    {s.IndexOf("Friday", StringComparison.OrdinalIgnoreCase)}");Console.WriteLine($"Saturday:  {s.IndexOf("Saturday", StringComparison.OrdinalIgnoreCase)}");Console.WriteLine($"Sunday:    {s.IndexOf("Sunday", StringComparison.OrdinalIgnoreCase)}");

yields this:

Length:    3293614Monday:    -1Tuesday:   971396Wednesday: 10652Thursday:  107470Friday:    640801Saturday:  1529549Sunday:    891753

That means that whereasContains_Iterate_Switch only needs to examine 10,652 positions (the position of the first “Wednesday”) before it finds a match,Contains_ContainsEachNeedle needs to examine 3,293,614 (no match found for “Monday” so it’ll look at everything) + 971,396 (the index of “Tuesday”) == 4,265,010 positions before it finds a match. That’s 400x as many positions to be examined as the iterative approach. Even the SIMD vectorization gains can’t match that gap in amount of work to be performed.

Ok, so what if we changed approach, and instead searched for the first letter in each word in order to quickly skip past the locations that couldn’t possibly match. We could even useSearchValues<char> to perform that search.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    private static readonly SearchValues<char> s_daysOfWeekFCSV = SearchValues.Create(['S', 's', 'M', 'm', 'T', 't', 'W', 'w', 'F', 'f']);    [Benchmark]    public bool Contains_IndexOfAnyFirstChars_SearchValues()    {        ReadOnlySpan<char> input = s_input;        int i;        while ((i = input.IndexOfAny(s_daysOfWeekFCSV)) >= 0)        {            ReadOnlySpan<char> slice = input.Slice(i);            switch ((char)(input[i] | 0x20))            {                case 's' when slice.StartsWith("Sunday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Saturday", StringComparison.OrdinalIgnoreCase):                case 'm' when slice.StartsWith("Monday", StringComparison.OrdinalIgnoreCase):                case 't' when slice.StartsWith("Tuesday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Thursday", StringComparison.OrdinalIgnoreCase):                case 'w' when slice.StartsWith("Wednesday", StringComparison.OrdinalIgnoreCase):                case 'f' when slice.StartsWith("Friday", StringComparison.OrdinalIgnoreCase):                    return true;            }            input = input.Slice(i + 1);        }        return false;    }}

In some situations, this is a very viable strategy; in fact, it’s a technique often employed byRegex. In other situations, it’s less appropriate. The potential problem is that letters like ‘s’ and ‘t’ are incredibly common. The characters here (‘s’, ‘m’, ‘t’, ‘w’, and ‘f’), both upper- and lower-case variants, make up ~17% of the input text (in contrast to just the capital subset, which makes up only ~0.54%). That means that, on average, thisIndexOfAny call needs to break out of its inner vectorized processing loop every six characters, which decreases the possible efficiency gains from said vectorization. Even so, this is still our best so far:

Method	Mean	Ratio
Contains_Iterate	227.526 us	1.000
Contains_Iterate_Switch	13.885 us	0.061
Contains_ContainsEachNeedle	302.330 us	1.329
Contains_IndexOfAnyFirstChars_SearchValues	7.151 us	0.031

Now, let’s try withSearchValues<string>:

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    private static readonly SearchValues<string> s_daysOfWeekSV = SearchValues.Create(        ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],         StringComparison.OrdinalIgnoreCase);    [Benchmark]    public bool Contains_StringSearchValues() =>        s_input.AsSpan().ContainsAny(s_daysOfWeekSV);}

The functionality is built-in, so we haven’t had to write any custom logic other than the call toContainsAny. And the results:

Method	Mean	Ratio
Contains_Iterate	227.526 us	1.000
Contains_Iterate_Switch	13.885 us	0.061
Contains_ContainsEachNeedle	302.330 us	1.329
Contains_IndexOfAnyFirstChars_SearchValues	7.151 us	0.031
Contains_StringSearchValues	2.153 us	0.009

Not only simpler, then, but also several times faster than the fastest result we’d previously managed, and ~105x faster than our original attempt. Sweet!

How does this all work? The algorithms behind it are quite fascinating. As withbyte andchar, there are multiple concrete implementations that might be employed, selected based on the exact needle values passed toCreate. The simplest implementations are those for handling degenerate cases, like zero inputs (in which case all of the methods can just return hard-coded “not found” results). There’s also a dedicated implementation for a single input, in which case it can perform the same search asIndexOf(needle) would have done, but lifting out the choice of characters within the needle for which to perform a vectorized search.IndexOf(string) chooses a couple of characters from the needle (typically just the first and last character in the needle), creates a vector for each of those, and then with appropriate offsets based on the distance between the chosen characters, iterates through the input, comparing against those vectors, and doing a full string comparison only if both vectors match at a particular location.SearchValues<string> does the same thing (in an internal implementation today calledSingleStringSearchValuesThreeChars), except it uses three instead of two characters, and it employs frequency analysis to choose those characters rather than simply picking the first and last, trying to use characters that are less likely to appear in general (e.g. given the string “amazing”, it’d likely pick the ‘m’, ‘z’, and ‘g’, as it deems those statistically less likely in average inputs than ‘a’, ‘i’, or ‘n’). It can take more time to do this given it can perform the computation once and then cache it for all subsequent searches. We’ll refer back to this in a bit.

Beyond those special cases, it starts to get really interesting. There’s been a lot of research done over the last 50 years for the most efficient ways to perform a multi-string search. One popular algorithm is Rabin-Karp, which was created by Richard Karp and Michael Rabin in the 1980s, and which works via a “rolling hash.” Imagine creating a hash of the first N characters in the haystack (input) text, where N is the length of the needle (the substring) for which you’re searching, and comparing the haystack hash against the needle hash; if they match, do the actual full comparison at that location, otherwise continue. Then update the hash by removing the first character and adding the next character, and repeat the check. And then repeat, and repeat, and so on. Each time you move forward, you’re just updating the hash via a fixed number of operations, meaning that all of the updates to the hash function for the whole operation are onlyO(Haystack). Best case, you only find a single location that the substring could match, and you’ve gotO(Haystack + Needle) algorithmic complexity. Worst case (but generally unlikely), every location is a possible match, and you’ve gotO(Haystack * Needle) algorithmic complexity. A simple implementation might look like this (for pedagogical purposes, this uses a terrible hash function that just sums the character’s numerical values; the real algorithm recommends a better one):

private static bool RabinKarpContains(ReadOnlySpan<char> haystack, ReadOnlySpan<char> needle){    if (haystack.Length >= needle.Length)    {        // Hash the needle and the first needle.Length chars of the haystack.        // Super simple hash for pedagogical purposes: just sum the chars.        int i, rollingHash = 0, needleHash = 0;        for (i = 0; i < needle.Length; i++)        {            rollingHash += haystack[i];            needleHash += needle[i];        }        while (true)        {            // If the hashes match, compare the strings.            if (needleHash == rollingHash && haystack.Slice(i - needle.Length).StartsWith(needle))            {                return true;            }            // If we've reached the end of the haystack, break.            if (i == haystack.Length)            {                break;            }            // Update the rolling hash.            rollingHash += haystack[i] - haystack[i - needle.Length];            i++;        }    }    return needle.IsEmpty;}

This supports one needle, but extending to support multiple needles can be accomplished in a variety of ways, such as by bucketing needles by their hash codes (ala what a hash map does), and then either checking all needles in the corresponding bucket when there’s a hit, or further reduction in what needs to be checked based on using a Bloom filter or similar technique.SearchValues<string> will utilize Rabin-Karp, but only for very short inputs, as for longer inputs there are more efficient approaches.

Another popular algorithm is Aho-Corasick, which was designed by Alfred Aho and Margaret Corasick even earlier, in the 1970s. Its primary purpose is multi-string search, enabling finding a match to be performed in linear time in the length of the input, assuming a fixed set of needles. It works by building up a form of a trie, a finite automaton where you start at the root of the graph and transition to children based on matching the character associated with the edge to that child. But, it extends a typical trie with additional edges between nodes that can be used as fallbacks. For example, here’s the automaton for the days of the week discussed earlier: Aho-Corasick automaton for the days of the week Given the input text “wednesunday”, this will start at the root, progress through the “w”, “we”, “wed”, “wedn”, “wedne”, and “wednes” nodes, but then upon encountering the subsequent ‘u’ and not being able to progress down that path, it’ll employ the fallback link over to the “s” node, at which point it’ll be able to traverse down through “s”, “su”, and so on, until it hits the leaf “sunday” node and can declare success. Aho-Corasick efficiently supports larger strings, and is the go-to implementationSearchValues<string> uses as a general fallback. However, in many situations, it can do even better…

The real workhorse ofSearchValues<string> that’s chosen whenever possible is a vectorized implementation of the “Teddy” algorithm. This algorithm originated in Intel’sHyperscan library, was later adopted by the Rust aho_corasick crate, and is now employed as part ofSearchValues<string> in .NET 9. It is super cool, and super efficient.

Earlier, I gave a rough summary of how theSingleStringSearchValuesThreeChars andIndexOfAnyAsciiSearcher implementations work.SingleStringSearchValuesThreeChars optimizes finding likely positions where a substring might start, reducing the number of false positives by checking for multiple contained characters, and then likely positions are validated by doing the full string comparison at that location. AndIndexOfAnyAsciiSearcher optimizes finding the next position of any character in a large-ish set. You can think of Teddy as a combination of those. There’s a really nice description of the algorithmin the source, so I won’t go into much detail here. In summary, though, it maintains a similar bitmap as withIndexOfAnyAsciiSearcher, but instead of a single bit per ASCII character, it maintains an 8-bit bitmap for each nibble, and instead of just one bitmap, it maintains two or three, each of which corresponds to a location in the substrings (e.g. one bitmap for the 0th character and one bitmap for the 1st character). Those 8 bits in the bitmap are used to indicate which of up to 8 needles contain that nibble at that location. If there are fewer than 8 needles being searched for, then each of these bits individually identifies one of them, and if there are more than 8 needles, just as with Rabin-Karp, we can create buckets of the needle substrings, with a bit in the bitmap referring to one of the buckets. If the comparisons against the bitmaps indicates a likely match, the full match is performed against the relevant needle (or needles, in the case of matching a bucket). And as withIndexOfAnyAsciiSearcher, all of this support employs SIMD instructions to perform the lookups on chunks of input text from between 16 and 64 characters at a time, yielding significant speedups.

SearchValues<string> is great for larger numbers of strings, but it’s relevant even for just a few. Consider, for example, this code from MSBuild that’s part of parsing build output looking for warnings and errors:

if (message.IndexOf("warning", StringComparison.OrdinalIgnoreCase) == -1 &&    message.IndexOf("error", StringComparison.OrdinalIgnoreCase) == -1){    return null;}

Rather than doing two individual searches, we can perform a single search:

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    private static readonly SearchValues<string> s_warningError = SearchValues.Create(["warning", "error"], StringComparison.OrdinalIgnoreCase);    [Benchmark(Baseline = true)]    public bool TwoContains() =>        s_input.Contains("warning", StringComparison.OrdinalIgnoreCase) ||        s_input.Contains("error", StringComparison.OrdinalIgnoreCase);    [Benchmark]    public bool ContainsAny() =>        s_input.AsSpan().ContainsAny(s_warningError);}

This is searching “War and Peace” for both “warning” or “error”, but even though both appear in the text, such that the second search for “error” in the original code will never happen, theSearchValues<string> search ends up being faster because “error” appears much earlier in the text than does “warning”.

Method	Mean	Ratio
TwoContains	70.03 us	1.00
ContainsAny	14.05 us	0.20

BeyondSearchValues<string>, the existingSearchValues<byte> andSearchValues<char> support also gets a variety of boosts in .NET 9.dotnet/runtime#96588, for example, makes some commonSearchValues<char> searches faster, specifically when there are 2 or 4 characters being searched for that represent 1 or 2 ASCII case-insensitive characters, such as['A', 'a'] or['A', 'a', 'B', 'b']. In .NET 8, for['A', 'a'], for example,SearchValues.Create will end up picking an implementation that will create a vector for each of'A' and'a', and then in the inner loop of the search, it’ll compare each vector against the input haystack text. This PR teaches it to do a similar ASCII trick we discussed earlier: rather than having two separate vectors, it can have a single vector for'a', and then do a single comparison against the input vector that’s been OR’d with0x20, such that any'A's become'a's. The OR plus a single comparison is cheaper than the two comparisons plus the OR of the resulting comparisons. Funnily enough, this needn’t even be about casing: since all we’re doing is OR’ing in0x20, it applies to any two characters that differ by that same single bit.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/100/pg100.txt").Result;    private static readonly SearchValues<char> s_symbols = SearchValues.Create("@`");    [Benchmark]    public bool ContainsAny() => s_input.AsSpan().ContainsAny(s_symbols);}

Method	Runtime	Mean	Ratio
ContainsAny	.NET 8.0	262.7 us	1.02
ContainsAny	.NET 9.0	232.3 us	0.90

The same thing applies with four characters: instead of doing four vector comparisons and three OR operations to combine them, we can do a single OR on the input to mix in0x20, two vector comparisons, and a single OR to combine those results. In fact, the four-vector approach was already more expensive than theIndexOfAnyAsciiSearcher implementation previously described, and since that supports any number of ASCII characters, when applicable even for just four-character needles,SearchValues.Create would have preferred that. But now in .NET 9 with this optimization,SearchValues.Create will prefer to use this specialized two-comparison path.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/100/pg100.txt").Result;    private static readonly SearchValues<char> s_symbols = SearchValues.Create("@`^~");    [Benchmark]    public bool ContainsAny() => s_input.AsSpan().ContainsAny(s_symbols);}

Method	Runtime	Mean	Ratio
ContainsAny	.NET 8.0	247.5 us	1.01
ContainsAny	.NET 9.0	196.2 us	0.80

OtherSearchValues implementations also improve in .NET 9, notably the “ProbabilisticMap” implementations. These implementations are preferred bySearchValues<char> as a fallback when the faster vectorized implementations aren’t applicable but when the number of characters in the needle isn’t exorbitant (the current limit is 256). It works via a form of Bloom filter. Effectively, it maintains a 256-bit bitmap, with needle characters mapping to one or two bits, depending on thechar. If any of the bits for a givenchar isn’t1, then thechar is definitively not in the set. If all of the bits for a givenchar are1, then thecharmay be in the set, and a more expensive check needs to be performed to determine inclusion. Whether those bits are set is a vectorizable operation, and so as long as false positives are relatively rare (which is why there’s a limit on the number of characters; the more characters are represented, the more false positives there’s likely to be), it’s an efficient means for doing the search. However, this vectorization only applies to positive cases (e.g.IndexOfAny /ContainsAny) but not negative cases (e.g.IndexOfAnyExcept /ContainsAnyExcept); for those “Except” methods, the implementation still walks character by character, and the check it employed per character wasO(Needle). Thanks todotnet/runtime#101001, which replaces a linear search with a “perfect hash,” thatO(Needle) drops toO(1), making such “Except” calls much more efficient.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    public static readonly string s_aristotle = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/39963/pg39963.txt").Result;    public static readonly SearchValues<char> s_greekOrAsciiDigits = SearchValues.Create(        Enumerable.Range(0, char.MaxValue + 1)        .Where(i => Regex.IsMatch(((char)i).ToString(), @"[\p{IsGreek}0-9]"))        .Select(i => (char)i)        .ToArray());    [Benchmark]    public int CountNonGreekOrAsciiDigitsChars()    {        int count = 0;        ReadOnlySpan<char> text = s_aristotle;        int index;        while ((index = text.IndexOfAnyExcept(s_greekOrAsciiDigits)) >= 0)        {            count++;            text = text.Slice(index + 1);        }        return count;    }}

Method	Runtime	Mean	Ratio
CountNonGreekOrAsciiDigitsChars	.NET 8.0	1,814.7 us	1.00
CountNonGreekOrAsciiDigitsChars	.NET 9.0	881.7 us	0.49

That same PR also made another significant improvement related to the probabilistic map: not using it as much. It’s a terrific implementation for some sets of inputs, but for others it can end up performing poorly. .NET 8 included aLatin1CharSearchValues, which was used when some of the needle characters were non-ASCII but all were less than 256. In such cases, if the probabilistic map couldn’t vectorize,SearchValues.Create would return aLatin1CharSearchValues instance, which maintained a simple 256-bit bitmap that detailed whether each character is in the needle. For .NET 9, that type has been replaced by a more general one that supports arbitrarily large bitmaps, used when there are simply too many for the probabilistic map implementation to handle well or when the values are sufficiently dense. Consider a case like this:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;    public static readonly SearchValues<char> s_greekChars = SearchValues.Create(        Enumerable.Range(0, char.MaxValue + 1)        .Where(i => Regex.IsMatch(((char)i).ToString(), @"[\p{IsGreek}\p{IsGreekExtended}]"))        .Select(i => (char)i)        .ToArray());    [Benchmark]    public int CountGreekChars()    {        int count = 0;        ReadOnlySpan<char> text = s_markTwain;        int index;        while ((index = text.IndexOfAny(s_greekChars)) >= 0)        {            count++;            text = text.Slice(index + 1);        }        return count;    }}

The needle here includes all of the characters in the Greek and Greek Extended Unicode blocks, approximately 400 characters. With the way the probabilistic map builds up its filter bitmap, every single bit in the bitmap ends up being set, which means every examined character will fall back to the expensive path. Now in .NET 9, it’ll use a simpler, non-probabilistic bitmap, and even though it’s not vectorized, it yields significantly faster throughput.

Method	Runtime	Mean	Ratio
CountGreekChars	.NET 8.0	126.454 ms	1.00
CountGreekChars	.NET 9.0	8.956 ms	0.07

dotnet/runtime#96931 also extended this probabilistic map support to benefit from AVX512 such that when the probabilistic map implementation is used, it can be significantly faster. Previously, its implementation would utilize 128-bit or 256-bit vectors, depending on hardware support, but now in .NET 9, it can also use 512-bit vectors. This not only possibly doubles throughput due to vector width, AVX512 also includes some applicable instructions that the older instruction sets don’t have (e.g.VPERMB, which is exposed asAvx512Vbmi.PermuteVar64x8), enabling even faster processing due to employing those more sophisticated instructions where relevant. This ends up being particularly impactful when searching for a reasonably small number of non-ASCII characters.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    public static readonly string s_aristotle = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/39963/pg39963.txt").Result;    public static readonly SearchValues<char> s_setSymbols = SearchValues.Create("⊂⊃⊆⊇⊄∩∪∈∊∉∋∍∌∅");    [Benchmark]    public int Count()    {        int count = 0;        ReadOnlySpan<char> text = s_aristotle;        int index;        while ((index = text.IndexOfAny(s_setSymbols)) >= 0)        {            count++;            text = text.Slice(index + 1);        }        return count;    }}

Method	Runtime	Mean	Ratio
Count	.NET 8.0	28.35 us	1.00
Count	.NET 9.0	13.19 us	0.47

Further, while the probabilistic map implementations were previously vectorized forIndexOfAny (and therefore implicitly forContainsAny), they weren’t forLastIndexOfAny, which means just changing whether you were searching from start to end or from end to start could have a significant impact on throughput.dotnet/runtime#102331 improves that as well, enabling theLastIndexOfAny path to also take advantage of SIMD.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;    public static readonly SearchValues<char> s_accentedChars = SearchValues.Create(        ['À', 'È', 'Ì', 'Ò', 'Ù', 'Á', 'É', 'Í', 'Ó', 'Ú',          'Â', 'Ê', 'Î', 'Ô', 'Û', 'Ã', 'Ẽ', 'Ĩ', 'Õ', 'Ũ',          'Ä', 'Ë', 'Ï', 'Ö', 'Ü', 'Ÿ']);    [Benchmark]    public bool HasAnyAccented_IndexOfAny() => s_markTwain.AsSpan().IndexOfAny(s_accentedChars) >= 0;    [Benchmark]    public bool HasAnyAccented_LastIndexOfAny() => s_markTwain.AsSpan().LastIndexOfAny(s_accentedChars) >= 0;}

Method	Runtime	Mean	Ratio
HasAnyAccented_IndexOfAny	.NET 8.0	7.910 ms	1.00
HasAnyAccented_IndexOfAny	.NET 9.0	4.476 ms	0.57

HasAnyAccented_LastIndexOfAny	.NET 8.0	17.491 ms	1.00
HasAnyAccented_LastIndexOfAny	.NET 9.0	5.253 ms	0.30

In many of my examples, I’ve usedContainsAny rather thanIndexOfAny. The former is functionally equivalent toinput.IndexOfAny(searchValues) >= 0, and in fact that was the entirety of the implementation in .NET 8. However, asIndexOfAny employs vectorization and is comparing multiple elements as part of the same instruction, when a match is found, there is a bit of overhead involved to then determine exactly which element matched (or if multiple matched, which match had the lowest index).ContainsAny doesn’t actually need to care about the exact index: as exemplified by its implementation, it only cares about whether there was a match rather than where one was. As such, we can shave off some cycles by customizing the implementation forContainsAny to avoid that unnecessary computation, and that’s exactly whatdotnet/runtime#96924 does. The effects of this are most notable where that overhead would be measurable, which is when a match is found really early in the input.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _haystack = "Hello, world! How are you today?";    private static readonly SearchValues<char> s_vowels = SearchValues.Create("aeiou");    [Benchmark]    public bool ContainsAny() => _haystack.AsSpan().ContainsAny(s_vowels);}

Method	Runtime	Mean	Ratio
ContainsAny	.NET 8.0	3.640 ns	1.00
ContainsAny	.NET 9.0	2.382 ns	0.65

Improvements aroundSearchValues aren’t limited just to new APIs or the implementation of the APIs; there’s also been work to help developers better consumeSearchValues.dotnet/roslyn-analyzers#6898 anddotnet/roslyn-analyzers#7252 added a new analyzer (CA1870) that will find opportunities to useSearchValues and automatically fix the call sites to do so.

CA1870 analyzer for SearchValues

It’s also worth highlighting that there have been improvements aroundIndexOf /Contains in .NET 9 besides withSearchValues. One simple but interesting change isdotnet/runtime#97632. This simply added anif block tostring.Contains(string):

public bool Contains(string value){    if (value == null)        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);    // PR added this if block    if (RuntimeHelpers.IsKnownConstant(value) && value.Length == 1)        return Contains(value[0]);    return SpanHelpers.IndexOf(        ref _firstChar,        Length,        ref value._firstChar,        value.Length) >= 0;}

What’s interesting about this is theSpanHelpers.IndexOf it delegates to already contains a fast path that special-cases single-character strings:

if (valueTailLength == 0){    // for single-char values use plain IndexOf    return IndexOfChar(ref searchSpace, value, searchSpaceLength);}

Why then is this extraif block helpful? It’s taking advantage of that same internalIsKnownConstant intrinsic we saw earlier. The JIT will always compile this method down to atrue/false constant, so it ends up adding no runtime overhead. If the value isfalse, the wholeif block evaporates. But if it’strue, that necessarily means the argument passed to the method is recognized by the JIT as being a constant, e.g. a developer calledsomeString.Contains("-") such that the JIT can see thatvalue is"-". In such a case, the JIT also knowsvalue.Length, such that it can see at compile time whether it’s1 or not. And that in turn means this whole method becomes:

public bool Contains(string value){    if (value == null)        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);    return SpanHelpers.IndexOf(        ref _firstChar,        Length,        ref value._firstChar,        value.Length) >= 0;}

if the JIT can’t prove the argument is a constant or if it’s not exactly one character in length, or:

public bool Contains(string value){    return Contains('the constant char');}

if it can. This eliminates a bit of overhead from the call.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _input = "!@#$%^&";    [Benchmark]    public bool Contains() => _input.Contains("$");}

Method	Runtime	Mean	Ratio
Contains	.NET 8.0	3.7649 ns	1.00
Contains	.NET 9.0	0.9614 ns	0.26

Regex

Regular expression support in .NET has received a lot of love over the past few years. The implementation was overhauled in.NET 5 to yield significant performance gains, and then in.NET 7 it not only saw another round of huge performance gains, it also gained a source generator, a new non-backtracking implementation, and more. In .NET 8, it sawadditional performance improvements, in part because of usingSearchValues.

Now in .NET 9, the trend continues. First and foremost, it’s important to recognize that many of the changes discussed thus far implicitly accrue toRegex.Regex already usesSearchValues, and so improvements toSearchValues benefitRegex (it’s one of my favorite things about working at the lowest levels of the stack: improvements there have a multiplicative effect, in that direct use of them improves, but so too does indirect use via intermediate components that instantly get better as the lower level does). Beyond that, though,Regex has increased its reliance onSearchValues.

There are multiple engines backingRegex today:

An interpreter, which is what you get when you don’t explicitly ask for one of the other engines.
A reflection-emit-based compiler, which at run-time emits custom IL for the specific regular expression and options. This is what you get when you specifyRegexOptions.Compiled.
A non-backtracking engine, which doesn’t support all ofRegex‘s features but which guaranteesO(N) throughput in the length of the input. This is what you get when you specifyRegexOptions.NonBacktracking.
And a source generator, which is very similar to the compiler, except it emits C# at build-time rather than emitting IL at run-time. This is what you get when you use[GeneratedRegex(...)].

As ofdotnet/runtime#98791,dotnet/runtime#103496, anddotnet/runtime#98880, all of the engines other than the interpreter avail themselves of the newSearchValues<string> support (the interpreter could as well, but we make an assumption that someone is using the interpreter in order to optimize for the speed ofRegex construction, and the analyses involved in choosing to useSearchValues<string> can take measurable time). The best way to see what this looks like is via the source generator, as we can easily examine the code it outputs in both .NET 8 and .NET 9. Consider this code:

using System.Text.RegularExpressions;internal partial class Example{    [GeneratedRegex("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday): (.*)", RegexOptions.IgnoreCase)]    public static partial Regex ParseEntry();}

In Visual Studio, you can right-click onParseEntry, select “Go To Definition,” and the tool will take you to the C# code for this pattern as generated by the regular expression source generator (the pattern is looking for a day of the week, followed by a colon, then followed by any text, and is capturing both the day and that subsequent text into capture groups for subsequent exploration). The generated code contains two relevant methods: aTryFindNextPossibleStartingPosition method, which is used to skip ahead as quickly as possible to the first location that might possibly match, and aTryMatchAtCurrentPosition method, which performs the full match attempt at that location. For our purposes here, we care aboutTryFindNextPossibleStartingPosition, as that’s the most impactful placeSearchValues shows up. On .NET 8, we see this code:

private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan){    int pos = base.runtextpos;    ulong charMinusLowUInt64;    // Any possible match is at least 8 characters.    if (pos <= inputSpan.Length - 8)    {        // The pattern matches a character in the set [ADSYadsy] at index 5.        // Find the next occurrence. If it can't be found, there's no match.        ReadOnlySpan<char> span = inputSpan.Slice(pos);        for (int i = 0; i < span.Length - 7; i++)        {            int indexOfPos = span.Slice(i + 5).IndexOfAny(Utilities.s_ascii_1200080212000802);            if (indexOfPos < 0)            {                goto NoMatchFound;            }            i += indexOfPos;            if (((long)((0x8106400081064000UL << (int)(charMinusLowUInt64 = (uint)span[i] - 'F')) & (charMinusLowUInt64 - 64)) < 0) &&                ((long)((0x8023400080234000UL << (int)(charMinusLowUInt64 = (uint)span[i + 3] - 'D')) & (charMinusLowUInt64 - 64)) < 0))            {                base.runtextpos = pos + i;                return true;            }        }    }    // No match found.    NoMatchFound:    base.runtextpos = inputSpan.Length;    return false;}

The code is using anIndexOfAny withUtilities.s_ascii_1200080212000802; what is that? It’s aSearchValues<char>:

/// <summary>Supports searching for characters in or not in "ADSYadsy".</summary>internal static readonly SearchValues<char> s_ascii_1200080212000802 = SearchValues.Create("ADSYadsy");

The source generator is employing the approach we looked at earlier, searching for a single character from each string. Here it’s decided that its best chance for an optimal search is to look for the character at offset 5 in each string, so ‘y’ for “Monday”, ‘a’ for “Tuesday”, etc., plus looking for the upper-case variants sinceRegexOptions.IgnoreCase was specified. Then after the single-character search, it’s doing a quick test for a couple of other positions in the string to try to weed out false positives, looking at the 0th offset to ensure the character is in the set[MTWFSmtwfs] and the 3rd offset to ensure the character is in the set[DSNRUdsnru]. (The check for those is obscured by it using a branchless technique to query a bitmap stored in a 64-bit ulong.)

Now, here’s what we get in .NET 9:

private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan){    int pos = base.runtextpos;    // Any possible match is at least 8 characters.    if (pos <= inputSpan.Length - 8)    {        // The pattern has multiple strings that could begin the match. Search for any of them.        // If none can be found, there's no match.        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_OrdinalIgnoreCase_B7E3C0B8368AC400913BEA56D1872F43698FDA2C54D1AD4886F6734244613374);        if (i >= 0)        {            base.runtextpos = pos + i;            return true;        }    }    // No match found.    base.runtextpos = inputSpan.Length;    return false;}

Again, we see anIndexOfAny, but notice that the subsequent checks for other positions are gone. Why? Because thatSearchValues being passed to theIndexOfAny now is aSearchValues<string>, and thus already confirms that one of the provided strings matches:

/// <summary>Supports searching for the specified strings.</summary>internal static readonly SearchValues<string> s_indexOfAnyStrings_OrdinalIgnoreCase_B7E3C0B8368AC400913BEA56D1872F43698FDA2C54D1AD4886F6734244613374 =    SearchValues.Create(        ["monday", "tuesday", "wednesda", "thursday", "friday", "saturday", "sunday"],         StringComparison.OrdinalIgnoreCase);

The sharp-eyed amongst you might notice that there’s no ‘y’ at the end of “Wednesday”; that’s simply due to a heuristic in theRegex implementation. When it searches for strings to use as part of such aSearchValues<string>, it limits itself to using no more than length 8 strings. And “searches” is an appropriate word here, as the implementation isn’t limited just to clean alternations as in the previous example. If I instead change the program to be:

using System.Text.RegularExpressions;internal partial class Example{    [GeneratedRegex("[Aa]([Bb][Cc]|[Dd][Ee])")]    public static partial Regex ParseEntry();}

we still end up with aSearchValues<string>, now for this:

/// <summary>Supports searching for the specified strings.</summary>internal static readonly SearchValues<string> s_indexOfAnyStrings_OrdinalIgnoreCase_33A76C255741CD9630059173F803FB92EBDDFBF62328261428CF8838D6379CE9 =    SearchValues.Create(["abc", "ade"], StringComparison.OrdinalIgnoreCase);

Interestingly, as ofdotnet/runtime#96402,SearchValues<string> will also be used when doing a single string search. As previously noted,IndexOf(string) will try to pick two characters and do a vectorized search for both, whereasSearchValues<string> for that same input can spend a bit more time trying to pick more characters and characters that will be better for the search. As such,Regex now opts to useSearchValues<string> as part ofTryFindNextPossibleStartingPosition. We can see this with the following benchmarks that count the number of occurrences of the words “Hello” or “earth”:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;    [GeneratedRegex(@"\bHello\b")]    public static partial Regex FindHello();    [GeneratedRegex(@"\bearth\b")]    public static partial Regex FindEarth();    [Benchmark]    public int CountHello() => FindHello().Count(s_markTwain);    [Benchmark]    public int CountEarth() => FindEarth().Count(s_markTwain);}

On .NET 8, the code generated forTryFindNextPossibleStartingPosition forFindEarth includes:

int i = inputSpan.Slice(pos).IndexOf("earth");

whereas on .NET 9, the generated code is:

int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_earth_Ordinal);...internal static readonly SearchValues<string> s_indexOfString_earth_Ordinal =    SearchValues.Create(["earth"], StringComparison.Ordinal);

And the results:

Method	Runtime	Mean	Ratio
CountHello	.NET 8.0	2.020 ms	1.00
CountHello	.NET 9.0	2.042 ms	1.01

CountEarth	.NET 8.0	2.738 ms	1.00
CountEarth	.NET 9.0	2.339 ms	0.85

This highlights that usingSearchValues<string> in this one-string case doesn’t always help, but it can improve things, in particular in situations where the extra one-time work done by theSearchValues.Create enables it to find meaningfully better characters for which to search.

My seeming obsession withSearchValues<T> might lead one to believe that it’s the only source of improvements inRegex, but that’s far from the truth. There are many other PRs in .NET 9 focused on different aspects that improved the area.

dotnet/runtime#93190 is a nice addition. One of the optimizations introduced toRegex in .NET 7 was a “literal-after-loop” search. A lot of effort goes into finding ways to helpRegex‘sTryFindNextPossibleStartingPosition be as efficient as possible at skipping unnecessary locations, and this “literal-after-loop” search is one such mechanism. It looks for a particular shape of pattern, where the pattern starts with a loop that’s then followed by a literal. For example, the industry regex benchmarks inmariomka/regex-benchmark includes this pattern for finding URIs:

@"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?"

The pattern starts with a word character loop. We don’t have a good way to vectorize a search for any word character, nor would we really want to; there are over 50,000 word characters that are part of the\w set, and in most inputs we’d typically find an occurrence so quickly that it wouldn’t be worth the vectorization. However, the"://" that follows is easily searchable and much less likely to occur, making it a good candidate forTryFindNextPossibleStartingPosition. However, we can’t just search for the"://" because it doesn’t start the pattern, nor is it at a fixed-offset from the beginning of the pattern that would enable us to find the"://" and then jump backwards a known number of positions. Instead, with the “literal-after-loop” optimization, we can find the"://" and then iterate backwards to the beginning of the loop in order to find the actual starting position for the match attempt (we can also keep track of where the loop ends so that we don’t need to re-match it).

There were, however, a number of gaps in this optimization. Most notably, the implementation needs to examine the pattern to determine whether it’s applicable. If the starting loop was wrapped in a capture or an atomic group, it was unnecessarily giving up and would fail to discover the loop for the purposes of enabling the “literal-after-loop” mechanism. The search would also give up if the literal after the loop was a set inside of various grouping constructs, like a concatenation.

This PR fixed those gaps. The impact of this can be seen by looking at another industry benchmark, this time from theBurntSushi/rebar repo:

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    private static string s_haystack = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;    [GeneratedRegex(@"(\s*)((?:# [Nn][Oo][Qq][Aa])(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)")]    private static partial Regex RuffNoQA();    [Benchmark]    public int Count() => RuffNoQA().Count(s_haystack);}

The impact of the literal-after-loop optimization ends up being obvious in the resulting numbers:

Method	Runtime	Mean	Ratio
Count	.NET 8.0	197.47 ms	1.00
Count	.NET 9.0	18.67 ms	0.09

Improvements inRegex go beyond just the initial searching, as well. An interesting changes comes indotnet/runtime#98723, not because it results in massive performance improvements (though it does yield some nice benefits), but rather because it highlights how improvement possibilities can be found in all manner of places. One of the areas we (and pretty much everyone else in the world it seems) has been investing a lot of energy into lately is AI, including into tokenizers, which are components that take text and translate it into a series of numbers that are meaningful to the model into which they’ll be fed. Each model is trained on a set of tokens from a specific tokenizer, and in the case of OpenAI’s models, that tokenizer algorithm is “tiktoken.” The official .NET implementation of tiktoken lives in theMicrosoft.ML.Tokenizers library, and as part of implementing tiktoken, it follows the reference implementation provided by OpenAI, which usesa regular expression as part of parsing; for consistency and to help ensure correctness, therefore, the .NET implementation does as well. This regex includes the following pattern:

(?i:'s|'t|'re|'ve|'m|'ll|'d)

What jumped out about this pattern is that it should trigger an optimization in the regex source generator that emits alternations like this as a C#switch statement, as the analyzer should be able to determine that all branches of the alternation are distinct, such that picking one branch because the first character matches necessarily means that no other branch could match. The benefit of aswitch here is it allows the C# compiler to implement a jump table, which means we’re not stuck exploring each branch when we could instead jump right to the correct one. But that optimization wasn’t kicking in. Why? A series of unfortunate events. An earlier optimization was seeing thell and rewriting that into a repeater (l{2}), which then defeated this alternation optimization because the implementation wasn’t written to examine loops. Loops were explicitly being skipped because a loop could be empty, and if empty, it wouldn’t have a first character required by theswitch. However, we can see whether a loop has a non-zero minimum bound on number of iterations set, as it does in this case, and in such cases we can still factor in up to that minimum number, which are all guaranteed iterations. This PR improved the analysis to handle loops well, as evidenced by this micro-benchmark (which has been crafted to accentuate this aspect of the pattern):

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    private static string s_haystack = new string('y', 10_000);    [GeneratedRegex("(?i:a|bb|c|dd|e|ff|g|hh|i|jj|k|ll|m|nn|o|pp|q|rr|s|tt|u|vv|w|xx|y|zz)")]    public static partial Regex Parse();    [Benchmark]    public int Count() => Parse().Count(s_haystack);}

Method	Runtime	Mean	Ratio
Count	.NET 8.0	315.7 us	1.00
Count	.NET 9.0	211.6 us	0.67

The non-backtracking engine also got some nice attention in .NET 9.dotnet/runtime#102655 from@ieviev (who submitting a small subset of the changes they’d made as part of some excitingregex research being done in a fork of the library), followed bydotnet/runtime#104766 anddotnet/runtime#105668 made a variety of changes to the non-backtracking implementation, including:

DFA limits. The non-backtracking implementation works by constructing a finite automata, which can be thought of as a graph, with the implementation walking around the graph as it consumes additional characters from the input and uses those to guide what node(s) it transitions to next. The graph is built out lazily, such that nodes are only added as those states are explored, and the nodes can be one of two kinds: DFA (deterministic) or NFA (non-deterministic). DFA nodes ensure that for any given character that comes next in the input, there’s only ever one possible node to which to transition. Not so for NFA, where at any point in time there’s a list of all the possible nodes the system could be in, and moving to the next state means examining each of the current states, finding all possible transitions out of each, and treating the union of all of those new positions as the next state. DFA is thusmuch cheaper than NFA in terms of the overheads involved in walking around the graph, and we want to fall back to NFA only when we absolutely have to, which is when the DFA graph would be too large: some patterns have the potential to create massive numbers of DFA nodes. Thus, there’s a threshold where once that number of constructed nodes in the graph is hit, new nodes are constructed as NFA rather than DFA. In .NET 8 and earlier, that limit was somewhat arbitrarily set at 10,000. For .NET 9 as part of this PR, analysis was done to show that a much higher limit was worth the memory trade-offs, and the limit was raised to 125,000, which means many more patterns can fully execute as DFA.
Minterm mappings. The implementation works in terms of “minterms,” which are equivalence classes for all characters that behave the same in the pattern. For example, with the pattern"[a-z]*", the lowercase ASCII letters are all treated the same and are all treated differently from every other character, so there are two minterms here, one for the 26 lowercase ASCII letters, and the other for the remaining 65,510 characters. This is used as a compression mechanism, as rather than needing to describe the transitions between nodes for every character, the system can instead do so for every minterm. Of course, that means during matching there’s a step where a character needs to be mapped to its minterm in order to know which edge to follow to the next state. Previously, that mapping was cached for all ASCII characters but recomputed each time for non-ASCII (recomputing it amounts to a binary search on a tree data structure). As you can imagine, this can lead to significant overhead when non-ASCII is encountered. Now in .NET 9, mappings for all characters represented in the pattern are stored. In degenerate cases, this can measurably increase memory consumption for theRegex instance, but on average it doesn’t; in fact, for common cases the new scheme actually reduces memory consumption, as it takes into account the fact that all but the most niche patterns have fewer than 256 minterms, and the per-character mapping can thus be stored in abyte rather than aushort oruint. Additionally, for cases where only a subset of ASCII is used in the pattern (which is common), theRegex instance needn’t allocate an array to represent all 128 ASCII characters, but can instead be shrunk to only support those characters that need be represented.
Timeout checks.Regex has long supported a timeout mechanism, where if a match operation takes longer than a specified limit, an exception is thrown. This mechanism exists to help mitigate possible regex denial of service (ReDOS) attacks, where a maliciously-constructed pattern when fed to a backtracking engine could lead to “catastrophic backtracking” (you can see an example of this in myDeep .NET discussion on Regex with Scott Hanselman). These timeouts are thus enabled in the interpreter, the compiler, and the source generator. For the non-backtracking engine, timeouts aren’t necessary to avoid catastrophic backtracking, as there is no backtracking. The engine still pays some attention to timeouts, though, purely for consistency with the other engines, yet the frequency of the checks was actually adding measurable overhead in some cases. The PR reduced the frequency of the checks to mitigate that overhead while not meaningfully affecting the effectiveness of the checks.
Hot path inner loop. The inner matching loop is the hot path for a matching operation: read the next character, look up its minterm, follow the corresponding edge to the next node in the graph, rinse and repeat. Performance of the engine is tied to efficiency of this loop. These PRs recognized that there were some checks being performed in that inner loop which were only relevant to a minority of patterns. For the majority, the code could be specialized such that those checks wouldn’t be needed in the hot path.
General good hygiene. Care was taken to remove unnecessary overheads, such as duplicate array lookups that could be removed, bounds checks that could be avoided, indirect reads viarefs that could instead be done against locals, and so on.

The net result of these changes is most patterns get faster, some significantly, especially on non-ASCII inputs.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    public static readonly string s_aristotle = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/39963/pg39963.txt").Result;    private readonly Regex _word = new Regex(@"\b[\p{IsGreek}]+\b", RegexOptions.NonBacktracking);    [Benchmark]    public int CountWords() => _word.Matches(s_aristotle).Count;}

Method	Runtime	Mean	Ratio
CountWords	.NET 8.0	14.808 ms	1.00
CountWords	.NET 9.0	9.673 ms	0.65

Thedotnet/runtime employs an automated performance regression testing system, with tests indotnet/performance constantly running on various operating systems and hardware, with the goal of detecting regressions. When a possible regression is noticed, an issue is opened containing the details. However, the system also notices statistically-significant improvements and also opens issues on those, just to ensure that we’re all aware of when and how things change in a meaningful way. When possible, the issues reference the PR known to have caused the regression or improvement, so it’s always a treat to see a list of references like this on a PR, as was the case withdotnet/runtime#102655: Automated performance analysis links

Both the non-backtracking engine and the interpreter now also gain additional optimized searching for certain classes of prefixes they didn’t previously support. Withdotnet/runtime#100315, patterns that begin with ranges can now be optimized with anIndexOfAny{Except}InRange call, whereas previously such patterns would likely result in walking character by character.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;    private readonly Regex _interpreter = new Regex(@"\b[0-9]+\b");    private readonly Regex _nonBacktracking = new Regex(@"\b[0-9]+\b", RegexOptions.NonBacktracking);    [Benchmark]    public int Interpreter() => _interpreter.Count(s_markTwain);    [Benchmark]    public int NonBacktracking() => _nonBacktracking.Count(s_markTwain);}

Method	Runtime	Mean	Ratio
Interpreter	.NET 8.0	21.223 ms	1.00
Interpreter	.NET 9.0	1.726 ms	0.08

NonBacktracking	.NET 8.0	21.945 ms	1.00
NonBacktracking	.NET 9.0	1.749 ms	0.08

Finally,Regex gains some new APIs in .NET 9, focused on performance.Regex currently has a set ofSplit overloads; these logically behave likeMatch, except instead of returning what matched, they effectively return what’s between the matches, treating the match as a split separator. As withstring.Split, theseRegex.Split methods return astring[], which means allocating the array to store all the results and allocating each of the individualstrings. There was also no overload for supporting span inputs, which meant if one had a span to search, that span would first need to be converted into a string, yet another allocation. .NET 7 saw a similar predicament fixed with the introduction of theEnumerateMatches method, which provided an allocation-free alternative toMatch orMatches. Now in .NET 9, thanks todotnet/runtime#103307,Regex gets newEnumerateSplits methods, that similarly provide an allocation-free way to access the same splits. The method accepts aReadOnlySpan<char>, and then rather than returning an array of strings, it returns an enumerator ofRanges pointing into the original.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;    private readonly Regex _whitespace = new Regex(@"\s+", RegexOptions.Compiled);    [Benchmark(Baseline = true)]    public int SplitOnWhitespace_Split()    {        int lengths = 0;        foreach (string split in _whitespace.Split(s_markTwain))        {            lengths += split.Length;        }        return lengths;    }    [Benchmark]    public int SplitOnWhitespace_EnumerateSplits()    {        int lengths = 0;        foreach (Range range in _whitespace.EnumerateSplits(s_markTwain))        {            ReadOnlySpan<char> split = s_markTwain.AsSpan(range);            lengths += split.Length;        }        return lengths;    }}

Method	Mean	Ratio	Allocated	Alloc Ratio
SplitOnWhitespace_Split	189.1 ms	1.00	185305389 B	1.000
SplitOnWhitespace_EnumerateSplits	116.6 ms	0.62	272 B	0.000

Encoding

Base64 encoding has been supported in .NET since the beginning, with methods likeConvert.ToBase64String andConvert.FromBase64CharArray. More recently, a plethora of Base64-related APIs have been added, including span-based APIs onConvert but also a dedicatedSystem.Buffers.Text.Base64 with methods for encoding and decoding between arbitrary bytes and UTF8 text, and most recently for very efficiently checking whether UTF8 and UTF16 text represents a valid Base64 payload.

Base64 is a fairly simple encoding scheme for taking arbitrary binary data and converting it to ASCII text, splitting the input up into groups of 6 bits (2^6 == 64 possible values) and mapping each of those values to a specific character in the Base64 alphabet: the 26 upper-case ASCII letters, the 26 lower-case ASCII letters, the 10 ASCII digits,'+', and'/'. While this is an incredibly popular encoding mechanism, it runs into problems for some use cases because of the exact choice of alphabet. Including Base64 data in a URI is possibly problematic, as'+' and'/' both have special meaning in URIs, as does the special'=' symbol used for padding Base64 data out to a specific length. That means that in addition to Base64-encoding data, the resulting data might also need to be URL-encoded for such a use, both taking additional time and further increasing the size of the payload. To address this, a variant was introduced, Base64Url, which does away with the need for padding, and which uses a slightly different alphabet,'-' instead of'+' and'_' instead of'/'. Base64Url is used in a variety of domains, including as part ofJSON Web Tokens (JWT), where it’s used to encode each segment of the token.

While .NET has had Base64 support for a long time, it hasn’t had Base64Url support, and as such, developers have had to craft their own. Many have done so by layering on top of the Base64 implementations inConvert orBase64. For example, here’s what the core part of ASP.NET’s implementation forWebEncoders.Base64UrlEncode looked like in .NET 8:

private static int Base64UrlEncode(ReadOnlySpan<byte> input, Span<char> output){    if (input.IsEmpty)        return 0;    Convert.TryToBase64Chars(input, output, out int charsWritten);    for (var i = 0; i < charsWritten; i++)    {        var ch = output[i];        if (ch == '+') output[i] = '-';        else if (ch == '/') output[i] = '_';        else if (ch == '=') return i;    }    return charsWritten;}

We can obviously write more code to do that more efficiently, but with .NET 9 we don’t have to. Withdotnet/runtime#102364, .NET now has a fully-featuredBase64Url type that is also very efficient. It actually shares almost all of its implementation with the same functionality onBase64 andConvert, using generic tricks to substitute the different alphabets in an optimized manner. (The ASP.NET implementation has also been updated to useBase64Url withdotnet/aspnetcore#56959 anddotnet/aspnetcore#57050.)

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _data;    private char[] _destination = new char[Base64.GetMaxEncodedToUtf8Length(1024 * 1024)];    [GlobalSetup]    public void Setup()    {        _data = new byte[1024 * 1024];        new Random(42).NextBytes(_data);    }    [Benchmark(Baseline = true)]    public int Old() => Base64UrlOld(_data, _destination);    [Benchmark]    public int New() => Base64Url.EncodeToChars(_data, _destination);    static int Base64UrlOld(ReadOnlySpan<byte> input, Span<char> output)    {        if (input.IsEmpty)            return 0;        Convert.TryToBase64Chars(input, output, out int charsWritten);        for (var i = 0; i < charsWritten; i++)        {            var ch = output[i];            if (ch == '+')            {                output[i] = '-';            }            else if (ch == '/')            {                output[i] = '_';            }            else if (ch == '=')            {                return i;            }        }        return charsWritten;    }}

Method	Mean	Ratio
Old	1,314.20 us	1.00
New	81.36 us	0.06

This also benefits from a set of changes that improved the performance ofBase64, and thus alsoBase64Url, since they now share the same code.dotnet/runtime#92241 from@DeepakRajendrakumaran added an AVX512-optimized Base64 encoding/decoding implementation, anddotnet/runtime#95513 from@SwapnilGaikwad anddotnet/runtime#100589 from@SwapnilGaikwad optimized Base64 encoding and decoding for Arm64.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _toEncode;    private char[] _encoded;    [GlobalSetup]    public void Setup()    {        _toEncode = new byte[1000];        new Random(42).NextBytes(_toEncode);        _encoded = new char[Convert.ToBase64String(_toEncode).Length];    }    [Benchmark]    public void ConvertToBase64() => Convert.ToBase64CharArray(_toEncode, 0, _toEncode.Length, _encoded, 0);}

Method	Runtime	Mean	Ratio
ConvertToBase64	.NET 8.0	104.55 ns	1.00
ConvertToBase64	.NET 9.0	60.19 ns	0.58

Another simpler form of encoding is hex, effectively employing an alphabet of 16 characters (for each group of 4 bits) rather than 64 (for each group of 6 bits). .NET 5 introduced theConvert.ToHexString set of methods, which take an inputReadOnlySpan<byte> orbyte[] and produce an outputstring with two hex chars per input byte. The alphabet selected for that encoding are the hexademical characters of ‘0’ through ‘9’ and then upper-case ‘A’ through ‘F’. That’s great when you want upper-case, but sometimes you want the lower-case ‘a’ through ‘f’ instead. As a result, it’s not uncommon now to see calls like this:

string result =  Convert.ToHexString(bytes).ToLowerInvariant();

whereToHexString produces one string and thenToLowerInvariant possibly produces another (“possibly” because it’ll only need to create a new string if the data contained any letters). With .NET 9 anddotnet/runtime#92483 from@determ1ne, the newConvert.ToHexStringLower methods may be used to go directly to the lower-case version; that PR also introduced theTryToHexString andTryToHexStringLower methods, which format directly into a provided destination span rather than allocating anything.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _data = new byte[100];    private char[] _dest = new char[200];    [GlobalSetup]    public void Setup() => new Random(42).NextBytes(_data);    [Benchmark(Baseline = true)]    public string Old() => Convert.ToHexString(_data).ToLowerInvariant();    [Benchmark]    public string New() => Convert.ToHexStringLower(_data).ToLowerInvariant();    [Benchmark]    public bool NewTry() => Convert.TryToHexStringLower(_data, _dest, out int charsWritten);}

Method	Mean	Ratio	Allocated	Alloc Ratio
Old	136.69 ns	1.00	848 B	1.00
New	119.09 ns	0.87	424 B	0.50
NewTry	21.97 ns	0.16	–	0.00

Before .NET 5 introducedConvert.ToHexString, there actually already was some functionality in .NET for converting bytes to hex:BitConverter.ToString.BitConverter.ToString does the same thingConvert.ToHexString now does, except inserting dashes between every two hex characters (i.e. between every byte). As a result, it became fairly common for folks that wanted the equivalent ofToHexString to instead writeBitConverter.ToString(bytes).Replace("-", ""). It’s so common to want the dashes removed, in fact, that it’s what GitHub copilot suggests for me just by typingBitConverter.ToString: GitHub copilot suggesting use of BitConverter.ToString Of course, that operation is much more expensive (and more complicated) than just usingConvert.ToHexString, so it’d be nice to help developers switch over toToHexString{Lower}. That’s exactly whatdotnet/roslyn-analyzers#6967 from@mpidash does.CA1872 will now flag both cases that can be converted toConvert.ToHexString: CA1872 analyzer for Convert.ToHexString and cases that can be converted toConvert.ToHexStringLower:And that’s good for performance, as the difference is quite stark:

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _bytes = Enumerable.Range(0, 100).Select(i => (byte) i).ToArray();    [Benchmark(Baseline = true)]    public string WithBitConverter() => BitConverter.ToString(_bytes).Replace("-", "").ToLowerInvariant();    [Benchmark]    public string WithConvert() => Convert.ToHexStringLower(_bytes);}

Method	Mean	Ratio	Allocated	Alloc Ratio
WithBitConverter	1,707.46 ns	1.00	1472 B	1.00
WithConvert	61.66 ns	0.04	424 B	0.29

There are a variety of reasons for that difference, including the obvious one that theReplace is having to search the input, find all the dashes, and allocate a brand new string without them. However,BitConverter.ToString is also slower in general as it’s not as easily vectorized, due to needing to insert dashes between the resulting characters.

In the other direction,Convert.FromHexString decodes a string of hex back into a newbyte[].dotnet/runtime#86556 from@hrrrrustic adds overloads ofFromHexString that write into a destination span rather than allocating a newbyte[] each time.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _hex = string.Concat(Enumerable.Repeat("0123456789abcdef", 10));    private byte[] _dest = new byte[100];    [Benchmark(Baseline = true)]    public byte[] FromHexString() => Convert.FromHexString(_hex);    [Benchmark]    public OperationStatus FromHexStringSpan() => Convert.FromHexString(_hex.AsSpan(), _dest, out int charsWritten, out int bytesWritten);}

Method	Mean	Ratio	Allocated	Alloc Ratio
FromHexString	33.78 ns	1.00	104 B	1.00
FromHexStringSpan	18.22 ns	0.54	–	0.00

Span, Span, and more Span

The introduction ofSpan<T> andReadOnlySpan<T> back in .NET Core 2.1 have revolutionized how we write .NET code (especially in the core libraries) and what APIs we expose (seeA Complete .NET Developer’s Guide to Span if you’re interested in a deeper dive.) .NET 9 has continued the trend of doubling-down on spans as a great way to both implicitly provide performance boosts and also expose APIs that enables developers to do more for performance in their own code.

One great example of this is the new C# 13 support for “params collections,” which merged into the C# compiler’s main branch indotnet/roslyn#72511. This feature enables the C#params keyword to be used with more than just array parameters, but rather any collection type that’s usable with collection expressions… that includes span. In fact, the feature makes it so that if there are two overloads, one taking aparams T[] and one taking aparams ReadOnlySpan<T>, the latter overload will win overload resolution. Moreover, the code generated for a call site for aparams ReadOnlySpan<T> is the same non-allocating approach you get for collection expressions, e.g. given code like this:

using System;public class C {    public void M()    {        Helpers.DoAwesomeStuff("Hello", "World");    }}public static class Helpers{    public static void DoAwesomeStuff<T>(params T[] values) { }    public static void DoAwesomeStuff<T>(params ReadOnlySpan<T> values) { }}

the IL the C# compiler generates forC.M will be equivalent to something like the following C#:

<>y__InlineArray2<string> buffer = default;<PrivateImplementationDetails>.InlineArrayElementRef<<>y__InlineArray2<string>, string>(ref buffer, 0) = "Hello";<PrivateImplementationDetails>.InlineArrayElementRef<<>y__InlineArray2<string>, string>(ref buffer, 1) = "World";Helpers.DoAwesomeStuff(<PrivateImplementationDetails>.InlineArrayAsReadOnlySpan<<>y__InlineArray2<string>, string>(ref buffer, 2));

This is using the[InlineArray] feature introduced in .NET 8 to stack-allocate a span of strings, and then pass that span into the method. No heap allocation. This is awesome for library developers, because it means any place where you have a method taking aparams T[], you can add aparams ReadOnlySpan<T> overload, and when consuming code calling that method recompiles, it just gets better.dotnet/runtime#101308 anddotnet/runtime#101499 rely on that to add ~40 new overloads for methods that didn’t previously accept spans and now do, and addedparams to over 20 existing overloads that were already taking spans. For example, if code had been usingPath.Join to build up a path comprised of five or more segments, it previously would have been using theparams string[] overload, but now upon recompilation it’ll switch to using theparams ReadOnlySpan<string> overload, and won’t need to allocate astring[] for the inputs.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;using System.Numerics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public string Join() => Path.Join("a", "b", "c", "d", "e");}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Join	.NET 8.0	30.83 ns	1.00	104 B	1.00
Join	.NET 9.0	24.85 ns	0.81	40 B	0.38

The C# compiler has also improved around spans in other ways. For example,dotnet/roslyn#71261 extends the assembly data support for initializing arrays andReadOnlySpan<T> to also apply tostackalloc. If you have code like this:

var array = new char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };

the compiler will generate code along the lines of the following:

char[] array = new char[7];RuntimeHelpers.InitializeArray(array, (RuntimeFieldHandle)&<PrivateImplementationDetails>.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2);

The compiler has taken that char data and blit it into the assembly; then when it creates the array, rather than setting each individual value into the array, it just copies that data directly from the assembly into the array. Similarly, if you have:

ReadOnlySpan<char> span = new char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };

the compiler recognizes that all of the data is constant and is being stored into a “read-only” location, so it doesn’t actually need to allocate an array. Instead, it emits code like:

ReadOnlySpan<char> span =RuntimeHelpers.CreateSpan<char>((RuntimeFieldHandle)&<PrivateImplementationDetails>.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2);

which effectively creates a span that points directly into the assembly data; no allocationand no copy needed. However, if you have this:

ReadOnlySpan<char> span = stackalloc char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };

or this:

Span<char> span = stackalloc char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };

you’d get codegen more like this:

char* ptr = stackalloc char[7];*(char*)ptr = 97;*(char*)(ptr + 1) = 98;*(char*)(ptr + 2) = 99;*(char*)(ptr + 3) = 100;*(char*)(ptr + 4) = 101;*(char*)(ptr + 5) = 102;*(char*)(ptr + 6) = 103;Span<char> span = new Span<char>(ptr, 7);

But now, thanks todotnet/roslyn#71261, that last example will also be unified with the same approach for the other constructions, resulting in code more like this:

char* ptr = stackalloc char[7];Unsafe.CopyBlockUnaligned(ptr, &<PrivateImplementationDetails>.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2, 14);Span<char> span = new Span<char>(ptr, 7);

(the compiler will actually generate acpblk IL instruction rather than a call toUnsafe.CopyBlockUnaligned).

The C# compiler has also improved its ability to avoid allocations when creatingReadOnlySpan<T> from some expressed array constructions or collection expressions. One of the really nice optimizations the C# compiler added several years back was the ability to recognize when a newbyte/sbyte/bool array was being constructed, filled with only constants, and directly assigned to aReadOnlySpan<T>. In such a case, it would recognize that the data was all blittable and could never be modified, so rather than allocating an array and wrapping a span around it, it would blit the data into the assembly and then just construct a span around a pointer into the assembly data with the appropriate length. So this:

ReadOnlySpan<byte> Values => new[] { (byte)0, (byte)1, (byte)2 };

got lowered into something more like this:

ReadOnlySpan<byte> Values => new ReadOnlySpan<byte>(    &<PrivateImplementationDetails>.AE4B3280E56E2FAF83F414A6E3DABE9D5FBE18976544C05FED121ACCB85B53FC),    3);

The optimization at the time was limited to only single-byte primitive types because of endianness concerns, but .NET 7 added aRuntimeHelpers.CreateSpan method which handled such endianness concerns, so then this was expanded to all such primitive types regardless of size. So this:

ReadOnlySpan<char> Values1 => new[] { 'a', 'b', 'c' };ReadOnlySpan<int> Values2 => new[] { 1, 2, 3 };ReadOnlySpan<long> Values3 => new[] { 1L, 2, 3 };ReadOnlySpan<DayOfWeek> Values4 => new[] { DayOfWeek.Monday, DayOfWeek.Friday };

gets lowered into something more like this:

ReadOnlySpan<char> Values1 => new ReadOnlySpan<char>(    &<PrivateImplementationDetails>.13E228567E8249FCE53337F25D7970DE3BD68AB2653424C7B8F9FD05E33CAEDF2),    3);ReadOnlySpan<int> Values2 => new ReadOnlySpan<int>(    &<PrivateImplementationDetails>.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D4),    3);ReadOnlySpan<long> Values3 => new ReadOnlySpan<long>(    &<PrivateImplementationDetails>.E2E2033AE7E19D680599D4EB0A1359A2B48EC5BAAC75066C317FBF85159C54EF8),    3);ReadOnlySpan<DayOfWeek> Values3 => new ReadOnlySpan<DayOfWeek>(    &<PrivateImplementationDetails>.ECA75F8497701D6223817CDE38BF42CDD1124E01EF6B705BCFE9A584F7B42F0F4),    2);

Lovely. But… what about types that are supported as constants at the C# level but that aren’t blittable in this fashion? That includesnint andnuint (which vary in size based on the bitness of the process),decimal (for which a constant is actually represented in metadata via a[DecimalConstant(...)] attribute), andstring (which is a reference type). In those cases, even though we’re still targeting something that can be mutated and we’re still using constants, we still get the array allocation:

ReadOnlySpan<nint> Values1 => new nint[] { 1, 2, 3 };ReadOnlySpan<nuint> Values2 => new nuint[] { 1, 2, 3 };ReadOnlySpan<decimal> Values3 => new[] { 1m, 2m, 3m };ReadOnlySpan<string> Values4 => new[] { "a", "b", "c" };

which are lowered to, well, themselves, such that there’s still an allocation. Or, at least there was. Thanks todotnet/roslyn#69820, these cases are now handled as well. They’re addressed by lazily allocating an array that’s then cached for all subsequent use. So now, that same example gets lowered into the equivalent of something more like this:

ReadOnlySpan<nint> Values1 =>    <PrivateImplementationDetails>.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D_B8 ??=    new nint[] { 1, 2, 3 };ReadOnlySpan<nuint> Values2 =>    <PrivateImplementationDetails>.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D_B16 ??=    new nuint[] { 1, 2, 3 };ReadOnlySpan<decimal> Values3 =>    <PrivateImplementationDetails>.04B64E80BCEFE521678C4D6565B6EEBCE2791130A600CCB5D23E1B5538155110_B18 ??=    new[] { 1m, 2m, 3m };ReadOnlySpan<string> Values4 =>    <PrivateImplementationDetails>.13E228567E8249FCE53337F25D7970DE3BD68AB2653424C7B8F9FD05E33CAEDF_B11 ??=    new[] { "a", "b", "c" };

There are, of course, many more span-related improvements in the libraries, too. One improvement for an existing span-related method isdotnet/runtime#103728, which further optimizesMemoryExtensions.Count used to count the number of occurrences of an element in a span. The implementation is vectorized, processing a vector’s worth of data at a time, e.g. if 256-bit vectors are hardware accelerated, and it’s searchingchars, it’ll process 16chars at a time (16chars 2 bytes perchar 8 bits per byte == 256). What happens if the number of elements isn’t an even multiple of 16? Then we’re left with some remaining elements after processing the last full vector. Previously the implementation would fall back to processing those remaining elements one at a time; now, it’ll process one last vector at the end of the input. Doing so means we’ll end up re-examining one or more elements we already examined, but that doesn’t really matter, as we can examine all of the elements in approximately the same number of instructions as processing just a single element.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;using System.Numerics;using System.Runtime.CompilerServices;using System.Runtime.InteropServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private char[][] _values = new char[10_000][];    [GlobalSetup]    public void Setup()    {        var rng = new Random(42);        for (int i = 0; i < _values.Length; i++)        {            _values[i] = new char[rng.Next(0, 128)];            rng.NextBytes(MemoryMarshal.AsBytes(_values[i].AsSpan()));        }    }    [Benchmark]    public int Count()    {        int count = 0;        foreach (char[] numbers in _values)        {            count += numbers.AsSpan().Count('a');        }        return count;    }}

Method	Runtime	Mean	Ratio
Count	.NET 8.0	133.25 us	1.00
Count	.NET 9.0	74.30 us	0.56

New span-related functionality also shows up in .NET 9. String splitting is an operation that’s used all over the place; a search for “.Split(” in C# code on GitHub yields millions of hits, and data from a variety of sources suggests that just the simplest overloadSplit(params char[]? separator) is used by upwards of 90% of applications and 20% of nuget packages. So it should come as no surprise that a request to have this functionality for spans is very popular.

Upvotes for span splitting

The devil is in the details, of course, and it’s taken a long time to figure out exactly how it should be exposed. There are largely two different use cases for splitting we see in the wild. One case is where the content being split has an expected or max number of segments, and splitting is used to extract them. For example,FileVersionInfo needs to be able to take a version string and parse from it up to 4 components separated by periods. .NET 8 introduced newSplit extension methods onMemoryExtensions to address this use case, by havingSplit take a destinationSpan<Range> to write the bounds of each segment into. That, however, still leaves the second major category of usage, which is for iterating through an unbounded number of segments. A representative example there is this snippet fromHttpListener‘s web sockets implementation:

string[] requestProtocols = clientSecWebSocketProtocol.Split(',', StringSplitOptions.TrimEntries | StringSplitOptions.RemoveEmptyEntries);for (int i = 0; i < requestProtocols.Length; i++){    if (string.Equals(acceptProtocol, requestProtocols[i], StringComparison.OrdinalIgnoreCase))    {        return true;    }}

TheclientSecWebSocketProtocol string is composed of comma-separated values, and this is iterating through them to see if any is equal to the targetacceptProtocol. It’s doing that, though, with a relatively expensive operation. ThatSplit call needs to allocate thestring[] that’s returned and that holds all the constituent strings, and then each segment results in astring being allocated. We can do better, anddotnet/runtime#104534 from@bbartels enables that. It adds four new overloads ofMemoryExtensions.Split andMemoryExtensions.SplitAny:

public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, T separator) where T : IEquatable<T>;public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, params ReadOnlySpan<T> separators) where T : IEquatable<T>;public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, SearchValues<T> separators) where T : IEquatable<T>;

With that, this same operation can be written as:

foreach (Range r in clientSecWebSocketProtocol.AsSpan().Split(',')){    if (clientSecWebSocketProtocol.AsSpan(r).Trim().Equals(acceptProtocol, StringComparison.OrdinalIgnoreCase))    {        return true;    }}

In doing so, it becomes allocation-free, as thisSplit doesn’t need to allocate astring[] to hold results and doesn’t need to allocate astring for each segment: instead, it’s returning aref struct enumerator that yields aRange representing each segment. The caller can then use thatRange to slice the input. It’s yielding aRange rather than, say, aReadOnlySpan<T>, to enable the splitting to be used with original sources other than spans and be able to get the segments in the original form. For example, if I had aReadOnlyMemory<T> and wanted to add segments from it into a list, I could do:

ReadOnlyMemory<T> source = ...;List<ReadOnlyMemory<T>> list = ...;foreach (Range r in source.Split(separator)){    list.Add(source.Slice(r));}

whereas that wouldn’t be possible ifSplit forced all yielded results to be spans.

You might notice that there’s noStringSplitOptions on these overloads. That’s because it’s both not applicable and not necessary. It’s not applicable because we’re working here withT, which might be something other thanchar, but an option likeStringSplitOptions.TrimEntries implies a notion of whitespace, and that’s only relevant for text. And it’s not necessary, because the main benefit ofStringSplitOptions, bothTrimEntries andRemoveEmptyEntries, is reducing allocation overheads. If these options didn’t exist with thestring overloads, and you wanted to simulate them with our original example (and spans didn’t exist), it would end up looking like this:

string[] requestProtocols = clientSecWebSocketProtocol.Split(',');for (int i = 0; i < requestProtocols.Length; i++){    if (string.Equals(acceptProtocol, requestProtocols[i].Trim(), StringComparison.OrdinalIgnoreCase))    {        return true;    }}

There are several possible performance problems here. Imagine theclientSecWebSocketProtocol input was"a , b, , , , , , c". There are only three entries we care about here ("a","b", and"c"), but the returned array is going to be astring[8] instead of astring[3], because it’s going to have entries for each of those whitespace-only segments. That’s a larger allocation than is necessary. Then, we’ll be producingstrings for all eight of those segments, even though only three of thestrings were necessary. And, all of"a "," b", and" c" have some extraneous whitespace that needs to be trimmed, such that the followingTrim() call will allocate a new string for each. TheStringSplitOptions enables the implementation ofSplit to avoid all of that overhead, by only allocating what’s desired. But with the span version, none of that allocation exists anyway. The consuming loop can trim the spans itself without incurring more overhead than would theSplit implementation, and the consuming loop can choose to ignore empty entries without increasing the size of astring[] allocation.

The net result is such operations can be significantly more efficient while not sacrificing much if anything in the way of maintainability.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _input = "a , b, , , , , , c";    private string _target = "d";    [Benchmark(Baseline = true)]    public bool ContainsString()    {        foreach (string item in _input.Split(',', StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries))        {            if (item.Equals(_target, StringComparison.OrdinalIgnoreCase))            {                return true;            }        }        return false;    }    [Benchmark]    public bool ContainsSpan()    {        foreach (Range r in _input.AsSpan().Split(','))        {            if (_input.AsSpan(r).Trim().Equals(_target, StringComparison.OrdinalIgnoreCase))            {                return true;            }        }        return false;    }}

Method	Mean	Ratio	Allocated	Alloc Ratio
ContainsString	127.26 ns	1.00	208 B	1.00
ContainsSpan	61.89 ns	0.49	–	0.00

The nature of this new set of splitting APIs is that they find just the next separator / segment; that’s both practical and possibly a performance improvement by itself. It’s practical because we’re only yielding a single segment at a time, and we don’t have anywhere to store all possible found separator positions (nor do we want to allocate space to do so). And it’s desirable because the consumer may early-exit from the consuming loop, in which case we don’t want to have spent time unnecessarily searching for additional segments that are going to be ignored. The existing set of splitting APIs, however, hand back all found segments in one go, either via a returnedstring[] or via ranges being written to a destination span. And as such, it makes more sense for those overloads to find all separators at once because that operation can be vectorized. In fact, previous versions have done so. But, that vectorization has only benefited from 128-bit vectors. Withdotnet/runtime#93043 from@khushal1996 in .NET 9, that vectorization will now light-up with 512-bit or 256-bit vectors if they’re available, enabling that separator searching that happens as part of splitting to run up to four times faster.

Spans show up in other new methods as well.dotnet/runtime#93938 from@TheMaximum added new overloads ofStringBuilder.Replace that acceptReadOnlySpan<char> instead ofstring. As is the case with most such overloads, they share the same implementation, with thestring-based overloads just creating a span from thestring and using a span-based implementation. In practice, the majority of use ofStringBuilder.Replace uses constant strings as arguments, for example to escape some known delimiter (Replace("$", "\\$")), or use previously-createdstring instances, such as to remove some substring from text (Replace(substring, "")). But, there are a minority of cases whereReplace is used with something that’s created on the spot, and for that, these new overloads can help to avoid allocation for creating the arguments. For example, here’s some escaping code used today by MSBuild:

char[] charsToEscape = ...;StringBuilder escapedString = ...;foreach (char unescapedChar in charsToEscape){    string escapedCharacterCode = string.Format(CultureInfo.InvariantCulture, "%{0:x00}", (int)unescapedChar);    escapedString.Replace(unescapedChar.ToString(CultureInfo.InvariantCulture), escapedCharacterCode);}

This is having to perform twostring allocations to create the input to thisReplace, which is going to be invoked for eachchar incharsToEscape. IfcharsToEscape is something fixed, it could be better to avoid these formatting operations per iteration, and instead just cache the necessary strings for all uses, e.g.

private static readonly char[] charsToEscape = ...;private static readonly string[] escapedCharsToEscape = charsToEscape.Select(c => $"%{(uint)unescapedChar:x00}").ToArray();private static readonly string[] stringsToEscape = charsToEscape.Select(c => c.ToString()).ToArray();...for (int i = 0; i < charsToEscape.Length; i++){    escapedString.Replace(stringsToEscape[i], escapedCharsToEscape[i]);}

but ifcharsToEscape isn’t predictable, then we can at least avoid the allocation by employing the new overloads, e.g.

char[] charsToEscape = ...;StringBuilder escapedString = ...;Span<char> escapedSpan = stackalloc char[5];foreach (char unescapedChar in charsToEscape){    escapedSpan.TryWrite($"%{(uint)unescapedChar:x00}", out int charsWritten);    escapedString.Replace(new ReadOnlySpan<char>(in unescapedChar), escapedSpan.Slice(0, charsWritten));}

and, boom, no more allocation for the arguments.

A variety of other improvements were made tostring manipulation, mainly around better employing vectorization.StringComparison.OrdinalIgnoreCase operations were previously vectorized, but only with 128-bit vectors, which means handling up to 8chars at a time. Thanks todotnet/runtime#93116, those code paths have been updated to support 256-bit and 512-bit vectors, which means handling up to 16 or 32chars at a time on hardware accelerated to support it.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static string s_s1 = """        Let me not to the marriage of true minds        Admit impediments; love is not love        Which alters when it alteration finds,        Or bends with the remover to remove.        O no, it is an ever-fixed mark        That looks on tempests and is never shaken;        It is the star to every wand'ring bark        Whose worth's unknown, although his height be taken.        Love's not time's fool, though rosy lips and cheeks        Within his bending sickle's compass come.        Love alters not with his brief hours and weeks,        But bears it out even to the edge of doom:        If this be error and upon me proved,        I never writ, nor no man ever loved.        """;    private static string s_s2 = s_s1[0..^1] + "!";    [Benchmark]    public bool EqualsIgnoreCase() => s_s1.Equals(s_s2, StringComparison.OrdinalIgnoreCase);}

Method	Runtime	Mean	Ratio
EqualsIgnoreCase	.NET 8.0	86.79 ns	1.00
EqualsIgnoreCase	.NET 9.0	20.97 ns	0.24

EndsWith also gets better, for both strings and spans. Previous releases sawStartsWith become a JIT intrinsic, enabling the JIT to generate dedicated SIMD code forStartsWith in the case where it’s passed a constant. Now withdotnet/runtime#98593, the same thing is done forEndsWith.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser(maxDepth: 0)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments("helloworld.txt")]    public bool EndsWith(string path) => path.EndsWith(".txt", StringComparison.OrdinalIgnoreCase);}

Method	Runtime	path	Mean	Ratio	Code Size
EndsWith	.NET 8.0	helloworld.txt	3.5006 ns	1.00	26 B
EndsWith	.NET 9.0	helloworld.txt	0.6653 ns	0.19	61 B

More interesting to me than these nice gains is the code that was generated to achieve them. This is what the assembly for this benchmark looked like with .NET 8:

; Tests.EndsWith(System.String)       mov       rdi,rsi       mov       rsi,7EE3C2D25E38       mov       edx,5       cmp       [rdi],edi       jmp       qword ptr [7F24678663A0]; System.String.EndsWith(System.String, System.StringComparison); Total bytes of code 26

Pretty straightforward, a bit of argument manipulation and then jumping to the actualstring.EndsWith implementation. Now here’s .NET 9:

; Tests.EndsWith(System.String)       push      rbp       mov       rbp,rsp       mov       eax,[rsi+8]       cmp       eax,4       jge       short M00_L00       xor       ecx,ecx       jmp       short M00_L01M00_L00:       mov       ecx,eax       lea       rax,[rsi+rcx*2-8]       mov       rcx,20002000200000       or        rcx,[rax+0C]       mov       rax,7400780074002E       cmp       rcx,rax       sete      cl       movzx     ecx,clM00_L01:       movzx     eax,cl       pop       rbp       ret; Total bytes of code 61

Notice there’s no call tostring.EndsWith in sight. That’s because the JIT has implemented theEndsWith functionality here, specific to".txt" andOrdinalIgnoreCase, in just a few instructions. The address of the string is being passed into this method in thersi register, and the secondmov instruction is grabbing itsLength (which is stored 8 bytes from the start of the string object) and storing that into theeax register. It’s then checking whether the string is at least 4 characters long; if it’s not, it can’t possibly end with".txt" and thus it jumps to the end to returnfalse. If it was at least 4 characters long, it then proceeds to load the last four characters of the string as a 64-bit value intorcx and OR it with the value20002000200000. Why? It’s playing the same ASCII trick we’ve seen before. The'.' is not subject to casing, so we don’t need to manipulate its value, and hence the 16-bits that aligns with the'.' are 0. But the other three characters all need to be comparable with both their lower-case and upper-case forms, so this is OR’ing each of the three 16-bit characters with0x2000 to produce the lower-case form. At that point, the 64-bit value can be compared against the 64-bit representation of".txt", which is7400780074002E (the ASCII value for'.' is 0x2E, for't' is 0x74, and for'x' is 0x78). Then it’s just a simple matter of whether that compared equally or not.

Finally, we’ve not talked much about arrays separate from spans, but there have been improvements there as well.dotnet/runtime#102739 anddotnet/runtime#104103 move more logic for array handling from native code in the runtime up into C# code inCoreLib. For example,Array.Copy has to handle a wide array of cases (pun intended), some of which can be implemented very efficiently and some of which are more laborious, and it tries to optimize the “simple” cases, such as whether the bits from one array can simply be memcpy’d over to the other, with as little overhead as possible. Some of those cases are easy to determine, such as single-dimensional arrays having the exact same type, but other cases require more introspection, such as if one array is enums and the other array is of the underlying type of that enum. The checks to make those determinations previously lived in native code in the runtime, but as of this PR they’re now implemented in C#, and in doing so, some of the overhead associated with the checks has been removed.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private DayOfWeek[] _enums = Enum.GetValues<DayOfWeek>();    private int[] _ints = new int[7];    [Benchmark]    public void Copy() => Array.Copy(_enums, _ints, _enums.Length);}

Method	Runtime	Mean	Ratio
Copy	.NET 8.0	16.25 ns	1.00
Copy	.NET 9.0	11.05 ns	0.68

In addition to other benefits that come from moving such logic into managed code (better maintainability, more implicitly safe code, reduced overhead from transitioning between managed and native, etc.), there’s another less obvious benefit: impact on GC pause times. And that’s nowhere more obvious than withdotnet/runtime#98623, which moved the implementations ofmemset/memcpy helpers used for core operations likeSpan<T>.Fill andArray.Copy from native to managed. Consider this C# console app:

using System.Diagnostics;new Thread(() =>{    var a = new object[1000];    while (true) a.AsSpan().Fill(a);}){ IsBackground = true }.Start();var sw = new Stopwatch();while (true){    sw.Restart();    for (int i = 0; i < 10; i++)    {        GC.Collect();        Thread.Sleep(15);    }    Console.WriteLine(sw.Elapsed.TotalSeconds);}

This is sitting in a loop that simply times how long it takes to perform 10 gen2 collections, each spaced out by ~15 ms. If each collection were free then, this loop should take ~150 ms. Since it’s not free, let’s round up and estimate that the loop should be around ~200 ms. Before we run the loop, though, we launch a thread that just sits in an infinite loop filling a span. That shouldn’t mess with our timing loop… or should it? When I run this on .NET 8, I get values like this:

1.06835240.88847590.84207481.11018041.2730635

Those values are in seconds, and that’s approximately 5x larger than we’d predicated. Now I try on .NET 9, and I get results like this:

0.16382370.21297480.28595660.30204490.2871952

What happened? In order to do some of its work, the GC needs to be able to get a consistent view of the world, which is violated if things are concurrently changing out from under it. As such, it may need to temporarily suspend all threads in the process, but to do that, it needs to wait for each thread to get to a safe point, and if a thread is executing code in the runtime, that can be hard to do. In this particular case, there’s a thread spending almost all of its time sitting in a call toSpan<T>.Fill, akamemset, which was implemented in .NET 8 as a native function in the runtime; this couldn’t be interrupted, and the GC would need to wait until the call returned and it could catch it before it could interrupt that thread. In .NET 9, these implementations are all in managed code, and the GC can trivially get the threads to a safe point.

Collections

LINQ

Language Integrated Query, or LINQ, is a mainstay of .NET. At its heart, LINQ is a specification for hundreds of overloads of methods that manipulate data, and then implementations of that specification for different types. One of the most prominent implementations comes fromSystem.Linq.Enumerable, sometimes referred to as “LINQ to Objects,” which provides an implementation of these operations as methods for working withIEnumerable<T>. It’s an incredibly useful set of operations, used ubiquitously, and thus it’s a common target for performance optimization. In many .NET releases, it’ll get a new additional method here or an optimized method there, a trickle of focused improvements. But in .NET 9, it’s received a huge amount of attention, with some improvements localized to particular methods and others broadly applicable across much of the surface area.

One of the more sweeping LINQ changes in .NET 9 has to do with how various optimizations are implemented. In the original implementation of LINQ circa 2007, almost every method was logically independent from every other. A method likeSelectMany took in anIEnumerable<TSource> and didn’t know anything about where that input came from; every enumerable was processed the same as every other. Some methods would special-case more optimizable data types, though, for exampleToArray would check whether the incomingIEnumerable<TSource> implementedICollection<TSource>, preferring if it did to use the collection’sCount andCopyTo in order to avoid having toMoveNext/Current through the whole input. But a couple of methods, in particular some overloads ofSelect andWhere, did something more interesting. Much of LINQ was implemented using the C# compiler’s support for iterators, where a method that returnsIEnumerable<T> can useyield return t; to produce instances ofT, and the compiler handles rewriting that method into a class that implementsIEnumerable<T> and handles all the gnarly state-machine details for you. These fewSelect andWhere overloads, however, didn’t use iterators, with the developer that authored them instead preferring to write a custom enumerable by hand. Why? It’s possible to hand-author an ever-so-slightly more efficient implementation in some cases, but the compiler is actually really good at doing it well, so that’s not the reason. The reason is because it a) gives the type a name that can be referred to elsewhere in the code, and b) it allows that type to expose state that other code can interrogate. This enables information to flow from one query operator to the next. So, for example,Where could return aWhereEnumerableIterator instance:

class WhereEnumerableIterator<TSource> : Iterator<TSource>{    IEnumerable<TSource> source;    Func<TSource, bool> predicate;    ...}

And thenSelect can look for that type, or, rather, its base type,Iterator<TSource>:

public static IEnumerable<TResult> Select<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector) {    if (source == null) throw Error.ArgumentNull("source");    if (selector == null) throw Error.ArgumentNull("selector");    if (source is Iterator<TSource>) return ((Iterator<TSource>)source).Select(selector);    ...}

and thatWhereEnumerableIterator<TSource> can override that virtualSelect method onIterator<TSource> to specialize what happens when aWhere is followed bySelect:

public override IEnumerable<TResult> Select<TResult>(Func<TSource, TResult> selector) {    return new WhereSelectEnumerableIterator<TSource, TResult>(source, predicate, selector);}

This is useful because it allows for avoiding one of the major sources of overhead with enumerables. Without this optimization, if I hadsource.Where(x => true).Select(x => x), the resulting enumerable would be for theSelect, which would in turn wrap the enumerable for theWhere, which would in turn wrap thesource enumerable. That means that when I callMoveNext on the select iterator, it in turn needs to callMoveNext on theWhere, which will in turn callMoveNext on the source, and then the same forCurrent. That means for each element in the source, we end up making 6 interface calls. With the cited optimization, we no longer have separate iterators for theSelect andWhere. Those end up being combined into a single iterator that does the work of both, eliminating one level from the call chain, so instead of 6 interface calls per element, there’s only 4. (SeeDeep Dive on LINQ andAn event DEEPER Dive into LINQ for a more in-depth exploration of how exactly this works.)

Over the last decade with .NET, those optimizations have been significantly extended, and in some cases to much greater benefit than just saving a few interface calls. For example, in a previous .NET release, a similar mechanism was used to special-caseOrderBy followed byFirst. Without special-casing, theOrderBy would need to do both a full copy of the input source and anO(N log N) sort of the data, all as part of the first call toMoveNext fromFirst. But with the optimization,First is able to see that its source is thatOrderBy, in which case it doesn’t need a copy or sort at all, and can instead simply do anO(N) search ofOrderBy‘s source for its minimum value. That difference can yield monstrous wins.

This additional special-casing was achieved with internal interfaces in the library. AnIIListProvider<TElement> providedToArray,ToList, andGetCount methods, and anIPartition<TElement> interface (which inheritedIIListProvider<TElement>) provided additional methods likeSkip,Take, andTryGetFirst. Custom iterators used to back various LINQ methods could then also implement one or more of these interfaces to specialize being followed by a call likeToArray orCount(). For example, it’s very common (e.g. as part of “paging”) to see call sequences like.Skip(...).Take(...); with these optimizations, those two operations can be consolidated down into a single iterator, and if it were followed by an operation likeLast() orToList(), those could see through both operators to the source in order to possibly optimize based on it (e.g. if the source were an array,Last() could calculate the exact element to return without needing to do any iteration at all).

dotnet/runtime#98969 anddotnet/runtime#99344 remove those internal interfaces and consolidate all of their members down to the baseIterator<TSource> type. This has a variety of benefits. Not directly related to performance, it simplifies the code base, making it easier to maintain (and easier to maintain code is also generally easier to optimize); the interface members ofIPartition<TElement> becamevirtual methods on the base class, which also resulted in some code reduction due to being able to share the same default implementation (though with the introduction of default interface methods a few releases ago, thiscould have been done separately without this consolidation). On the performance front, though, there are three main benefits of this PR:

Virtual dispatch is generally a bit cheaper than interface dispatch. All of those interface methods became virtual methods, enabling all call sites to them to be a bit cheaper.
In various places, type tests were being done for multiple targets, and those could now be consolidated to reduce type checks. For example,Select looked something like this:
```
if (source is Iterator<TSource> iterator){    ...}if (source is IPartition partition){    ...}
```
That means for non-specialized iterators,Select was incurring a type check forIterator<TSource> and an interface check forIPartition<TSource>. With this change, the latter check is now removed.
Some types were only inheriting from the base class but not implementing any of the interfaces, some were implementing an interface but not the other, some were even implementing one of the interfaces but not deriving from the base class. The new approach makes it such that all of the provided virtual methods are implemented by any iterator deriving from the base class.

dotnet/runtime#97905,dotnet/runtime#97956,dotnet/runtime#98874, anddotnet/runtime#99216 also added more implementations.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IEnumerable<int> _arrayDistinct = Enumerable.Range(0, 1000).ToArray().Distinct();    private IEnumerable<int> _appendSelect = Enumerable.Range(0, 1000).ToArray().Append(42).Select(i => i * 2);    private IEnumerable<int> _rangeReverse = Enumerable.Range(0, 1000).Reverse();    private IEnumerable<int> _listDefaultIfEmptySelect = Enumerable.Range(0, 1000).ToList().DefaultIfEmpty().Select(i => i * 2);    private IEnumerable<int> _listSkipTake = Enumerable.Range(0, 1000).ToList().Skip(500).Take(100);    private IEnumerable<int> _rangeUnion = Enumerable.Range(0, 1000).Union(Enumerable.Range(500, 1000));    [Benchmark] public int DistinctFirst() => _arrayDistinct.First();    [Benchmark] public int AppendSelectLast() => _appendSelect.Last();    [Benchmark] public int RangeReverseCount() => _rangeReverse.Count();    [Benchmark] public int DefaultIfEmptySelectElementAt() => _listDefaultIfEmptySelect.ElementAt(999);    [Benchmark] public int ListSkipTakeElementAt() => _listSkipTake.ElementAt(99);    [Benchmark] public int RangeUnionFirst() => _rangeUnion.First();}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
DistinctFirst	.NET 8.0	49.844 ns	1.00	328 B	1.00
DistinctFirst	.NET 9.0	7.928 ns	0.16	–	0.00

AppendSelectLast	.NET 8.0	3,668.347 ns	1.000	144 B	1.00
AppendSelectLast	.NET 9.0	2.222 ns	0.001	–	0.00

RangeReverseCount	.NET 8.0	8.703 ns	1.00	–	NA
RangeReverseCount	.NET 9.0	3.465 ns	0.40	–	NA

DefaultIfEmptySelectElementAt	.NET 8.0	2,772.283 ns	1.000	144 B	1.00
DefaultIfEmptySelectElementAt	.NET 9.0	4.399 ns	0.002	–	0.00

ListSkipTakeElementAt	.NET 8.0	3.699 ns	1.00	–	NA
ListSkipTakeElementAt	.NET 9.0	2.103 ns	0.57	–	NA

RangeUnionFirst	.NET 8.0	53.670 ns	1.00	344 B	1.00
RangeUnionFirst	.NET 9.0	5.181 ns	0.10	–	0.00

Subsequent PRs also further benefited from this consolidation.dotnet/runtime#99218, for example, uses it to improveEnumerable.Any(IEnumerable<T>).Any just needs to say whether the source has any elements, and it tries hard to determine that without having to get an enumerator from the source, which allocates, and callMoveNext (an interface call) to see if it returns true. In .NET 8, it was doing this usingEnumerable.TryGetNonEnumeratedCount, which usesIterator<T>.GetCount(onlyIfCheap: true) (the “onlyIfCheap” part basically means “don’t enumerate to compute the count”). However, for iterators where it’s not “cheap”,TryGetNonEnumeratedCount would returnfalse, andAny would still be forced to get an enumerator. However, now that everyIterator<T> provides aTryGetFirst,Any can use that in the case where the source is anIterator<T> butGetCount isn’t successful. Worst case,TryGetFirst will itself end up callingGetEnumerator, but best case, the iterator will have provided a more efficient implementation ofTryGetFirst. And either way, it’s still a win, because enumerating would require not only aGetEnumerator call on theIterator<T>, but that in turn would need to callGetEnumerator<T> on whatever source it was wrapping, whereas this ends up saving one layer.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IEnumerable<int> _data1 = Iterations(100).Where(i => i % 2 == 0).Select(i => i);    private IEnumerable<int> _data2 = Enumerable.Range(0, 100).ToArray().Where(i => i % 2 == 0).Select(i => i);    [Benchmark] public bool Any1() => _data1.Any();    [Benchmark] public bool Any2() => _data2.Any();    private static IEnumerable<int> Iterations(int count)    {        for (int i = 0; i < count; i++) yield return i;    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Any1	.NET 8.0	31.967 ns	1.00	104 B	1.00
Any1	.NET 9.0	15.818 ns	0.49	40 B	0.38

Any2	.NET 8.0	21.062 ns	1.00	56 B	1.00
Any2	.NET 9.0	3.780 ns	0.18	–	0.00

Another cross-cutting improvement across LINQ comes indotnet/runtime#96602 and has to do with empty inputs. It’s also a nice example of how what’s considered an optimization ebbs and flows. In the beginning of LINQ,Enumerable.Empty<T>(), which is strongly-typed to returnIEnumerable<T>, returned an emptyT[] as the actual instance. WhenArray.Empty<T>() was introduced, it used that. Then, however, the aforementionedIPartition<T> was introduced internally in LINQ, andEnumerable.Empty<T>() was changed to return a singletonEmptyPartition<T>, an implementation of the interface with all of the methods dedicated to being efficient for empty inputs. This was helpful internally as an implementation detail, as methods that were typed to returnIPartition<T> could return thatEmptyPartition<T> instance, whereas they couldn’t return aT[], since it doesn’t implement that interface. However, it had a downside. A variety of APIs can optimize very well if they know the input is empty, e.g. aTake call can immediately return an empty singleton if it knows the input is empty. But, it can’t be based solely on whether it’s emptynow, but rather if it’s empty now and for always; otherwise, you could callTake, it would see it was empty, then elements get added to the source, and then you callGetEnumerator on the enumerable returned fromTake… according to the rules for how all of this behaves, that should yield the newly-added elements, but ifTake had returned an empty singleton, it wouldn’t. There are a variety of types that we know will always be empty once seen as empty (e.g.ImmutableArray<T>,T[],FrozenSet<T>, etc.), but it’d be too costly to check for each of them individually. Instead, the implementation just picked the same type asEnumerable.Empty<T>() returned as the one to check for. That’s fairly reasonable, but as it turns out, when that type isEmptyPartition<T>, there are a lot of empty arrays that are no longer noticed as being a special empty input. This gets even worse with collection expressions in the picture, as initializing anIEnumerable<T> with[] will, as an implementation detail, produceArray.Empty<T>(). So, this PR put everything back on a plan ofEnumerable.Empty<T>() beingArray.Empty<T>() and aT[0] being what’s checked for when special-casing empty inputs. The PR also included new checks for empty in many different places that warranted it.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string[] _values = [];    [Benchmark] public object Chunk() => _values.Chunk(10);    [Benchmark] public object Distinct() => _values.Distinct();    [Benchmark] public object GroupJoin() => _values.GroupJoin(_values, i => i, i => i, (i, j) => i);    [Benchmark] public object Join() => _values.Join(_values, i => i, i => i, (i, j) => i);    [Benchmark] public object ToLookup() => _values.ToLookup(i => i);    [Benchmark] public object Reverse() => _values.Reverse();    [Benchmark] public object SelectIndex() => _values.Select((s, i) => i);    [Benchmark] public object SelectMany() => _values.SelectMany(i => i);    [Benchmark] public object SkipWhile() => _values.SkipWhile(i => true);    [Benchmark] public object TakeWhile() => _values.TakeWhile(i => true);    [Benchmark] public object WhereIndex() => _values.Where((s, i) => true);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Chunk	.NET 8.0	10.7213 ns	1.00	72 B	1.00
Chunk	.NET 9.0	4.1320 ns	0.39	–	0.00

Distinct	.NET 8.0	9.4410 ns	1.00	64 B	1.00
Distinct	.NET 9.0	0.7162 ns	0.08	–	0.00

GroupJoin	.NET 8.0	22.4746 ns	1.00	144 B	1.00
GroupJoin	.NET 9.0	1.1356 ns	0.05	–	0.00

Join	.NET 8.0	18.6332 ns	1.00	168 B	1.00
Join	.NET 9.0	1.3585 ns	0.07	–	0.00

ToLookup	.NET 8.0	23.3518 ns	1.00	128 B	1.00
ToLookup	.NET 9.0	0.9539 ns	0.04	–	0.00

Reverse	.NET 8.0	9.5791 ns	1.00	48 B	1.00
Reverse	.NET 9.0	0.9947 ns	0.10	–	0.00

SelectIndex	.NET 8.0	11.1235 ns	1.00	72 B	1.00
SelectIndex	.NET 9.0	0.5603 ns	0.05	–	0.00

SelectMany	.NET 8.0	10.7537 ns	1.00	64 B	1.00
SelectMany	.NET 9.0	0.9906 ns	0.09	–	0.00

SkipWhile	.NET 8.0	11.2900 ns	1.00	72 B	1.00
SkipWhile	.NET 9.0	1.0988 ns	0.10	–	0.00

TakeWhile	.NET 8.0	11.8818 ns	1.00	72 B	1.00
TakeWhile	.NET 9.0	1.0381 ns	0.09	–	0.00

WhereIndex	.NET 8.0	11.1751 ns	1.00	80 B	1.00
WhereIndex	.NET 9.0	1.2185 ns	0.11	–	0.00

dotnet/runtime#98963 also has to do with emptiness, but actually improves non-empty cases.DefaultIfEmpty needs to produce anIEnumerable<T> containing all of the elements from the source, or if the source is empty, an enumerable with a singledefault(T) value. In most cases, that means it has to allocate a new enumerable, because for the same reasons as just described, it can’t know untilGetEnumerator is called whether the source is empty. Except, it can if the source is aT[], which has an immutable length. This PR thus special-cases arrays, which are very common, such that if the array isn’t empty, it’s just returned directly rather than allocating a wrapper enumerable for it. That’s more than just about avoiding an allocation: that wrapper object would be in the middle of all subsequent iterations of the object, so avoiding it not only avoids an allocation but also a layer of interface calls. And for subsequent code paths that special-case arrays, the result ofDefaultIfEmpty will still be seen as anT[] and thus now special-cased, whereas it wouldn’t if it were wrapped.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _data = Enumerable.Range(0, 1000).ToArray();    [Benchmark]    public double Average() => _data.DefaultIfEmpty().Average();}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Average	.NET 8.0	1,915.4 ns	1.00	80 B	1.00
Average	.NET 9.0	117.6 ns	0.06	–	0.00

Another change taking advantage of emptiness isdotnet/runtime#99256, this time forEnumerable.Chunk.Chunk(int size) creates anIEnumerable<T[]> that pages through the inputsize elements at a time. Normally, this requires iterating through the source and buffering untilsize elements have been reached, then yielding an array with those elements, and then rinsing and repeating. With an array input, we could do this much more efficiently, as we could just do math to compute the right bounds for each set to be yielded and do an efficient copy of the elements, rather than iterating through each element one by one. And while it might not be worth adding a specialized check for array here (Chunk isn’t exactly a high-performance method to begin with, given it’s allocating a new array for each set), as it turns out we now have a check for array, as part of determining whether the source is permanently empty. This PR just leverages that check to take advantage of both answers. If the array is empty, then it still just returns an empty array. But if it’s not empty, rather than falling back to the normal iteration path, it employs a 7-line alternative that’s specialized to arrays and much more efficient.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _values = Enumerable.Range(0, 1000).ToArray();    [Benchmark]    public int Count()    {        int count = 0;        foreach (var chunk in _values.Chunk(10)) count += chunk.Length;        return count;    }}

Method	Runtime	Mean	Ratio
Count	.NET 8.0	3.612 us	1.00
Count	.NET 9.0	1.334 us	0.37

The statement about checking for empty now and permanently applies in particular to methods that accept and return enumerables. It’s the laziness of these methods that makes that relevant. There is, however, a set of LINQ methods that are not lazy because they produce things that aren’t enumerables, such asToArray returning an array,Sum returning a single value,Count returning anint, and so on. These methods also received attention, thanks todotnet/runtime#102884 from@neon-sunset. One of the optimizations applied in various LINQ methods is to special-case input types that are super common, in particularT[] andList<T>. These can be special-cased not just asIList<T>, which would generally be more efficient than enumerating an input via anIEnumerator<T>, but rather as aReadOnlySpan<T>, which can be iterated through very efficiently. This PR extended that optimization to apply to most of these other non-enumerable producing methods, in particular overloads ofAny,All,Count,First, andSingle that take predicates. This is particularly helpful because recent additions to analyzers have resulted in developers being told about opportunities to simplify their LINQ usage.IDE0120 flags code likesource.Where(predicate).First() and instead recommends simplifying it tosource.First(predicate). And while that is a nice simplification and is likely to reduce allocation,Where is considerably more optimized thanFirst(predicate) has been, with the former having special-casing forT[] andList<T> but the latter historically not. That difference is now addressed in .NET 9.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IEnumerable<int> _list = Enumerable.Range(0, 1000).ToList();    [Benchmark] public bool Any() => _list.Any(i => i == 1000);    [Benchmark] public bool All() => _list.All(i => i >= 0);    [Benchmark] public int Count() => _list.Count(i => i == 0);    [Benchmark] public int First() => _list.First(i => i == 999);    [Benchmark] public int Single() => _list.Single(i => i == 0);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Any	.NET 8.0	1,553.3 ns	1.00	40 B	1.00
Any	.NET 9.0	222.2 ns	0.14	–	0.00

All	.NET 8.0	1,586.0 ns	1.00	40 B	1.00
All	.NET 9.0	224.9 ns	0.14	–	0.00

Count	.NET 8.0	1,535.6 ns	1.00	40 B	1.00
Count	.NET 9.0	244.6 ns	0.16	–	0.00

First	.NET 8.0	1,600.7 ns	1.00	40 B	1.00
First	.NET 9.0	245.4 ns	0.15	–	0.00

Single	.NET 8.0	1,550.6 ns	1.00	40 B	1.00
Single	.NET 9.0	239.4 ns	0.15	–	0.00

dotnet/runtime#97004 from@neon-sunset uses that same mechanism to improve performance forList<T> inputs inside ofEnumerable.SequenceEqual.Enumerable.SequenceEqual already had a special-case that checked whether both inputs were arrays, and if they were, it created spans from those arrays and delegated toMemoryExtensions.SequenceEquals, which will efficiently iterate through the spans, vectorizing if possible. This PR just tweaked that special-case to use the same helper that’s used elsewhere to try to get a span from the source, and that gives this super power toList<T> as well.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IEnumerable<int> _source1, _source2;    [GlobalSetup]    public void Setup()    {        _source1 = Enumerable.Range(0, 10_000).ToArray();        _source2 = _source1.ToList();    }    [Benchmark]    public bool SequenceEqual() => _source1.SequenceEqual(_source2);}

Method	Runtime	Mean	Ratio
SequenceEqual	.NET 8.0	26,623.3 ns	1.00
SequenceEqual	.NET 9.0	913.4 ns	0.03

ToArray andToList were generally improved via a variety of PRs, in particular bydotnet/runtime#96570.ToArray in particular is used so ubiquitously that over the years, many folks have attempted to optimize it. In doing so, however, it’s gotten too complex for its own good. This PR takes advantage of newer runtime capabilities to significantly simplify the implementation, while also improving common case performance. The easy cases were already handled well and continue to be: if the source is anICollection<T>, itsCount /CopyTo methods can be used to provide a very efficientToArray, and if the source is anIterator<TSource>,ToArray just delegates to the iterator’sToArray implementation. The challenge, instead, is in efficiently handling the case where we’re dealing with anIEnumerable<T> of unknown length, needing to handle both short and long inputs, and doing so in a way that minimizes allocation and maximizes throughput. To achieve that, this PR deleted the internalArrayBuilder,LargeArrayBuilder, andSparseArrayBuilder types that were previously being used and replaced them all with a simpler internalSegmentedArrayBuilder. The builder is seeded with an[InlineArray]-based struct that’s large enough to hold eightT instances. For up to eight items, the builder can simply use that stack-based space to store the elements. For more than eight items, the builder contains another[InlineArray] of up to 27T[]s. The arrays stored in there are rented from theArrayPool<T>, and based on the starting size and standard doubling growth algorithm, 27 arrays is enough to storeArray.MaxLength elements. This approach means that small inputs never need to allocate (other than for the finalT[], which is unavoidable as it’s the whole purpose of the method), and larger inputs can useArrayPool<T> arrays without having to copy while growing, leading to on average significantly less allocation than before and generally faster throughput. There are trade-offs to this approach when compared to the previous one, with a few niche corner cases it doesn’t handle quite as efficiently, but on the whole it’s an improvement in both performance and maintainability.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments(1)]    [Arguments(8)]    [Arguments(500)]    public string[] IteratorToArray(int count) => GetItems(count).ToArray();    private IEnumerable<string> GetItems(int count)    {        for (int i = 0; i < count; i++)        {            yield return ".NET 9";        }    }}

Method	Runtime	count	Mean	Ratio	Allocated	Alloc Ratio
IteratorToArray	.NET 8.0	1	65.51 ns	1.00	136 B	1.00
IteratorToArray	.NET 9.0	1	41.39 ns	0.63	80 B	0.59

IteratorToArray	.NET 8.0	8	103.30 ns	1.00	192 B	1.00
IteratorToArray	.NET 9.0	8	74.66 ns	0.72	136 B	0.71

IteratorToArray	.NET 8.0	500	3,100.69 ns	1.00	8536 B	1.00
IteratorToArray	.NET 9.0	500	3,080.31 ns	0.99	4072 B	0.48

dotnet/runtime#104365 from@andrewjsaid followed-up on this to use that sameSegmentedArrayBuilder to improveToList. Everything stays the same, except for the last step of constructing the final collection to be returned: rather than allocating an array, it allocates aList<T> and uses theCollectionsMarshal.SetCount method to set both theCapacity andCount of the list to the desired size, then copies the elements directly into the backing array for the list, thanks toCollectionsMarshal.AsSpan.ToList was also improved indotnet/runtime#86796 from@brantburnett. In variousIterator<T>.ToList specializations, the common pattern is to useList<T>.Add to fill in the resulting collection. This PR used a similar approach as with the previous PR, using a combination ofCollectionsMarshal.SetCount andCollectionsMarshal.AsSpan to get a span for the list and directly write into the span. This saves some of the overhead fromList<T>.Add, including bounds checks that would otherwise occur when writing to its backing array.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public List<int> IteratorSelectToList() => GetItems(8).Select(i => i).ToList();    [Benchmark]    public List<int> IteratorWhereSelectToList() => GetItems(8).Where(i => true).Select(i => i).ToList();    private IEnumerable<int> GetItems(int count)    {        for (int i = 0; i < count; i++)        {            yield return i;        }    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
IteratorSelectToList	.NET 8.0	75.14 ns	1.00	224 B	1.00
IteratorSelectToList	.NET 9.0	67.50 ns	0.90	184 B	0.82

IteratorWhereSelectToList	.NET 8.0	94.84 ns	1.00	288 B	1.00
IteratorWhereSelectToList	.NET 9.0	89.42 ns	0.94	248 B	0.86

A few more tweaks were made toToList andToArray indotnet/runtime#95224 from@Windows10CE anddotnet/runtime#100218. The former improvedToList on the result of aDistinct orUnion by enablingHashSet<T>‘sCopyTo implementation to be used; previously a custom function was manually iterating through the set, and this PR deleted that code (yay!) and just usedList<T>‘s constructor directly. The latter PR also improvedDistinct andUnion, but forToArray, and specifically in the case where it would have allocated a 0-length array when the source was empty.dotnet/runtime#99639 also improvedToArray andToList on the result of anOrderBy;OrderBy‘s iterator already special-cased empty sources, but with small tweaks it could also be made to optimize sources with only a single element, in which case no additional work needs to be done (a length-1 array is inherently sorted).

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public int[] OrderByToArray() => GetItems(1).OrderBy(x => x).ToArray();    private IEnumerable<int> GetItems(int count)    {        for (int i = 0; i < count; i++)        {            yield return i;        }    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
OrderByToArray	.NET 8.0	66.99 ns	1.00	352 B	1.00
OrderByToArray	.NET 9.0	53.92 ns	0.81	160 B	0.45

Not to be left out from the fun itsTo cousins are having,ToDictionary also sees improvements fromdotnet/runtime#96574 from@xin9le. The PR changes the code to do a better job setting the capacity of theDictionary<TKey, TValue> prior to filling it, and also using theCollectionsMarshal.AsSpan to share code for handling sources that are arrays and lists, while also shaving off some overhead by enumerating the span instead of the list directly.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly IEnumerable<KeyValuePair<int, int>> _enumerable = Enumerable.Range(0, 10_000).Select(x => new KeyValuePair<int, int>(x, x));    [Benchmark]    public Dictionary<int, KeyValuePair<int, int>> EnumerableToDictionary() => _enumerable.ToDictionary(x => x.Key);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
EnumerableToDictionary	.NET 8.0	284.3 us	1.00	788.73 KB	1.00
EnumerableToDictionary	.NET 9.0	149.9 us	0.53	237.01 KB	0.30

dotnet/runtime#96605 updatedEnumerable.Min andEnumerable.Max to specialize forchar,Int128, andUInt128 (previous changes specialized other numerical primitives, but these had been left out). By taking advantage of the existing code paths for handling those other primitives, with just a few lines added/changed, these types can now utilize those faster paths, which in particular special-case arrays and lists (which means it can then avoid an enumerator allocation in addition to faster access to each element).

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Int128[] _values = Enumerable.Range(0, 1000).Select(x => (Int128)x).ToArray();    [Benchmark]    public Int128 Max() => _values.Max();}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Max	.NET 8.0	1,882.0 ns	1.00	32 B	1.00
Max	.NET 9.0	624.7 ns	0.33	–	0.00

The aforementioned special code paths for the primitive types also support vectorization. Previously that vectorization only supported 128-bit and 256-bit vector widths, but as ofdotnet/runtime#93369 from@Spacefish, it now also supports 512-bit vector widths, possibly doubling the throughput ofEnumerable.Min andEnumerable.Max on supported hardware with the core numerical primitive types.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _values = Enumerable.Range(0, 10_000).ToArray();    [Benchmark]    public int Max() => _values.Max();}

Method	Runtime	Mean	Ratio
Max	.NET 8.0	327.6 ns	1.00
Max	.NET 9.0	166.3 ns	0.51

One caveat here about AVX512. Some AVX512 hardware, even on recent chips, can take a measurable amount of time to “power up,” such that it might be tens, hundreds, or even thousands of cycles before AVX512 processing ends up actually dispatching 512-bit vectors. Until then, the hardware might end up doing the equivalent of dispatching two 256-bit vectors. On my machine, for example, if I lower the size in the previous benchmark from 10,000 elements to 1,000 elements, the .NET 9 improvement disappears and it ends up running at exactly the same throughput as on .NET 8; on a colleague’s machine with a different processor, even at 1,000 elements the .NET 9 throughput is almost twice that of .NET 8. This is all to say, your mileage may vary. In some of the micro-benchmarks discussed in this post, small improvements are made to already very fast operations, and the gains then come from those operations being done many, many, many times on hot paths. In others, the gains come from taking an expensive operation and making it measurably cheaper. In general the benefits with using AVX512 in these kinds of vectorized implementations come for the latter case, where large data sizes lead to operations taking significant amounts of time, and the use of 512-bit vectors instead of 256-bit vectors measurably speeds up those longer operations.

TheOrderBy family of methods onEnumerable were also improved in several ways:

Ordering operations followed by aFirst() orLast() call were already specialized to completely avoid theO(N log N) sort and instead do anO(N) search for the min or max. However,OrderBy in LINQ is fairly complicated because it needs to account for the possibility of one or more subsequentThenBy operations that impact the sort order, and thus it uses a custom comparison mechanism that factors in the possibility of such refinement. That custom comparer mechanism was being used as part of thoseFirst/Last specializations.dotnet/runtime#97483 detects whether there are anyThenBys in play, and if there aren’t, it bypasses that customization and, in doing so, avoids its overhead. That can be very measurable, but in certain cases, it can be enormous, as it can enable other optimizations to kick in, e.g. anOrder().Last() on anint[] can just end up doing a vectorized search for the max.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.InteropServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private List _ints;    [GlobalSetup]    public void Setup()    {        _ints = new(Enumerable.Range(-8000, 8000 * 2));        new Random(42).Shuffle(CollectionsMarshal.AsSpan(_ints));    }    [Benchmark]    public int OrderByLast_Int32() => _ints.OrderBy(x => x).Last();    [Benchmark]    public int OrderLast_Int32() => _ints.Order().Last();}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
OrderByLast_Int32	.NET 8.0	34,715.6 ns	1.00	136 B	1.00
OrderByLast_Int32	.NET 9.0	25,001.1 ns	0.72	128 B	0.94
OrderLast_Int32	.NET 8.0	36,064.9 ns	1.00	112 B	1.00
OrderLast_Int32	.NET 9.0	693.8 ns	0.02	56 B	0.50

In .NET 8,Enumerable.Order was updated to recognize that sorting of certain primitive types is implicitly stable even if an unstable sorting algorithm is used, because any two values of such types that compare equally are indistinguishable in memory (e.g. the onlyInt32 values that compare equally are those with the exact same bit patterns in memory).dotnet/runtime#99533 improves this logic to also handle enums whose underlying type counts.

dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IEnumerable<DayOfWeek> _days = new Random(42).GetItems(Enum.GetValues<DayOfWeek>(), 100);    [Benchmark]    public int Order()    {        int sum = 0;        foreach (DayOfWeek dow in _days.Order())        {            sum += (int)dow;        }        return sum;    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Order	.NET 8.0	1,652.9 ns	1.00	1088 B	1.00
Order	.NET 9.0	873.0 ns	0.53	544 B	0.50

In .NET 8, a change was submitted toEnumerable.Range to vectorize its operation when followed by methods likeToArray. At the time, we had some debate about whether to merge the change, with me asking questions like “Who would actually useEnumerable.Range(...).ToArray() on code paths that care about performance?” As it turns out, we do! As part ofOrderBy‘s stable sort implementation, it had code like this:
```
int[] map = new int[count];for (int i = 0; i < map.Length; i++){    map[i] = i;}
```
For all intents and purposes, that’sEnumerable.Range(0, count).ToArray().dotnet/runtime#99538 recognizes this and uses that same vectorized helper to fill this array in a vectorized manner, and that overhead can actually be measurable in some cases.
```
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IEnumerable<int> _data = Enumerable.Range(0, 1000).ToArray();    [Benchmark]    public int OrderBy()    {        int sum = 0;        foreach (int value in _data.OrderBy(i => i))        {            sum += value;        }        return sum;    }}
```
Method Runtime Mean Ratio
OrderBy .NET 8.0 14.83 us 1.00
OrderBy .NET 9.0 13.48 us 0.91

GroupBy andToLookup also get some dedicated improvements in .NET 9, thanks todotnet/runtime#99365.GetEnumerator on the grouping object returned by these methods was implemented using a simple C# iterator:

public IEnumerator<TElement> GetEnumerator(){    for (int i = 0; i < _count; i++)    {        yield return _elements[i];    }}

In general we favor using C# iterators over manual implementations (unless we’re going to go all out and implement all of theIterator<TSource> logic) because C# iterators make the code so simple and maintainable. In this particular case, however, this is a reasonably common hot path and we can actually do meaningfully better by hand than the compiler is able to do today. When the compiler generates a state machine for the previous iterator, it does so with a dedicated state field, but with a manual implementation, we can use the same field for state as we use for the iteration variable, which also means we only need to update one thing per loop iteration.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private ILookup<int, string> _stringsByLength =        (from i in Enumerable.Range(0, 10)         from item in Enumerable.Range(0, 8)         select new string((char)('a' + item), i + 1)).ToLookup(s => s.Length);    [Benchmark]    public int Sum()    {        int sum = 0;        foreach (IGrouping<int, string> group in _stringsByLength)        {            foreach (string item in group)            {                sum += item.Length;            }        }        return sum;    }}

Method	Runtime	Mean	Ratio
Sum	.NET 8.0	290.3 ns	1.00
Sum	.NET 9.0	267.4 ns	0.92

Core Collections

As shared inPerformance Improvements in .NET 8,Dictionary<TKey, TValue> is one of the most popular collections in all of .NET, by far (probably not surprising to anyone). And in .NET 9, it gets a performance-focused feature I’ve been wanting for years.

One of the most common uses for a dictionary is as a cache, often indexed by astring key. And for high-performance scenarios, such caches are frequently used in situations where an actualstring object may not be available, but where the text is available, just in a different form, like aReadOnlySpan<char> (or for caches indexed by UTF8 data, the key might be abyte[] yet the data to perform the lookup is only available as aReadOnlySpan<byte>). Performing the lookup on the dictionary then would either require materializing a string from the data, which makes the lookup more costly (and in some cases can entirely defeat the purposes of the cache), or require using a custom key type that’s capable of handling multiple forms of the data, which then also generally requires a custom comparer.

This has been addressed in .NET 9 with the introduction ofIAlternateEqualityComparer<TAlternate, T>. A comparer that implementsIEqualityComparer<T> may now also implement this additional interface one or more times for otherTAlternate types, making it possible for that comparer to treat alternate types the same as theT. Then a type likeDictionary<TKey, TValue> can expose additional methods that work in terms of aTAlternateKey and allow them to work if the comparer in thatDictionary<TKey, TValue> implementsIAlternateEqualityComparer<TAlternateKey, TKey>. In .NET 9 withdotnet/runtime#102907 anddotnet/runtime#103191,Dictionary<TKey, TValue>,ConcurrentDictionary<TKey, TValue>,FrozenDictionary<TKey, TValue>,HashSet<T>, andFrozenSet<T> all do exactly that. For example, here I have aDictionary<string, int> I’m using to count the number of occurrences of each word in a span:

static Dictionary<string, int> CountWords1(ReadOnlySpan<char> input){    Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);    foreach (ValueMatch match in Regex.EnumerateMatches(input, @"\b\w+\b"))    {        ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);        string key = word.ToString();        result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;    }    return result;}

I’m returning aDictionary<string, int>, so I certainly need to materialize thestring for eachReadOnlySpan<char> in order tostore it in the dictionary, but I should only need to do so once, the first time the word is found. I shouldn’t need to create a new string each time, yet I’m having to in order to do theTryGetValue call. Now with .NET 9, a newGetAlternateLookup method (and a correspondingTryGetAlternateLookup) exists to produce a separate value type wrapper that enables using an alternate key type for all the relevant operations, which means I can now write this:

static Dictionary<string, int> CountWords2(ReadOnlySpan<char> input){    Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);    Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();    foreach (ValueMatch match in Regex.EnumerateMatches(input, @"\b\w+\b"))    {        ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);        alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;    }    return result;}

Note the distinct lack of aToString(), which means no allocation will occur here for words already seen. How then does thealternate[word] = ... part work? Surely this isn’t storing aReadOnlySpan<char> into the dictionary? Nope. Rather,IAlternateEqualityComparer<TAlternate, T> looks like this:

public interface IAlternateEqualityComparer<in TAlternate, T>    where TAlternate : allows ref struct    where T : allows ref struct{    bool Equals(TAlternate alternate, T other);    int GetHashCode(TAlternate alternate);    T Create(TAlternate alternate);}

TheEquals andGetHashCode should look familiar, the main difference from the corresponding members ofIEqualityComparer<T> just being the type of the first parameter. But then there’s this additionalCreate method. That method accepts aTAlternate and returns aT, giving the comparer the ability to map from one to the other. That setter we saw previously (and other methods likeTryAdd) are able to use this to only create theTKey from theTAlternateKey when they have to, so the setter here will only allocate the string for the word if the word doesn’t already exist in the collection.

Another possibly perplexing thing for anyone reading this and who’s well versed in the ways ofReadOnlySpan<T>: how in the world isDictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> valid?ref structs like span can’t be used as generic parameters, right? Right… until now. C# 13 and .NET 9 now permitref structs as generic parameters, but the generic parameter needs to opt-in to it via the newallows ref struct constraint (or “anti-constraint” as some of us frequently refer to it). There are things a method can do with an instance of an unconstrained generic parameter, like cast it toobject or store it into a field of a class, that can’t be done withref struct. By addingallows ref struct to a generic parameter, it tells the compiler compiling the consumer that it may specify aref struct, and it tells the compiler compiling the type or method with the constraint that the generic instantiation might be aref struct and thus the generic parameter can only be used in situations where aref struct would be legal.

Of course, all of this working hinges on the supplied comparer sporting the appropriateIAlternateEqualityComparer<TAlternate, T> implementation; if it doesn’t, attempts to callGetAlternateLookup will throw an exception, and attempts to callTryGetAlternateLookup will returnfalse. You can use these collection types with whatever comparer you want, and that comparer can provide implementations of this interface for whatever alternate key types you want. But withstring andReadOnlySpan<char> being so common, it’d be a shame if there wasn’t built-in support for this combination. And indeed, with the aforementioned PRs, all of the built-inStringComparer types implementIAlternateEqualityComparer<ReadOnlySpan<char>, string>. That’s why theDictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase); line is successful in the previous code example, as the subsequent call toresult.GetAlternateLookup<ReadOnlySpan<char>>() will successfully find the interface on the supplied comparer.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    [GeneratedRegex(@"\b\w+\b")]    private static partial Regex WordParser();    [Benchmark(Baseline = true)]    public Dictionary<string, int> CountWords1()    {        ReadOnlySpan<char> input = s_input;        Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);        foreach (ValueMatch match in WordParser().EnumerateMatches(input))        {            ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);            string key = word.ToString();            result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;        }        return result;    }    [Benchmark]    public Dictionary<string, int> CountWords2()    {        ReadOnlySpan<char> input = s_input;        Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);        Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();        foreach (ValueMatch match in WordParser().EnumerateMatches(input))        {            ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);            alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;        }        return result;    }}

Method	Mean	Ratio	Allocated	Alloc Ratio
CountWords1	60.35 ms	1.00	20.67 MB	1.00
CountWords2	57.40 ms	0.95	2.54 MB	0.12

Note the huge reduction in allocation.

For fun, we can also take this example one step further. .NET 6 introduced theCollectionsMarshal.GetValueRefOrAddDefault method, which returns a writableref to the actual location where theTValue for a givenTKey is stored, creating the entry in the dictionary if it doesn’t exist. This is very handy for operations like the one used above, as it helps to avoid an extra dictionary lookup. Without it, we’re doing one lookup as part of theTryGetValue and then another lookup as part of the setter, but with it, we just do the single lookup as part ofGetValueRefOrAddDefault and then no additional lookup is necessary because we already have the location into which we can directly write. And as the lookups in this benchmark are one of the more costly elements, eliminating half of them can significantly reduce the cost of the operation. As part of this alternate key effort, a new overload ofGetValueRefOrAddDefault was added that works with it, such that the same operation can be performed with aTAlternateKey.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.InteropServices;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;    [GeneratedRegex(@"\b\w+\b")]    private static partial Regex WordParser();    [Benchmark(Baseline = true)]    public Dictionary<string, int> CountWords1()    {        ReadOnlySpan<char> input = s_input;        Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);        foreach (ValueMatch match in WordParser().EnumerateMatches(input))        {            ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);            string key = word.ToString();            result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;        }        return result;    }    [Benchmark]    public Dictionary<string, int> CountWords2()    {        ReadOnlySpan<char> input = s_input;        Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);        Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();        foreach (ValueMatch match in WordParser().EnumerateMatches(input))        {            ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);            alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;        }        return result;    }    [Benchmark]    public Dictionary<string, int> CountWords3()    {        ReadOnlySpan<char> input = s_input;        Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);        Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();        foreach (ValueMatch match in WordParser().EnumerateMatches(input))        {            ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);            CollectionsMarshal.GetValueRefOrAddDefault(alternate, word, out _)++;        }        return result;    }}

Method	Mean	Ratio	Allocated	Alloc Ratio
CountWords1	60.73 ms	1.00	20.67 MB	1.00
CountWords2	54.01 ms	0.89	2.54 MB	0.12
CountWords3	44.38 ms	0.73	2.54 MB	0.12

“But wait, there’s more!”dotnet/runtime#104202 extends the alternate comparer implementation forstring/ReadOnlySpan<char> further to also apply toEqualityComparer<string>.Default, which means that if you don’t supply a comparer at all, these collection types will still supportReadOnlySpan<char> lookups. That change not only then improves the usability of these new APIs, but it actually had an additional unintended but welcome performance benefit. Previously,EqualityComparer<string>.Default would return an internalGenericEqualityComparer<string> type, derived fromEqualityComparer<string>. It wouldn’t be possible to implementIAlternateEqualityComparer<ReadOnlySpan<char>, string> onGenericEqualityComparer<string> because doing so would actually have to be done onGenericEqualityComparer<T>, which would mean everyEqualityComparer<T>.Default would supportIAlternateEqualityComparer<ReadOnlySpan<char>, T>, and we have no correct way of providing such an implementation for allTs. Instead, we introduced a new internal non-genericStringEqualityComparer type and madeEqualityComparer<T>.Default return an instance of that whenT isstring (the implementation ofDefault already knows about and returns a bunch of specialized comparers, this is just one more). In doing so, it made the type that’s used non-generic, which in turn means that in some situations, it eliminates some of the overhead associated with generics.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IEqualityComparer<string> _comparer = EqualityComparer<string>.Default;    private string[] _values = Enumerable.Range(0, 1000).Select(i => i.ToString()).ToArray();    [Benchmark]    public int Count() => CountEquals(_values, "500", _comparer);    [MethodImpl(MethodImplOptions.NoInlining)]    private static int CountEquals<T>(T[] haystack, T needle, IEqualityComparer<T> comparer)    {        int count = 0;        foreach (T value in haystack)        {            if (comparer.Equals(value, needle))            {                count++;            }        }        return count;    }}

Method	Runtime	Mean	Ratio
Count	.NET 8.0	4.477 us	1.00
Count	.NET 9.0	2.808 us	0.63

HashSet<T> also gains all of these super powers, but several additional PRs went into making other performance improvements to it.dotnet/runtime#85877 from@hrrrrustic added aTrimExcess(int capacity) method toHashSet<T> (as well as toQueue<T> andStack<T>), enabling more fine-grained control over how much memory to cull from a set that might have grown larger than is now required. Anddotnet/runtime#102758 from@lilinus improved itsIsSubsetOf,IsProperSubsetOf, andSetEquals methods by tweaking the fast paths already present. The methods were attempting to early-exit as soon as the condition could be provedtrue orfalse, but some common conditions were being missed, and this rectified those.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private HashSet<int> _set = new(Enumerable.Range(0, 1000));    private List<int> _list = Enumerable.Range(0, 999).ToList();    [Benchmark]    public bool IsSubsetOf() => _set.IsSubsetOf(_list);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
IsSubsetOf	.NET 8.0	7,351.373 ns	1.000	40 B	1.00
IsSubsetOf	.NET 9.0	1.216 ns	0.000	–	0.00

dotnet/runtime#96573 from@ndsvw also identified a few places in various libraries where aDictionary<T, T> was being used as a set and replaced them withHashSet<T>. The implementations ofDictionary<> andHashSet<> are very close in nature, but the latter consumes less memory because it doesn’t need to store separate values. Using aDictionary<T, T> effectively doubles the required storage, so if aHashSet<T> suffices, it’s preferable.

A variety of other collection types have also seen improvements in .NET 9:

PriorityQueue<TElement, TPriority>. TheEnqueueRange(IEnumerable<Telement>, TPriority) method enables multiple elements to all be inserted at the same priority. If there are already elements in the collection, this is akin to just callingEnqueue for each. However, if the collection is currently empty, then it can skip the per-element addition costs and instead just store the elements directly into the backing array. After doing so, it was then performing a heapify operation. Butdotnet/runtime#99139 from@skyoxZ recognized that this heapify was entirely unnecessary, because all of the elements were inserted at the same priority, and there were no elements of any other priority already in the collection. Many performance optimizations come with trade-offs, making one common thing much faster at the expense of making some less common thing a little slower. This, however, is my favorite kind of optimization: elimination of unnecessary work with effectively zero downside.
```
using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[DisassemblyDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private PriorityQueue<int, int> _pq = new();    private int[] _elements = Enumerable.Range(0, 100).ToArray();    [Benchmark]    public void EnqueueRange()    {        _pq.Clear();        _pq.EnqueueRange(_elements, priority: 42);    }}
```
Method Runtime Mean Ratio
EnqueueRange .NET 8.0 239.3 ns 1.00
EnqueueRange .NET 9.0 206.7 ns 0.86

BitArray. Multiple methods onBitArray are already accelerated usingVector128 andVector256, enabling much faster throughput for methods likeAnd,Or, andNot.dotnet/runtime#91903 from@khushal1996 addsVector512 support to all of these as well, enabling hardware with AVX512 support to process these operations upwards of twice as fast as before.

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private BitArray _first = new BitArray(1024 * 1024);    private BitArray _second = new BitArray(1024 * 1024);    [Benchmark]    public void Or() => _first.Or(_second);}

Method	Runtime	Mean	Ratio
Or	.NET 8.0	2.894 us	1.00
Or	.NET 9.0	2.354 us	0.81

List<T>.dotnet/runtime#90089 from@karakasa avoided an extraArray.Copy call as part ofInsert. Previously the implementation may have done up to threeArray.Copy operations, and as part of this change that can drop to just two.
FrozenDictionary<TKey, TValue> andFrozenSet<T>. These frozen collections introduced in .NET 8 have also received some attention in .NET 9. As a reminder,FrozenDictionary<TKey, TValue> andFrozenSet<T> are immutable collections optimized for reading, willing to spend more time and effort during construction to make subsequent operations on the collections as fast as possible. When theTKey/T is astring, one optimization employed is to track the minimum and maximum lengths of all strings in the collection; if astring that’s shorter or longer than that is used in a lookup, the collection can immediately report that it’s not in the collection without having to actually perform any lookup, instead just comparing against the min and max.dotnet/runtime#92546 from@andrewjsaid extends this further by employing a bitmap of up to 64 bits corresponding to lengths of strings contained in the collection. On lookup, rather than only comparing against min/max, the implementation can test whether the corresponding bit for thestring‘s length is set, bailing immediately if it’s not.dotnet/runtime#100998 also reduced creation overheads with frozen collections created withstring keys andStringComparer.OrdinalIgnoreCase. The implementation had been using its own custom comparison logic for hash code generation, in order to support building fornetstandard2.0 in addition to .NET Core, but this PR specialized the code for .NET Core to usestring.GetHashCode(ReadOnlySpan<char>, StringComparison), which is more efficient than the custom implementation.
```
using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Frozen;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly FrozenSet<string> s_words = Regex.Matches("""        Let me not to the marriage of true minds        Admit impediments; love is not love        Which alters when it alteration finds,        Or bends with the remover to remove.        O no, it is an ever-fixed mark        That looks on tempests and is never shaken;        It is the star to every wand'ring bark        Whose worth's unknown, although his height be taken.        Love's not time's fool, though rosy lips and cheeks        Within his bending sickle's compass come.        Love alters not with his brief hours and weeks,        But bears it out even to the edge of doom:        If this be error and upon me proved,        I never writ, nor no man ever loved.        """, @"\b\w+\b").Cast<Match>().Select(w => w.Value).ToFrozenSet();    private string _word = "quickness";    [GlobalSetup] public void Setup() => Console.WriteLine(s_words);    [Benchmark]    public bool Contains() => s_words.Contains(_word);}
```
Method Runtime Mean Ratio
Contains .NET 8.0 4.373 ns 1.00
Contains .NET 9.0 1.154 ns 0.26

Compression

It’s an important goal of the core .NET libraries to be as platform-agnostic as possible. Things should generally behave the same way regardless of which operating system or which hardware is being used, excepting things that really are operating system or hardware specific (e.g. we purposefully don’t try to paper over casing differences of different file systems). To that end, we generally implement as much as possible in C#, deferring down to the operating system and native platform libraries only when necessary; for example, the default .NET HTTP implementation,System.Net.Http.SocketsHttpHandler, is written in C# on top ofSystem.Net.Sockets,System.Net.Dns, etc., and subject to the implementation of sockets on each platform (where behaviors are implemented by the operating system), generally behaves the same wherever you’re running.

There are, however, just a few specific places where we’ve actively made the choice to defer more to something in the platform. The most important case here is cryptography, where we want to rely on the operating system for such security-related functionality; so on Windows, for example, TLS is implemented in terms of components likeSChannel, on Linux in terms ofOpenSSL, and on macOS in terms ofSecureTransport. The other notable case has been compression, and in particularzlib. We decided long ago to simply use whateverzlib was distributed with the operating system. That has had various implications, however. Starting with the fact that Windows doesn’t ship withzlib as a library exposed for consumption, so the .NET build targeting Windows still had to include its own copy ofzlib. That was then improved but also complicated by a decision to switch to distribute a variant ofzlib produced by Intel, which was nicely optimized further for x64, but which didn’t have as much attention paid to other hardware, like Arm64. And very recently, theintel/zlib repository was archived and is not actively being maintained by Intel.

To simplify things, to improve consistency and performance across more platforms, and to move to an actively supported and evolving implementation, this changes for .NET 9. Thanks to a stream of PRs, and in particulardotnet/runtime#104454 anddotnet/runtime#105771, .NET 9 now includes thezlib functionality built-in across Windows, Linux, and macOS, based on the newerzlib-ng/zlib-ng.zlib-ng is azlib-compatible API that is actively maintained, includes improvements previously made to both Intel and Cloudflare’s forks, and has received improvements across many different CPU intrinsics.

Benchmarking just throughput is easy with BenchmarkDotNet. Unfortunately, while I love the tool, the lack ofdotnet/BenchmarkDotNet#784 makes it very challenging to appropriately benchmark compression, because throughput is only one part of the equation. Compression ratio is also a key part of the equation (you can make “compression” super fast by just outputting the input without actually manipulating it at all), so we also need to know about compressed output size when discussing compression speeds. To do that for this post, I’ve hacked up just enough in this benchmark to make it work for this example, implementing a custom column for BenchmarkDotNet, but please note this is not a general-purpose implementation.

// dotnet run -c Release -f net8.0 --filter "*"// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Columns;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Reports;using BenchmarkDotNet.Running;using System.IO.Compression;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(    args,    DefaultConfig.Instance.AddColumn(new CompressedSizeColumn()));[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private byte[] _uncompressed = new HttpClient().GetByteArrayAsync(@"https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;    [Params(CompressionLevel.NoCompression, CompressionLevel.Fastest, CompressionLevel.Optimal, CompressionLevel.SmallestSize)]    public CompressionLevel Level { get; set; }    private MemoryStream _compressed = new MemoryStream();    private long _compressedSize;    [Benchmark]    public void Compress()    {        _compressed.Position = 0;        _compressed.SetLength(0);        using (var ds = new DeflateStream(_compressed, Level, leaveOpen: true))        {            ds.Write(_uncompressed, 0, _uncompressed.Length);        }        _compressedSize = _compressed.Length;    }    [GlobalCleanup]    public void SaveSize()    {        File.WriteAllText(Path.Combine(Path.GetTempPath(), $"Compress_{Level}"), _compressedSize.ToString());    }}public class CompressedSizeColumn : IColumn{    public string Id => nameof(CompressedSizeColumn);    public string ColumnName { get; } = "CompressedSize";    public bool AlwaysShow => true;    public ColumnCategory Category => ColumnCategory.Custom;    public int PriorityInCategory => 1;    public bool IsNumeric => true;    public UnitType UnitType { get; } = UnitType.Size;    public string Legend => "CompressedSize Bytes";    public bool IsAvailable(Summary summary) => true;    public bool IsDefault(Summary summary, BenchmarkCase benchmarkCase) => true;    public string GetValue(Summary summary, BenchmarkCase benchmarkCase, SummaryStyle style) =>        GetValue(summary, benchmarkCase);    public string GetValue(Summary summary, BenchmarkCase benchmarkCase) =>        File.ReadAllText(Path.Combine(Path.GetTempPath(), $"Compress_{benchmarkCase.Parameters.Items[0].Value}")).Trim();}

Running that for .NET 8, I get this:

Method	Level	Mean	CompressedSize
Compress	NoCompression	1.783 ms	16015049
Compress	Fastest	164.495 ms	7312367
Compress	Optimal	620.987 ms	6235314
Compress	SmallestSize	867.076 ms	6208245

and for .NET 9, I get this:

Method	Level	Mean	CompressedSize
Compress	NoCompression	1.814 ms	16015049
Compress	Fastest	64.345 ms	9578398
Compress	Optimal	230.646 ms	6276158
Compress	SmallestSize	567.579 ms	6215048

A few interesting things to note here:

On both .NET 8 and .NET 9, there’s an obvious correlation: the more compression is requested, the slower it gets and the smaller the file size becomes.
NoCompression, which really just echos the input bytes back as output, produces the exact same compressed size across .NET 8 and .NET 9, as one would hope; the compressed size should be identical to the input size.
The compressed size forSmallestSize is almost the same between .NET 8 and .NET 9; they differ by only ~0.1%, but for that small increase, theSmallestSize throughput ends up being ~35% faster. In both cases, the .NET layer is just passing down a zlib compression level of 9, which is the largest value possible and denotes best-possible compression. It just happens that withzlib-ng, that best possible compression is significantly faster and just a tad bit worse compression-ratio-wise than withzlib.
ForOptimal, which is also the default and represents a balanced tradeoff between speed and compression ratio (with 20/20 hindsight, the name for this member should have beenBalanced), the .NET 9 version usingzlib-ng is 60% faster while only sacrificing ~0.6% on compression ratio.
Fastest is interesting. The .NET implementation is just passing down a compression level of 1 to thezlib-ng native code, indicating to choose the fastest speed while still doingsome compression (0 means don’t compress at all). But thezlib-ng implementation is obviously making different trade-offs than did the olderzlib code, as it’s truer to its name: it’s running more than 2x as fast and still compressing, but the compressed output is ~30% larger than the output on .NET 8.

The net effect of this is, especially if you’re usingFastest, you might want to re-evaluate to see whether the throughput / compression ratios meet your needs. If you want to tweak it further, though, you’re no longer limited to just these options.dotnet/runtime#105430 adds new constructors toDeflateStream,GZipStream,ZLibStream, and also the unrelatedBrotliStream, enabling more fine-grained control over the parameters passed to the native implementations, e.g.

private static readonly ZLibCompressionOptions s_options = new ZLibCompressionOptions(){    CompressionLevel = 2,};...Stream sourceStream = ...;using var ds = new DeflateStream(compressed, s_options, leaveOpen: true){    sourceStream.CopyTo(ds);}

Cryptography

Investments inSystem.Security.Cryptography are generally focused on improving the security of a system, supporting new cryptographic primitives, better integrating with security capabilities of the underlying operating system, and so on. But as cryptography is ever present in most modern systems, it’s also impactful to make the existing functionality more efficient, and a variety of PRs in .NET 9 have done so.

Let’s start with random number generation. .NET 8 added a newGetItems method to bothRandom (the core non-cryptographically-secure random number generator) andRandomNumberGenerator (the core cryptographically-secure random number generator). This method is very handy when you need to randomly generate N elements sourced from a specific set of values. For example, if you wanted to write 100 random hex characters to a destinationSpan<char>, you could do:

Span<char> dest = stackalloc char[100];Random.Shared.GetItems("0123456789abcdef", dest);

The core implementation is very simple, and is just a convenience implementation for something you could easily do yourself:

for (int i = 0; i < dest.Length; i++){    dest[i] = choices[Next(choices.Length)];}

Easy peasy. However, in some situations we can do better. This implementation ends up making a call to the random number generator for each element, and that roundtrip adds measurable overhead. If we could instead make fewer calls, we could ammortize that overhead across however many elements could be filled by that single call. That’s exactly whatdotnet/runtime#92229 does. If the number of choices is less than or equal to 256 and a power of two, rather than asking for a random integer for each element, we can instead get a byte for each element, and we can do that in bulk with a single call to NextBytes. The max of 256 choices is because that’s the number of values a byte can represent, and the power of two is so that we can simply mask off unnecessary bits from the byte, which helps to avoid bias. This makes a measurable impact forRandom, but even more so forRandomNumberGenerator, where each call to get random bytes requires a transition into the operating system.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private char[] _dest = new char[100];    [Benchmark]    public void GetRandomHex() => RandomNumberGenerator.GetItems<char>("0123456789abcdef", _dest);}

Method	Runtime	Mean	Ratio
GetRandomHex	.NET 8.0	58,659.2 ns	1.00
GetRandomHex	.NET 9.0	746.5 ns	0.01

Sometimes performance improvements are about revisiting past assumptions. .NET 5 added a newGC.AllocateArray method which optionally allows that array to be created on the “pinned object heap,” or POH. Allocating on the POH is the same as allocating normally, except that the GC guarantees that objects on the POH won’t be moved (normally the GC is free to compact the heap, moving objects around in order to reduce fragmentation). This is a useful guarantee for cryptography, which employs defense-in-depth measures like zero’ing out buffers to reduce the chances of an attacker being able to find sensitive information in the memory (or memory dump) of a process. The crypto library wants to be able to allocate some memory, use it to temporarily contain some sensitive information, and then zero out the memory before stopping using it, but if the GC is able to move the object around in the interim, it could end up leaving shadows of the data on the heap. When the POH was introduced, then,System.Security.Cryptography started using it, including for relatively short-lived objects. This is potentially problematic, however. Because the nature of the POH is that objects can’t be moved around, creating short-lived objects on the POH can significantly increase fragmentation, which can in turn increase memory consumption, GC costs, and so on. And as a result, the POH is really only recommended for long-lived objects, ideally ones that you create and then hold onto for the remainder of the process’ lifetime.dotnet/runtime#99168 undidSystem.Security.Cryptography‘s reliance on the POH, instead preferring to use native memory (e.g. viaNativeMemory.Alloc andNativeMemory.Free) for such needs.

On the subject of memory, multiple PRs went into the crypto libraries to reduce allocation. Here are some examples:

Marshaling pointers instead of temporary arrays. TheCngKey type exposes properties likeExportPolicy,IsMachineKey, andKeyUsage, all of which utilize an internalGetPropertyAsDword method that P/Invokes to retrieve an integer from Windows. It was doing so, however, via a shared helper that was allocating a 4-bytebyte[], passing that to the OS to fill, and then converting those four bytes into anint.dotnet/runtime#91521 changed the interop path to instead just store theint on the stack, passing a pointer to it to the OS, avoiding the need to allocate and parse.
Special-casing empty. Throughout the core libraries, we rely heavily onArray.Empty<T>() to avoid allocating lots of empty arrays when we could instead just employ singletons. The crypto libraries work with a lot of arrays, and as part of defense-in-depth, will often hand out clones of those arrays rather than handing out the same array to everyone; that’s handled by a sharedCloneByteArray helper. As it turns out, however, it’s reasonably common for arrays to be empty, yetCloneByteArray wasn’t special-casing them, and was thus always allocating new arrays even if the input was empty.dotnet/runtime#93231 simply special-cased empty input arrays to return themselves rather than clone them.
Avoiding unnecessary defensive copies.dotnet/runtime#97108 avoids more defensive copies than just those for empty arrays mentioned above. ThePublicKey type is passed twoAsnEncodedData instances, one for parameters and one for a key value, and both of which it clones to avoid any issues that might arise with that provided instance being mutated. But in some internal uses, the caller is constructing a temporaryAsnEncodedData and effectively transferring ownership, yetPublicKey would still then make a defensive copy, even though the temporary could have just been used in its stead. This change enables the original instances to just be used without copy in such cases.
Using collection expressions with spans. One of the really neat things about the collection expressions feature introduced in C# 11 is it allows you to express your intent for what you want and allow the system to implement that as best it can. As part of initializingOidLookup, it had multiple lines that look like this:
```
AddEntry("1.2.840.10045.3.1.7", "ECDSA_P256", new[] { "nistP256", "secP256r1", "x962P256v1", "ECDH_P256" });AddEntry("1.3.132.0.34", "ECDSA_P384", new[] { "nistP384", "secP384r1", "ECDH_P384" });AddEntry("1.3.132.0.35", "ECDSA_P521", new[] { "nistP521", "secP521r1", "ECDH_P521" });
```
This effectively forced it to allocate these arrays, even though theAddEntry method doesn’t actually require the array-ness and just iterates through the supplied values.dotnet/runtime#100252 changedAddEntry to take aReadOnlySpan<string> instead ofstring[], and changed all the call sites to be collection expressions:
```
AddEntry("1.2.840.10045.3.1.7", "ECDSA_P256", ["nistP256", "secP256r1", "x962P256v1", "ECDH_P256"]);AddEntry("1.3.132.0.34", "ECDSA_P384", ["nistP384", "secP384r1", "ECDH_P384"]);AddEntry("1.3.132.0.35", "ECDSA_P521", ["nistP521", "secP521r1", "ECDH_P521"]);
```
allowing the compiler to do the “right thing.” All of those call sites then instead just end up using stack space to store the strings passed toAddEntry, rather than allocating any arrays at all.
Presizing collections. Many collections, such asList<T> orDictionary<TKey, TValue>, allow you to create a new one, with no a priori knowledge of how large they’ll grow to be, and internally they handle growing their storage to accommodate additional data. The growth algorithm employed typically involves doubling capacity, as doing so strikes a reasonable balance between possibly wasting some memory and not having to re-grow too frequently. However, such growing does have overhead, avoiding it is desirable, and so many collections offer the ability to pre-size the capacity of the collection, e.g.List<T> has a constructor that accepts anint capacity, where the list will immediately create a backing store large enough to accommodate that many elements. TheOidCollection in cryptography didn’t have such a capability even though many of the places it was being created did know the exact required size, which in turn results in unnecessary allocation and copying as the collection grows to reach the target size.dotnet/runtime#97106 added such a constructor internally and used it in various places, in order to avoid that overhead. As withOidCollection,CborWriter also lacked the ability to presize, making the aforementioned growth algorithm problem even more stark.dotnet/runtime#92538 added such a constructor.
AvoidingO(N^2) growth algorithms.dotnet/runtime#92435 from@MichalPetryka fixes a good example of what happens when youdon’t employ such a doubling scheme as part of collection resizing. The algorithm used to grow the buffer used byCborWriter would increase the backing buffer by a fixed number of elements each time. A doubling strategy ensures you need no more thanO(log N) growth operations, and ensures thatN items can be added to a collection inO(N) time, since the number of element copies will beO(2N), which is justO(N) (e.g. if N == 128, and you start with a buffer of size 1, and you grow to 2, then 4, 8, 16, 32, 64, and 128, that’s 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128, which is 255, or just under twice N). But increasing by a fixed number can meanO(N) such operations. And since each growth operation also needs to copy all the elements (assuming the growing is done by array resizing), that makes the algorithmO(N^2). In the extreme, if that fixed number was 1, and we were again growing from 1 to 128 one at a time, that’s just summing all the numbers from 1 to 128, the formula for which isN(N+1)/2, which isO(N^2). This PR changedCborWriters‘s growth strategy to use doubling instead.
```
// Add a <PackageReference Include="System.Formats.Cbor" Version="8.0.0" /> to the csproj.// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using System.Formats.Cbor;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("System.Formats.Cbor", "8.0.0").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet("System.Formats.Cbor", "9.0.0-rc.1.24431.7"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    [Benchmark]    public CborWriter Test()    {        const int NumArrayElements = 100_000;        CborWriter writer = new();        writer.WriteStartArray(NumArrayElements);        for (int i = 0; i < NumArrayElements; i++)        {            writer.WriteInt32(i);        }        writer.WriteEndArray();        return writer;    }}
```
Method Runtime Mean Ratio Allocated Alloc Ratio
Test .NET 8.0 25,185.2 us 1.00 65350.11 KB 1.00
Test .NET 9.0 697.2 us 0.03 1023.82 KB 0.02

Of course, improving performance is more than just avoiding allocation. A variety of changes helped in other ways.

dotnet/runtime#99053 “memoizes” (caches) various properties onCngKey that are accessed multiple times but where the answer stays the same from call to call; it does so simply by adding a few fields to the type to cache these values, which is a significant win if any is accessed multiple times over the lifetime of the object. The affected properties (Algorithm,AlgorithmGroup, andProvider) are particularly expensive because the OS implementation of these functions needs to make a remote procedure call to another Windows process to access the relevant data.

// Windows-only test.// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private RSACng _rsa = new RSACng(2048);    [GlobalCleanup]    public void Cleanup() => _rsa.Dispose();    [Benchmark]    public CngAlgorithm GetAlgorithm() => _rsa.Key.Algorithm;    [Benchmark]    public CngAlgorithmGroup? GetAlgorithmGroup() => _rsa.Key.AlgorithmGroup;    [Benchmark]    public CngProvider? GetProvider() => _rsa.Key.Provider;}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
GetAlgorithm	.NET 8.0	63,619.352 ns	1.000	88 B	1.00
GetAlgorithm	.NET 9.0	10.216 ns	0.000	–	0.00

GetAlgorithmGroup	.NET 8.0	62,580.363 ns	1.000	88 B	1.00
GetAlgorithmGroup	.NET 9.0	8.354 ns	0.000	–	0.00

GetProvider	.NET 8.0	62,108.489 ns	1.000	232 B	1.00
GetProvider	.NET 9.0	8.393 ns	0.000	–	0.00

There were also several improvements related to loading certificates and keys.dotnet/runtime#97267 from@birojnayak addressed an issue on Linux where the same certificate was being processed multiple times rather than just once, anddotnet/runtime#97827 improved the performance of RSA key loading by avoiding some unnecessary work that the key validation was performing.

Networking

Quick, when was the last time you worked on a real application or service that didn’t involve networking at all? I’ll wait… (I’m so funny.) Effectively every modern application relies on networking in one way, shape, or form, especially one that’s following more cloud-native architectures, involving microservices, and the like. Driving down the costs associated with networking is something we take very seriously, and the .NET community whittles away at these costs every release, including in .NET 9.

SslStream has been a key focus for performance optimization in past releases. It’s used by a significant portion of traffic with bothHttpClient and the ASP.NET Kestrel web server, putting it on the hot path for many systems. Previous improvements have targeted both steady-state throughput as well as creation overhead.

In .NET 9, a few PRs focused on steady-state throughput, such asdotnet/runtime#95595, which addressed an issue where some packets were being unnecessarily split into two, leading to extra overhead associated with needing to send and receive that extra packet. This was particularly impactful when writing out exactly 16K, and especially on Windows (where I’ve run this test):

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Net.Security;using System.Net.Sockets;using System.Runtime.InteropServices;using System.Security.Authentication;using System.Security.Cryptography.X509Certificates;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private SslStream _client, _server;    private byte[] _buffer = new byte[16 * 1024];    private readonly SslServerAuthenticationOptions _serverOptions = new SslServerAuthenticationOptions    {        ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),        EnabledSslProtocols = SslProtocols.Tls13,    };    private readonly SslClientAuthenticationOptions _clientOptions = new SslClientAuthenticationOptions    {        TargetHost = "localhost",        RemoteCertificateValidationCallback = delegate { return true; },        EnabledSslProtocols = SslProtocols.Tls13,    };    [GlobalSetup]    public async Task Setup()    {        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));        listener.Listen(1);        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };        client.Connect(listener.LocalEndPoint!);        Socket serverSocket = listener.Accept();        serverSocket.NoDelay = true;        _client = new SslStream(new NetworkStream(client, ownsSocket: true), leaveInnerStreamOpen: true);        _server = new SslStream(new NetworkStream(serverSocket, ownsSocket: true), leaveInnerStreamOpen: true);        await Task.WhenAll(            _client.AuthenticateAsClientAsync(_clientOptions),            _server.AuthenticateAsServerAsync(_serverOptions));    }    [GlobalCleanup]    public void Cleanup()    {        _client.Dispose();        _server.Dispose();    }    [Benchmark]    public async Task SendReceive()    {        await _client.WriteAsync(_buffer);        await _server.ReadExactlyAsync(_buffer);    }    private static X509Certificate2 GetCertificate()    {        X509Certificate2 cert;        using (RSA rsa = RSA.Create())        {            var certReq = new CertificateRequest("CN=localhost", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);            certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));            certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid("1.3.6.1.5.5.7.3.1") }, false));            certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));            cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));            if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))            {                cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));            }        }        return cert;    }}

Method	Runtime	Mean	Ratio
SendReceive	.NET 8.0	43.07 us	1.00
SendReceive	.NET 9.0	29.38 us	0.68

dotnet/runtime#100513 also reduced the cost of checkingSslStream.IsMutuallyAuthenticated orSslStream.LocalCertificate from a client when a client certificate is being used.

However, the bigger impacts in .NET 9 weren’t on steady-state throughput but rather on TLS connection establishment, aka the handshake. Establishing a TLS connection requires the client and server to engage in a conversation where they agree on details like TLS version, what cipher suite to use, confirm the other is who they say they are, create and exchange dedicated symmetric keys for the communication, and so on. That’s a relatively expensive endeavor. For long-lived connections, that overhead is generally not a big deal, but there are plenty of scenarios where connections are more routinely established and torn down, and for those, we want to drive down the overhead associated with setting up such a connection.

dotnet/runtime#87874 focused on reducing allocations associated with the TLS handshake, by renting some buffers fromArrayPool<byte> rather than always allocating. Anddotnet/runtime#97348 continued the effort by avoiding some unnecessarySafeHandle allocation.dotnet/runtime#103814 also switched a rarely-neededConcurrentDictionary<> in the Linux implementation to be lazily allocated rather than always being allocated as part of the handshake. These changes combine to significantly reduce the allocation incurred as part of setting up TLS:

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Net.Security;using System.Net.Sockets;using System.Runtime.InteropServices;using System.Security.Authentication;using System.Security.Cryptography.X509Certificates;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private NetworkStream _client, _server;    private readonly SslServerAuthenticationOptions _serverOptions = new SslServerAuthenticationOptions    {        ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),        EnabledSslProtocols = SslProtocols.Tls13,    };    private readonly SslClientAuthenticationOptions _clientOptions = new SslClientAuthenticationOptions    {        TargetHost = "localhost",        RemoteCertificateValidationCallback = delegate { return true; },        EnabledSslProtocols = SslProtocols.Tls13,    };    [GlobalSetup]    public void Setup()    {        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));        listener.Listen(1);        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };        client.Connect(listener.LocalEndPoint!);        Socket serverSocket = listener.Accept();        serverSocket.NoDelay = true;        _server = new NetworkStream(serverSocket, ownsSocket: true);        _client = new NetworkStream(client, ownsSocket: true);    }    [GlobalCleanup]    public void Cleanup()    {        _client.Dispose();        _server.Dispose();    }    [Benchmark]    public async Task Handshake()    {        using var client = new SslStream(_client, leaveInnerStreamOpen: true);        using var server = new SslStream(_server, leaveInnerStreamOpen: true);        await Task.WhenAll(            client.AuthenticateAsClientAsync(_clientOptions),            server.AuthenticateAsServerAsync(_serverOptions));    }    private static X509Certificate2 GetCertificate()    {        X509Certificate2 cert;        using (RSA rsa = RSA.Create())        {            var certReq = new CertificateRequest("CN=localhost", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);            certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));            certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid("1.3.6.1.5.5.7.3.1") }, false));            certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));            cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));            if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))            {                cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));            }        }        return cert;    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Handshake	.NET 8.0	2.652 ms	1.00	5.03 KB	1.00
Handshake	.NET 9.0	2.581 ms	0.97	3.3 KB	0.66

Of course, while driving down the costs of doing something is good, avoiding that thing altogether is even better. “TLS resumption” is a capability in the TLS protocol where, if a connection is closed and the same client later opens a new connection to the same server, the client may be able to effectively pick up where it left off with the previous TLS connection rather than starting a brand new one from scratch. Support for TLS resumption on Linux was added in .NET 7, but clients using client certificates weren’t supported… now in .NET 9 thanks todotnet/runtime#102656, even such clients can benefit from this significant time saver.

TLS resumption is an optimization where information is stored to enable more efficient operation later. In some ways, it’s not unlike pooling in that regard. We frequently talk about pooling as a way to optimize. Often our conversations are around avoiding allocations, where employing a pool is betting that you can be more efficient than the garbage collector. For small, cheap to create objects, that’s often a bad bet. For larger objects, such as for larger arrays, it can be a good bet, which is whyArrayPool<T> exists and is used throughout the core libraries for getting temporary buffers (see thisDeep .NET discussion onArrayPool for more info). But there’s a much more impactful class where pooling is useful, and that’s where the thing being pooled is really expensive to create. Such cases are no longer about memory management, they’re about ammortizing the cost of that creation. And out of all of the pooling done throughout the core libraries, it’s hard to imagine a more impactful case of that then the HTTP connection pool. The objects in this pool represent established connections to an HTTP server, and establishing such connections can be measured in microseconds, or even seconds in certain environments. If such costs had to be paid every single time you were making an HTTP request, it would add huge latency throughout the system. Instead, outgoing HTTP requests try to grab a connection from the connection pool, reusing that connection for the individual request/response, and then putting the connection back into the pool when done.

However, as with any pool, the pool itself has cost. In the case of the HTTP connection pool in aSocketsHttpHandler instance, the most important factor impacting performance is how quickly a connection can be rented and returned to the pool, especially when under load. That load aspect is important, because as a shared resource, access to this pool must be synchronized, in order to ensure the correctness of the system: it’d be really bad, for example, if two requests went to rent a connection at the same time and ended up incorrectly being given the same connection to use, concurrently. “Really bad” in such a case could not only be corrupting data, but possibly even sending the wrong data to the wrong server. That obviously needs to be avoided. So, synchronization is used, but that synchronization creates a bottleneck, where under load lots of requests can end up being blocked just waiting to check whether a connection is even available. Over the years we’ve whittled away at that cost, but it gets even lower in .NET 9, in particular for HTTP/1.1 connections (we talk about “the” pool, but in reality connections are only pooled together when they’re interchangeable, so there are actually many groupings of connections, for example with HTTP/1.1 connections separate from HTTP/2 connections or HTTP/3 connections, a separate pool for each endpoint, etc.).dotnet/runtime#99364 changes the synchronization mechanism from using a pure lock-based scheme to a more opportunistic concurrency scheme that employs a first-layer of lockless synchronization. There’s now still a lock, but for the hot path it’s avoided as long as there are connections in the pool by using aConcurrentStack<T>, such that renting is aTryPop and returning is aPush.ConcurrentStack<T> itself uses a lock-free algorithm, that’s a lot more scalable than a lock. There is an interesting downside toConcurrentStack<T>, which is that the algorithm it employs necessarily involves an allocation per pushed element, and for reasons related to the“ABA” problem, those allocations can’t be pooled. That means that every time a connection is returned to the pool now, there’s a small allocation. However, for an HTTP request/response, even though we’ve significantly reduced it over the years, there’s still a non-trivial amount of allocation that occurs over the lifetime of the operation, so one more tiny one doesn’t break the bank, and it’s worth it for the reduced synchronization overheads. We’ve experimented with other data structures, such asConcurrentQueue<T> (which is able to avoid allocation perEnqueue at steady state), but they’ve had other downsides. I expect we’ll continue to push on this in the future, but for now, what’s there now is a nice improvement.

Of course once you’ve got the connection, there’s all of the costs associated with actually making the request and handling the response, and those have been whittled away at as well.

Using vectorized helpers.dotnet/runtime#93511 replaces a scalar loop for writing out bytes from an HTTP/1.1 request header, instead usingAscii.FromUtf16, which is vectorized. In fact, that vectorization was further improved bydotnet/runtime#102735, which improved the 256-bit and 512-bit code paths by using better instructions possible due to not having to care about edge cases already weeded out.
Avoiding extra async state machines.dotnet/runtime#100205 avoids an extra layer of async state machine that was incurred by most of the HTTP/1.1 response streams; a method wasasync only to accommodate some rare logging being enabled, so theasync wrapper is now only employed when that logging is enabled.
More caching of very common data.dotnet/runtime#100177 removes some more allocation and reduces some overheads by computing and caching some bytes that need to be written out on every request.
Special-casing main use cases.dotnet/runtime#102859 anddotnet/runtime#103008 madeJsonContent andStringContent cheaper by special-casing the vastly most common media types used.dotnet/runtime#93759 further improvedJsonContent by reducing the number ofasync frames on the hot path. Anddotnet/runtime#102845 from@pedrobsaila madeTryAddWithoutValidation cheaper for multiple values by special-casing the most common case of the input being anIList<string>, which enables presizing arrays while also avoiding an enumerator allocation.
More ArrayPool.dotnet/runtime#103764 avoided a possibly largechar[] allocation in the parsing ofAlt-Svc headers by usingArrayPool rather than direct allocation.

While this simple benchmark doesn’t touch on all of these changes, it does highlight that the end-to-end performance of HTTP requests gets cheaper:

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net.Sockets;using System.Net;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly Socket s_listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);    private static readonly HttpMessageInvoker s_client = new(new SocketsHttpHandler());    private static Uri? s_uri;    [Benchmark]    public async Task HttpGet()    {        var m = new HttpRequestMessage(HttpMethod.Get, s_uri) { Content = new StringContent("Hello, there! How are you today?") };        using (HttpResponseMessage r = await s_client.SendAsync(m, default))        using (Stream s = r.Content.ReadAsStream())            await s.CopyToAsync(Stream.Null);    }    [GlobalSetup]    public void CreateSocketServer()    {        s_listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));        s_listener.Listen(int.MaxValue);        var ep = (IPEndPoint)s_listener.LocalEndPoint!;        s_uri = new Uri($"http://{ep.Address}:{ep.Port}/");        Task.Run(async () =>        {            while (true)            {                Socket s = await s_listener.AcceptAsync();                _ = Task.Run(() =>                {                    using (var ns = new NetworkStream(s, true))                    {                        byte[] buffer = new byte[1024];                        int totalRead = 0;                        while (true)                        {                            int read = ns.Read(buffer, totalRead, buffer.Length - totalRead);                            if (read == 0) return;                            totalRead += read;                            if (buffer.AsSpan(0, totalRead).IndexOf("\r\n\r\n"u8) == -1)                            {                                if (totalRead == buffer.Length) Array.Resize(ref buffer, buffer.Length * 2);                                continue;                            }                            ns.Write("HTTP/1.1 200 OK\r\nDate: Sun, 05 Jul 2020 12:00:00 GMT \r\nServer: Example\r\nContent-Length: 5\r\n\r\nHello"u8);                            totalRead = 0;                        }                    }                });            }        });    }}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
HttpGet	.NET 8.0	92.42 us	1.00	1.98 KB	1.00
HttpGet	.NET 9.0	77.13 us	0.83	1.8 KB	0.91

Related to HTTP, theWebUtility andHttpUtility types both got more efficient this release.dotnet/runtime#103737, in particular, made a variety of changes that have a measurable impact onHtmlEncode andUrlEncode:

HtmlEncode had a scalar loop looking for characters that need to be encoded. That loop can instead be vectorized by usingSearchValues<char>.
UrlEncode also had a simlar scalar loop as part of looking for the first non-safe character.SearchValues<char> can also solve this.
UrlEncode had a complicated scheme where it would UTF8-encode into a newly-allocatedbyte[], percent-encode in place in that (thanks to the ability to reinterpret cast with spans), and then use the resulting chars to create a newstring. Instead,string.Create can be used, with all of the work done in-place in the buffer generated for that operation.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    [Arguments("""        How much wood could a woodchuck chuck        If a woodchuck could chuck wood?        A woodchuck would chuck as much wood        As much wood as a woodchuck could chuck,        If a woodchuck could chuck wood.        """)]    public string HtmlEncode(string input) => WebUtility.HtmlEncode(input);    [Benchmark]    [Arguments("short_name.txt")]    public string UrlEncode(string input) => WebUtility.UrlEncode(input);}

Method	Runtime	input	Mean	Ratio
HtmlEncode	.NET 8.0	How (…)ood. [181]	102.607 ns	1.00
HtmlEncode	.NET 9.0	How (…)ood. [181]	10.188 ns	0.10

UrlEncode	.NET 8.0	short_name.txt	8.656 ns	1.00
UrlEncode	.NET 9.0	short_name.txt	2.463 ns	0.28

HttpUtility also received some attention.dotnet/runtime#102805 from@TrayanZapryanov updatedUrlEncodeToBytes, using stack space instead of allocation for smaller inputs, and usingSearchValues<byte> to optimize the search for invalid bytes.dotnet/runtime#102753 from@TrayanZapryanov did the same forUrlDecodeToBytes.dotnet/runtime#102909 from@TrayanZapryanov similarly reduced allocation inUrlPathEncode, but via theArrayPool.dotnet/runtime#102917 from@TrayanZapryanov optimizedJavaScriptStringEncode, in particular by usingSearchValues. Anddotnet/runtime#102745 from@TrayanZapryanov optimizedParseQueryString by usingstackalloc instead of array allocation for smaller input lengths and by replacingstring.Substrings with span slicing.

There were also changes elsewhere in the networking stack that contribute to HTTP use cases. Indotnet/runtime#98074, for example,Uri gained newTryEscapeDataString andTryUnescapeDataString methods that store the output characters into a provided destination span rather than allocating new strings on each call. These methods were then used elsewhere in the networking stack, such as inFormUrlEncodedContent, to improve throughput and reduce allocation.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private KeyValuePair<string, string>[] _data =    [        new("key1", "value1"),        new("key2", "value2"),        new("key3", "value3"),        new("key4", "value4")    ];    [Benchmark]    public FormUrlEncodedContent Create() => new FormUrlEncodedContent(_data);}

Method	Runtime	Mean	Ratio	Allocated	Alloc Ratio
Create	.NET 8.0	311.5 ns	1.00	848 B	1.00
Create	.NET 9.0	218.7 ns	0.70	384 B	0.45

Beyond raw HTTP, there were also some new features forWebSocket in .NET 9, namely support for keep-alive pings and timeouts, though not many PRs focused solely on performance (thoughdotnet/runtime#101953 from@PaulusParssinen did utilize some newer APIs inManagedWebSocket in a way that may have removed a bit of fat). There was one notable improvement, however, indotnet/runtime#104865. The web socketsRFC 6455 specification requires that when a data frame’s payload data has an opcode denoting it as text, that text must be checked to be valid UTF8-encoded bytes. The validation for that had been a hand-rolled scalar comparison loop. However, now thatUtf8.IsValid exists (it was introduced in .NET 8), that accelerated method can be used here instead. It can’t be used in all situations, which is probably why it wasn’t immediately employed when the method was added in the first place. Web sockets payloads may be split across data frames, so it’s possible that the frame being validated is actually the continuation of some previously-received data, and it’s possible that this frame is not the end of the payload, either. But, we know those two pieces of information up-front: if it’s a continuation from a previous frame, we would have already noted it as such, and if it’s not complete, its end-of-message bit won’t have been set. Thus, for the common case where the payload is complete, we can use the accelerated helper for UTF8 validation, and only fall back to the slower path for the corner cases. And this matters because even with the networking costs involved, that UTF8 validation shows up.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Net.Sockets;using System.Net.WebSockets;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private WebSocket _client, _server;    private Memory<byte> _buffer = Encoding.UTF8.GetBytes("""        Shall I compare thee to a summer’s day?        Thou art more lovely and more temperate:        Rough winds do shake the darling buds of May,        And summer’s lease hath all too short a date;        Sometime too hot the eye of heaven shines,        And often is his gold complexion dimm'd;        And every fair from fair sometime declines,        By chance or nature’s changing course untrimm'd;        But thy eternal summer shall not fade,        Nor lose possession of that fair thou ow’st;        Nor shall death brag thou wander’st in his shade,        When in eternal lines to time thou grow’st:        So long as men can breathe or eyes can see,        So long lives this, and this gives life to thee.        """);    private Memory<byte> _tmp = new byte[1024];    [GlobalSetup]    public void Setup()    {        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));        listener.Listen();        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);        client.Connect(listener.LocalEndPoint!);        Socket server = listener.Accept();        _client = WebSocket.CreateFromStream(new NetworkStream(client, ownsSocket: true), new WebSocketCreationOptions { IsServer = false, });        _server = WebSocket.CreateFromStream(new NetworkStream(server, ownsSocket: true), new WebSocketCreationOptions { IsServer = true });    }    [GlobalCleanup]    public void Cleanup()    {        _client.Dispose();        _server.Dispose();    }    [Benchmark]    public async Task SendReceive()    {        await _client.SendAsync(_buffer, WebSocketMessageType.Text, true, default);        while (!(await _server.ReceiveAsync(_tmp, default)).EndOfMessage) ;    }}

Method	Runtime	Mean	Ratio
SendReceive	.NET 8.0	4.093 us	1.00
SendReceive	.NET 9.0	3.438 us	0.84

There are of course a variety of reasons that performance could have improved, e.g. maybeWebSockets is just exercising a code path that benefits from one of the other optimizations already discussed. How do we know it’s connected to the validation? Let’s profile. And since we already have a benchmark written, we can just use it. There’s another very handy nuget package,Microsoft.VisualStudio.DiagnosticsHub.BenchmarkDotNetDiagnosers, which contains additional “diagnosers” for BenchmarkDotNet. Diagnosers are one of the main extensibility points within BenchmarkDotNet, enabling developers to perform additional tracking and analyses over benchmarks. You’ve already seen me use some, including the built-in[MemoryDiagnoser(false)] and[DisassemblyDiagnoser]; there are other built-in ones we haven’t used in this post but that are helpful in various situations, like[ThreadingDiagnoser] and[ExceptionDiagnoser], but diagnosers can come from anywhere, and the aforementioned nuget package provides several more. The purpose of those diagnosers is to collect and export performance traces that Visual Studio’s performance tools can then consume. In my case, I want to collect a CPU trace, so as to understand where CPU consumption is going, so I added a[CPUUsageDiagnoser] attribute to myTests class:

[CPUUsageDiagnoser][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests

and then re-ran (on Windows). That’s it. While the test is running, you’ll see the same output as you’re used to seeing, plus a little more. For example, at the end of the benchmarking, I now also see this:

// * Diagnostic Output - VSDiagnosticsDiagnoser *Collection result moved to 'BenchmarkDotNet_Tests_20240804_081400.diagsession'.Session : {a1671047-d6da-4a56-9c71-eadef6c1dd00}  StoppedExported diagsession file: d:\Benchmarks\BenchmarkDotNet_Tests_20240804_081400.diagsession.

I then simply opened that.diagsession file, just typing its name at the command-line, since that file extension is by default associated with Visual Studio, but you could also File->Open from within Visual Studio itself. That results in a view like the following: Benchmark traces in Visual Studio Notice this single trace covers both the .NET 8 test execution and the .NET 9 test execution, and each is represented by a different entry in the Benchmarks table (but both are on the same execution timeline). I can then double-click one of the tests to narrow the timeline down to just the relevant portion of activity, and then switch over to the CPU Usage tab. When I do, here’s what I see for .NET 8 for the top impacting methods: Top impacting methods in .NET 8 benchmark and for .NET 9:Notice in the first trace thatTryValidateUtf8 is taking up almost 8% of the CPU time, but it doesn’t show up in the second trace at all, instead being replaced byUtf8Utility.GetPointerToFirstInvalidByte, which is the implementation ofUtf8.IsValid and which is only half a percent. That ~8% correlates with the ~10% reduction we saw in benchmark execution time. Neat.

JSON

System.Text.Json hit the scene in .NET Core 3.0, and every release since it’s gotten more capable and more efficient. .NET 9 is no exception. In addition to new features like support for exporting JSON schema, deep semantic equality comparison ofJsonElements, the ability to respect nullable reference type annotations, support for ordering JsonObject properties, new contract metadata APIs, and more, performance has also been a significant focus.

One improvement comes from the integration ofJsonSerializer withSystem.IO.Pipelines. Much of the .NET stack moves bytes around viaStream, however ASP.NET internally is implemented withSystem.IO.Pipelines. There are built-in bidirectional adapters between streams and pipes, but in some cases those adapters add some overhead. As JSON is so critical to modern services, it’s important thatJsonSerializer be able to work equally well with both streams and pipes. As such,dotnet/runtime#101461 adds newJsonSerializer.SerializeAsync overloads that targetPipeWriter in addition to the existing overloads that targetStream. That way, whether you have aStream or aPipeWriter,JsonSerializer will natively work with either without requiring any indirection to adapt between them. Just use whichever you already have.

JsonSerializer‘s handling of enums was also improved bydotnet/runtime#105032. In addition to adding support for the new[JsonEnumMemberName] attribute, it also moved to an allocation-free parsing solution for enums, utilizing theGetAlternateLookup support added toDictionary<TKey, TValue> andConcurrentDictionary<TKey, TValue> to enable a cache of enum information queryable via aReadOnlySpan<char>.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Json;using System.Reflection;using System.Text.Json.Serialization;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly JsonSerializerOptions s_options = new()    {        Converters = { new JsonStringEnumConverter() },        DictionaryKeyPolicy = JsonNamingPolicy.KebabCaseLower,    };    [Params(BindingFlags.Default, BindingFlags.NonPublic | BindingFlags.Instance)]    public BindingFlags _value;    private byte[] _jsonValue;    private Utf8JsonWriter _writer = new(Stream.Null);    [GlobalSetup]    public void Setup() => _jsonValue = JsonSerializer.SerializeToUtf8Bytes(_value, s_options);    [Benchmark]    public void Serialize()    {        _writer.Reset();        JsonSerializer.Serialize(_writer, _value, s_options);    }    [Benchmark]    public BindingFlags Deserialize() =>        JsonSerializer.Deserialize<BindingFlags>(_jsonValue, s_options);}

Method	Runtime	_value	Mean	Ratio	Allocated	Alloc Ratio
Serialize	.NET 8.0	Default	38.67 ns	1.00	24 B	1.00
Serialize	.NET 9.0	Default	27.23 ns	0.70	–	0.00

Deserialize	.NET 8.0	Default	73.86 ns	1.00	–	NA
Deserialize	.NET 9.0	Default	70.48 ns	0.95	–	NA

Serialize	.NET 8.0	Instance, NonPublic	37.60 ns	1.00	24 B	1.00
Serialize	.NET 9.0	Instance, NonPublic	26.82 ns	0.71	–	0.00

Deserialize	.NET 8.0	Instance, NonPublic	97.54 ns	1.00	–	NA
Deserialize	.NET 9.0	Instance, NonPublic	70.72 ns	0.73	–	NA

JsonSerializer relies on lots of other functionality fromSystem.Text.Json, which has also improved. Here’s a sampling:

Direct use of UTF8.JsonProperty.WriteTo would always usewriter.WritePropertyName(Name) to output the property name. However, thatName property might end up allocating a newstring if theJsonProperty wasn’t already caching one.dotnet/runtime#90074 from@karakasa tweaked the implementation to write out thestring if it already had one, or else to directly write out a name based on the UTF8 bytes it would have used to create thatstring.
Avoiding unnecessary intermediate state.dotnet/runtime#97687 from@habbes is one of those lovely PRs that’s a pure win. The primary change here is to aBase64EncodeAndWrite method that’s Base64-encoding a sourceReadOnlySpan<byte> to a destinationSpan<byte>. The implementation was eitherstackalloc‘ing a buffer or renting a buffer, then encoding into that temporary, and then copying the data into a buffer that is guaranteed to be large enough. Why wasn’t it just encoding directly into that destination buffer rather than going through a temporary? Unclear. But thanks to this PR, that intermediate overhead was simply removed. Similarly,dotnet/runtime#92284 removed some unnecessary intermediate state fromJsonNode.GetPath.JsonNode.GetPath was doing a lot of allocation, creating aList<string> of all of the path segments which were then combined in reverse order into aStringBuilder. This changes the implementation to extract the path segments in the reverse order in the first place, then building up the resulting path in stack space or an array rented from theArrayPool.
```
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Json.Nodes;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private JsonNode _json = JsonNode.Parse("""        {            "staff": {                "Elsa": {                    "age": 21,                    "position": "queen"                }            }        }        """)["staff"]["Elsa"]["position"];    [Benchmark]    public string GetPath() => _json.GetPath();}
```
Method Runtime _value Mean Ratio Allocated Alloc Ratio
GetPath .NET 8.0 Default 176.68 ns 1.00 472 B 1.00
GetPath .NET 9.0 Default 27.23 ns 0.30 64 B 0.14
Using existing caches.JsonNode.ToString andJsonNode.ToJsonString were allocating a newPooledByteBufferWriter andUtf8JsonWriter, but the internalUtf8JsonWriterCache type already provides support for using cached versions of these same objects.dotnet/runtime#92358 just updated theseJsonNode methods to utilize the existing cache.
Pre-sizing collections.JsonObject has a constructor that accepts an enumerable of properties to add to the object. For a lot of properties, as it’s adding properties, the backing store may need to keep growing and growing, incurring the overhead of allocation and copies.dotnet/runtime#96486 from@olo-ntaylor tests to see whether a count can be retrieved from the enumerable, and if it can, it uses that count to pre-size the dictionary.

Allow fast paths to be fast.JsonValue has a niche feature that enables it to wrap an arbitrary .NET object. AsJsonValue derives fromJsonNode,JsonNode needs to take that capability into account. The current way it does so makes some common operations much more expensive than they’d need to be.dotnet/runtime#103733 refactors the implementation to optimize for the common cases.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Json;using System.Text.Json.Nodes;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private JsonNode[] _nodes = [42, "I am a string", false, DateTimeOffset.Now];    [Benchmark]    [Arguments(JsonValueKind.String)]    public int Count(JsonValueKind kind)    {        var count = 0;        foreach (var node in _nodes)        {            if (node.GetValueKind() == kind)            {                count++;            }        }        return count;    }}

Method	Runtime	kind	Mean	Ratio
Count	.NET 8.0	String	729.26 ns	1.00
Count	.NET 9.0	String	12.14 ns	0.02

Deduplicating accesses.JsonValue.CreateFromElement accessesJsonElement.ValueKind to determine how to process the data, e.g.
```
if (element.ValueKind is JsonValueKind.Null) { ... }else if (element.ValueKind is JsonValueKind.Object or JsonValueKind.Array) { ... }else { ... }
```
IfValueKind were a simple field access, that’d be fine. But it’s a bit more complicated than that, involving a largeswitch to determine what kind to return. Rather than possibly reading from it twice,dotnet/runtime#104108 from@andrewjsaid just makes a small tweak to only access the property once. No point in doing that work twice.

Spans over existing data. TheJsonElement.GetRawText method is useful for extracting the original input backing theJsonElement, but the data is stored as UTF8 bytes andGetRawText returns astring, so every call allocates and transcodes to produce the result. Fromdotnet/runtime#104595, the newJsonMarshal.GetRawUtf8Value simply returns a span over the original data, no encoding, no allocation.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.InteropServices;using System.Text.Json;using System.Text.Json.Nodes;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private JsonElement _json = JsonSerializer.Deserialize("""        {            "staff": {                "Elsa": {                    "age": 21,                    "position": "queen"                }            }        }        """);    [Benchmark(Baseline = true)]    public string GetRawText() => _json.GetRawText();    [Benchmark]    public ReadOnlySpan<byte> TryGetRawText() => JsonMarshal.GetRawUtf8Value(_json);}

Method	Mean	Ratio	Allocated	Alloc Ratio
GetRawText	51.627 ns	1.00	192 B	1.00
TryGetRawText	7.998 ns	0.15	–	0.00

Note that the new method is on the newJsonMarshal class because it’s an API with safety concerns (in general, APIs on theUnsafe class or in theSystem.Runtime.InteropServices namespace are considered “unsafe”). The concern here is that theJsonElement might be backed by an array rented from theArrayPool, if theJsonElement came from aJsonDocument. TheReadOnlySpan<byte> you get back is simply pointing into that array. If after getting the span, theJsonDocument is disposed, it’ll return that array back to the pool, and now the span is referencing an array that someone else might rent. If they do and write into that array, the span will now contain whatever was written there, effectively yielding corrupted data. Try this:

// dotnet run -c Release -f net9.0using System.Runtime.InteropServices;using System.Text.Json;using System.Text;ReadOnlySpan<byte> elsaUtf8;using (JsonDocument elsaJson = JsonDocument.Parse("""        {          "staff": {            "Elsa": {              "age": 21,              "position": "queen"            }          }        }        """)){    elsaUtf8 = JsonMarshal.GetRawUtf8Value(elsaJson.RootElement);}using (JsonDocument annaJson = JsonDocument.Parse("""        {          "staff": {            "Anna": {              "age": 18,              "position": "princess"            }          }        }        """)){    Console.WriteLine(Encoding.UTF8.GetString(elsaUtf8)); // uh oh!}

When I run that, it prints out the information about “Anna,” even though I retrieved the raw text from the “Elsa”JsonElement. Oops! As with anything in C# or .NET that’s “unsafe,” you just need to make sure you hold it correctly.

One last improvement I want to call out. The feature itself is not actually about performance, but the workarounds I’ve seen folks employ for the lack of this capability do have a significant performance impact, and so having the feature built-in will be a net performance win.dotnet/runtime#104328 adds support to bothUtf8JsonReader andJsonSerializer for parsing out multiple top-level JSON objects from an input. Previously if any data was found after a JSON object in the input, that would be considered erroneous and fail to parse, and that means that if a particular data source served up multiple JSON objects one after the other, the data would need to be pre-parsed in order to feed only the relevant portions toSystem.Text.Json. This is particularly relevant with services that stream data, as some of them use such a format.

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Json;using System.Text.Json.Nodes;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private MemoryStream _source = new MemoryStream("""        {          "name": "Alice",          "age": 30,          "city": "New York"        }        {          "name": "Bob",          "age": 25,          "city": "Los Angeles"        }        {          "name": "Charlie",          "age": 35,          "city": "Chicago"        }        """u8.ToArray());    [Benchmark]    [Arguments("Dave")]    public async Task<Person?> FindAsync(string name)    {        _source.Position = 0;        await foreach (var p in JsonSerializer.DeserializeAsyncEnumerable<Person>(_source, topLevelValues: true))        {            if (p?.Name == name)            {                return p;            }        }        return null;    }    public class Person    {        public string Name { get; set; }        public int Age { get; set; }        public string City { get; set; }    }}

Diagnostics

Being able to observe one’s application in production is critical to the operation of modern services.System.Diagnostics.Metrics.Meter is .NET’s recommended type for emitting metrics, and several improvements have gone into making it more efficient in .NET 9.

Counter andUpDownCounter are often used for hot-path tracking of metrics like number of active or queued requests. In production environments, these instruments are frequently bombarded from multiple threads concurrently, which both means they need to be thread-safe but also that they need to be able to scale well. The thread-safety had been achieved by using alock around updates (which were simply reading a value, adding to it, and storing it back), but under heavy load that could result in significant contention on the lock. To address this,dotnet/runtime#91566 changed the implementation in a few ways. First, rather than using alock to protect the state:

lock (this){    _delta += value;}

it used an interlocked operation to perform the addition atomically. Here_delta is adouble, and there’s noInterlocked.Add that works withdouble values, so instead the standard approach of using a loop around anInterlocked.CompareExchange was employed.

double currentValue;do{    currentValue = _delta;}while (Interlocked.CompareExchange(ref _delta, currentValue + value, currentValue) != currentValue);

That helps, but while this does reduce the overhead and improve scalability, it still represents a bottleneck under heavy contention. To address that, the change also split the single_delta into an array of values, one per core, and a thread picks one of them to update, typically the value associated with the core on which it’s currently running. That way, contention is significantly reduced as it’s both distributed across N values instead of 1 value, and because threads prefer the value for the core on which they’re on, and because there’s only ever one thread executing on a specific core at a given moment, chances for conflicts are significantly reduced. There is still some contention, both because a thread isn’t guaranteed to use the associated value (e.g. the thread could migrate between the time it checks what core it’s on and does the access) and because we actually cap the size of the array (so that it doesn’t consume too much memory), but it still makes the system much more scalable.

// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Diagnostics.Metrics;using System.Diagnostics.Tracing;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private MetricsEventListener _listener = new MetricsEventListener();    private Meter _meter = new Meter("Example");    private Counter<int> _counter;    [GlobalSetup]    public void Setup() => _counter = _meter.CreateCounter<int>("counter");    [GlobalCleanup]    public void Cleanup()    {        _listener.Dispose();        _meter.Dispose();    }    [Benchmark]    public void Counter_Parallel()    {        Parallel.For(0, 1_000_000, i =>        {            _counter.Add(1);            _counter.Add(1);        });    }    private sealed class MetricsEventListener : EventListener    {        protected override void OnEventSourceCreated(EventSource eventSource)        {            if (eventSource.Name == "System.Diagnostics.Metrics")            {                EnableEvents(eventSource, EventLevel.LogAlways, EventKeywords.All, new Dictionary<string, string?>() { { "Metrics", "Example\\upDownCounter;Example\\counter" } });            }        }    }}

Method	Runtime	Mean	Ratio
Counter_Parallel	.NET 8.0	137.90 ms	1.00
Counter_Parallel	.NET 9.0	30.65 ms	0.22

There’s another interesting aspect of the improvement worth mentioning, and that’s the padding employed in the array. Going from a singledouble _delta to an array of deltas, you might imagine we’d end up with:

private readonly double[] _deltas;

but if you look at the code, it’s instead:

private readonly PaddedDouble[] _deltas;

wherePaddedDouble is defined as:

[StructLayout(LayoutKind.Explicit, Size = 64)]private struct PaddedDouble{    [FieldOffset(0)]    public double Value;}

This effectively increases the size of each value from 8 bytes to 64 bytes, where only the first 8 bytes of each value is used and the other 56 bytes are padding. That’s odd, right? Normally we’d jump at an opportunity to shrink 64 bytes down to 8 bytes in order to reduce allocation and memory consumption, but here we’re purposefully going in the other direction.

The reason for that is “false sharing.” Consider this benchmark, which I’ve shamelessly borrowed from a conversation Scott Hanselman and I recently had inLet’s Talk Parallel Programming in theDeep .NET series:

// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _values = new int[32];    [Params(1, 31)]    public int Index { get; set; }    [Benchmark]    public void ParallelIncrement()    {        Parallel.Invoke(            () => IncrementLoop(ref _values[0]),            () => IncrementLoop(ref _values[Index]));        static void IncrementLoop(ref int value)        {            for (int i = 0; i < 100_000_000; i++)            {                Interlocked.Increment(ref value);            }        }    }}

When I run that, I get results like this:

Method	Index	Mean
ParallelIncrement	1	1,779.9 ms
ParallelIncrement	31	432.3 ms

In this benchmark, one thread is incrementing_values[0] and the other thread is incrementing either_values[1] or_values[31]. That index is the only difference, yet the one accessing_values[31] is several times faster than the one accessing_values[1]. That’s because there’s contention here even if it’s not obvious in the code. The contention comes from the fact that the hardware works with memory in groups of bytes called a “cache line.” Most hardware has caches lines of 64 bytes. In order to update a particular memory location, the hardware will acquire the whole cache line. If another core wants to update that same cache line, it’ll need to acquire it. That back and forth results in a lot of overhead. It doesn’t matter if one core is touching the first of those 64 bytes and another thread is touching the last, from the hardware’s perspective there’s still sharing happening. “False sharing.” Thus, theCounter fix is using padding around thedouble values to try to space them out more so as to minimize the sharing that limits scalability.

As an aside, there are some additional BenchmarkDotNet diagnosers that can help to highlight the effects of false sharing. ETW on Windows enables collecting various CPU performances counters, such as for branch misses or instructions retired, and BenchmarkDotNet has a[HardwareCounters] diagnoser that’s able to collect this ETW data. One such counter is for cache misses, which often reflect false sharing issues. If you’re on Windows, you can try grabbing the separateBenchmarkDotNet.Diagnostics.Windows nuget package and using it as in this benchmark:

// This benchmark only works on Windows.// Add a <PackageReference Include="BenchmarkDotNet.Diagnostics.Windows" Version="0.14.0" /> to the csproj.// dotnet run -c Release -f net9.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Diagnosers;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HardwareCounters(HardwareCounter.InstructionRetired, HardwareCounter.CacheMisses)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private int[] _values = new int[32];    [Params(1, 31)]    public int Index { get; set; }    [Benchmark]    public void ParallelIncrement()    {        Parallel.Invoke(            () => IncrementLoop(ref _values[0]),            () => IncrementLoop(ref _values[Index]));        static void IncrementLoop(ref int value)        {            for (int i = 0; i < 100_000_000; i++)            {                Interlocked.Increment(ref value);            }        }    }}

Here I’ve asked for both instructions retired, which reflects how much instructions were fully executed (this in and of itself can be a useful metric when analyzing performance, as it’s not as prone to variation as wall-clock measurements), and cache misses, which reflects how many times data wasn’t available in the CPU’s cache.

Method	Index	Mean	InstructionRetired/Op	CacheMisses/Op
ParallelIncrement	1	1,846.2 ms	804,300,000	177,889
ParallelIncrement	31	442.5 ms	824,333,333	52,429

In the two benchmarks, we can see that the number of instructions executed is almost the same between when false sharing occurred (Index == 1) and didn’t (Index == 31), but the number of cache misses is more than three times larger in the false sharing case, and reasonably well correlated with the time increase. When one core performs a write, that invalidates the corresponding cache line in the other core’s cache, such that the other core then needs to reload the cache line, resulting in cache misses. But I digress…

Another nice improvement comes indotnet/runtime#105011 from@stevejgordon, adding a new constructor toMeasurement. Often when creatingMeasurements, you’re also tagging them with additional key/value pairs of information, for which theTagList type exists.TagList implementsIList<KeyValuePair<string, object?>>, andMeasurement has a constructor that takes anIEnumerable<KeyValuePair<string, object?>>, so you can pass aTagList to aMeasurement and it “just works”… albeit slower than it could. If you had code like:

measurements.Add(new Measurement<long>(    snapshotV4.LastAckCount,    new TagList { tcpVersionFourTag, new(NetworkStateKey, "last_ack") }));

that would end up boxing theTagList struct as an enumerable, and then enumerating through it via the interface, which also entails an enumerator allocation. The new constructor this PR adds takes aTagList, avoiding those overheads.TagList is also a large struct, as common usage has it living only on the stack and so as an optimization it stores some of the contained key/value pairs directly in fields on the struct rather than always forcing an array allocation. The net result is much less overhead in constructing these measurements.

TagList itself was also improved bydotnet/runtime#104132, which re-implemented the type for .NET 8+ on top of[InlineArray].TagList is effectively a list of key/value pairs, but in order to avoid always allocating a backing store, it stores some of those pairs inline in itself. This previously was done with dedicated fields for each pair, and then code that directly accessed each field. Now, an[InlineArray] is used, cleaning up the code and enabling access via spans.

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Diagnostics;using System.Diagnostics.Metrics;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Counter<long> _counter;    private Meter _meter;    [GlobalSetup]    public void Setup()    {        this._meter = new Meter("TestMeter");        this._counter = this._meter.CreateCounter<long>("counter");    }    [GlobalCleanup]    public void Cleanup() => this._meter.Dispose();    [Benchmark]    public void CounterAdd()    {        this._counter?.Add(100, new TagList        {            { "Name1", "Val1" },            { "Name2", "Val2" },            { "Name3", "Val3" },            { "Name4", "Val4" },            { "Name5", "Val5" },            { "Name6", "Val6" },            { "Name7", "Val7" },        });    }}

Method	Runtime	Mean	Ratio
CounterAdd	.NET 8.0	31.88 ns	1.00
CounterAdd	.NET 9.0	13.93 ns	0.44

Peanut Butter

Throughout this post, I’ve tried to group improvements by topic area in order to create a more fluid and interesting discussion. However, over the course of a year, with as vibrant a community as exists for .NET, and with the breadth of functionality that exists across the platform, there are invariably a large number of one-off PRs that improve this or that by a little. It’s often challenging to imagine any one of these significantly “moving the needle,” but altogether, such changes reduce the “peanut butter” of performance overhead spread thinly across the libraries. In no particular order, here’s a non-comprehensive look at some of these:

StreamWriter.Null.StreamWriter exposes a staticNull field. It stores aStreamWriter instance that’s intended to be a “bit bucket,” a sink you can write to that just ignores all of the data, ala/dev/null on Unix,Stream.Null, and so on. Unfortunately, the way it was implemented had two problems, one of which I’m incredibly surprised took us this long to discover (as it’s been this way for as long as .NET has existed). It was implemented asnew StreamWriter(Stream.Null, ...). All of the state tracking done inStreamWriter is not thread-safe, yet here this instance is exposed from a public static member, which means it should be thread-safe, and if multiple threads hammered thatStreamWriter instance, it could result in really strange exceptions occurring, like arithmetic overflow. Performance-wise, it’s also problematic, because even though actual writes to the underlyingStream are ignored, all of the work actually done byStreamWriter is still done, even though it’s useless.dotnet/runtime#98473 fixes both of those problems by creating an internalNullStreamWriter : StreamWriter type that overrides everything to be nops, and thenNull is initialized to an instance of that.
```
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public void WriteLine() => StreamWriter.Null.WriteLine("Hello, world!");}
```
Method Runtime Mean Ratio
WriteLine .NET 8.0 7.5164 ns 1.00
WriteLine .NET 9.0 0.0283 ns 0.004
NonCryptographicHashAlgorithm.Append{Async}NonCryptographicHashAlgorithm is the base class inSystem.IO.Hashing for types likeXxHash3 andCrc32. One nice feature it provides is the ability to append an entireStream‘s contents to it in a single call, e.g.
```
XxHash3 hash = new();hash.Append(someStream);
```
The implementation ofAppend was fairly straightforward: rent a buffer from theArrayPool and then in a loop repeatedlyStream.Read (orStream.ReadAsync in the case ofAppendAsync) into that buffer andAppend that filled portion of the buffer. This has a couple of performance downsides, however. First, the buffer being rented was 4096 bytes. That’s not tiny, but using a larger buffer can reduce the number of calls to the source stream being appended, which in turn can reduce any I/O performed by thatStream. Second, many streams have optimized implementations for pushing all of their contents to a sink like this:CopyTo.MemoryStream.CopyTo, for example, will just perform a single write of its entire internal buffer to theStream passed to itsCopyTo. But even if aStream doesn’t overrideCopyTo, the baseCopyTo implementation already provides such a copying loop, and it does so by default using a much larger rented buffer. As such,dotnet/runtime#103669 changes the implementation ofAppend to allocate a small temporaryStream object that wraps thisNonCryptographicHashAlgorithm instance, and any calls toWrite are just translated to be calls toAppend. This is a neat example where sometimes we actually choose to pay for a small, short-lived allocation in exchange for significant throughput benefits.
```
// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using System.IO.Hashing;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("System.IO.Hashing", "8.0.0").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet("System.IO.Hashing", "9.0.0-rc.1.24431.7"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    private Stream _stream;    private byte[] _bytes;    [GlobalSetup]    public void Setup()    {        _bytes = new byte[1024 * 1024];        new Random(42).NextBytes(_bytes);        string path = Path.GetRandomFileName();        File.WriteAllBytes(path, _bytes);        _stream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0, FileOptions.DeleteOnClose);    }    [GlobalCleanup]    public void Cleanup() => _stream.Dispose();    [Benchmark]    public ulong Hash()    {        _stream.Position = 0;        var hash = new XxHash3();        hash.Append(_stream);        return hash.GetCurrentHashAsUInt64();    }}
```
Method Runtime Mean
Hash .NET 8.0 91.60 us
Hash .NET 9.0 61.26 us
Unnecessary virtual.virtual methods have overhead. First, they’re more expensive to invoke than non-virtual methods because it requires several indirections to find the actual target method to invoke (the actual target may differ based on the concrete type being used). And second, without a technology like dynamic PGO,virtual methods won’t be inlined, because the compiler can’t statically see which target should be inlined (and even if dynamic PGO makes such inlining possible for the most common type, there’s still a check required to ensure it’s ok to follow that path). As such, if things don’tneed to bevirtual, it’s better performance-wise for them to not be. And if such things areinternal, unless they’re actively being overridden by something, there’s no reason to keep themvirtual.dotnet/runtime#104453 from@xtqqczze,dotnet/runtime#104456 from@xtqqczze, anddotnet/runtime#104483 from@xtqqczze all address exactly such cases, removingvirtual from a smattering ofinternal members that weren’t being overridden. It might only save a few instructions here and there, but there’s effectively no downside to such a change (other than some minimal code churn), a pure win.
ReadOnlySpan vs Span. We as developers like to protect ourselves from ourselves, for example making fieldsreadonly to avoid accidentally changing them. Such changes can also have performance benefits, for example the JIT can better optimize static fields that arereadonly than those that aren’t. The same set of principles applies toSpan<T> andReadOnlySpan<T>. If a method doesn’t need to mutate the contents of a collection being passed in, it’s less accident prone to use aReadOnlySpan<T> rather than aSpan<T>. It also signals to the caller that they don’t need to be concerned about the data changing out from under them. Interestingly, here, too, there’s both a correctness and a performance benefit to usingReadOnlySpan<T> instead ofSpan<T>. The implementations of these two types is almost word-for-word identical, the critical difference being whether the indexer returns aref T or aref readonly T. There is one additional line inSpan<T>, however, that doesn’t exist inReadOnlySpan<T>.Span<T>‘s constructor has this one extra check:
```
if (!typeof(T).IsValueType && array.GetType() != typeof(T[]))    ThrowHelper.ThrowArrayTypeMismatchException();
```
This check exists because of array covariance. Let’s say you have this:
```
Base[] array = new Derived[3];class Base { }class Derived : Base { }
```
That code compiles and runs successfully, because .NET supports array covariance, meaning an array of a derived type can be used as an array as the base type. But there’s an important catch here. Let’s augment the example slightly:
```
Base[] array = new Derived[3];array[0] = new AlsoDerived(); // uh oh!class Base { }class Derived : Base { }class AlsoDerived : Base { }
```
This will compile successfully, but at run-time it’ll fail with anArrayTypeMismatchException. That’s because it’s trying to store anAlsoDerived instance into aDerived[], and there’s no relationship between the two types that should permit that. The check required to enforce that comes at a cost, every single time you try to write into an array (except in cases where the compiler can prove it’s safe and elide the costs). WhenSpan<T> was introduced, the decision was made to hoist that check up to the span’s constructor; that way, once you get a valid span, no such checking needs to be performed on every write, only once on construction. That’s what that additional line of code is doing, checking to ensure that the specifiedT is the same as the provided array’s element type. That means code like this will also throw anArrayTypeMismatchException:
```
Span<Object> span = new string[2]; // uh oh
```
But that also means if you useSpan<T> in situations where you could have usedReadOnlySpan<T>, there’s a good chance you’re unnecessarily incurring that check, which means you’re both possibly going to hit unexpected exceptions depending on what arrays are passed in, and you’re incurring a bit of peanut butter cost.dotnet/runtime#104864 replaced a bunch ofSpan<T>s withReadOnlySpan<T>s to reduce the chances we’d incur such overhead, while also just improving the maintainability of the code.
readonly and const. In the same vein, changing fields that could beconst to be so, changing non-readonly fields that could bereadonly to be so, and removing unnecessary property setters is all goodness for maintainability while also having the chance of improving performance. Making fieldsconst avoids unnecessary memory accesses while also allowing the JIT to better employ constant propagation. And making static fieldsreadonly enables the JIT to treat them as if they wereconst in tier 1 compilation.dotnet/runtime#100728 updates hundreds of occurrences.
MemoryCache.dotnet/runtime#103992 from@ADNewsom09 addresses an inefficiency inMicrosoft.Extensions.Caching.Memory. If multiple concurrent operations end up triggering the cache’s compaction operation, many of the involved threads can all end up duplicating each other’s work. The fix is to simply have only one of the threads do the compaction operation.
BinaryReader.dotnet/runtime#80331 from@teo-tsirpanis madeBinaryReader allocations relevant only to reading text be lazily allocated only when such reading occurs. If the reader is never used for reading text, the application won’t need to pay for the allocation.
ArrayBufferWriter.dotnet/runtime#88009 from@AlexRadch adds a newResetWrittenCount method toArrayBufferWriter.ArrayBufferWriter.Clear already exists, but in addition to setting the written count to 0, it also clears the underlying buffer. In many situations, that clearing is unnecessary overhead, soResetWrittenCount allows it to be avoided. (There was an interesting debate about whether such a new method is even necessary, and whetherClear could just be changed to remove the zeroing. But concerns about stale data finding their way into consuming code as corrupted data led to the new method being added instead.)
Span-based File methods. The staticFile class provides simple helpers for interacting with files, e.g.File.WriteAllText. Historically, these methods have worked with strings and arrays. That means, though, that if someone instead has a span, they either can’t use these simple helpers or they need to pay to create a string or an array from the span.dotnet/runtime#103308 adds new span-based overloads so that developers don’t need to choose here between simplicity and performance.
string concat vs Append. string concatenation inside of a loop is well-known no-no, as in the extreme it can lead to significantO(N^2) costs. Such a string concatenation was occurring, however, inside ofMailAddressCollection, where an encoded version of every address in the collection was being appended onto a string using string concatenation.dotnet/runtime#95760 from@YohDeadfall changed that to use a builder instead.
Closures. The config source generator was introduced in .NET 8 to significantly improve the performance of configuration binding, while also making it friendlier to Native AOT. It achieved both. However, it can be improved further. There’s an unanticipated extra allocation that occurs on success paths that’s only relevant to failure paths, because of how the code is being generated. For a call site like this:
```
public static void M(IConfiguration configuration, C1 c) => configuration.Bind(c);
```
the source generator would emit a method like this:
```
public static void BindCore(IConfiguration configuration, ref C1 obj, BinderOptions? binderOptions){    ValidateConfigurationKeys(typeof(C1), s_configKeys_C1, configuration, binderOptions);    if (configuration["Value"] is string value15)        obj.Value = ParseInt(value15, () => configuration.GetSection("Value").Path);}
```
That lambda being passed to theParseInt helper is accessingconfiguration, which is defined outside of the lambda as a parameter. To get that data into the lambda, the compiler allocates a “display class” to store the information, with the body of the lambda translated into a method on that display class. That display class gets allocated at the beginning of the scope that contains the data, which in this case means it’s allocated at the beginning of theBindCore method. That means it’s allocated regardless of whether theif block is true, and even ifParseInt is called, the delegate passed to it is only ever invoked when there’s a failure.dotnet/runtime#100257 from@pedrobsaila reworks the source generator code so that this allocation isn’t incurred.
Stream.Read/Write Span Overrides.Streams that don’t override the span-basedRead/Write methods end up utilizing the base implementations, which allocate. There are a ton ofStream implementations in dotnet/runtime, and we’ve overridden such methods almost everywhere, but now and again we find one that slipped through.dotnet/runtime#86674 from@hrrrrustic fixed one such case on theStreamOnSqlBytes type.
Globalization Arrays. EveryNumberFormatInfo object defaults itsNumberGroupSizes,CurrentGroupSizes, andPercentGroupSizes to each be new instances ofnew int[] { 3 } (even if subsequent initialization overwrites them). And yet these arrays are never handed out to consumers: the properties that expose them make defensive copies. Which means all of these can just refer to the same shared singleton array. The same is true forNativeDigits, which is initialized first to a new array of the numbers0 through9.dotnet/runtime#93117 addresses all of these by creating and using such singletons.
ColorTranslator.ToWin32. There’s a.NET design guideline that says properties should be like smarter fields. The resulting expectation is that they should be cheap, like just accessing a field or doing a very simple calculation over a field. Unfortunately, we don’t always follow our own guidance, and there exist some properties that really, really look like they should be trivial but are actually sometimes not.System.Drawing.Color is a good example. A very reasonable mental model forColor (which according to the docs “Represents an ARGB (alpha, red, green, blue) color”) is that it’d just be four byte values, one for each channel, either in their own fields or packed together into anint. Unfortunately, it’s not quite as simple as that.Colorcan be that, such as if it’s constructed usingColor.FromArgb(uint), but it can also be used to represent a “known colors,” as is evident by theSystemColors type having a bunch of static properties (e.g.SystemColors.Control) that return a color for the underlying OS. And even there, you might think “oh, ok, well those properties must be what Stephen is referring to, they must call out to the OS to get the color, they probably do so and then useFromArgb.” And again, that’s a very intuitive mental model, and again it’s not what actually happens. Those properties actually are cheap; all they do is construct aColor with an enum value corresponding to the system color. Then where is the actual OS color value retrieved, you ask? As part of theR,G,B, andA properties onColor!. That means if you access each of these properties, asColorTranslator was doing in a variety of its methods, you’re making three or four times as many P/Invokes as you’d otherwise need to.dotnet/runtime#106042 fixes this forColorTranslator, but it serves as a good reminder why such guidelines exist. (This benchmark is Windows-specific asSystemColor doesn’t currently rely on OS information for Linux or macOS.)
```
// Windows-specific (it works on Linux and macOS, but doesn't demonstrate the same thing.)// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Drawing;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[MemoryDiagnoser(false)][HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]public class Tests{    private Color _color = SystemColors.Control;    [Benchmark]    public int ColorToWin32() => ColorTranslator.ToWin32(_color);}
```
Method Runtime Mean Ratio
ColorToWin32 .NET 8.0 11.263 ns 1.00
ColorToWin32 .NET 9.0 4.711 ns 0.42

What’s Next?

Maybe one more poem? An acrostic this time:

Driving innovation with unmatched speed, Opening doors to what developers need.Turbocharged perf, breaking the mold,New benchmarks surpassed, metrics so bold.Empowering coders, dreams take flight,Transforming visions with .NET might.Navigating challenges with precision and flair,Inspiring creativity, improvements everywhere.Nurturing growth, pushing limits high,Elevating success, reaching for the sky.

Several hundred pages later and still not a poet. Oh well.

I’m asked from time to time why I invest in writing these “Performance Improvements in .NET” posts. There’s no one answer. In no particular order:

Personal learning. I pay close attention throughout the year to all of the various performance improvements happening in the release, sometimes from a distance, sometimes as the one making the changes. Writing this post serves as a forcing-function for me to revisit them all and really internalize the changes that were made and their relevance to the broader picture. It’s a learning opportunity for me.
Testing. As one of the developers on the team recently said to me, “I like this time of the year when you give our optimizations a stress-test and uncover inefficiencies.” Every year when I’m going through the improvements, just the act of re-validating the improvements often highlights regressions, cases that were missed, or further opportunities that can be addressed in the future. Again, it’s a forcing function to do more testing and with a fresh set of eyes.
Thanks. Many of the performance improvements in each release aren’t from the folks working on the .NET team or even at Microsoft. They’re from passionate and talented individuals throughout the global .NET ecosystem, and I like to highlight their contributions. That’s why throughout the post you see me calling out when PRs are from folks not employed by Microsoft as full-time employees. In this post, that accounts for ~20% of all the cited PRs. Amazing. Heartfelt thanks to everyone who’s worked to make .NET better for everyone.
Excitement. Developers often have conflicting opinions about the speed at which .NET is advancing, some really appreciating the frequent introduction of new features, others concerned that they can’t keep up with all of the newness. But the one thing everyone seems to agree on is the love of “free perf,” and that’s a lot of what these posts talk about. .NET gets faster and faster every release, and it’s exciting to see a tour through the highlights collected all in one place.
Education. There are multiple forms of performance improvements covered throughout the post. Some of the improvements you get completely for free just by upgrading the runtime; the implementations in the runtime are better, and so when you run on them, your code just gets better, too. Some of the improvements you get completely for free by upgrading the runtimeand recompiling; the C# compiler itself generates better code, often taking advantage of newer surface area exposed in the runtime. And other improvements are new features that, in addition to the runtime and compiler utilizing, you can utilize directly and make your code even faster. Educating about those capabilities and why and where you’d want to utilize them is important to me. But beyond the new features, the techniques employed in making all of the rest of the optimizations throughout the runtime are often more broadly applicable. By learning how these optimizations are applied in the runtime, you can extrapolate and apply similar techniques to your own code, making it that much faster.

If you’ve read this far, I hope you indeed have learned something and are excited about the .NET 9 release. As is likely obvious from my enthusiastic ramblings and awkward poetry, I’m incredibly excited about .NET, everything that’s been achieved in .NET 9, and the future of the platform. If you’re already using .NET 8, upgrading to .NET 9 should be a breeze (the.NET 9 Release Candidate is available for download), and I’d love it if you’d do so and share with us any successes you achieve or issues you face along the way. We’d love to learn from you. And if you have ideas about how to further improve the performance of .NET for .NET 10, please join us indotnet/runtime.

Happy coding!