GitHub Copilot is now available for free

No trial. No credit card required. Just your GitHub account.

September 13th, 2023
heart155 reactions

Performance Improvements in .NET 8

Stephen Toub - MSFT
Partner Software Engineer

I look forward to summer every year. Sun, beach, warm nights, and putting the finishing touches on the next version of .NET. It’s also the time I get to continue a tradition I started for myself back in 2017 of writing about the performance improvements that have gone into the latest .NET incarnation. A year ago that wasPerformance Improvements in .NET 7, which followed similar posts for.NET 6,.NET 5,.NET Core 3.0,.NET Core 2.1, and.NET Core 2.0.

Since.NET 7 was released a year ago, you’ve likely been inundated with news about AI and ChatGPT. You may have even invested effort in using AI and/orincorporated it into your applications. And I got to wondering: could AI write this blog post for me? So I issued the following request toBing Chat:

“You are an expert .NET developer with a penchant for pedagogy and a love of details. Write a treatise covering the most important performance improvements in .NET 8. The article should educate developers not just about what improvements were made, but also about why and how they were made so that developers can learn the relevant techniques and apply them in their own codebases. The article should also excite developers about the upcoming release, highlighting the very real benefits to upgrading their services, applications, and libraries to the new version. And the article should give credit to the myriad of developers that contributed significantly to the release, calling out the hundreds upon hundreds of pull requests that were merged to make .NET 8 the fastest .NET to date. The article should follow the style of the post about .NET 7 performance at https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/. Based on the extent and depth of everything to cover, it is estimated that the article should be around 200 pages in length, including all code samples and exposition.”

And the response?

“I’m sorry, but I can’t write a 200-page treatise for you. That’s beyond the scope of my capabilities and my chat settings. I can only generate short and concise content, such as summaries, snippets, or paragraphs. If you want to learn more about the performance improvements in .NET 8, you can check out some of the web search results I found for you. You can also read the official release notes and the GitHub issue tracker for more details. I hope you understand and appreciate my limitations. Thank you for your interest in .NET 8. 😊”

Fair enough (especially since, after writing this post and measuring the token count, it’s north of 300,000). And I’m happy with that outcome, because it means I still get to have the fun of writing this myself.

Throughout the past year, as I was reviewing PRs in various .NET repos, I maintained a list of all the PRs that I might want to cover in this post, which is focused on the core runtime and libraries (Performance Improvements in ASP.NET Core 8 provides an in-depth focus on ASP.NET). And as I sat down to write this, I found myself staring at a daunting list of 1289 links. This post can’t cover all of them, but it does take a tour through more than 500 PRs, all of which have gone into making .NET 8 an irresistible release, one I hope you’ll all upgrade to as soon as humanly possible.

.NET 7 was super fast. .NET 8 is faster.

Table of Contents

Benchmarking Setup

Throughout this post, I include microbenchmarks to highlight various aspects of the improvements being discussed. Most of those benchmarks are implemented usingBenchmarkDotNetv0.13.8, and, unless otherwise noted, there is a simple setup for each of these benchmarks.

To follow along, first make sure you have.NET 7 and.NET 8 installed. For this post, I’ve used the .NET 8 Release Candidate (8.0.0-rc.1.23419.4).

With those prerequisites taken care of, create a new C# project in a newbenchmarks directory:

dotnet new console -o benchmarkscd benchmarks

That directory will contain two files:benchmarks.csproj (the project file with information about how the application should be built) andProgram.cs (the code for the application). Replace the entire contents ofbenchmarks.csproj with this:

<Project Sdk="Microsoft.NET.Sdk">  <PropertyGroup>    <OutputType>Exe</OutputType>    <TargetFrameworks>net8.0;net7.0</TargetFrameworks>    <LangVersion>Preview</LangVersion>    <ImplicitUsings>enable</ImplicitUsings>    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>    <ServerGarbageCollection>true</ServerGarbageCollection>  </PropertyGroup>  <ItemGroup>    <PackageReference Include="BenchmarkDotNet" Version="0.13.8" />  </ItemGroup></Project>

The preceding project file tells the build system we want:

  • to build a runnable application (as opposed to a library),
  • to be able to run on both .NET 8 and .NET 7 (so that BenchmarkDotNet can run multiple processes, one with .NET 7 and one with .NET 8, in order to be able to compare the results),
  • to be able to use all of the latest features from the C# language even though C# 12 hasn’t officially shipped yet,
  • to automatically import common namespaces,
  • to be able to use theunsafe keyword in the code,
  • and to configure the garbage collector (GC) into its “server” configuration, which impacts the tradeoffs it makes between memory consumption and throughput (this isn’t strictly necessary, I’m just in the habit of using it, and it’s the default for ASP.NET apps.)

The<PackageReference/> at the end pulls in BenchmarkDotNet fromNuGet so that we’re able to use the library inProgram.cs. (A handful of benchmarks require additional packages be added; I’ve noted those where applicable.)

For each benchmark, I’ve then included the fullProgram.cs source; just copy and paste that code intoProgram.cs, replacing its entire contents. In each test, you’ll notice several attributes may be applied to theTests class. The[MemoryDiagnoser] attribute indicates I want it to track managed allocation, the[DisassemblyDiagnoser] attribute indicates I want it to report on the actual assembly code generated for the test (and by default one level deep of functions invoked by the test), and the[HideColumns] attribute simply suppresses some columns of data BenchmarkDotNet might otherwise emit by default but are unnecessary for our purposes here.

Running the benchmarks is then straightforward. Each shown test also includes a comment at the beginning for thedotnet command to run the benchmark. Typically, it’s something like this:

dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0

The precedingdotnet run command:

  • builds the benchmarks in a Release build. This is important for performance testing, as most optimizations are disabled in Debug builds, in both the C# compiler and the JIT compiler.
  • targets .NET 7 for the host project. In general with BenchmarkDotNet, you want to target the lowest-common denominator of all runtimes you’ll be executing against, so as to ensure that all of the APIs being used are available everywhere they’re needed.
  • runs all of the benchmarks in the whole program. The--filter argument can be refined to scope down to just a subset of benchmarks desired, but"*" says “run ’em all.”
  • runs the tests on both .NET 7 and .NET 8.

Throughout the post, I’ve shown many benchmarks and the results I received from running them. All of the code works well on all supported operating systems and architectures. Unless otherwise stated, the results shown for benchmarks are from running them on Linux (Ubuntu 22.04) on an x64 processor (the one bulk exception to this is when I’ve used[DisassemblyDiagnoser] to show assembly code, in which case I’ve run them on Windows 11 due to a sporadic issue on Unix with[DisassemblyDiagnoser] on .NET 7 not always producing the requested assembly). My standard caveat: these aremicrobenchmarks, often measuring operations that take very short periods of time, but where improvements to those times add up to be impactful when executed over and over and over. Different hardware, different operating systems, what else is running on your machine, your current mood, and what you ate for breakfast can all affect the numbers involved. In short, don’t expect the numbers you see to match exactly the numbers I report here, though I have chosen examples where themagnitude of differences cited is expected to be fully repeatable.

With all that out of the way, let’s dive in…

JIT

Code generation permeates every single line of code we write, and it’s critical to the end-to-end performance of applications that the compiler doing that code generation achieves high code quality. In .NET, that’s the job of the Just-In-Time (JIT) compiler, which is used both “just in time” as an application executes as well as in Ahead-Of-Time (AOT) scenarios as the workhorse to perform the codegen at build-time. Every release of .NET has seen significant improvements in the JIT, and .NET 8 is no exception. In fact, I dare say the improvements in .NET 8 in the JIT are an incredible leap beyond what was achieved in the past, in large part due to dynamic PGO…

Tiering and Dynamic PGO

To understand dynamic PGO, we first need to understand “tiering.” For many years, a .NET method was only ever compiled once: on first invocation of the method, the JIT would kick in to generate code for that method, and then that invocation and every subsequent one would use that generated code. It was a simple time, but also one frought with conflict… in particular, a conflict between how much the JIT should invest in code quality for the method and how much benefit would be gained from that enhanced code quality. Optimization is one of the most expensive things a compiler does; a compiler can spend an untold amount of time searching for additional ways to shave off an instruction here or improve the instruction sequence there. But none of us has an infinite amount of time to wait for the compiler to finish, especially in a “just in time” scenario where the compilation is happening as the application is running. As such, in a world where a method is compiled once for that process, the JIT has to either pessimize code quality or pessimize how long it takes to run, which means a tradeoff between steady-state throughput and startup time.

As it turns out, however, the vast majority of methods invoked in an application are only ever invoked once or a small number of times. Spending a lot of time optimizing such methods would actually be a deoptimization, as likely it would take much more time to optimize them than those optimizations would gain. So, .NET Core 3.0 introduced a new feature of the JIT known as “tiered compilation.” With tiering, a method could end up being compiled multiple times. On first invocation, the method would be compiled in “tier 0,” in which the JIT prioritizes speed of compilation over code quality; in fact, the mode the JIT uses is often referred to as “min opts,” or minimal optimization, because it does as little optimization as it can muster (it still maintains a few optimizations, primarily the ones that result in less code to be compiled such that the JIT actually runs faster). In addition to minimizing optimizations, however, it also employs call counting “stubs”; when you invoke the method, the call goes through a little piece of code (the stub) that counts how many times the method was invoked, and once that count crosses a predetermined threshold (e.g. 30 calls), the method gets queued for re-compilation, this time at “tier 1,” in which the JIT throws every optimization it’s capable of at the method. Only a small subset of methods make it to tier 1, and those that do are the ones worthy of additional investment in code quality. Interestingly, there are things the JIT can learn about the method from tier 0 that can lead to even better tier 1 code quality than if the method had been compiled to tier 1 directly. For example, the JIT knows that a method “tiering up” from tier 0 to tier 1 has already been executed, and if it’s already been executed, then anystatic readonly fields it accesses are now already initialized, which means the JIT can look at the values of those fields and base the tier 1 code gen on what’s actually in the field (e.g. if it’s astatic readonly bool, the JIT can now treat the value of that field as if it wereconst bool). If the method were instead compiled directly to tier 1, the JIT might not be able to make the same optimizations. Thus, with tiering, we can “have our cake and eat it, too.” We get both good startup and good throughput. Mostly…

One wrinkle to this scheme, however, is the presence of longer-running methods. Methods might be important because they’re invoked many times, but they might also be important because they’re invoked only a few times but end up running forever, in particular due to looping. As such, tiering was disabled by default for methods containing backward branches, such that those methods would go straight to tier 1. To address that, .NET 7 introduced On-Stack Replacement (OSR). With OSR, the code generated for loops also included a counting mechanism, and after a loop iterated to a certain threshold, the JIT would compile a new optimized version of the method and jump from the minimally-optimized code to continue execution in the optimized variant. Pretty slick, and with that, in .NET 7 tiering was also enabled for methods with loops.

But why is OSR important? If there are only a few such long-running methods, what’s the big deal if they just go straight to tier 1? Surely startup isn’t significantly negatively impacted? First, it can be: if you’re trying to trim milliseconds off startup time, every method counts. But second, as noted before, there are throughput benefits to going through tier 0, in that there are things the JIT can learn about a method from tier 0 which can then improve its tier 1 compilation. And the list of things the JIT can learn gets a whole lot bigger with dynamic PGO.

Profile-Guided Optimization (PGO) has been around for decades, for many languages and environments, including in .NET world. The typical flow is you build your application with some additional instrumentation, you then run your application on key scenarios, you gather up the results of that instrumentation, and then you rebuild your application, feeding that instrumentation data into the optimizer, allowing it to use the knowledge about how the code executed to impact how it’s optimized. This approach is often referred to as “static PGO.” “Dynamic PGO” is similar, except there’s no effort required around how the application is built, scenarios it’s run on, or any of that. With tiering, the JIT is already generating a tier 0 version of the code and then a tier 1 version of the code… why not sprinkle some instrumentation into the tier 0 code as well? Then the JIT can use the results of that instrumentation to better optimize tier 1. It’s the same basic “build, run and collect, re-build” flow as with static PGO, but now on a per-method basis, entirely within the execution of the application, and handled automatically for you by the JIT, with zero additional dev effort required and zero additional investment needed in build automation or infrastructure.

Dynamic PGO first previewed in .NET 6, off by default. It was improved in .NET 7, but remained off by default. Now, in .NET 8, I’m thrilled to say it’s not only been significantly improved, it’s now on by default. This one-character PR to enable it might be the most valuable PR in all of .NET 8:dotnet/runtime#86225.

There have been a multitude of PRs to make all of this work better in .NET 8, both on tiering in general and then on dynamic PGO in particular. One of the more interesting changes isdotnet/runtime#70941, which added more tiers, though we still refer to the unoptimized as “tier 0” and the optimized as “tier 1.” This was done primarily for two reasons. First, instrumentation isn’t free; if the goal of tier 0 is to make compilation as cheap as possible, then we want to avoid adding yet more code to be compiled. So, the PR adds a new tier to address that. Most code first gets compiled to an unoptimized and uninstrumented tier (though methods with loops currently skip this tier). Then after a certain number of invocations, it gets recompiled unoptimized but instrumented. And then after a certain number of invocations, it gets compiled as optimized using the resulting instrumentation data. Second,crossgen/ReadyToRun (R2R) images were previously unable to participate in dynamic PGO. This was abig problem for taking full advantage of all that dynamic PGO offers, in particular because there’s a significant amount of code that every .NET application uses that’s already R2R’d: the core libraries.ReadyToRun is an AOT technology that enables most of the code generation work to be done at build-time, with just some minimal fix-ups applied when that precompiled code is prepared for execution. That code is optimized and not instrumented, or else the instrumentation would slow it down. So, this PR also adds a new tier for R2R. After an R2R method has been invoked some number of times, it’s recompiled, again with optimizations but this time also with instrumentation, and then when that’s been invoked sufficiently, it’s promoted again, this time to an optimized implementation utilizing the instrumentation data gathered in the previous tier.Code flow between JIT tiers

There have also been multiple changes focused on doing more optimization in tier 0. As noted previously, the JIT wants to be able to compile tier 0 as quickly as possible, however some optimizations in code quality actually help it to do that. For example,dotnet/runtime#82412 teaches it to do some amount of constant folding (evaluating constant expressions at compile time rather than at execution time), as that can enable it to generate much less code. Much of the time the JIT spends compiling in tier 0 is for interactions with the Virtual Machine (VM) layer of the .NET runtime, such as resolving types, and so if it can significantly trim away branches that won’t ever be used, it can actually speed up tier 0 compilation while also getting better code quality. We can see this with a simple repro app like the following:

// dotnet run -c Release -f net8.0MaybePrint(42.0);static void MaybePrint<T>(T value){    if (value is int)        Console.WriteLine(value);}

I can set theDOTNET_JitDisasm environment variable to*MaybePrint*; that will result in the JIT printing out to the console the code it emits for this method. On .NET 7, when I run this (dotnet run -c Release -f net7.0), I get the following tier 0 code:

; Assembly listing for method Program:<<Main>$>g__MaybePrint|0_0[double](double); Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-0 compilation; MinOpts code; rbp based frame; partially interruptibleG_M000_IG01:                ;; offset=0000H       55                   push     rbp       4883EC30             sub      rsp, 48       C5F877               vzeroupper       488D6C2430           lea      rbp, [rsp+30H]       33C0                 xor      eax, eax       488945F8             mov      qword ptr [rbp-08H], rax       C5FB114510           vmovsd   qword ptr [rbp+10H], xmm0G_M000_IG02:                ;; offset=0018H       33C9                 xor      ecx, ecx       85C9                 test     ecx, ecx       742D                 je       SHORT G_M000_IG03       48B9B877CB99F97F0000 mov      rcx, 0x7FF999CB77B8       E813C9AE5F           call     CORINFO_HELP_NEWSFAST       488945F8             mov      gword ptr [rbp-08H], rax       488B4DF8             mov      rcx, gword ptr [rbp-08H]       C5FB104510           vmovsd   xmm0, qword ptr [rbp+10H]       C5FB114108           vmovsd   qword ptr [rcx+08H], xmm0       488B4DF8             mov      rcx, gword ptr [rbp-08H]       FF15BFF72000         call     [System.Console:WriteLine(System.Object)]G_M000_IG03:                ;; offset=0049H       90                   nopG_M000_IG04:                ;; offset=004AH       4883C430             add      rsp, 48       5D                   pop      rbp       C3                   ret; Total bytes of code 80

The important thing to note here is that all of the code associated with theConsole.WriteLine had to be emitted, including the JIT needing to resolve the method tokens involved (which is how it knew to print “System.Console:WriteLine”), even though that branch will provably never be taken (it’s only taken whenvalue is int and the JIT can see thatvalue is adouble). Now in .NET 8, it applies the previously-reserved-for-tier-1 constant folding optimizations that recognize the value is not anint and generates tier 0 code accordingly (dotnet run -c Release -f net8.0):

; Assembly listing for method Program:<<Main>$>g__MaybePrint|0_0[double](double) (Tier0); Emitting BLENDED_CODE for X64 with AVX - Windows; Tier0 code; rbp based frame; partially interruptibleG_M000_IG01:                ;; offset=0x0000       push     rbp       mov      rbp, rsp       vmovsd   qword ptr [rbp+0x10], xmm0G_M000_IG02:                ;; offset=0x0009G_M000_IG03:                ;; offset=0x0009       pop      rbp       ret; Total bytes of code 11

dotnet/runtime#77357 anddotnet/runtime#83002 also enable some JIT intrinsics to be employed in tier 0 (a JIT intrinsic is a method the JIT has some special knowledge of, either knowing about its behavior so it can optimize around it accordingly, or in many cases actually supplying its own implementation to replace the one in the method’s body). This is in part for the same reason; many intrinsics can result in better dead code elimination (e.g.if (typeof(T).IsValueType) { ... }). But more so, without recognizing intrinsics as being special, we might end up generating code for an intrinsic method that we would never otherwise need to generate code for, even in tier 1.dotnet/runtime#88989 also eliminates some forms of boxing in tier 0.

Collecting all of this instrumentation in tier 0 instrumented code brings with it some of its own challenges. The JIT is augmenting a bunch of methods to track a lot of additional data; where and how does it track it? And how does it do so safely and correctly when multiple threads are potentially accessing all of this at the same time? For example, one of the things the JIT tracks in an instrumented method is which branches are followed and how frequently; that requires it to count each time code traverses that branch. You can imagine that happens, well, a lot. How can it do the counting in a thread-safe yet efficient way?

The answer previously was, it didn’t. It used racy, non-synchronized updates to a shared value, e.g._branches[branchNum]++. This means that some updates might get lost in the presence of multithreaded access, but as the answer here only needs to be approximate, that was deemed ok. As it turns out, however, in some cases it was resulting ina lot of lost counts, which in turn caused the JIT to optimize for the wrong things. Another approach tried for comparison purposes indotnet/runtime#82775 was to use interlocked operations (e.g. if this were C#,Interlocked.Increment); that results in perfect accuracy, but that explicit synchronization represents a huge potential bottleneck when heavily contended.dotnet/runtime#84427 provides the approach that’s now enabled by default in .NET 8. It’s an implementation of a scalable approximate counter that employs some amount of pseudo-randomness to decide how often to synchronize and by how much to increment the shared count. There’s agreat description of all of this in thedotnet/runtime repo; here is a C# implementation of the counting logic based on that discussion:

static void Count(ref uint sharedCounter){    uint currentCount = sharedCounter, delta = 1;    if (currentCount > 0)    {        int logCount = 31 - (int)uint.LeadingZeroCount(currentCount);        if (logCount >= 13)        {            delta = 1u << (logCount - 12);            uint random = (uint)Random.Shared.NextInt64(0, uint.MaxValue + 1L);            if ((random & (delta - 1)) != 0)            {                return;            }        }    }    Interlocked.Add(ref sharedCounter, delta);}

For current count values less than 8192, it ends up just doing the equivalent of anInterlocked.Add(ref counter, 1). However, as the count increases to beyond that threshold, it starts only doing the add randomly half the time, and when it does, it adds 2. Then randomly a quarter of the time it adds 4. Then an eighth of the time it adds 8. And so on. In this way, as more and more increments are performed, it requires writing to the shared counter less and less frequently.

We can test this out with a little app like the following (if you want to try running it, just copy the aboveCount into the program as well):

// dotnet run -c Release -f net8.0using System.Diagnostics;uint counter = 0;const int ItersPerThread = 100_000_000;while (true){    Run("Interlock", _ => { for (int i = 0; i < ItersPerThread; i++) Interlocked.Increment(ref counter); });    Run("Racy     ", _ => { for (int i = 0; i < ItersPerThread; i++) counter++; });    Run("Scalable ", _ => { for (int i = 0; i < ItersPerThread; i++) Count(ref counter); });    Console.WriteLine();}void Run(string name, Action<int> body){    counter = 0;    long start = Stopwatch.GetTimestamp();    Parallel.For(0, Environment.ProcessorCount, body);    long end = Stopwatch.GetTimestamp();    Console.WriteLine($"{name} => Expected: {Environment.ProcessorCount * ItersPerThread:N0}, Actual: {counter,13:N0}, Elapsed: {Stopwatch.GetElapsedTime(start, end).TotalMilliseconds}ms");}

When I run that, I get results like this:

Interlock => Expected: 1,200,000,000, Actual: 1,200,000,000, Elapsed: 20185.548msRacy      => Expected: 1,200,000,000, Actual:   138,526,798, Elapsed: 987.4997msScalable  => Expected: 1,200,000,000, Actual: 1,193,541,836, Elapsed: 1082.8471ms

I find these results fascinating. The interlocked approach gets the exact right count, but it’s super slow, ~20x slower than the other approaches. The fastest is the racy additions one, but its count is also wildly inaccurate: it was off by a factor of 8x! The scalable counters solution was only a hair slower than the racy solution, but its count was only off the expected value by 0.5%. This scalable approach then enables the JIT to track what it needs with the efficiency and approximate accuracy it needs. Other PRs likedotnet/runtime#82014,dotnet/runtime#81731, anddotnet/runtime#81932 also went into improving the JIT’s efficiency around tracking this information.

As it turns out, this isn’t the only use of randomness in dynamic PGO. Another is used as part of determining which types are the most common targets of virtual and interface method calls. At a given call site, the JIT wants to know which type is most commonly used and by what percentage; if there’s a clear winner, it can then generate a fast path specific to that type. As in the previous example, tracking a count for every possible type that might come through is expensive. Instead, it uses an algorithm known as“reservoir sampling”. Let’s say I have achar[1_000_000] containing ~60%'a's, ~30%'b's, and ~10%'c's, and I want to know which is the most common. With reservoir sampling, I might do so like this:

// dotnet run -c Release -f net8.0// Create random input for testing, with 60% a, 30% b, 10% cchar[] chars = new char[1_000_000];Array.Fill(chars, 'a', 0, 600_000);Array.Fill(chars, 'b', 600_000, 300_000);Array.Fill(chars, 'c', 900_000, 100_000);Random.Shared.Shuffle(chars);for (int trial = 0; trial < 5; trial++){    // Reservoir sampling    char[] reservoir = new char[32]; // same reservoir size as the JIT    int next = 0;    for (int i = 0; i < reservoir.Length && next < chars.Length; i++, next++)    {        reservoir[i] = chars[i];    }    for (; next < chars.Length; next++)    {        int r = Random.Shared.Next(next + 1);        if (r < reservoir.Length)        {            reservoir[r] = chars[next];        }    }    // Print resulting percentages    Console.WriteLine($"a: {reservoir.Count(c => c == 'a') * 100.0 / reservoir.Length}");    Console.WriteLine($"b: {reservoir.Count(c => c == 'b') * 100.0 / reservoir.Length}");    Console.WriteLine($"c: {reservoir.Count(c => c == 'c') * 100.0 / reservoir.Length}");    Console.WriteLine();}

When I run this, I get results like the following:

a: 53.125b: 31.25c: 15.625a: 65.625b: 28.125c: 6.25a: 68.75b: 25c: 6.25a: 40.625b: 31.25c: 28.125a: 59.375b: 25c: 15.625

Note that in the above example, I actually had all the data in advance; in contrast, the JIT likely has multiple threads all running instrumented code and overwriting elements in the reservoir. I also happened to choose the same size reservoir the JIT is using as ofdotnet/runtime#87332, which highlights how that value was chosen for its use case and why it needed to be tweaked.

On all five runs above, it correctly found there to be more'a's than'b's and more'b's than'c's, and it was often reasonably close to the actual percentages. But, importantly, randomness is involved here, and every run produced slightly different results. I mention this because that means the JIT compiler now incorporates randomness, which means that the produced dynamic PGO instrumentation data is very likely to be slightly different from run to run. However, even without explicit use of randomness, there’s already non-determinism in such code, and in general there’s enough data produced that the overall behavior is quite stable and repeatable.

Interestingly, the JIT’s PGO-based optimizations aren’t just based on the data gathered during instrumented tier 0 execution. Withdotnet/runtime#82926 (and a handful of follow-on PRs likedotnet/runtime#83068,dotnet/runtime#83567,dotnet/runtime#84312, anddotnet/runtime#84741), the JIT will now create a synthetic profile based on statically analyzing the code and estimating a profile, such as with various approaches to static branch prediction. The JIT can then blend this data together with the instrumentation data, helping to fill in data where there are gaps (think “Jurassic Park” and using modern reptile DNA to plug the gaps in the recovered dinosaur DNA).

Beyond the mechanisms used to enable tiering and dynamic PGO getting better (and, did I mention, being on by default?!) in .NET 8, the optimizations it performs also get better. One of the main optimizations dynamic PGO feeds is the ability to devirtualize virtual and interface calls per call site. As noted, the JIT tracks what concrete types are used, and then can generate a fast path for the most common type; this is known as guarded devirtualization (GDV). Consider this benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    internal interface IValueProducer    {        int GetValue();    }    class Producer42 : IValueProducer    {        public int GetValue() => 42;    }    private IValueProducer _valueProducer;    private int _factor = 2;    [GlobalSetup]    public void Setup() => _valueProducer = new Producer42();    [Benchmark]    public int GetValue() => _valueProducer.GetValue() * _factor;}

TheGetValue method is doing:

return _valueProducer.GetValue() * _factor;

Without PGO, that’s just a normal interface dispatch. With PGO, however, the JIT will end up seeing that the actual type of_valueProducer is most commonlyProducer42, and it will end up generating tier 1 code closer to if my benchmark was instead:

int result = _valueProducer.GetType() == typeof(Producer42) ?    Unsafe.As<Producer42>(_valueProducer).GetValue() :    _valueProducer.GetValue();return result * _factor;

It can then in turn see that theProducer42.GetValue() method is really simple, and so not only is theGetValue call devirtualized, it’s also inlined, such that the code effectively becomes:

int result = _valueProducer.GetType() == typeof(Producer42) ?    42 :    _valueProducer.GetValue();return result * _factor;

We can confirm this by running the above benchmark. The resulting numbers certainly show something going on:

MethodRuntimeMeanRatioCode Size
GetValue.NET 7.01.6430 ns1.0035 B
GetValue.NET 8.00.0523 ns0.0357 B

We see it’s both faster (which we expected) and more code (which we also expected). Now for the assembly. On .NET 7, we get this:

; Tests.GetValue()       push      rsi       sub       rsp,20       mov       rsi,rcx       mov       rcx,[rsi+8]       mov       r11,7FF999B30498       call      qword ptr [r11]       imul      eax,[rsi+10]       add       rsp,20       pop       rsi       ret; Total bytes of code 35

We can see it’s performing the interface call (the threemovs followed by thecall) and then multiplying the result by_factor (imul eax,[rsi+10]). Now on .NET 8, we get this:

; Tests.GetValue()       push      rbx       sub       rsp,20       mov       rbx,rcx       mov       rcx,[rbx+8]       mov       rax,offset MT_Tests+Producer42       cmp       [rcx],rax       jne       short M00_L01       mov       eax,2AM00_L00:       imul      eax,[rbx+10]       add       rsp,20       pop       rbx       retM00_L01:       mov       r11,7FFA1FAB04D8       call      qword ptr [r11]       jmp       short M00_L00; Total bytes of code 57

We still see thecall, but it’s buried in a cold section at the end. Instead, we see the type of the object being compared againstMT_Tests+Producer42, and if it matches (thecmp [rcx],rax followed by thejne), we store2A intoeax;2A is the hex representation of42, so this is the entirety of the inlined body of the devirtualizedProducer42.GetValue call. .NET 8 is also capable of doing multiple GDVs, meaning it can generate fast paths for more than 1 type, thanks in large part todotnet/runtime#86551 anddotnet/runtime#86809. However, this is off by default and for now needs to be opted-into with a configuration setting (setting theDOTNET_JitGuardedDevirtualizationMaxTypeChecks environment variable to the desired maximum number of types for which to test). We can see the impact of that with this benchmark (note that because I’ve explicitly specified the configs to use in the code itself, I’ve omitted the--runtimes argument in thedotnet command):

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId("ChecksOne").WithRuntime(CoreRuntime.Core80))    .AddJob(Job.Default.WithId("ChecksThree").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_JitGuardedDevirtualizationMaxTypeChecks", "3"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables")][DisassemblyDiagnoser]public class Tests{    private readonly A _a = new();    private readonly B _b = new();    private readonly C _c = new();    [Benchmark]    public void Multiple()    {        DoWork(_a);        DoWork(_b);        DoWork(_c);    }    [MethodImpl(MethodImplOptions.NoInlining)]    private static int DoWork(IMyInterface i) => i.GetValue();    private interface IMyInterface { int GetValue(); }    private class A : IMyInterface { public int GetValue() => 123; }    private class B : IMyInterface { public int GetValue() => 456; }    private class C : IMyInterface { public int GetValue() => 789; }}
MethodJobMeanCode Size
MultipleChecksOne7.463 ns90 B
MultipleChecksThree5.632 ns133 B

And in the assembly code with the environment variable set, we can indeed see it doing multiple checks for three types before falling back to the general interface dispatch:

; Tests.DoWork(IMyInterface)       sub       rsp,28       mov       rax,offset MT_Tests+A       cmp       [rcx],rax       jne       short M01_L00       mov       eax,7B       jmp       short M01_L02M01_L00:       mov       rax,offset MT_Tests+B       cmp       [rcx],rax       jne       short M01_L01       mov       eax,1C8       jmp       short M01_L02M01_L01:       mov       rax,offset MT_Tests+C       cmp       [rcx],rax       jne       short M01_L03       mov       eax,315M01_L02:       add       rsp,28       retM01_L03:       mov       r11,7FFA1FAC04D8       call      qword ptr [r11]       jmp       short M01_L02; Total bytes of code 88

(Interestingly, this optimization gets a bit better in Native AOT. There, withdotnet/runtime#87055, there can be no need for the fallback path. The compiler can see the entire program being optimized and can generate fast paths for all of the types that implement the target abstraction if it’s a small number.)

dotnet/runtime#75140 provides another really nice optimization, still related to GDV, but now for delegates and in relation to loop cloning. Take the following benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    private readonly Func<int, int> _func = i => i + 1;    [Benchmark]    public int Sum() => Sum(_func);    private static int Sum(Func<int, int> func)    {        int sum = 0;        for (int i = 0; i < 10_000; i++)        {            sum += func(i);        }        return sum;    }}

Dynamic PGO is capable of doing GDV with delegates just as it is with virtual and interface methods. The JIT’s profiling of this method will highlight that the function being invoked is always the samei => i + 1 lambda, and as we saw, that can then be transformed into a method something like the following pseudo-code:

private static int Sum(Func<int, int> func){    int sum = 0;    for (int i = 0; i < 10_000; i++)    {        sum += func.Method == KnownLambda ? i + 1 : func(i);    }    return sum;}

It’s not very visible that inside our loop we’re performing the same check over and over and over. We’re also branching based on it. One common compiler optimization is “hoisting,” where a computation that’s “loop invariant” (meaning it doesn’t change per iteration) can be pulled out of the loop to be above it, e.g.

private static int Sum(Func<int, int> func){    int sum = 0;    bool isAdd = func.Method == KnownLambda;    for (int i = 0; i < 10_000; i++)    {        sum += isAdd ? i + 1 : func(i);    }    return sum;}

but even with that, we still have the branch on each iteration. Wouldn’t it be nice if we could hoist that as well? What if we could “clone” the loop, duplicating it once for when the method is the known target and once for when it’s not. That’s “loop cloning,” an optimization the JIT is already capable of for other reasons, and now in .NET 8 the JIT is capable of that with this exact scenario, too. The code it’ll produce ends up then being very similar to this:

private static int Sum(Func<int, int> func){    int sum = 0;    if (func.Method == KnownLambda)    {        for (int i = 0; i < 10_000; i++)        {            sum += i + 1;        }    }    else    {        for (int i = 0; i < 10_000; i++)        {            sum += func(i);        }    }    return sum;}

Looking at the generated assembly on .NET 8 confirms this:

; Tests.Sum(System.Func`2<Int32,Int32>)       push      rdi       push      rsi       push      rbx       sub       rsp,20       mov       rbx,rcx       xor       esi,esi       xor       edi,edi       test      rbx,rbx       je        short M01_L01       mov       rax,7FFA2D630F78       cmp       [rbx+18],rax       jne       short M01_L01M01_L00:       inc       edi       mov       eax,edi       add       esi,eax       cmp       edi,2710       jl        short M01_L00       jmp       short M01_L03M01_L01:       mov       rax,7FFA2D630F78       cmp       [rbx+18],rax       jne       short M01_L04       lea       eax,[rdi+1]M01_L02:       add       esi,eax       inc       edi       cmp       edi,2710       jl        short M01_L01M01_L03:       mov       eax,esi       add       rsp,20       pop       rbx       pop       rsi       pop       rdi       retM01_L04:       mov       edx,edi       mov       rcx,[rbx+8]       call      qword ptr [rbx+18]       jmp       short M01_L02; Total bytes of code 103

Focus just on theM01_L00 block: you can see it ends with ajl short M01_L00 to loop back around toM01_L00 ifedi (which is storingi) is less than 0x2710, or 10,000 decimal, aka our loop’s upper bound. Note that there are just a few instructions in the middle, nothing at all resembling acall… this is the optimized cloned loop, where our lambda has been inlined. There’s another loop that alternates betweenM01_L02,M01_L01, andM01_L04, and that one does have acall… that’s the fallback loop. And if we run the benchmark, we see a huge resulting improvement:

MethodRuntimeMeanRatioCode Size
Sum.NET 7.016.546 us1.0055 B
Sum.NET 8.02.320 us0.14113 B

As long as we’re discussing hoisting, it’s worth noting other improvements have also contributed. In particular,dotnet/runtime#81635 enables the JIT to hoist more code used in generic method dispatch. We can see that in action with a benchmark like this:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public void Test() => Test<string>();    static void Test<T>()    {        for (int i = 0; i < 100; i++)        {            Callee<T>();        }    }    [MethodImpl(MethodImplOptions.NoInlining)]    static void Callee<T>() { }}
MethodRuntimeMeanRatio
Test.NET 7.0170.8 ns1.00
Test.NET 8.0147.0 ns0.86

Before moving on, one word of warning about dynamic PGO: it’s good at what it does, really good. Why is that a “warning?” Dynamic PGO is very good about seeing what your code is doing and optimizing for it, which is awesome when you’re talking about your production applications. But there’s a particular kind of coding where you might not want that to happen, or at least you need to be acutely aware of it happening, and you’re currently looking at it: benchmarks. Microbenchmarks are all about isolating a particular piece of functionality and running that over and over and over and over in order to get good measurements about its overhead. With dynamic PGO, however, the JIT will then optimize for the exact thing you’re testing. If the thing you’re testing is exactly how the code will execute in production, then awesome. But if your test isn’t fully representative, you can get a skewed understanding of the costs involved, which can lead to making less-than-ideal assumptions and decisions.

For example, consider this benchmark:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId("No PGO").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_TieredPGO", "0"))    .AddJob(Job.Default.WithId("PGO").WithRuntime(CoreRuntime.Core80));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables")]public class Tests{    private static readonly Random s_rand = new();    private readonly IEnumerable<int> _source = Enumerable.Repeat(0, 1024);    [Params(1.0, 0.5)]    public double Probability { get; set; }    [Benchmark]    public bool Any() => s_rand.NextDouble() < Probability ?        _source.Any(i => i == 42) :        _source.Any(i => i == 43);}

This runs a benchmark with two different “Probability” values. Regardless of that value, the code that’s executed for the benchmark does exactly the same thing and should result in exactly the same assembly code (other than one path checking for the value42 and the other for43). In a world without PGO, there should be close to zero difference in performance between the runs, and if we set theDOTNET_TieredPGO environment variable to0 (to disable PGO), that’s exactly what we see, but with PGO, we observe a larger difference:

MethodJobProbabilityMean
AnyNo PGO0.55.354 us
AnyNo PGO15.314 us
AnyPGO0.51.969 us
AnyPGO11.495 us

When all of the calls usei == 42 (because we set the probability to 1, all of the random values are less than that, and we always take the first branch), we see throughput ends up being 25% faster than when half of the calls usei == 42 and half usei == 43. If your benchmark was only trying to measure the overhead of usingEnumerable.Any, you might not realize that the resulting code was being optimized for callingAny with the same delegate every time, in which case you get different results than ifAny is called with multiple delegates and all with reasonably equal chances of being used. (As an aside, the nice overall improvement between dynamic PGO being disabled and enabled comes in part from the use ofRandom, which internally makes a virtual call thatdynamic PGO can help elide.)

Throughout the rest of this post, I’ve kept this in mind and tried hard to show benchmarks where the resulting wins are due primarily to the cited improvements in the relevant code; where dynamic PGO plays a larger role in the improvements, I’ve called that out, often showing the results with and without dynamic PGO. There are many more benchmarks I could have shown but have avoided where it would look like a particular method had massive improvements, yet in reality it’d all be due to dynamic PGO being its awesome self rather than some explicit change made to the method’s C# code.

One final note about dynamic PGO: it’s awesome, but it doesn’t obviate the need for thoughtful coding. If you know and can use something’s concrete type rather than an abstraction, from a performance perspective it’s better to do so rather than hoping the JIT will be able to see through it and devirtualize. To help with this, a new analyzer,CA1859, was added to the .NET SDK indotnet/roslyn-analyzers#6370. The analyzer looks for places where interfaces or base classes could be replaced by derived types in order to avoid interface and virtual dispatch.CA1859dotnet/runtime#80335 anddotnet/runtime#80848 rolled this out acrossdotnet/runtime. As you can see from the first PR in particular, there were hundreds of places identified that with just an edit of one character (e.g. replacingIList<T> withList<T>), we could possibly reduce overheads.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId("No PGO").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_TieredPGO", "0"))    .AddJob(Job.Default.WithId("PGO").WithRuntime(CoreRuntime.Core80));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables")]public class Tests{    private readonly IList<int> _ilist = new List<int>();    private readonly List<int> _list = new();    [Benchmark]    public void IList()    {        _ilist.Add(42);        _ilist.Clear();    }    [Benchmark]    public void List()    {        _list.Add(42);        _list.Clear();    }}
MethodJobMean
IListNo PGO2.876 ns
IListPGO1.777 ns
ListNo PGO1.718 ns
ListPGO1.476 ns

Vectorization

Another huge area of investment in code generation in .NET 8 is around vectorization. This is a continuation of a theme that’s been going for multiple .NET releases. Almost a decade ago, .NET gained theVector<T> type. .NET Core 3.0 and .NET 5 added thousands of intrinsic methods for directly targeting specific hardware instructions. .NET 7 provided hundreds of cross-platform operations forVector128<T> andVector256<T> to enable SIMD algorithms on fixed-width vectors. And now in .NET 8, .NET gains support for AVX512, both with new hardware intrinsics directly exposing AVX512 instructions and with the newVector512 andVector512<T> types.

There were a plethora of changes that went into improving existing SIMD support, such asdotnet/runtime#76221 that improves the handling ofVector256<T> when it’s not hardware accelerated by lowering it as twoVector128<T> operations. Or likedotnet/runtime#87283, which removed the generic constraint on theT in all of the vector types in order to make them easier to use in a larger set of contexts. But the bulk of the work in this area in this release is focused on AVX512.

Wikipedia has a good overview ofAVX512, which provides instructions for processing 512-bits at a time. In addition to providing wider versions of the 256-bit instructions seen in previous instruction sets, it also adds a variety of new operations, almost all of which are exposed via one of the new types inSystem.Runtime.Intrinsics.X86, likeAvx512BW,AVX512CD,Avx512DQ,Avx512F, andAvx512Vbmi.dotnet/runtime#83040 kicked things off by stubbing out the various files, followed by dozens of PRs that filled in the functionality, for exampledotnet/runtime#84909 that added the 512-bit variants of the SSE through SSE4.2 intrinsics that already exist; likedotnet/runtime#75934 from@DeepakRajendrakumaran anddotnet/runtime#77419 from@DeepakRajendrakumaran that added support for the EVEX encoding used by AVX512 instructions; likedotnet/runtime#74113 from@DeepakRajendrakumaran that added the logic for detecting AVX512 support; likedotnet/runtime#80960 from@DeepakRajendrakumaran anddotnet/runtime#79544 from@anthonycanino that enlightened the register allocator and emitter about AVX512’s additional registers; and likedotnet/runtime#87946 from@Ruihan-Yin anddotnet/runtime#84937 from@jkrishnavs that plumbed through knowledge of various intrinsics.

Let’s take it for a spin. The machine on which I’m writing this doesn’t have AVX512 support, but myDev Box does, so I’m using that for AVX512 comparisons (usingWSL with Ubuntu). In last year’sPerformance Improvements in .NET 7, we wrote aContains method that usedVector256<T> if there was sufficient data available and it was accelerated, or elseVector128<T> if there was sufficient data available and it was accelerated, or else a scalar fallback. Tweaking that to also “light up” with AVX512 took me literally less than 30 seconds: copy/paste the code block forVector256 and then search and replace in that copy from “Vector256” to “Vector512″… boom, done. Here it is in a benchmark, using environment variables to disable the JIT’s ability to use the various instruction sets so that we can try out this method with each acceleration path:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;using System.Runtime.InteropServices;using System.Runtime.Intrinsics;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId("Scalar").WithEnvironmentVariable("DOTNET_EnableHWIntrinsic", "0").AsBaseline())    .AddJob(Job.Default.WithId("Vector128").WithEnvironmentVariable("DOTNET_EnableAVX2", "0").WithEnvironmentVariable("DOTNET_EnableAVX512F", "0"))    .AddJob(Job.Default.WithId("Vector256").WithEnvironmentVariable("DOTNET_EnableAVX512F", "0"))    .AddJob(Job.Default.WithId("Vector512"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables", "value")]public class Tests{    private readonly byte[] _data = Enumerable.Repeat((byte)123, 999).Append((byte)42).ToArray();    [Benchmark]    [Arguments((byte)42)]    public bool Find(byte value) => Contains(_data, value);    private static unsafe bool Contains(ReadOnlySpan<byte> haystack, byte needle)    {        if (Vector128.IsHardwareAccelerated && haystack.Length >= Vector128<byte>.Count)        {            ref byte current = ref MemoryMarshal.GetReference(haystack);            if (Vector512.IsHardwareAccelerated && haystack.Length >= Vector512<byte>.Count)            {                Vector512<byte> target = Vector512.Create(needle);                ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector512<byte>.Count);                do                {                    if (Vector512.EqualsAny(target, Vector512.LoadUnsafe(ref current)))                        return true;                    current = ref Unsafe.Add(ref current, Vector512<byte>.Count);                }                while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));                if (Vector512.EqualsAny(target, Vector512.LoadUnsafe(ref endMinusOneVector)))                    return true;            }            else if (Vector256.IsHardwareAccelerated && haystack.Length >= Vector256<byte>.Count)            {                Vector256<byte> target = Vector256.Create(needle);                ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector256<byte>.Count);                do                {                    if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref current)))                        return true;                    current = ref Unsafe.Add(ref current, Vector256<byte>.Count);                }                while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));                if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref endMinusOneVector)))                    return true;            }            else            {                Vector128<byte> target = Vector128.Create(needle);                ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector128<byte>.Count);                do                {                    if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref current)))                        return true;                    current = ref Unsafe.Add(ref current, Vector128<byte>.Count);                }                while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));                if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref endMinusOneVector)))                    return true;            }        }        else        {            for (int i = 0; i < haystack.Length; i++)                if (haystack[i] == needle)                    return true;        }        return false;    }}
MethodJobMeanRatio
FindScalar461.49 ns1.00
FindVector12837.94 ns0.08
FindVector25622.98 ns0.05
FindVector51210.93 ns0.02

Numerous PRs elsewhere in the JIT then take advantage of AVX512 support when it’s available. For example, separate from AVX512,dotnet/runtime#83945 anddotnet/runtime#84530 taught the JIT how to unrollSequenceEqual operations, such that the JIT can emit optimized, vectorized replacements when it can see a constant length for at least one of the inputs. “Unrolling” means that rather than emitting a loop for N iterations, each of which does the loop body once, a loop is emitted for N / M iterations, where every iteration does the loop body M times (and if N == M, there is no loop at all). So for a benchmark like this:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private byte[] _scheme = "Transfer-Encoding"u8.ToArray();    [Benchmark]    public bool SequenceEqual() => "Transfer-Encoding"u8.SequenceEqual(_scheme);}

we now get results like this:

MethodRuntimeMeanRatioCode Size
SequenceEqual.NET 7.03.0558 ns1.0065 B
SequenceEqual.NET 8.00.8055 ns0.2691 B

For .NET 7, we see assembly code like this (note thecall instruction to the underlyingSequenceEqual helper):

; Tests.SequenceEqual()       sub       rsp,28       mov       r8,1D7BB272E48       mov       rcx,[rcx+8]       test      rcx,rcx       je        short M00_L03       lea       rdx,[rcx+10]       mov       eax,[rcx+8]M00_L00:       mov       rcx,r8       cmp       eax,11       je        short M00_L02       xor       eax,eaxM00_L01:       add       rsp,28       retM00_L02:       mov       r8d,11       call      qword ptr [7FF9D33CF120]; System.SpanHelpers.SequenceEqual(Byte ByRef, Byte ByRef, UIntPtr)       jmp       short M00_L01M00_L03:       xor       edx,edx       xor       eax,eax       jmp       short M00_L00; Total bytes of code 65

And now for .NET 8, we get assembly code like this:

; Tests.SequenceEqual()       vzeroupper       mov       rax,1EBDDA92D38       mov       rcx,[rcx+8]       test      rcx,rcx       je        short M00_L01       lea       rdx,[rcx+10]       mov       r8d,[rcx+8]M00_L00:       cmp       r8d,11       jne       short M00_L03       vmovups   xmm0,[rax]       vmovups   xmm1,[rdx]       vmovups   xmm2,[rax+1]       vmovups   xmm3,[rdx+1]       vpxor     xmm0,xmm0,xmm1       vpxor     xmm1,xmm2,xmm3       vpor      xmm0,xmm0,xmm1       vptest    xmm0,xmm0       sete      al       movzx     eax,al       jmp       short M00_L02M00_L01:       xor       edx,edx       xor       r8d,r8d       jmp       short M00_L00M00_L02:       retM00_L03:       xor       eax,eax       jmp       short M00_L02; Total bytes of code 91

Now there’s nocall, with the entire implementation provided by the JIT; we can see it making liberal use of the 128-bitxmm SIMD registers. However, those PRs only enabled the JIT to handle up to 64 bytes being compared (unrolling results in larger code, so at some length it no longer makes sense to unroll). With AVX512 support in the JIT,dotnet/runtime#84854 then extends that up to 128 bytes. This is easily visible in a benchmark like this, which is similar to the previous example, but with larger data:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private byte[] _data1, _data2;    [GlobalSetup]    public void Setup()    {        _data1 = Enumerable.Repeat((byte)42, 200).ToArray();        _data2 = (byte[])_data1.Clone();    }    [Benchmark]    public bool SequenceEqual() => _data1.AsSpan(0, 128).SequenceEqual(_data2.AsSpan(128));}

On my Dev Box with AVX512 support, for .NET 8 we get:

; Tests.SequenceEqual()       sub       rsp,28       vzeroupper       mov       rax,[rcx+8]       test      rax,rax       je        short M00_L01       cmp       dword ptr [rax+8],80       jb        short M00_L01       add       rax,10       mov       rcx,[rcx+10]       test      rcx,rcx       je        short M00_L01       mov       edx,[rcx+8]       cmp       edx,80       jb        short M00_L01       add       rcx,10       add       rcx,80       add       edx,0FFFFFF80       cmp       edx,80       je        short M00_L02       xor       eax,eaxM00_L00:       vzeroupper       add       rsp,28       retM00_L01:       call      qword ptr [7FF820745F08]       int       3M00_L02:       vmovups   zmm0,[rax]       vmovups   zmm1,[rcx]       vmovups   zmm2,[rax+40]       vmovups   zmm3,[rcx+40]       vpxorq    zmm0,zmm0,zmm1       vpxorq    zmm1,zmm2,zmm3       vporq     zmm0,zmm0,zmm1       vxorps    ymm1,ymm1,ymm1       vpcmpeqq  k1,zmm0,zmm1       kortestb  k1,k1       setb      al       movzx     eax,al       jmp       short M00_L00; Total bytes of code 154

Now instead of the 128-bitxmm registers, we see use of the 512-bitzmm registers from AVX512.

The JIT in .NET 8 also now unrollsmemmoves (CopyTo,ToArray, etc.) for small-enough constant lengths, thanks todotnet/runtime#83638 anddotnet/runtime#83740. And then withdotnet/runtime#84348 that unrolling takes advantage of AVX512 if it’s available.dotnet/runtime#85501 extends this toSpan<T>.Fill, too.

dotnet/runtime#84885 extended the unrolling and vectorization done as part ofstring/ReadOnlySpan<char>Equals andStartsWith to utilize AVX512 when available, as well.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private readonly string _str = "Let me not to the marriage of true minds admit impediments";    [Benchmark]    public bool Equals() => _str.AsSpan().Equals(        "LET ME NOT TO THE MARRIAGE OF TRUE MINDS ADMIT IMPEDIMENTS",        StringComparison.OrdinalIgnoreCase);}
MethodRuntimeMeanRatioCode Size
Equals.NET 7.030.995 ns1.00101 B
Equals.NET 8.01.658 ns0.05116 B

It’s so fast in .NET 8 because, whereas with .NET 7 it ends up calling through to the underlying helper:

; Tests.Equals()       sub       rsp,48       xor       eax,eax       mov       [rsp+28],rax       vxorps    xmm4,xmm4,xmm4       vmovdqa   xmmword ptr [rsp+30],xmm4       mov       [rsp+40],rax       mov       rcx,[rcx+8]       test      rcx,rcx       je        short M00_L03       lea       rdx,[rcx+0C]       mov       ecx,[rcx+8]M00_L00:       mov       r8,21E57C058A0       mov       r8,[r8]       add       r8,0C       cmp       ecx,3A       jne       short M00_L02       mov       rcx,rdx       mov       rdx,r8       mov       r8d,3A       call      qword ptr [7FF8194B1A08]; System.Globalization.Ordinal.EqualsIgnoreCase(Char ByRef, Char ByRef, Int32)M00_L01:       nop       add       rsp,48       retM00_L02:       xor       eax,eax       jmp       short M00_L01M00_L03:       xor       ecx,ecx       xor       edx,edx       xchg      rcx,rdx       jmp       short M00_L00; Total bytes of code 101

in .NET 8, the JIT generates code for the operation directly, taking advantage of AVX512’s greater width and thus able to process a larger input without significantly increasing code size:

; Tests.Equals()       vzeroupper       mov       rax,[rcx+8]       test      rax,rax       jne       short M00_L00       xor       ecx,ecx       xor       edx,edx       jmp       short M00_L01M00_L00:       lea       rcx,[rax+0C]       mov       edx,[rax+8]M00_L01:       cmp       edx,3A       jne       short M00_L02       vmovups   zmm0,[rcx]       vmovups   zmm1,[7FF820495080]       vpternlogq zmm0,zmm1,[7FF8204950C0],56       vmovups   zmm1,[rcx+34]       vporq     zmm1,zmm1,[7FF820495100]       vpternlogq zmm0,zmm1,[7FF820495140],0F6       vxorps    ymm1,ymm1,ymm1       vpcmpeqq  k1,zmm0,zmm1       kortestb  k1,k1       setb      al       movzx     eax,al       jmp       short M00_L03M00_L02:       xor       eax,eaxM00_L03:       vzeroupper       ret; Total bytes of code 116

Even super simple operations get in on the action. Here we just have a cast from aulong to adouble:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD", "val")][DisassemblyDiagnoser]public class Tests{    [Benchmark]    [Arguments(1234567891011121314ul)]    public double UIntToDouble(ulong val) => val;}

Thanks todotnet/runtime#84384 from@khushal1996, the code for that shrinks from this:

; Tests.UIntToDouble(UInt64)       vzeroupper       vxorps    xmm0,xmm0,xmm0       vcvtsi2sd xmm0,xmm0,rdx       test      rdx,rdx       jge       short M00_L00       vaddsd    xmm0,xmm0,qword ptr [7FF819E776C0]M00_L00:       ret; Total bytes of code 26

using the AVXvcvtsi2sd instruction, to this:

; Tests.UIntToDouble(UInt64)       vzeroupper       vcvtusi2sd xmm0,xmm0,rdx       ret; Total bytes of code 10

using the AVX512vcvtusi2sd instruction.

As yet another example, withdotnet/runtime#87641 we see the JIT using AVX512 to accelerate variousMath APIs:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD", "left", "right")]public class Tests{    [Benchmark]    [Arguments(123456.789f, 23456.7890f)]    public float Max(float left, float right) => MathF.Max(left, right);}
MethodRuntimeMeanRatio
Max.NET 7.01.1936 ns1.00
Max.NET 8.00.2865 ns0.24

Branching

Branching is integral to all meaningful code; while some algorithms are written in a branch-free manner, branch-free algorithms typically are challenging to get right and complicated to read, and typically are isolated to only small regions of code. For everything else, branching is the name of the game. Loops, if/else blocks, ternaries… it’s hard to imagine any real code without them. Yet they can also represent one of the more significant costs in an application. Modern hardware gets big speed boosts from pipelining, for example from being able to start reading and decoding the next instruction while the previous ones are still processing. That, of course, relies on the hardware knowing what the next instruction is. If there’s no branching, that’s easy, it’s whatever instruction comes next in the sequence. For when there is branching, CPUs have built-in support in the form of branch predictors, used to determine what the next instruction most likely will be, and they’re often right… but when they’re wrong, the cost incurred from that incorrect branch prediction can be huge. Compilers thus strive to minimize branching.

One way the impact of branches is reduced is by removing them completely. Redundant branch optimizers look for places where the compiler can prove that all paths leading to that branch will lead to the same outcome, such that the compiler can remove the branch and everything in the path not taken. Consider the following example:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private static readonly Random s_rand = new();    private readonly string _text = "hello world!";    [Params(1.0, 0.5)]    public double Probability { get; set; }    [Benchmark]    public ReadOnlySpan<char> TrySlice() => SliceOrDefault(_text.AsSpan(), s_rand.NextDouble() < Probability ? 3 : 20);    [MethodImpl(MethodImplOptions.AggressiveInlining)]    public ReadOnlySpan<char> SliceOrDefault(ReadOnlySpan<char> span, int i)    {        if ((uint)i < (uint)span.Length)        {            return span.Slice(i);        }        return default;    }}

Running that on .NET 7, we can glimpse into the impact of failed branch prediction. When we always take the branch the same way, the throughput is 2.5x what it was when it was impossible for the branch predictor to determine where we were going next:

MethodProbabilityMeanCode Size
TrySlice0.58.845 ns136 B
TrySlice13.436 ns136 B

We can also use this example for a .NET 8 improvement. That guardedReadOnlySpan<char>.Slice call has its own branch to ensure thati is within the bounds of the span; we can see that very clearly by looking at the disassembly generated on .NET 7:

; Tests.TrySlice()       push      rdi       push      rsi       push      rbp       push      rbx       sub       rsp,28       vzeroupper       mov       rdi,rcx       mov       rsi,rdx       mov       rcx,[rdi+8]       test      rcx,rcx       je        short M00_L01       lea       rbx,[rcx+0C]       mov       ebp,[rcx+8]M00_L00:       mov       rcx,1EBBFC01FA0       mov       rcx,[rcx]       mov       rcx,[rcx+8]       mov       rax,[rcx]       mov       rax,[rax+48]       call      qword ptr [rax+20]       vmovsd    xmm1,qword ptr [rdi+10]       vucomisd  xmm1,xmm0       ja        short M00_L02       mov       eax,14       jmp       short M00_L03M00_L01:       xor       ebx,ebx       xor       ebp,ebp       jmp       short M00_L00M00_L02:       mov       eax,3M00_L03:       cmp       eax,ebp       jae       short M00_L04       cmp       eax,ebp       ja        short M00_L06       mov       edx,eax       lea       rdx,[rbx+rdx*2]       sub       ebp,eax       jmp       short M00_L05M00_L04:       xor       edx,edx       xor       ebp,ebpM00_L05:       mov       [rsi],rdx       mov       [rsi+8],ebp       mov       rax,rsi       add       rsp,28       pop       rbx       pop       rbp       pop       rsi       pop       rdi       retM00_L06:       call      qword ptr [7FF999FEB498]       int       3; Total bytes of code 136

In particular, look atM00_L03:

M00_L03:       cmp       eax,ebp       jae       short M00_L04       cmp       eax,ebp       ja        short M00_L06       mov       edx,eax       lea       rdx,[rbx+rdx*2]

At this point, either3 or20 (0x14) has been loaded intoeax, and it’s being compared againstebp, which was loaded from the span’sLength earlier (mov ebp,[rcx+8]). There’s a very obvious redundant branch here, as the code doescmp eax,ebp, and then if it doesn’t jump as part of thejae, it does the exact same comparison again; the first is the one we wrote inTrySlice, the second is the one fromSlice itself, which got inlined.

On .NET 8, thanks todotnet/runtime#72979 anddotnet/runtime#75804, that branch (and many others of a similar ilk) is optimized away. We can run the exact same benchmark, this time on .NET 8, and if we look at the assembly at the corresponding code block (which isn’t numbered exactly the same because of other changes):

M00_L04:       cmp       eax,ebp       jae       short M00_L07       mov       ecx,eax       lea       rdx,[rdi+rcx*2]

we can see that, indeed, the redundant branch has been eliminated.

Another way the overhead associated with branches (and branch misprediction) is removed is by avoiding them altogether. Sometimes simple bit manipulation tricks can be employed to avoid branches.dotnet/runtime#62689 from@pedrobsaila, for example, finds expressions likei >= 0 && j >= 0 for signed integersi andj, and rewrites them to the equivalent of(i | j) >= 0.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD", "i", "j")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    [Benchmark]    [Arguments(42, 84)]    public bool BothGreaterThanOrEqualZero(int i, int j) => i >= 0 && j >= 0;}

Here instead of code like we’d get on .NET 7, which involves a branch for the&&:

; Tests.BothGreaterThanOrEqualZero(Int32, Int32)       test      edx,edx       jl        short M00_L00       mov       eax,r8d       not       eax       shr       eax,1F       retM00_L00:       xor       eax,eax       ret; Total bytes of code 16

now on .NET 8, the result is branchless:

; Tests.BothGreaterThanOrEqualZero(Int32, Int32)       or        edx,r8d       mov       eax,edx       not       eax       shr       eax,1F       ret; Total bytes of code 11

Such bit tricks, however, only get you so far. To go further, both x86/64 and Arm provide conditional move instructions, likecmov on x86/64 andcsel on Arm, that encapsulate the condition into the single instruction. For example,csel “conditionally selects” the value from one of two register arguments based on whether the condition is true or false and writes that value into the destination register. The instruction pipeline stays filled then because the instruction after thecsel is always the next instruction; there’s no control flow that would result in a different instruction coming next.

The JIT in .NET 8 is now capable of emitting conditional instructions, on both x86/64 and Arm. With PRs likedotnet/runtime#73472 from@a74nh anddotnet/runtime#77728 from@a74nh, the JIT gains an additional “if conversion” optimization phase, where various conditional patterns are recognized and morphed into conditional nodes in the JIT’s internal representation; these can then later be emitted as conditional instructions, as was done bydotnet/runtime#78879,dotnet/runtime#81267,dotnet/runtime#82235,dotnet/runtime#82766, anddotnet/runtime#83089. Other PRs, likedotnet/runtime#84926 from@SwapnilGaikwad anddotnet/runtime#82031 from@SwapnilGaikwad optimized which exact instructions would be employed, in these cases using the Armcinv andcinc instructions.

We can see all this in a simple benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private static readonly Random s_rand = new();    [Params(1.0, 0.5)]    public double Probability { get; set; }    [Benchmark]    public FileOptions GetOptions() => GetOptions(s_rand.NextDouble() < Probability);    private static FileOptions GetOptions(bool useAsync) => useAsync ? FileOptions.Asynchronous : FileOptions.None;}
MethodRuntimeProbabilityMeanRatioCode Size
GetOptions.NET 7.00.57.952 ns1.0064 B
GetOptions.NET 8.00.52.327 ns0.2986 B
GetOptions.NET 7.012.587 ns1.0064 B
GetOptions.NET 8.012.357 ns0.9186 B

Two things to notice:

  1. In .NET 7, the cost with a probability of 0.5 is 3x that of when it had a probability of 1.0, due to the branch predictor not being able to successfully predict which way the actual branch would go.
  2. In .NET 8, it doesn’t matter whether the probability is 0.5 or 1: the cost is the same (and cheaper than on .NET 7).

We can also look at the generated assembly to see the difference. In particular, on .NET 8, we see this for the generated assembly:

; Tests.GetOptions()       push      rbx       sub       rsp,20       vzeroupper       mov       rbx,rcx       mov       rcx,2C54EC01E40       mov       rcx,[rcx]       mov       rcx,[rcx+8]       mov       rax,offset MT_System.Random+XoshiroImpl       cmp       [rcx],rax       jne       short M00_L01       call      qword ptr [7FFA2D790C88]; System.Random+XoshiroImpl.NextDouble()M00_L00:       vmovsd    xmm1,qword ptr [rbx+8]       mov       eax,40000000       xor       ecx,ecx       vucomisd  xmm1,xmm0       cmovbe    eax,ecx       add       rsp,20       pop       rbx       retM00_L01:       mov       rax,[rcx]       mov       rax,[rax+48]       call      qword ptr [rax+20]       jmp       short M00_L00; Total bytes of code 86

Thatvucomisd; cmovbe sequence in there is the comparison between the randomly-generated floating-point value and the probability threshold followed by the conditional move (“conditionally move if below or equal”).

There are many methods that implicitly benefit from these transformations. Take even a simple method, likeMath.Max, whose code I’ve copied here:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    [Benchmark]    public int Max() => Max(1, 2);    [MethodImpl(MethodImplOptions.NoInlining)]    public static int Max(int val1, int val2)    {        return (val1 >= val2) ? val1 : val2;    }}

That pattern should look familiar. Here’s the assembly we get on .NET 7:

; Tests.Max(Int32, Int32)       cmp       ecx,edx       jge       short M01_L00       mov       eax,edx       retM01_L00:       mov       eax,ecx       ret; Total bytes of code 10

The two arguments come in via theecx andedx registers. They’re compared, and if the first argument is greater than or equal to the second, it jumps down to the bottom where the first argument is moved intoeax as the return value; if it wasn’t, then the second value is moved intoeax. And on .NET 8:

; Tests.Max(Int32, Int32)       cmp       ecx,edx       mov       eax,edx       cmovge    eax,ecx       ret; Total bytes of code 8

Again the two arguments come in via theecx andedx registers, and they’re compared. The second argument is then moved intoeax as the return value. If the comparison showed that the first argument was greater than the second, it’s then moved intoeax (overwriting the second argument that was just moved there). Fun.

Note if you ever find yourself wanting to do a deeper-dive into this area, BenchmarkDotNet has some excellent additional tools at your disposal. On Windows, it enables you to collect hardware counters, which expose a wealth of information about how things actually executed on the hardware, whether it be number of instructions retired, cache misses, or branch mispredictions. To use it, add another package reference to your .csproj:

<PackageReference Include="BenchmarkDotNet.Diagnostics.Windows" Version="0.13.8" />

and add an additional attribute to your tests class:

[HardwareCounters(HardwareCounter.BranchMispredictions, HardwareCounter.BranchInstructions)]

Then make sure you’re running the benchmarks from an elevated / admin terminal. When I do that, now I see this:

MethodRuntimeProbabilityMeanRatioBranchMispredictions/OpBranchInstructions/Op
GetOptions.NET 7.00.58.585 ns1.0015
GetOptions.NET 8.00.52.488 ns0.2904
GetOptions.NET 7.012.783 ns1.0004
GetOptions.NET 8.012.531 ns0.9104

We can see it confirms what we already knew: on .NET 7 with a 0.5 probability, it ends up mispredicting a branch.

The C# compiler (aka “Roslyn”) also gets in on the branch-elimination game in .NET 8, for a very specific kind of branch. In .NET, while we think ofSystem.Boolean as only being a two-value type (false andtrue),sizeof(bool) is actually one byte. That means abool can technically have 256 different values, where 0 is consideredfalse and [1,255] are all consideredtrue. Thankfully, unless a developer is poking around the edges of interop or otherwise usingunsafe code to purposefully manipulate these other values, developers can remain blissfully unaware of the actual numeric value here, for two reasons. First, C# doesn’t considerbool to be a numerical type, and thus you can’t perform arithmetic on it or cast it to a type likeint. Second, all of thebools produced by the runtime and C# are normalized to actually be 0 or 1 in value, e.g. acmp IL instruction is documented as “If value1 is greater than value2, 1 is pushed onto the stack; otherwise 0 is pushed onto the stack.” There is a class of algorithms, however, where being able to rely on such 0 and 1 values is handy, and we were just talking about them: branch-free algorithms.

Let’s say we didn’t have the JIT’s new-found ability to use conditional moves and we wanted to write our ownConditionalSelect method for integers:

static int ConditionalSelect(bool condition, int whenTrue, int whenFalse);

If we could rely onbool always being 0 or 1 (we can’t), andif we could do arithmetic on abool (we can’t), then we could use the behavior of multiplication to implement ourConditionalSelect function. Anything multiplied by 0 is 0, and anything multiplied by 1 is itself, so we could write ourConditionalSelect like this:

// pseudo-code; this won't compile!static int ConditionalSelect(bool condition, int whenTrue, int whenFalse) =>    (whenTrue  *  condition) +    (whenFalse * !condition);

Then ifcondition is 1,whenTrue * condition would bewhenTrue andwhenFalse * !condition would be 0, such that the whole expression would evaluate towhenTrue. And, conversely, ifcondition is 0,whenTrue * condition would be 0 andwhenFalse * !condition would bewhenFalse, such that the whole expression would evaluate towhenFalse. As noted, though, we can’t write the above, but we could write this:

static int ConditionalSelect(bool condition, int whenTrue, int whenFalse) =>    (whenTrue  * (condition ? 1 : 0)) +    (whenFalse * (condition ? 0 : 1));

That provides the exact semantics we want… but we’ve introduced two branches into our supposedly branch-free algorithm. This is the IL produced for thatConditionalSelect in .NET 7:

.method private hidebysig static  int32 ConditionalSelect (bool condition, int32 whenTrue, int32 whenFalse) cil managed {    .maxstack 8    IL_0000: ldarg.1    IL_0001: ldarg.0    IL_0002: brtrue.s IL_0007    IL_0004: ldc.i4.0    IL_0005: br.s IL_0008    IL_0007: ldc.i4.1    IL_0008: mul    IL_0009: ldarg.2    IL_000a: ldarg.0    IL_000b: brtrue.s IL_0010    IL_000d: ldc.i4.1    IL_000e: br.s IL_0011    IL_0010: ldc.i4.0    IL_0011: mul    IL_0012: add    IL_0013: ret}

Note all thosebrtrue.s andbr.s instructions in there. Are they necessary, though? Earlier I noted that the runtime will only producebools with a value of 0 or 1. And thanks todotnet/roslyn#67191, the C# compiler now recognizes that and optimizes the pattern(b ? 1 : 0) to be branchless. Our sameConditionalSelect function now in .NET 8 compiles to this:

.method private hidebysig static  int32 ConditionalSelect (bool condition, int32 whenTrue, int32 whenFalse) cil managed {    .maxstack 8    IL_0000: ldarg.1    IL_0001: ldarg.0    IL_0002: ldc.i4.0    IL_0003: cgt.un    IL_0005: mul    IL_0006: ldarg.2    IL_0007: ldarg.0    IL_0008: ldc.i4.0    IL_0009: ceq    IL_000b: mul    IL_000c: add    IL_000d: ret}

Zero branch instructions. Of course, you wouldn’t actually want to write this function like this anymore; just because it’s branch-free doesn’t mean it’s the most efficient. On .NET 8, here’s the assembly code produced by the JIT for the above:

       movzx    rax, cl       xor      ecx, ecx       test     eax, eax       setne    cl       imul     ecx, edx       test     eax, eax       sete     al       movzx    rax, al       imul     eax, r8d       add      eax, ecx       ret

whereas if you just wrote it as:

static int ConditionalSelect(bool condition, int whenTrue, int whenFalse) =>    condition ? whenTrue : whenFalse;

here’s what you’d get:

       test     cl, cl       mov      eax, r8d       cmovne   eax, edx       ret

Even so, this C# compiler optimization is useful for other branch-free algorithms. Let’s say I wanted to write aCompare method that would compare twoints, returning -1 if the first is less than the second, 0 if they’re equal, and 1 if the first is greater than the second. I could write that like this:

static int Compare(int x, int y){    if (x < y) return -1;    if (x > y) return 1;    return 0;}

Simple, but every invocation will incur at least one branch, if not two. With the(b ? 1 : 0) optimization, we can instead write it like this:

static int Compare(int x, int y){    int gt = (x > y) ? 1 : 0;    int lt = (x < y) ? 1 : 0;    return gt - lt;}

This is now branch-free, with the C# compiler producing:

    IL_0000: ldarg.0    IL_0001: ldarg.1    IL_0002: cgt    IL_0004: ldarg.0    IL_0005: ldarg.1    IL_0006: clt    IL_0008: stloc.0    IL_0009: ldloc.0    IL_000a: sub    IL_000b: ret

and, from that, the JIT producing:

       xor      eax, eax       cmp      ecx, edx       setg     al       setl     cl       movzx    rcx, cl       sub      eax, ecx       ret

Does that mean that everyone should now be running to rewrite their algorithms in a branch-free manner? Most definitely not. It’s another tool in your tool belt, and in some cases it’s quite beneficial, especially when it can provide more consistent throughput results due to doing the same work regardless of outcome. It’s not always a win, however, and in general it’s best not to try to outsmart the compiler. Take the example we just looked at. There’s a function with that exact implementation in the core libraries:int.CompareTo. And if you look at its implementation in .NET 8, you’ll find that it’s still using the branch-based implementation. Why? Because it often yields better results, in particular in the common case where the operation gets inlined and the JIT is able to combine the branches in theCompareTo method with ones based on processing the result ofCompareTo. Most uses ofCompareTo involve additional branching based on its result, such as in a quick sort partitioning step that’s deciding whether to move elements. So let’s take an example where code makes a decision based on the result of such a comparison:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    private int _x = 2, _y = 1;    [Benchmark]    public int GreaterThanOrEqualTo_Branching()    {        if (Compare_Branching(_x, _y) >= 0)        {            return _x * 2;        }        return _y * 3;    }    [Benchmark]    public int GreaterThanOrEqualTo_Branchless()    {        if (Compare_Branchless(_x, _y) >= 0)        {            return _x * 2;        }        return _y * 3;    }    private static int Compare_Branching(int x, int y)    {        if (x < y) return -1;        if (x > y) return 1;        return 0;    }    private static int Compare_Branchless(int x, int y)    {        int gt = (x > y) ? 1 : 0;        int lt = (x < y) ? 1 : 0;        return gt - lt;    }}

And the resulting assembly:Branching vs Branchless Assembly Difference

Note that both implementations now have just one branch (ajl in the “branching” case and ajs in the “branchless” case),and the “branching” implementation results in less assembly code.

Bounds Checking

Arrays, strings, and spans are all bounds checked by the runtime. That means that indexing into one of these data structures incurs validation to ensure that the index is within the bounds of the data structure. For example, theGet(byte[],int) method here:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    private byte[] _array = new byte[8];    private int _index = 4;    [Benchmark]    public void Get() => Get(_array, _index);    [MethodImpl(MethodImplOptions.NoInlining)]    private static byte Get(byte[] array, int index) => array[index];}

results in this code being generated for the method:

; Tests.Get(Byte[], Int32)       sub       rsp,28       cmp       edx,[rcx+8]       jae       short M01_L00       mov       eax,edx       movzx     eax,byte ptr [rcx+rax+10]       add       rsp,28       retM01_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 27

Here, thebyte[] is passed inrcx, theint index is inedx, and the code is comparing the value of the index against the value stored at an 8-byte offset from the beginning of the array: that’s where the array’s length is stored. Thejae instruction (jump if above or equal) is an unsigned comparison, such that if(uint)index >= (uint)array.Length, it’ll jump toM01_L00, where we see a call to a helper functionCORINFO_HELP_RNGCHKFAIL that will throw anIndexOutOfRangeException. All of that is the “bounds check.” The actual access into the array is the twomov andmovzx instructions, where theindex is moved intoeax, and then the value located atrcx (the address of the array) +rax (the index) + 0x10 (the offset of the start of the data in the array) is moved into the returneax register.

It’s the runtime’s responsibility to ensure that all accesses are guaranteed in bounds. It can do so with a bounds check. But it can also do so by proving that the index is always in range, in which case it can elide adding a bounds check that would only add overhead and provide zero benefit. Every .NET release, the JIT improves its ability to recognize patterns that don’t need a bounds check added because there’s no way the access could be out of range. And .NET 8 is no exception, with it learning several new and valuable tricks.

One such trick comes fromdotnet/runtime#84231, where it learns how to avoid bounds checks in a pattern that’s very prevalent in collections, in particular in hash tables. In a hash table, you generally compute a hash code for a key and then use that key to index into an array (often referred to as “buckets”). As the hash code might be anyint and the buckets array is invariably going to be much smaller than the full range of a 32-bit integer, all of the hash codes need to be mapped down to an element in the array, and a good way to do that is by mod’ing the hash code by the array’s length, e.g.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private readonly int[] _array = new int[7];    [Benchmark]    public int GetBucket() => GetBucket(_array, 42);    private static int GetBucket(int[] buckets, int hashcode) =>        buckets[(uint)hashcode % buckets.Length];}

In .NET 7, that produces:

; Tests.GetBucket()       sub       rsp,28       mov       rcx,[rcx+8]       mov       eax,2A       mov       edx,[rcx+8]       mov       r8d,edx       xor       edx,edx       idiv      r8       cmp       rdx,r8       jae       short M00_L00       mov       eax,[rcx+rdx*4+10]       add       rsp,28       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 44

Note theCORINFO_HELP_RNGCHKFAIL, the tell-tale sign of a bounds check. Now in .NET 8, the JIT recognizes that it’s impossible for auint value%‘d by an array’s length to be out of bounds of that array; either the array’sLength is greater than 0, in which case the result of the% will always be>= 0 and< array.Length, or theLength is 0, and% 0 will throw an exception. As such, it can elide the bounds check:

; Tests.GetBucket()       mov       rcx,[rcx+8]       mov       eax,2A       mov       r8d,[rcx+8]       xor       edx,edx       div       r8       mov       eax,[rcx+rdx*4+10]       ret; Total bytes of code 23

Now consider this:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private readonly string _s = "\"Hello, World!\"";    [Benchmark]    public bool IsQuoted() => IsQuoted(_s);    private static bool IsQuoted(string s) =>        s.Length >= 2 && s[0] == '"' && s[^1] == '"';}

Our function is checking to see whether the supplied string begins and ends with a quote. It needs to be at least two characters long, and the first and last characters need to be quotes (s[^1] is shorthand for and expanded by the C# compiler into the equivalent ofs[s.Length - 1]). Here’s the .NET 7 assembly:

; Tests.IsQuoted(System.String)       sub       rsp,28       mov       eax,[rcx+8]       cmp       eax,2       jl        short M01_L00       cmp       word ptr [rcx+0C],22       jne       short M01_L00       lea       edx,[rax-1]       cmp       edx,eax       jae       short M01_L01       mov       eax,edx       cmp       word ptr [rcx+rax*2+0C],22       sete      al       movzx     eax,al       add       rsp,28       retM01_L00:       xor       eax,eax       add       rsp,28       retM01_L01:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 58

Note that our function is indexing into the string twice, and the assembly does have acall CORINFO_HELP_RNGCHKFAIL at the end of the method, but there’s only onejae referring to the location of thatcall. That’s because the JIT already knows to avoid the bounds check on thes[0] access: it sees that it’s already been verified that the string’sLength >= 2, so it’s safe to index without a bounds check into any index<= 2. But, we do still have the bounds check for thes[s.Length - 1]. Now in .NET 8, we get this:

; Tests.IsQuoted(System.String)       mov       eax,[rcx+8]       cmp       eax,2       jl        short M01_L00       cmp       word ptr [rcx+0C],22       jne       short M01_L00       dec       eax       cmp       word ptr [rcx+rax*2+0C],22       sete      al       movzx     eax,al       retM01_L00:       xor       eax,eax       ret; Total bytes of code 33

Note the distinct lack of thecall CORINFO_HELP_RNGCHKFAIL; no more bounds checks. Not only did the JIT recognize thats[0] is safe becauses.Length >= 2, thanks todotnet/runtime#84213 it also recognized that sinces.Length >= 2,s.Length - 1 is>= 0 and< s.Length, which means it’s in-bounds and thus no range check is needed.

Constant Folding

Another important operation employed by compilers is constant folding (and the closely related constant propagation). Constant folding is just a fancy name for a compiler evaluating expressions at compile-time, e.g. if you have2 * 3, rather than emitting a multiplication instruction, it can just do the multiplication at compile-time and substitute6. Constant propagation is then the act of taking that new constant and using it anywhere this expression’s result feeds, e.g. if you have:

int a = 2 * 3;int b = a * 4;

a compiler can instead pretend it was:

int a = 6;int b = 24;

I bring this up here, after we just talked about bounds-check elimination, because there are scenarios where constant folding and bounds check elimination go hand-in-hand. If we can determine a data structure’s length at compile-time, and we can determine an index at a compile-time, then also at compile-time we can determine whether the index is in bounds and avoid the bounds check. We can also take it further: if we can determine not only the data structure’s length but also its contents, then we can do the indexing at compile-time and substitute the value from the data structure.

Consider this example, which is similar in nature to the kind of code types often have in theirToString orTryFormat implementations:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    [Benchmark]    [Arguments(42)]    public string Format(int value) => Format(value, "B");    [MethodImpl(MethodImplOptions.AggressiveInlining)]    static string Format(int value, ReadOnlySpan<char> format)    {        if (format.Length == 1)        {            switch (format[0] | 0x20)            {                case 'd': return DecimalFormat(value);                case 'x': return HexFormat(value);                case 'b': return BinaryFormat(value);            }        }        return FallbackFormat(value, format);    }    [MethodImpl(MethodImplOptions.NoInlining)] private static string DecimalFormat(int value) => null;    [MethodImpl(MethodImplOptions.NoInlining)] private static string HexFormat(int value) => null;    [MethodImpl(MethodImplOptions.NoInlining)] private static string BinaryFormat(int value) => null;    [MethodImpl(MethodImplOptions.NoInlining)] private static string FallbackFormat(int value, ReadOnlySpan<char> format) => null;}

We have aFormat(int value, ReadOnlySpan<char> format) method for formatting theint value according to the specifiedformat. The call site is explicit about the format to use, as many such call sites are, explicitly passing"B" here. The implementation is then special-casing formats that are one-character long and match in an ignore-case manner against one of three known formats (it’s using an ASCII trick based on the values of the lowercase letters being one bit different from their uppercase counterparts, such thatOR‘ing an uppercase ASCII letter with0x20 lowercases it). If we look at the assembly generated for this method in .NET 7, we get this:

; Tests.Format(Int32)       sub       rsp,38       xor       eax,eax       mov       [rsp+28],rax       mov       ecx,edx       mov       rax,251C4801418       mov       rax,[rax]       add       rax,0C       movzx     edx,word ptr [rax]       or        edx,20       cmp       edx,62       je        short M00_L01       cmp       edx,64       je        short M00_L00;       cmp       edx,78       jne       short M00_L02       call      qword ptr [7FFF3DD47918]; Tests.HexFormat(Int32)       jmp       short M00_L03M00_L00:       call      qword ptr [7FFF3DD47900]; Tests.DecimalFormat(Int32)       jmp       short M00_L03M00_L01:       call      qword ptr [7FFF3DD47930]; Tests.BinaryFormat(Int32)       jmp       short M00_L03M00_L02:       mov       [rsp+28],rax       mov       dword ptr [rsp+30],1       lea       rdx,[rsp+28]       call      qword ptr [7FFF3DD47948]; Tests.FallbackFormatM00_L03:       nop       add       rsp,38       ret; Total bytes of code 105

We can see the code here fromFormat(Int32, ReadOnlySpan<char>) but this is the code forFormat(Int32), so the callee was successfully inlined. We also don’t see any code for theformat.Length == 1 (the firstcmp is part of theswitch), nor do we see any signs of a bounds check (there’s nocall CORINFO_HELP_RNGCHKFAIL). We do, however, see it loading the first character fromformat:

mov       rax,251C4801418       ; loads the address of where the format const string reference is storedmov       rax,[rax]             ; loads the address of formatadd       rax,0C                ; loads the address of format's first charactermovzx     edx,word ptr [rax]    ; reads the first character of format

and then using the equivalent of a cascadingif/else. Now let’s look at .NET 8:

; Tests.Format(Int32)       sub       rsp,28       mov       ecx,edx       call      qword ptr [7FFEE0BAF4C8]; Tests.BinaryFormat(Int32)       nop       add       rsp,28       ret; Total bytes of code 18

Whoa. It not only saw thatformat‘sLength was 1 and not only was able to avoid the bounds check, it actually read the first character, lowercased it, and matched it against all theswitch branches, such that the entire operation was constant folded and propagated away, leaving just a call toBinaryFormat. That’s primarily thanks todotnet/runtime#78593.

There are a multitude of other such improvements, such asdotnet/runtime#77593 which enables it to constant fold the length of astring orT[] stored in astatic readonly field. Consider:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private static readonly string s_newline = Environment.NewLine;    [Benchmark]    public bool IsLineFeed() => s_newline.Length == 1 && s_newline[0] == '\n';}

On .NET 7, I get the following assembly:

; Tests.IsLineFeed()       mov       rax,18AFF401F78       mov       rax,[rax]       mov       edx,[rax+8]       cmp       edx,1       jne       short M00_L00       cmp       word ptr [rax+0C],0A       sete      al       movzx     eax,al       retM00_L00:       xor       eax,eax       ret; Total bytes of code 36

This is effectively a 1:1 translation of the C#, with not much interesting happening: it loads the string froms_newline, and compares itsLength to 1; if it doesn’t match, it returns 0 (false), otherwise it compares the value in the first element of the array against 0xA (line feed) and returns whether they match. Now, .NET 8:

; Tests.IsLineFeed()       xor       eax,eax       ret; Total bytes of code 3

That’s more interesting. I ran this code on Windows, whereEnvironment.NewLine is"\r\n". The JIT has constant folded the entire operation, seeing that the length is not 1, such that the whole operation boils down to just returning false.

Or considerdotnet/runtime#78783 anddotnet/runtime#80661 which teach the JIT how to actually peer into the contents of an “RVA static.” These are “Relative Virtual Address” static fields, which is a fancy way of saying they live in the assembly’s data section. The C# compiler has optimizations that put constant data into such fields; for example, when you write:

private static ReadOnlySpan<byte> Prefix => "http://"u8;

the C# compiler will actually emil IL like this:

.method private hidebysig specialname static     valuetype [System.Runtime]System.ReadOnlySpan`1<uint8> get_Prefix () cil managed {    .maxstack 8    IL_0000: ldsflda int64 '<PrivateImplementationDetails>'::'6709A82409D4C9E2EC04E1E71AB12303402A116B0F923DB8114F69CB05F1E926'    IL_0005: ldc.i4.7    IL_0006: newobj instance void valuetype [System.Runtime]System.ReadOnlySpan`1<uint8>::.ctor(void*, int32)    IL_000b: ret}....class private auto ansi sealed '<PrivateImplementationDetails>'    extends [System.Runtime]System.Object{    .field assembly static initonly int64 '6709A82409D4C9E2EC04E1E71AB12303402A116B0F923DB8114F69CB05F1E926' at I_00002868    .data cil I_00002868 = bytearray ( 68 74 74 70 3a 2f 2f 00 )}

With these PRs, when indexing into such RVA statics, the JIT is now able to actually read the data at the relevant location, constant folding the operation to the value at that location. So, take the following benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    [Benchmark]    public bool IsWhiteSpace() => char.IsWhiteSpace('\n');}

Thechar.IsWhiteSpace method is implemented via a lookup into such an RVA static, using thechar passed in as an index into it. If the index ends up being aconst, now on .NET 8 the whole operation can be constant folded away. .NET 7:

; Tests.IsWhiteSpace()       xor       eax,eax       test      byte ptr [7FFF9BCCD83A],80       setne     al       ret; Total bytes of code 13

and .NET 8:

; Tests.IsWhiteSpace()       mov       eax,1       ret; Total bytes of code 6

You get the idea. Of course, a developer hopefully wouldn’t explicitly writechar.IsWhiteSpace('\n'), but such code can result none-the-less, especially via inlining.

There are a multitude of these kinds of improvements in .NET 8.dotnet/runtime#77102 made it so that astatic readonly value type’s primitive fields can be constant folded as if they were themselvesstatic readonly fields, anddotnet/runtime#80431 extended that to strings.dotnet/runtime#85804 taught the JIT how to handleRuntimeTypeHandle.ToIntPtr(typeof(T).TypeHandle) (which is used in methods likeGC.AllocateUninitializedArray), whiledotnet/runtime#87101 taught it to handleobj.GetType() (such that if the JIT knows the exact type of an instanceobj, it can replace theGetType() invocation with the known answer). However, one of my favorite examples, purely because of just how magical it seems, comes from a series of PRs, includingdotnet/runtime#80622,dotnet/runtime#78961,dotnet/runtime#80888, anddotnet/runtime#81005. Together, they enable this:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    [Benchmark]    public DateTime Get() => new DateTime(2023, 9, 1);}

to produce this:

; Tests.Get()       mov       rax,8DBAA7E629B4000       ret; Total bytes of code 11

The JIT was able to successfully inline and constant fold the entire operation down to a single constant. That8DBAA7E629B4000 in thatmov instruction is the value for theprivate readonly ulong _dateData field that backsDateTime. Sure enough, if you run:

new DateTime(0x8DBAA7E629B4000)

you’ll see it produces:

[9/1/2023 12:00:00 AM]

Very cool.

Non-GC Heap

Earlier we saw an example of the codegen when loading a constant string. As a reminder, this code:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    [Benchmark]    public string GetPrefix() => "https://";}

results in this assembly on .NET 7:

; Tests.GetPrefix()       mov       rax,126A7C01498       mov       rax,[rax]       ret; Total bytes of code 14

There are twomov instructions here. The first is loading the location where the address to the string object is stored, and the second is reading the address stored at that location (this requires twomovs because on x64 there’s no addressing mode that supports moving the value stored at an absolute address larger than 32-bits). Even though we’re dealing with a string literal here, such that the data for the string is constant, that constant data still ends up being copied into a heap-allocatedstring object. That object is interned, such that there’s only one of them in the process, but it’s still a heap object, and that means it’s still subject to being moved around by the GC. That means the JIT can’t just bake in the address of thestring object, since the address can change, hence why it needs to read the address each time, in order to know where it currently is. Or, does it?

What if we could ensure that thestring object for this literal is created some place where it would never move, for example on the Pinned Object Heap (POH)? Then the JIT could avoid the indirection and instead just hardcode the address of thestring, knowing that it would never move. Of course, the POH guarantees objects on it will nevermove, but it doesn’t guarantee addresses to them will always be valid; after all, it doesn’t root the objects, so objects on the POH are still collectible by the GC, and if they were collected, their addresses would be pointing at garbage or other data that ended up reusing the space.

To address that, .NET 8 introduces a new mechanism used by the JIT for these kinds of situations: the Non-GC Heap (an evolution of the older “Frozen Segments” concept used by Native AOT). The JIT can ensure relevant objects are allocated on the Non-GC Heap, which is, as the name suggests, not managed by the GC and is intended to store objects where the JIT can prove the object has no references the GC needs to be aware of and will be rooted for the lifetime of the process, which in turn implies it can’t be part of an unloadable context.

Heaps where .NET Objects Live

The JIT can then avoid indirections in code generated to access that object, instead just hardcoding the object’s address. That’s exactly what it does now for string literals, as ofdotnet/runtime#49576. Now in .NET 8, that same method above results in this assembly:

; Tests.GetPrefix()       mov       rax,227814EAEA8       ret; Total bytes of code 11

dotnet/runtime#75573 makes a similar play, but with theRuntimeType objects produced bytypeof(T) (subject to various constraints, like theT not coming from an unloadable assembly, in which case permanently rooting the object would prevent unloading). Again, we can see this with a simple benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    [Benchmark]    public Type GetTestsType() => typeof(Tests);}

where we get the following difference between .NET 7 and .NET 8:

; .NET 7; Tests.GetTestsType()       sub       rsp,28       mov       rcx,offset MT_Tests       call      CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE       nop       add       rsp,28       ret; Total bytes of code 25; .NET 8; Tests.GetTestsType()       mov       rax,1E0015E73F8       ret; Total bytes of code 11

The same capability can be extended to other kinds of objects, as it is indotnet/runtime#85559 (which is based on work fromdotnet/runtime#76112), makingArray.Empty<T>() cheaper by allocating the empty arrays on the Non-GC Heap.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    [Benchmark]    public string[] Test() => Array.Empty<string>();}
; .NET 7; Tests.Test()       mov       rax,17E8D801FE8       mov       rax,[rax]       ret; Total bytes of code 14; .NET 8; Tests.Test()       mov       rax,1A0814EAEA8       ret; Total bytes of code 11

And as ofdotnet/runtime#77737, it also applies to the heap object associated withstatic value type fields, at least those that don’t contain any GC references. Wait, heap object for value type fields? Surely, Stephen, you got that wrong, value types aren’t allocated on the heap when stored in fields. Well, actually they are when they’re stored instatic fields; the runtime creates a heap-allocated box associated with that field to store the value (but the same box is reused for all writes to that field). And that means for a benchmark like this:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public partial class Tests{    private static readonly ConfigurationData s_config = ConfigurationData.ReadData();    [Benchmark]    public TimeSpan GetRefreshInterval() => s_config.RefreshInterval;    // Struct for storing fictional configuration data that might be read from a configuration file.    private struct ConfigurationData    {        public static ConfigurationData ReadData() => new ConfigurationData        {            Index = 0x12345,            Id = Guid.NewGuid(),            IsEnabled = true,            RefreshInterval = TimeSpan.FromSeconds(100)        };        public int Index;        public Guid Id;        public bool IsEnabled;        public TimeSpan RefreshInterval;    }}

we see the following assembly code for reading thatRefreshInterval on .NET 7:

; Tests.GetRefreshInterval()       mov       rax,13D84001F78       mov       rax,[rax]       mov       rax,[rax+20]       ret; Total bytes of code 18

That code is loading the address of the field, reading from it the address of the box object, and then reading from that box object theTimeSpan value that’s stored inside of it. But, now on .NET 8 we get the assembly you’ve now come to expect:

; Tests.GetRefreshInterval()       mov       rax,20D9853AE48       mov       rax,[rax]       ret; Total bytes of code 14

The box gets allocated on the Non-GC heap, which means the JIT can bake in the address of the object, and we get to save amov.

Beyond fewer indirections to access these Non-GC Heap objects, there are other benefits. For example, a “generational GC” like the one used in .NET divides the heap into multiple “generations,” where generation 0 (“gen0”) is for recently created objects and generation 2 (“gen2”) is for objects that have been around for a while. When the GC performs a collection, it needs to determine which objects are still alive (still referenced) and which ones can be collected, and to do that it has to trace through all references to find out what objects are still reachable. However, the generational model is beneficial because it can enable the GC to scour through much less of the heap than it might otherwise need to. If it can tell, for example, that there aren’t any references from gen2 back to gen0, then when doing a gen0 collection, it can avoid enumerating gen2 objects entirely. But to be able to know about such references, the GC needs to know any time a reference is written to a shared location. We can see that in this benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    [Benchmark]    public void Write()    {        string dst = "old";        Write(ref dst, "new");    }    [MethodImpl(MethodImplOptions.NoInlining)]    private static void Write(ref string dst, string s) => dst = s;}

where the code generated for thatWrite(ref string, string) method on both .NET 7 and .NET 8 is:

; Tests.Write(System.String ByRef, System.String)       call      CORINFO_HELP_CHECKED_ASSIGN_REF       nop       ret; Total bytes of code 7

ThatCORINFO_HELP_CHECKED_ASSIGN_REF is a JIT helper function that contains what’s known as a “GC write barrier,” a little piece of code that runs to let the GC track that a reference is being written that it might need to know about, e.g. because the object being assigned might be gen0 and the destination might be gen2. We see the same thing on .NET 7 for a tweak to the benchmark like this:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    [Benchmark]    public void Write()    {        string dst = "old";        Write(ref dst);    }    [MethodImpl(MethodImplOptions.NoInlining)]    private static void Write(ref string dst) => dst = "new";}

Now we’re storing a string literal into the destination, and on .NET 7 we see assembly similarly callingCORINFO_HELP_CHECKED_ASSIGN_REF:

; Tests.Write(System.String ByRef)       mov       rdx,1FF0E4014A0       mov       rdx,[rdx]       call      CORINFO_HELP_CHECKED_ASSIGN_REF       nop       ret; Total bytes of code 20

But, now on .NET 8 we see this:

; Tests.Write(System.String ByRef)       mov       rax,1B3814EAEC8       mov       [rcx],rax       ret; Total bytes of code 14

No write barrier. That’s thanks todotnet/runtime#76135, which recognizes that these Non-GC Heap objects don’t need to be tracked, since they’ll never be collected anyway. There are multiple other PRs that improve how constant folding works with these Non-GC Heap objects, too, likedotnet/runtime#85127,dotnet/runtime#85888, anddotnet/runtime#86318.

Zeroing

The JIT frequently needs to generate code that zeroes out memory. Unless you’ve used[SkipLocalsInit], for example, any stack space allocated withstackalloc needs to be zeroed, and it’s the JIT’s responsibility to generate the code that does so. Consider this benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{        [Benchmark] public void Constant256() => Use(stackalloc byte[256]);    [Benchmark] public void Constant1024() => Use(stackalloc byte[1024]);    [MethodImpl(MethodImplOptions.NoInlining)] // prevent stackallocs from being optimized away    private static void Use(Span<byte> span) { }}

Here’s what the .NET 7 assembly looks like for bothConstant256 andConstant1024:

; Tests.Constant256()       push      rbp       sub       rsp,40       lea       rbp,[rsp+20]       xor       eax,eax       mov       [rbp+10],rax       mov       [rbp+18],rax       mov       rax,0A77E4BDA96AD       mov       [rbp+8],rax       add       rsp,20       mov       ecx,10M00_L00:       push      0       push      0       dec       rcx       jne       short M00_L00       sub       rsp,20       lea       rcx,[rsp+20]       mov       [rbp+10],rcx       mov       dword ptr [rbp+18],100       lea       rcx,[rbp+10]       call      qword ptr [7FFF3DD37900]; Tests.Use(System.Span`1<Byte>)       mov       rcx,0A77E4BDA96AD       cmp       [rbp+8],rcx       je        short M00_L01       call      CORINFO_HELP_FAIL_FASTM00_L01:       nop       lea       rsp,[rbp+20]       pop       rbp       ret; Total bytes of code 110; Tests.Constant1024()       push      rbp       sub       rsp,40       lea       rbp,[rsp+20]       xor       eax,eax       mov       [rbp+10],rax       mov       [rbp+18],rax       mov       rax,606DD723A061       mov       [rbp+8],rax       add       rsp,20       mov       ecx,40M00_L00:       push      0       push      0       dec       rcx       jne       short M00_L00       sub       rsp,20       lea       rcx,[rsp+20]       mov       [rbp+10],rcx       mov       dword ptr [rbp+18],400       lea       rcx,[rbp+10]       call      qword ptr [7FFF3DD47900]; Tests.Use(System.Span`1<Byte>)       mov       rcx,606DD723A061       cmp       [rbp+8],rcx       je        short M00_L01       call      CORINFO_HELP_FAIL_FASTM00_L01:       nop       lea       rsp,[rbp+20]       pop       rbp       ret; Total bytes of code 110

We can see in the middle there that the JIT has written a zeroing loop, zeroing 16 bytes at a time by pushing two 8-byte0s onto the stack on each iteration:

M00_L00:       push      0       push      0       dec       rcx       jne       short M00_L00

Now in .NET 8 withdotnet/runtime#83255, the JIT unrolls and vectorizes that zeroing, and after a certain threshold (which as ofdotnet/runtime#83274 has also been updated and made consistent with what other native compilers do), it switches over to using an optimizedmemset routine rather than emitting a large amount of code to achieve the same thing. Here’s what we now get on .NET 8 forConstant256 (on my machine… I call that out because the limits are based on what instruction sets are available):

; Tests.Constant256()       push      rbp       sub       rsp,40       vzeroupper       lea       rbp,[rsp+20]       xor       eax,eax       mov       [rbp+10],rax       mov       [rbp+18],rax       mov       rax,6281D64D33C3       mov       [rbp+8],rax       test      [rsp],esp       sub       rsp,100       lea       rcx,[rsp+20]       vxorps    ymm0,ymm0,ymm0       vmovdqu   ymmword ptr [rcx],ymm0       vmovdqu   ymmword ptr [rcx+20],ymm0       vmovdqu   ymmword ptr [rcx+40],ymm0       vmovdqu   ymmword ptr [rcx+60],ymm0       vmovdqu   ymmword ptr [rcx+80],ymm0       vmovdqu   ymmword ptr [rcx+0A0],ymm0       vmovdqu   ymmword ptr [rcx+0C0],ymm0       vmovdqu   ymmword ptr [rcx+0E0],ymm0       mov       [rbp+10],rcx       mov       dword ptr [rbp+18],100       lea       rcx,[rbp+10]       call      qword ptr [7FFEB7D3F498]; Tests.Use(System.Span`1<Byte>)       mov       rcx,6281D64D33C3       cmp       [rbp+8],rcx       je        short M00_L00       call      CORINFO_HELP_FAIL_FASTM00_L00:       nop       lea       rsp,[rbp+20]       pop       rbp       ret; Total bytes of code 156

Notice there’s no zeroing loop, and instead we see a bunch of 256-bitvmovdqu move instructions to copy the zeroed outymm0 register to the next portion of the stack. And then forConstant1024 we see:

; Tests.Constant1024()       push      rbp       sub       rsp,40       lea       rbp,[rsp+20]       xor       eax,eax       mov       [rbp+10],rax       mov       [rbp+18],rax       mov       rax,0CAF12189F783       mov       [rbp],rax       test      [rsp],esp       sub       rsp,400       lea       rcx,[rsp+20]       mov       [rbp+8],rcx       xor       edx,edx       mov       r8d,400       call      CORINFO_HELP_MEMSET       mov       rcx,[rbp+8]       mov       [rbp+10],rcx       mov       dword ptr [rbp+18],400       lea       rcx,[rbp+10]       call      qword ptr [7FFEB7D5F498]; Tests.Use(System.Span`1<Byte>)       mov       rcx,0CAF12189F783       cmp       [rbp],rcx       je        short M00_L00       call      CORINFO_HELP_FAIL_FASTM00_L00:       nop       lea       rsp,[rbp+20]       pop       rbp       ret; Total bytes of code 119

Again, no zeroing loop, and instead we seecall CORINFO_HELP_MEMSET, relying on the optimized underlyingmemset to efficiently handle the zeroing. The effects of this are visible in throughput numbers as well:

MethodRuntimeMeanRatio
Constant256.NET 7.07.927 ns1.00
Constant256.NET 8.03.181 ns0.40
Constant1024.NET 7.030.523 ns1.00
Constant1024.NET 8.08.850 ns0.29

dotnet/runtime#83488 improved this further by using a standard trick frequently employed when vectorizing algorithms. Let’s say you want to zero out 120 bytes and you have at your disposal an instruction for zeroing out 32 bytes at a time. We can issue three such instructions to zero out 96 bytes, but we’re then left with 24 bytes that still need to be zeroed. What do we do? We can’t write another 32 bytes from where we left off, as we might then be overwriting 8 bytes we shouldn’t be touching. We could use scalar zeroing and issue three instructions each for 8 bytes, but could we do it in just a single instruction? Yes! Since the writes are idempotent, we can just zero out the last 32 bytes of the 120 bytes, even though that means we’ll be re-zeroing 8 bytes we already zeroed. You can see this same approach utilized in many of the vectorized operations throughout the core libraries, and as of this PR, the JIT employs it when zeroing as well.

dotnet/runtime#85389 takes this further and uses AVX512 to improve bulk operations like this zeroing. So, running the same benchmark on my Dev Box with AVX512, I see this assembly generated forConstant256:

; Tests.Constant256()       push      rbp       sub       rsp,40       vzeroupper       lea       rbp,[rsp+20]       xor       eax,eax       mov       [rbp+10],rax       mov       [rbp+18],rax       mov       rax,992482B435F7       mov       [rbp+8],rax       test      [rsp],esp       sub       rsp,100       lea       rcx,[rsp+20]       vxorps    ymm0,ymm0,ymm0       vmovdqu32 [rcx],zmm0       vmovdqu32 [rcx+40],zmm0       vmovdqu32 [rcx+80],zmm0       vmovdqu32 [rcx+0C0],zmm0       mov       [rbp+10],rcx       mov       dword ptr [rbp+18],100       lea       rcx,[rbp+10]       call      qword ptr [7FFCE555F4B0]; Tests.Use(System.Span`1<Byte>)       mov       rcx,992482B435F7       cmp       [rbp+8],rcx       je        short M00_L00       call      CORINFO_HELP_FAIL_FASTM00_L00:       nop       lea       rsp,[rbp+20]       pop       rbp       ret; Total bytes of code 132
; Tests.Use(System.Span`1<Byte>)       ret; Total bytes of code 1

Note that now, rather than eightvmovdqu instructions withymm0, we see fourvmovdqu32 instructions withzmm0, as each move instruction is able to zero out twice as much, with each instruction handling 64 bytes at a time.

Value Types

Value types (structs) have been used increasingly as part of high-performance code. Yet while they have obvious advantages (they don’t require heap allocation and thus reduce pressure on the GC), they also have disadvantages (more data being copied around) and have historically not been as optimized as someone relying on them heavily for performance might like. It’s been a key focus area of improvement for the JIT in the last several releases of .NET, and that continues into .NET 8.

One specific area of improvement here is around “promotion.” In this context, promotion is the idea of splitting a struct apart into its constituent fields, effectively treating each field as its own local. This can lead to a number of valuable optimizations, including being able to enregister portions of a struct. As of .NET 7, the JIT does support struct promotion, but with limitations, including only supporting structs with at most four fields and not supporting nested structs (other than for primitive types).

A lot of work in .NET 8 went into removing those restrictions.dotnet/runtime#83388 improves upon the existing promotion support with an additional optimization pass the JIT refers to as “physical promotion;” it does away with both of those cited limitations, however as of this PR the feature was still disabled by default. Other PRs likedotnet/runtime#85105 anddotnet/runtime#86043 improved it further, anddotnet/runtime#88090 enabled the optimizations by default. The net result is visible in a benchmark like the following:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    private ParsedStat _stat;    [Benchmark]    public ulong GetTime()    {        ParsedStat stat = _stat;        return stat.utime + stat.stime;    }    internal struct ParsedStat    {        internal int pid;        internal string comm;        internal char state;        internal int ppid;        internal int session;        internal ulong utime;        internal ulong stime;        internal long nice;        internal ulong starttime;        internal ulong vsize;        internal long rss;        internal ulong rsslim;    }}

Here we have a struct modeling some data that might be extracted from aprocfsstat file on Linux. The benchmark makes a local copy of the struct and returns a sum of the user and kernel times. In .NET 7, the assembly looks like this:

; Tests.GetTime()       push      rdi       push      rsi       sub       rsp,58       lea       rsi,[rcx+8]       lea       rdi,[rsp+8]       mov       ecx,0A       rep movsq       mov       rax,[rsp+10]       add       rax,[rsp+18]       add       rsp,58       pop       rsi       pop       rdi       ret; Total bytes of code 40

The two really interesting instructions here are these:

mov ecx,0Arep movsq

TheParsedStat struct is 80 bytes in size, and this pair of instructions is repeatedly (rep) copying 8-bytes (movsq) 10 times (ecx that’s been populated with 0xA) from the source location inrsi (which was initialized with[rcx+8], aka the location of the_stat field) to the destination location inrdi (a stack location at[rsp+8]). In other words, this is making a full copy of the whole struct, even though we only need two fields from it. Now in .NET 8, we get this:

; Tests.GetTime()       add       rcx,8       mov       rax,[rcx+8]       mov       rcx,[rcx+10]       add       rax,rcx       ret; Total bytes of code 16

Ahhh, so much nicer. Now it’s avoided the whole copy, and is simply moving the relevantulong values into registers and adding them together.

Here’s another example:

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 7").WithRuntime(CoreRuntime.Core70).AsBaseline())    .AddJob(Job.Default.WithId(".NET 8 w/o PGO").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_TieredPGO", "0"))    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables", "Runtime")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private readonly List<int?> _list = Enumerable.Range(0, 10000).Select(i => (int?)i).ToList();    [Benchmark]    public int CountList()    {        int count = 0;        foreach (int? i in _list)            if (i is not null)                count++;        return count;    }}

List<T> has a structList<T>.Enumerator that’s returned fromList<T>.GetEnumerator(), such that when youforeach the list directly (rather than as anIEnumerable<T>), the C# compiler binds to this struct enumerator via the enumerator pattern. This example runs afoul of the previous limitations in two ways. ThatEnumerator has a field for the currentT, so ifT is a non-primitive value type, it violates the “no nested structs” limitation. And thatEnumerator has four fields, so if thatT has multiple fields, it pushes it beyond the four-field limit. Now in .NET 8, the JIT is able to see through the struct to its fields, and optimize the enumeration of the list to a much more efficient result.

MethodJobMeanRatioCode Size
CountList.NET 718.878 us1.00215 B
CountList.NET 8 w/o PGO11.726 us0.6270 B
CountList.NET 85.912 us0.3166 B

Note the significant improvement in both throughput and code size from .NET 7 to .NET 8 even without PGO. However, the gap here between .NET 8 without PGO and with PGO is also interesting, albeit for other reasons. We see an almost halving of execution time with PGO applied, but only four bytes of difference in assembly code size. Those four bytes stem from a singlemov instruction that PGO was able to help remove, which we can see easily by pasting the two snippets into a diffing tool:An extra mov highlighted in a diff tool~12us down to ~6us is a lot for a difference of a singlemov… why such an outsized impact? This ends up being a really good example of what I mentioned at the beginning of this article: beware microbenchmarks, as they can differ from machine to machine. Or in this case, in particular from processor to processor. The machine on which I’m writing this and on which I’ve run the majority of the benchmarks in this post is a several year old desktop with an Intel Coffee Lake processor. When I run the same benchmark on my Dev Box, which has an Intel Xeon Platinum 8370C, I see this:

MethodJobMeanRatioCode Size
CountList.NET 715.804 us1.00215 B
CountList.NET 8 w/o PGO7.138 us0.4570 B
CountList.NET 86.111 us0.3966 B

Same code size, still a large improvement due to physical promotion, but now only a small ~15% rather than ~2x improvement from PGO. As it turns out, Coffee Lake is one of the processors affected by the Jump Conditional Code(JCC) Erratum issued in 2019 (“erratum” here is a fancy way of saying “bug”, or alternatively, “documentation about a bug”). The problem involved jump instructions on a 32-byte boundary, and the hardware caching information about those instructions. The issue was then subsequently fixed via a microcode update that disabled the relevant caching, but that then created a possible performance issue, as whether a jump is on a 32-byte boundary impacts whether it’s cached and therefore the resulting performance gains that cache was introduced to provide. If I set theDOTNET_JitDisasm environment variable to*CountList* (to get the JIT to output the disassembly directly, rather than relying on BenchmarkDotNet to fish it out), and set theDOTNET_JitDisasmWithAlignmentBoundaries environment variable to1 (to get the JIT to include alignment boundary information in that output), I see this:

G_M000_IG04:                ;; offset=0018H       mov      r8d, dword ptr [rcx+10H]       cmp      edx, r8d       jae      SHORT G_M000_IG05; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 1 ; jcc erratum) 32B boundary ...............................       mov      r8, gword ptr [rcx+08H]

Sure enough, we see that this jump instruction is falling on a 32-byte boundary. When PGO kicks in and removes the earliermov, that changes the alignment such that the jump is no longer on a 32-byte boundary:

G_M000_IG05:                ;; offset=0018H       cmp      edx, dword ptr [rcx+10H]       jae      SHORT G_M000_IG06       mov      r8, gword ptr [rcx+08H]; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 1) 32B boundary ...............................       cmp      edx, dword ptr [r8+08H]

This is all to say, again, there are many things that can impact microbenchmarks, and it’s valuable to understand the source of a difference rather than just taking it at face value.

Ok, where were we? Oh yeah, structs. Another improvement related to structs comes indotnet/runtime#79346, which adds an additional “liveness” optimization pass earlier than the others it already has (liveness is just an indication of whether a variable might still be needed because its value might be used again in the future). This then allows the JIT to remove some struct copies it wasn’t previously able to, in particular in situations where the last time the struct is used is in passing it to another method. However, this additional liveness pass has other benefits as well, in particular with relation to “forward substitution.” Forward substitution is an optimization that can be thought of as the opposite of “common subexpression elimination” (CSE). With CSE, the compiler replaces an expression with something containing the result already computed for that expression, so for example if you had:

int c = (a + b) + 3;int d = (a + b) * 4;

a compiler might use CSE to rewrite that as:

int tmp = a + b;int c = tmp + 3;int d = tmp * 4;

Forward substitution could be used to undo that, distributing the expression feeding intotmp back to wheretmp is used, such that we end up back with:

int c = (a + b) + 3;int d = (a + b) * 4;

Why would a compiler want to do that? It can make certain subsequent optimizations easier for it to see. For example, consider this benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    [Benchmark]    [Arguments(42)]    public int Merge(int a)    {        a *= 3;        a *= 3;        return a;    }}

On .NET 7, that results in this assembly:

; Tests.Merge(Int32)       lea       edx,[rdx+rdx*2]       lea       edx,[rdx+rdx*2]       mov       eax,edx       ret; Total bytes of code 9

The generated code here is performing each multiplication individually. But when we view:

a *= 3;a *= 3;return a;

instead as:

a = a * 3;a = a * 3;return a;

and knowing that the initial result stored intoa is temporary (thank you, liveness), forward substitution can turn that into:

a = (a * 3) * 3;return a;

at which point constant folding can kick in. Now on .NET 8 we get:

; Tests.Merge(Int32)       lea       eax,[rdx+rdx*8]       ret; Total bytes of code 4

Another change related to liveness isdotnet/runtime#77990 from@SingleAccretion. This adds another pass over one of the JIT’s internal representations, eliminating writes it finds to be useless.

Casting

Various improvements have gone into improving the performance of casting in .NET 8.

dotnet/runtime#75816 improved the performance of usingis T[] whenT is sealed. There’s aCORINFO_HELP_ISINSTANCEOFARRAY helper the JIT uses to determine whether an object is of a specified array type, but when theT is sealed, the JIT can instead emit it without the helper, generating code as if it were written likeobj is not null && obj.GetType() == typeof(T[]). This is another example where dynamic PGO has a measurable impact, so the benchmark highlights the improvements with and without it.

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 7").WithRuntime(CoreRuntime.Core70).AsBaseline())    .AddJob(Job.Default.WithId(".NET 8 w/o PGO").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_TieredPGO", "0"))    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables", "Runtime")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private readonly object _obj = new string[1];    [Benchmark]    public bool IsStringArray() => _obj is string[];}
MethodJobMeanRatio
IsStringArray.NET 71.2290 ns1.00
IsStringArray.NET 8 w/o PGO0.2365 ns0.19
IsStringArray.NET 80.0825 ns0.07

Moving on, consider this benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser(maxDepth: 0)]public class Tests{    private readonly string[] _strings = new string[1];    [Benchmark]    public string Get1() => _strings[0];    [Benchmark]    public string Get2() => Volatile.Read(ref _strings[0]);}

Get1 here is just reading and returning the 0th element from the array.Get2 here is returning aref to the 0th element from the array. Here’s the assembly we get in .NET 7:

; Tests.Get1()       sub       rsp,28       mov       rax,[rcx+8]       cmp       dword ptr [rax+8],0       jbe       short M00_L00       mov       rax,[rax+10]       add       rsp,28       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 29; Tests.Get2()       sub       rsp,28       mov       rcx,[rcx+8]       xor       edx,edx       mov       r8,offset MT_System.String       call      CORINFO_HELP_LDELEMA_REF       nop       add       rsp,28       ret; Total bytes of code 31

InGet1, we’re immediately using the array element, so the C# compiler can emit aldelem.ref IL instruction, but inGet2, the reference to the array element is being returned, so the C# compiler emits aldelema (load element address) instruction. In the general case,ldelema requires a type check, because of covariance; you could have aBase[] array = new DerivedFromBase[1];, in which case if you handed out aref Base pointing into that array and someone wrote anew AlsoDerivedFromBase() via thatref, type safety would be violated (since you’d be storing anAlsoDerivedFromBase into aDerivedFromBase[] even thoughDerivedFromBase andAlsoDerivedFromBase aren’t related). As such, the .NET 7 assembly for this code includes a call toCORINFO_HELP_LDELEMA_REF, which is the helper function the JIT uses to perform that type check. But the array element type here isstring, which is sealed, which means we can’t get into that problematic situation: there’s no type you can store into astring variable other thanstring. Thus, this helper call is superfluous, and withdotnet/runtime#85256, the JIT can now avoid using it. On .NET 8, then, we get this forGet2:

; Tests.Get2()       sub       rsp,28       mov       rax,[rcx+8]       cmp       dword ptr [rax+8],0       jbe       short M00_L00       add       rax,10       add       rsp,28       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 29

NoCORINFO_HELP_LDELEMA_REF in sight.

And thendotnet/runtime#86728 reduces the costs associated with a generic cast. Previously the JIT would always use aCastHelpers.ChkCastAny method to perform the cast, but with this change, it inlines a fast success path.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly object _o = "hello";    [Benchmark]    public string GetString() => Cast<string>(_o);    [MethodImpl(MethodImplOptions.NoInlining)]    public T Cast<T>(object o) => (T)o;}
MethodRuntimeMeanRatio
GetString.NET 7.02.247 ns1.00
GetString.NET 8.01.300 ns0.58

Peephole Optimizations

A “peephole optimization” is one in which a small sequence of instructions is replaced by a different sequence that is expected to perform better. This could include getting rid of instructions deemed unnecessary or replacing two instructions with one instruction that can accomplish the same task. Every release of .NET features a multitude of new peephole optimizations, often inspired by real-world examples where some overhead could be trimmed by slightly increasing code quality, and .NET 8 is no exception. Here are just some of these optimizations in .NET 8:

(I’ve touched here on some of the improvements specific to Arm. For a more in-depth look, seeArm64 Performance Improvements in .NET 8).

Native AOT

Native AOT shipped in .NET 7. It enables .NET programs to be compiled at build time into a self-contained executable or library composed entirely of native code: no JIT is required at execution time to compile anything, and in fact there’s no JIT included with the compiled program. The result is an application that can have a very small on-disk footprint, a small memory footprint, and very fast startup time. In .NET 7, the primary supported workloads were console applications. Now in .NET 8, a lot of work has gone into making ASP.NET applications shine when compiled with Native AOT, as well as driving down overall costs, regardless of app model.

A significant focus in .NET 8 was on reducing the size of built applications, and the net effect of this is quite easy to see. Let’s start by creating a new Native AOT console app:

dotnet new console -o nativeaotexample -f net7.0

That creates a newnativeaotexample directory and adds to it a new “Hello, world” app that targets .NET 7. Edit the generated nativeaotexample.csproj in two ways:

  1. Change the<TargetFramework>net7.0</TargetFramework> to instead be<TargetFrameworks>net7.0;net8.0</TargetFrameworks>, so that we can easily build for either .NET 7 or .NET 8.
  2. Add<PublishAot>true</PublishAot> to the<PropertyGroup>...</PropertyGroup>, so that when wedotnet publish, it uses Native AOT.

Now, publish the app for .NET 7. I’m currently targeting Linux for x64, so I’m usinglinux-x64, but you can follow along on Windows with a Windows identifier, likewin-x64:

dotnet publish -f net7.0 -r linux-x64 -c Release

That should successfully build the app, producing a standalone executable, and we canls/dir the output directory to see the produced binary size (here I’ve usedls -s --block-size=k):

12820K /home/stoub/nativeaotexample/bin/Release/net7.0/linux-x64/publish/nativeaotexample

So, on .NET 7 on Linux, this “Hello, world” application, including all necessary library support, the GC, everything, is ~13Mb. Now, we can do the same for .NET 8:

dotnet publish -f net8.0 -r linux-x64 -c Release

and again see the generated output size:

1536K /home/stoub/nativeaotexample/bin/Release/net8.0/linux-x64/publish/nativeaotexample

Now on .NET 8, that ~13MB has dropped to ~1.5M! We can get it smaller, too, using various supported configuration flags. First, we can set a size vs speed option introduced indotnet/runtime#85133, adding<OptimizationPreference>Size</OptimizationPreference> to the .csproj. Then if I don’t need globalization-specific code and data and am ok utilizing an invariant mode, I can add<InvariantGlobalization>true</InvariantGlobalization>. Maybe I don’t care about having good stack traces if an exception occurs?dotnet/runtime#88235 added the<StackTraceSupport>false</StackTraceSupport> option. Add all of those and republish:

1248K /home/stoub/nativeaotexample/bin/Release/net8.0/linux-x64/publish/nativeaotexample

Sweet.

A good chunk of those improvements came from a relentless effort that involved hacking away at the size, 10Kb here, 20Kb there. Some examples that drove down these sizes:

  • There are a variety of data structures the Native AOT compiler needs to create that then need to be used by the runtime when the app executes.dotnet/runtime#77884 added support for these data structures, including ones containing pointers, to be stored into the application and then rehydrated at execution time. Even before being extended in a variety of ways by subsequent PRs, this shaved hundreds of kilobytes off the app size, on both Windows and Linux (but more so on Linux).
  • Every type with a static field containing references has a data structure associated with it containing a few pointers.dotnet/runtime#78794 made those pointers relative, saving ~0.5% of the HelloWorld app size (at least on Linux, a bit less on Windows).dotnet/runtime#78801 did the same for another set of pointers, saving another ~1%.
  • dotnet/runtime#79594 removed some over-aggressive tracking of types and methods that needed data stored about them for reflection. This saved another ~32Kb on HelloWorld.
  • In some cases, generic type dictionaries were being created even if they were never used and thus empty.dotnet/runtime#82591 got rid of these, saving another ~1.5% on a simple ASP.NET minimal APIs app.dotnet/runtime#83367 saved another ~20Kb by ridding itself of other empty type dictionaries.
  • Members declared on a generic type have their code copied and specialized for each value type that’s substituted for the generic type parameter. However, if with some tweaks those members can be made non-generic and moved out of the type, such as into a non-generic base type, that duplication can be avoided.dotnet/runtime#82923 did so for array enumerators, moving down theIDisposable and non-genericIEnumerator interface implementations.
  • CoreLib has an implementation of an empty array enumerator that can be used when enumerating aT[] that’s empty, and that singleton may be used in non-array enumerables, e.g. enumerating an empty(IEnumerable<KeyValuePair<TKey, TValue>>)Dictionary<TKey, TValue> could produce that array enumerator singleton. That enumerator, however, has a reference to aT[], and in the Native AOT world, using the enumerator then means code needs to be produced for the various members ofT[]. If, however, the enumerator in question is for aT[] that’s unlikely to be used elsewhere (e.g.KeyValuePair<TKey, TValue>[]),dotnet/runtime#82899 supplies a specialized enumerator singleton that doesn’t referenceT[], avoiding forcing that code to be created and kept (for example, code for aDictionary<TKey, TValue>‘sIEnumerable<KeyValuePair<TKey, TValue>>).
  • No one ever calls theEquals/GetHashCode methods on theAsyncStateMachine structs produced by the C# compiler for async methods; they’re a hidden implementation detail, but even so, such virtual methods are in general kept rooted in a Native AOT app (and whereas CoreCLR can use reflection to provide the implementation of these methods for value types, Native AOT needs customized code emitted for each).dotnet/runtime#83369 special-cased these to avoid them being kept, shaving another ~1% off a minimal APIs app.
  • dotnet/runtime#83937 reduced the size of static constructor contexts, data structures used to pass information about a type’s staticcctor between portions of the system.
  • dotnet/runtime#84463 made a few tweaks that ended up avoiding creatingMethodTables fordouble/float and that reduced reliance on some array methods, shaving another ~3% off HelloWorld.
  • dotnet/runtime#84156 manually split a method into two portions such that some lesser-used code isn’t always brought in when using the more commonly-used code; this saved another several hundred kilobytes.
  • dotnet/runtime#84224 improved handling of the common patterntypeof(T) == typeof(Something) that’s often used to do generic specialization (e.g. such as in code likeMemoryExtensions), doing it in a way that makes it easier to get rid of side effects from branches that are trimmed away.
  • The GC includes a vectorized sort implementation calledvxsort. When building with a configuration optimized for size,dotnet/runtime#85036 enabled removing that throughput optimization, saving several hundred kilobytes.
  • ValueTuple<...> is a very handy type, but it brings a lot of code with it, as it implements multiple interfaces which then end up rooting functionality on the generic type parameters.dotnet/runtime#87120 removed a use ofValueTuple<T1, T2> fromSynchronizationContext, saving ~200Kb.
  • On Linux specifically, a large improvement came fromdotnet/runtime#85139. Debug symbols were previously being stored in the published executable; with this change, symbols are stripped from the executable and are instead stored in a separate.dbg file built next to it. Someone who wants to revert to keeping the symbols in the executable can add<StripSymbols>false</StripSymbols> to in their project.

You get the idea. The improvements go beyond nipping and tucking here and there within the Native AOT compiler, though. Individual libraries also contributed. For example:

  • HttpClient supports automatic decompression of response streams, for bothdeflate andbrotli, and that in turn means that anyHttpClient use implicitly brings with it most ofSystem.IO.Compression. However, by default that decompression isn’t enabled, and you need to opt-in to it by explicitly setting theAutomaticDecompression property on theHttpClientHandler orSocketsHttpHandler in use. So,dotnet/runtime#78198 employs a trick where rather thanSocketsHttpHandler‘s main code paths relying directly on the internalDecompressionHandler that does this work, it instead relies on a delegate. The field storing that delegate starts out as null, and then as part of theAutomaticDecompression setter, that field is set to a delegate that will do the decompression work. That means that if the trimmer doesn’t see any code accessing theAutomaticDecompression setter such that the setter can be trimmed away, then all of theDecompressionHandler and its reliance onDeflateStream andBrotliStream can also be trimmed away. Since it’s a little confusing to read, here’s a representation of it:

    private DecompressionMethods _automaticDecompression;private Func<Stream, Stream>? _getStream;public DecompressionMethods AutomaticDecompression{    get => _automaticDecompression;    set    {        _automaticDecompression = value;        _getStream ??= CreateDecompressionStream;    }}public Stream GetStreamAsync(){    Stream response = ...;    return _getStream is not null ? _getStream(response) : response;}private static Stream CreateDecompressionStream(Stream stream) =>    UseGZip   ? new GZipStream(stream, CompressionMode.Decompress) :    UseZLib   ? new ZLibStream(stream, CompressionMode.Decompress) :    UseBrotli ? new BrotliStream(stream, CompressionMode.Decompress) :    stream;}

    TheCreateDecompressionStream method here is the one that references all of the compression-related code, and the only code path that touches it is in theAutomaticDecompression setter. Therefore, if nothing in the app accesses the setter, the setter can be trimmed, which means theCreateDecompressionStream method can also be trimmed, which means if nothing else in the app is using these compression streams, they can also be trimmed.

  • dotnet/runtime#80884 is another example, saving ~90Kb of size whenRegex is used by just being a bit more intentional about what types are being used in its implementation (e.g. using abool[30] instead of aHashSet<UnicodeCategory> to store a bitmap).
  • Or particularly interesting,dotnet/runtime#84169, which adds a new feature switch toSystem.Xml. Various APIs inSystem.Xml useUri, which can trigger use ofXmlUrlResolver, which in turn references the networking stack; an app that’s using XML but not otherwise using networking can end up inadvertently bringing in upwards of 3MB of networking code, just by using an API likeXDocument.Load("filepath.xml"). Such an app can use the<XmlResolverIsNetworkingEnabledByDefault> MSBuild property added indotnet/sdk#34412 to enable all of those code paths in XML to be trimmed away.
  • ActivatorUtilities.CreateFactory inMicrosoft.Extensions.DependencyInjection.Abstractions tries to optimize throughput by spending some time upfront to build a factory that’s then very efficient at creating things. Its main strategy for doing so involved usingSystem.Linq.Expressions as a simpler API for using reflection emit, building up custom IL for the exact thing being constructed. When you have a JIT, that can work very well. But when dynamic code isn’t supported,System.Linq.Expressions can’t use reflection emit and instead falls back to using an interpreter. That makes such an “optimization” inCreateFactory actually a deoptimization, plus it brings with it the size impact ofSystem.Linq.Expression.dll.dotnet/runtime#81262 adds a reflection-based alternative for when!RuntimeFeature.IsDynamicCodeSupported, resulting in faster code and allowing theSystem.Linq.Expression usage to be trimmed away.

Of course, while size was a large focus for .NET 8, there are a multitude of other ways in which performance with Native AOT has improved. For example,dotnet/runtime#79709 anddotnet/runtime#80969 avoid helper calls as part of reading static fields. BenchmarkDotNet works with Native AOT as well, so we can run the following benchmark to compare; instead of using--runtimes net7.0 net8.0, we just use--runtimes nativeaot7.0 nativeaot8.0 (BenchmarkDotNet also currently doesn’t support the[DisassemblyDiagnoser] with Native AOT):

// dotnet run -c Release -f net7.0 --filter "*" --runtimes nativeaot7.0 nativeaot8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly int s_configValue = 42;    [Benchmark]    public int GetConfigValue() => s_configValue;}

For that, BenchmarkDotNet outputs:

MethodRuntimeMeanRatio
GetConfigValueNativeAOT 7.01.1759 ns1.000
GetConfigValueNativeAOT 8.00.0000 ns0.000

including:

// * Warnings *ZeroMeasurement  Tests.GetConfigValue: Runtime=NativeAOT 8.0, Toolchain=Latest ILCompiler -> The method duration is indistinguishable from the empty method duration

(When looking at the output of optimizations, that warning always brings a smile to my face.)

dotnet/runtime#83054 is another good example. It improves uponEqualityComparer<T> support in Native AOT by ensuring that the comparer can be stored in astatic readonly to enable better constant folding in consumers.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes nativeaot7.0 nativeaot8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly int[] _array = Enumerable.Range(0, 1000).ToArray();    [Benchmark]    public int FindIndex() => FindIndex(_array, 999);    [MethodImpl(MethodImplOptions.NoInlining)]    private static int FindIndex<T>(T[] array, T value)    {        for (int i = 0; i < array.Length; i++)            if (EqualityComparer<T>.Default.Equals(array[i], value))                return i;        return -1;    }}
MethodRuntimeMeanRatio
FindIndexNativeAOT 7.0876.2 ns1.00
FindIndexNativeAOT 8.0367.8 ns0.42

As another example,dotnet/runtime#83911 avoids some overhead related to static class initialization. As we discussed in the JIT section, the JIT is able to rely on tiering to know that a static field accessed by a method must have already been initialized if the method is being promoted from tier 0 to tier 1, but tiering doesn’t exist in the Native AOT world, so this PR adds a fast-path check to help avoid most of the costs.

Other fundamental support has also improved.dotnet/runtime#79519, for example, changes how locks are implemented for Native AOT, employing a hybrid approach that starts with a lightweight spinlock and upgrades to using theSystem.Threading.Lock type (which is currently internal to Native AOT but likely to ship publicly in .NET 9).

VM

The VM is, loosely speaking, the part of the runtime that’s not the JIT or the GC. It’s what handles things like assembly and type loading. While there were a multitude of improvements throughout, I’ll call out three notable improvements.

First,dotnet/runtime#79021 optimized the operation of mapping an instruction pointer to aMethodDesc (a data structure that represents a method, with various pieces of information about it, like its signature), which happens in particular any time stack walking is performed (e.g. exception handling,Environment.Stacktrace, etc.) and as part of some delegate creations. The change not only makes this conversion faster but also mostly lock-free, which means on a benchmark like the following, there’s a significant improvement for sequential use but an even larger one for multi-threaded use:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public void InSerial()    {        for (int i = 0; i < 10_000; i++)        {            CreateDelegate<string>();        }    }    [Benchmark]    public void InParallel()    {        Parallel.For(0, 10_000, i =>        {            CreateDelegate<string>();        });    }    [MethodImpl(MethodImplOptions.NoInlining)]    private static Action<T> CreateDelegate<T>() => new Action<T>(GenericMethod);    private static void GenericMethod<T>(T t) { }}
MethodRuntimeMeanRatio
InSerial.NET 7.01,868.4 us1.00
InSerial.NET 8.0706.5 us0.38
InParallel.NET 7.01,247.3 us1.00
InParallel.NET 8.0222.9 us0.18

Second,dotnet/runtime#83632 improves the performance of theExecutableAllocator. This allocator is responsible for allocation related to all executable memory in the runtime, e.g. the JIT uses it to get memory into which to write the generated code that will then need to be executed. When memory is mapped, it has permissions associated with it for what can be done with that memory, e.g. can it be read and written, can it be executed, etc. The allocator maintains a cache, and this PR improved the performance of the allocator by reducing the number of cache misses incurred and reducing the cost of those cache misses when they do occur.

Third,dotnet/runtime#85743 makes a variety of changes focused on significantly reducing startup time. This includes reducing the amount of time spent on validation of types in R2R images, making lookups for generic parameters and nested types in R2R images much faster due to dedicated metadata in the R2R image, converting anO(n^2) lookup into anO(1) lookup by storing an additional index in a method description, and ensuring that vtable chunks are always shared.

GC

At the beginning of this post, I suggested that<ServerGarbageCollection>true</ServerGarbageCollection> be added to the csproj used for running the benchmarks in this post. That setting configures the GC to run in “server” mode, as opposed to “workstation” mode. The workstation mode was designed for use with client applications and is less resource intensive, preferring to use less memory but at the possible expense of throughput and scalability if the system is placed under heavier load. In contrast, the server mode was designed for larger-scale services. It is much more resource hungry, with a dedicated heap by default per logical core in the machine, and a dedicated thread per heap for servicing that heap, but it is also significantly more scalable. This tradeoff often leads to complication, as while applications might demand the scalability of the server GC, they may also want memory consumption closer to that of workstation, at least at times when demand is lower and the service needn’t have so many heaps.

In .NET 8, the server GC now has support for a dynamic heap count, thanks todotnet/runtime#86245,dotnet/runtime#87618, anddotnet/runtime#87619, which add a feature dubbed “Dynamic Adaptation To Application Sizes”, or DATAS. It’s off-by-default in .NET 8 in general (though on-by-default when publishing for Native AOT), but it can be enabled trivially, either by setting theDOTNET_GCDynamicAdaptationMode environment variable to1, or via the<GarbageCollectionAdaptationMode>1</GarbageCollectionAdaptationMode> MSBuild property. The employed algorithm is able to increase and decrease the heap count over time, trying to maximize its view of throughput, and maintaining a balance between that and overall memory footprint.

Here’s a simple example. I create a console app with<ServerGarbageCollection>true</ServerGarbageCollection> in the .csproj and the following code in Program.cs, which just spawns a bunch of threads that continually allocate, and then repeatedly prints out the working set:

// dotnet run -c Release -f net8.0using System.Diagnostics;for (int i = 0; i < 32; i++){    new Thread(() =>    {        while (true) Array.ForEach(new byte[1], b => { });    }).Start();}using Process process = Process.GetCurrentProcess();while (true){    process.Refresh();    Console.WriteLine($"{process.WorkingSet64:N0}");    Thread.Sleep(1000);}

When I run that, I consistently see output like:

154,226,688154,226,688154,275,840154,275,840154,816,512154,816,512154,816,512154,824,704154,824,704154,824,704

When I then add<GarbageCollectionAdaptationMode>1</GarbageCollectionAdaptationMode> to the .csproj, the working set drops significantly:

71,430,14472,187,90472,196,09672,196,09672,245,24872,245,24872,245,24872,245,24872,245,24872,253,440

For a more detailed examination of the feature and plans for it, seeDynamically Adapting To Application Sizes.

Mono

Thus far I’ve referred to “the runtime”, “the JIT”, “the GC”, and so on. That’s all in the context of the “CoreCLR” runtime, which is the primary runtime used for console applications, ASP.NET applications, services, desktop applications, and the like. For mobile and browser .NET applications, however, the primary runtime used is the “Mono” runtime. And it also has seen some huge improvements in .NET 8, improvements that accrue to scenarios like Blazor WebAssembly apps.

Just as how with CoreCLR there’s both the ability to JIT and AOT, there are multiple ways in which code can be shipped for Mono. Mono includes an AOT compiler; for WASM in particular, the AOT compiler enables all of the IL to be compiled to WASM, which is then shipped down to the browser. As with CoreCLR, however, AOT is opt-in. The default experience for WASM is to use an interpreter: the IL is shipped down to the browser, and the interpreter (which itself is compiled to WASM) then interprets the IL. Of course, interpretation has performance implications, and so .NET 7 augmented the interpreter with a tiering scheme similar in concept to the tiering employed by the CoreCLR JIT. The interpreter has its own representation of the code to be interpreted, and the first few times a method is invoked, it just interprets that byte code with little effort put into optimizing it. Then after enough invocations, the interpreter will take some time to optimize that internal representation so as to speed up subsequent interpretations. Even with that, however, it’s still interpreting: it’s still an interpreter implemented in WASM reading instructions for what to do and doing them. One of the most notable improvements to Mono in .NET 8 expands on this tiering by introducing a partial JIT into the interpreter.dotnet/runtime#76477 provided the initial code for this “jiterpreter,” as some folks refer to it. As part of the interpreter, this JIT is able to participate in the same data structures used by the interpreter and process the same byte code, and works by replacing sequences of that byte code with on-the-fly generated WASM. That could be a whole method, it could just be a hot loop within a method, or it could be just a few instructions. This provides significant flexibility, including a very progressive on-ramp where optimizations can be added incrementally, shifting more and more logic from interpretation to jitted WASM. Dozens of PRs went into making the jiterpreter a reality for .NET 8, such asdotnet/runtime#82773 that added basic SIMD support,dotnet/runtime#82756 that added basic loop support, anddotnet/runtime#83247 that added a control-flow optimization pass.

Let’s see this in action. I created a new .NET 7 Blazor WebAssembly project, added a NuGet reference to theSystem.IO.Hashing project, and replaced the contents ofCounter.razor with the following:

@page "/counter"@using System.Diagnostics;@using System.IO.Hashing;@using System.Text;@using System.Threading.Tasks;<h1>.NET 7</h1><p role="status">Current time: @_time</p><button @onclick="Hash">Click me</button>@code {    private TimeSpan _time;    private void Hash()    {        var sw = Stopwatch.StartNew();        for (int i = 0; i < 50_000; i++) XxHash64.HashToUInt64(_data);        _time = sw.Elapsed;    }    private byte[] _data =        @"Shall I compare thee to a summer's day?          Thou art more lovely and more temperate:          Rough winds do shake the darling buds of May,          And summer's lease hath all too short a date;          Sometime too hot the eye of heaven shines,          And often is his gold complexion dimm'd;          And every fair from fair sometime declines,          By chance or nature's changing course untrimm'd;          But thy eternal summer shall not fade,          Nor lose possession of that fair thou ow'st;          Nor shall death brag thou wander'st in his shade,          When in eternal lines to time thou grow'st:          So long as men can breathe or eyes can see,          So long lives this, and this gives life to thee."u8.ToArray();}

Then I did the exact same thing, but for .NET 8, built both in Release, and ran them both. When the resulting page opened for each, I clicked the “Click me” button (a few times, but it didn’t change the results).

Interpreted WASM on .NET 7 vs .NET 8

The timing measurements for how long the operation took in .NET 7 compared to .NET 8 speak for themselves.

Beyond the jiterpreter, the interpreter itself saw a multitude of improvements, for example:

  • dotnet/runtime#79165 added special handling of thestobj IL instruction for when the value type doesn’t contain any references, and thus doesn’t need to interact with the GC.
  • dotnet/runtime#80046 special-cased a compare followed bybrtrue/brfalse, creating a single interpreter opcode for the very common pattern.
  • dotnet/runtime#79392 added an intrinsic to the interpreter for string creation.
  • dotnet/runtime#78840 added a cache to the Mono runtime (including for but not limited to the interpreter) for various pieces of information about types, likeIsValueType,IsGenericTypeDefinition, andIsDelegate.
  • dotnet/runtime#81782 added intrinsics for some of the most common operations onVector128, anddotnet/runtime#86859 augmented this to use those same opcodes forVector<T>.
  • dotnet/runtime#83498 special-cased division by powers of 2 to instead employ shifts.
  • dotnet/runtime#83490 tweaked the inlining size limit to ensure that key methods could be inlined, likeList<T>‘s indexer.
  • dotnet/runtime#85528 added devirtualization support in situations where enough type information is available to enable doing so.

I’ve already alluded several times to vectorization in Mono, but in its own right this has been a big area of focus for Mono in .NET 8, across all backends. As ofdotnet/runtime#86546, which completed addingVector128<T> support for Mono’s AMD64 JIT backend,Vector128<T> is now supported across all Mono backends. Mono’s WASM backends not only supportVector128<T>, .NET 8 includes the newSystem.Runtime.Intrinsics.Wasm.PackedSimd type, which is specific to WASM and exposes hundreds of overloads that map down to WASM SIMD operations. The basis for this type was introduced indotnet/runtime#73289, where the initial SIMD support was added as internal.dotnet/runtime#76539 continued the effort by adding more functionality and also making the type public, as it now is in .NET 8. Over a dozen PRs continued to build it out, such asdotnet/runtime#80145 that addedConditionalSelect intrinsics,dotnet/runtime#87052 anddotnet/runtime#87828 that added load and store intrinsics,dotnet/runtime#85705 that added floating-point support, anddotnet/runtime#88595, which overhauled the surface area based on learnings since its initial design.

Another effort in .NET 8, related to app size, has been around reducing reliance on ICU’s data files (ICU is the globalization library employed by .NET and many other systems). Instead, the goal is to rely on the target platform’s native APIs wherever possible (for WASM, APIs provided by the browser). This effort is referred to as “hybrid globalization,” because the dependence on ICU’s data files still remains, it’s just lessened, and it comes with behavioral changes, so it’s opt-in for situations where someone really wants the smaller size and is willing to deal with the behavioral accommodations. A multitude of PRs have also gone into making this a reality for .NET 8, such asdotnet/runtime#81470,dotnet/runtime#84019, anddotnet/runtime#84249. To enable the feature, you can add<HybridGlobalization>true</HybridGlobalization> to your .csproj, and for more information, there’s agood design document that goes into much more depth.

Threading

Recent releases of .NET saw huge improvements to the area of threading, parallelism, concurrency, and asynchrony, such as a complete rewrite of theThreadPool (in .NET 6 and .NET 7), a complete rewrite of the async method infrastructure (in .NET Core 2.1), a complete rewrite ofConcurrentQueue<T> (in .NET Core 2.0), and so on. This release doesn’t include such massive overhauls, but it does include some thoughtful and impactful improvements.

ThreadStatic

The .NET runtime makes it easy to associate data with a thread, often referred to as thread-local storage (TLS). The most common way to achieve this is by annotating a static field with the[ThreadStatic] attribute (another for more advanced uses is via theThreadLocal<T> type), which causes the runtime to replicate the storage for that field to be per thread rather than global for the process.

private static int s_onePerProcess;[ThreadStatic]private static int t_onePerThread;

Historically, accessing such a[ThreadStatic] field has required a non-inlined JIT helper call (e.g.CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR), but now withdotnet/runtime#82973 anddotnet/runtime#85619, the common and fast path from that helper can be inlined into the caller. We can see this with a simple benchmark that just increments anint stored in a[ThreadStatic].

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0// dotnet run -c Release -f net7.0 --filter "*" --runtimes nativeaot7.0 nativeaot8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    [ThreadStatic]    private static int t_value;    [Benchmark]    public int Increment() => ++t_value;}
MethodRuntimeMeanRatio
Increment.NET 7.08.492 ns1.00
Increment.NET 8.01.453 ns0.17

[ThreadStatic] was similarly optimized for Native AOT, via bothdotnet/runtime#84566 anddotnet/runtime#87148:

MethodRuntimeMeanRatio
IncrementNativeAOT 7.02.305 ns1.00
IncrementNativeAOT 8.01.325 ns0.57

ThreadPool

Let’s try an experiment. Create a new console app, and add<PublishAot>true</PublishAot> to the .csproj. Then make the entirety of the program this:

// dotnet run -c Release -f net8.0Task.Run(() => Console.WriteLine(Environment.StackTrace)).Wait();

The idea is to see the stack trace of a work item running on aThreadPool thread. Now run it, and you should see something like this:

   at System.Environment.get_StackTrace()   at Program.<>c.<<Main>$>b__0_0()   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)   at System.Threading.ThreadPoolWorkQueue.Dispatch()   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

The important piece here is the bottom line: we see we’re being called from thePortableThreadPool, which is the managed thread pool implementation that’s been used across operating systems since .NET 6. Now, instead of running directly, let’s publish for Native AOT and run the resulting app (for the specific thing we’re looking for, this part should be done on Windows).

dotnet publish -c Release -r win-x64D:\examples\tmpapp\bin\Release\net8.0\win-x64\publish\tmpapp.exe

Now, we see this:

   at System.Environment.get_StackTrace() + 0x21   at Program.<>c.<<Main>$>b__0_0() + 0x9   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread, ExecutionContext, ContextCallback, Object) + 0x3d   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task&, Thread) + 0xcc   at System.Threading.ThreadPoolWorkQueue.Dispatch() + 0x289   at System.Threading.WindowsThreadPool.DispatchCallback(IntPtr, IntPtr, IntPtr) + 0x45

Again, note the last line: “WindowsThreadPool.” Applications published with Native AOTon Windows have historically used aThreadPool implementation that wraps theWindows thread pool. The work item queues and dispatching code is all the same as with the portable pool, but the thread management itself is delegated to the Windows pool. Now in .NET 8 withdotnet/runtime#85373, projectson Windows have the option of using either pool; Native AOT apps can opt to instead use the portable pool, and other apps can opt to instead use the Windows pool. Opting in or out is easy: in a<PropertyGroup/> in the .csproj, add<UseWindowsThreadPool>false</UseWindowsThreadPool> to opt-out in a Native AOT app, and conversely usetrue in other apps to opt-in. When using this MSBuild switch, in a Native AOT app, whichever pool isn’t being used can automatically be trimmed away. For experimentation, theDOTNET_ThreadPool_UseWindowsThreadPool environment variable can also be set to0 or1 to explicitly opt out or in, respectively.

There’s currently no hard-and-fast rule about why one pool might be better; the option has been added to allow developers to experiment. We’ve seen with the Windows pool that I/O doesn’t scale as well on larger machines as it does with the portable pool. However, if the Windows thread pool is already being used heavily elsewhere in the application, consolidating into the same pool can reduce oversubscription. Further, if thread pool threads get blocked very frequently, the Windows thread pool has more information about that blocking and can potentially handle those scenarios more efficiently. We can see this with a simple example. Compile this code:

// dotnet run -c Release -f net8.0using System.Diagnostics;var sw = Stopwatch.StartNew();var barrier = new Barrier(Environment.ProcessorCount * 2 + 1);for (int i = 0; i < barrier.ParticipantCount; i++){    ThreadPool.QueueUserWorkItem(id =>    {        Console.WriteLine($"{sw.Elapsed}: {id}");        barrier.SignalAndWait();    }, i);}barrier.SignalAndWait();Console.WriteLine($"Done: {sw.Elapsed}");

This is a dastardly repro that creates a bunch of work items, all of which block until all of the work items have been processed: basically it takes every thread the thread pool gives it and never gives it back (until the program exits). When I run this on my machine whereEnvironment.ProcessorCount is 12, I get output like this:

00:00:00.0038906: 000:00:00.0038911: 100:00:00.0042401: 400:00:00.0054198: 900:00:00.0047249: 600:00:00.0040724: 300:00:00.0044894: 500:00:00.0052228: 800:00:00.0049638: 700:00:00.0056831: 1000:00:00.0039327: 200:00:00.0057127: 1100:00:01.0265278: 1200:00:01.5325809: 1300:00:02.0471848: 1400:00:02.5628161: 1500:00:03.5805581: 1600:00:04.5960218: 1700:00:05.1087192: 1800:00:06.1142907: 1900:00:07.1331915: 2000:00:07.6467355: 2100:00:08.1614072: 2200:00:08.6749720: 2300:00:08.6763938: 24Done: 00:00:08.6768608

The portable pool quickly injectsEnvironment.ProcessorCount threads, but after that it proceeds to only inject an additional thread once or twice a second. Now, setDOTNET_ThreadPool_UseWindowsThreadPool to1 and try again:

00:00:00.0034909: 300:00:00.0036281: 400:00:00.0032404: 000:00:00.0032727: 100:00:00.0032703: 200:00:00.0447256: 500:00:00.0449398: 600:00:00.0451899: 700:00:00.0454245: 800:00:00.0456907: 900:00:00.0459155: 1000:00:00.0461399: 1100:00:00.0463612: 1200:00:00.0465538: 1300:00:00.0467497: 1400:00:00.0469477: 1500:00:00.0471055: 1600:00:00.0472961: 1700:00:00.0474888: 1800:00:00.0477131: 1900:00:00.0478795: 2000:00:00.0480844: 2100:00:00.0482900: 2200:00:00.0485110: 2300:00:00.0486981: 24Done: 00:00:00.0498603

Zoom. The Windows pool ismuch more aggressive about injecting threads here. Whether that’s good or bad can depend on your scenario. If you’ve found yourself setting a really high minimum thread pool thread count for your application, you might want to give this option a go.

Tasks

Even with all the improvements to async/await in previous releases, this release sees async methods get cheaper still, both when they complete synchronously and when they complete asynchronously.

When an asyncTask/Task<TResult>-returning method completes synchronously, it tries to give back a cached task object rather than creating one a new and incurring the allocation. In the case ofTask, that’s easy, it can simply useTask.CompletedTask. In the case ofTask<TResult>, it uses a cache that stores cached tasks for someTResult values. WhenTResult isBoolean, for example, it can successfully cache aTask<bool> for bothtrue andfalse, such that it’ll always successfully avoid the allocation. Forint, it caches a few tasks for common values (e.g.-1 through8). For reference types, it caches a task fornull. And for the primitive integer types (sbyte,byte,short,ushort,char,int,uint,long,ulong,nint, andnuint), it caches a task for 0. It used to be that all of this logic was dedicated to async methods, but in .NET 6 that logic moved intoTask.FromResult, such that all use ofTask.FromResult now benefits from this caching. In .NET 8, thanks todotnet/runtime#76349 anddotnet/runtime#87541, the caching is improved further. In particular, the optimization of caching a task for0 for the primitive types is extended to be the caching of a task fordefault(TResult) for any value typeTResult that is 1, 2, 4, 8, or 16 bytes. In such cases, we can do an unsafe cast to one of these primitives, and then use that primitive’s equality to compare againstdefault. If that comparison is true, it means the value is entirely zeroed, which means we can use a cached task forTask<TResult> created fromdefault(TResult), as that is also entirely zeroed. What if that type has a custom equality comparer? That actually doesn’t matter, since the original value and the one stored in the cached task have identical bit patterns, which means they’re indistinguishable. The net effect of this is we can cache tasks for other commonly used types.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark] public async Task<TimeSpan> ZeroTimeSpan() => TimeSpan.Zero;    [Benchmark] public async Task<DateTime> MinDateTime() => DateTime.MinValue;    [Benchmark] public async Task<Guid> EmptyGuid() => Guid.Empty;    [Benchmark] public async Task<DayOfWeek> Sunday() => DayOfWeek.Sunday;    [Benchmark] public async Task<decimal> ZeroDecimal() => 0m;    [Benchmark] public async Task<double> ZeroDouble() => 0;    [Benchmark] public async Task<float> ZeroFloat() => 0;    [Benchmark] public async Task<Half> ZeroHalf() => (Half)0f;    [Benchmark] public async Task<(int, int)> ZeroZeroValueTuple() => (0, 0);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
ZeroTimeSpan.NET 7.031.327 ns1.0072 B1.00
ZeroTimeSpan.NET 8.08.851 ns0.280.00
MinDateTime.NET 7.031.457 ns1.0072 B1.00
MinDateTime.NET 8.08.277 ns0.260.00
EmptyGuid.NET 7.032.233 ns1.0080 B1.00
EmptyGuid.NET 8.09.013 ns0.280.00
Sunday.NET 7.030.907 ns1.0072 B1.00
Sunday.NET 8.08.235 ns0.270.00
ZeroDecimal.NET 7.033.109 ns1.0080 B1.00
ZeroDecimal.NET 8.013.110 ns0.400.00
ZeroDouble.NET 7.030.863 ns1.0072 B1.00
ZeroDouble.NET 8.08.568 ns0.280.00
ZeroFloat.NET 7.031.025 ns1.0072 B1.00
ZeroFloat.NET 8.08.531 ns0.280.00
ZeroHalf.NET 7.033.906 ns1.0072 B1.00
ZeroHalf.NET 8.09.008 ns0.270.00
ZeroZeroValueTuple.NET 7.033.339 ns1.0072 B1.00
ZeroZeroValueTuple.NET 8.011.274 ns0.340.00

Those changes helped some async methods to become leaner when they complete synchronously. Other changes have helped practicallyall async methods to become leaner when they complete asynchronously. When an async method suspends for the first time, assuming it’s returningTask/Task<TResult>/ValueTask/ValueTask<TResult> and the default async method builders are in use (i.e. they haven’t been overridden using[AsyncMethodBuilder(...)] on the method in question), a single allocation occurs: the task object to be returned. That task object is actually a type derived fromTask (in the implementation today the internal type is calledAsyncStateMachineBox<TStateMachine>) and that has on it a strongly-typed field for the state machine struct generated by the C# compiler. In fact, as of .NET 7, it has three additional fields beyond what’s on the baseTask<TResult>:

  1. One to hold theTStateMachine state machine struct generated by the C# compiler.
  2. One to cache anAction delegate that points toMoveNext.
  3. One to store anExecutionContext to flow to the nextMoveNext invocation.

If we can trim down the fields required, we can make every async method less expensive by allocating smaller instead of larger objects. That’s exactly whatdotnet/runtime#83696 anddotnet/runtime#83737 accomplish, together shaving 16 bytes (in a 64-bit process) off the size ofevery such async method task. How?

The C# language allows anything to be awaitable as long as it follows the right pattern, exposing aGetAwaiter() method that returns a type with the right shape. That pattern includes a set of “OnCompleted” methods that take anAction delegate, enabling the async method builder to provide a continuation to the awaiter, such that when the awaited operation completes, it can invoke theAction to resume the method’s processing. As such, theAsyncStateMachineBox type has on it a field used to cache anAction delegate that’s lazily created to point to itsMoveNext method; thatAction is created during the first suspending await where it’s needed and can then be used for all subsequent awaits, such that theAction is allocated at most once for the lifetime of an async method, regardless of how many times the invocation suspends. (The delegate is only needed, however, if the state machine awaits something that’s not a known awaiter; the runtime has fast paths that avoid requiring thatAction when awaiting all of the built-in awaiters). Interestingly, though,Task itself has a field for storing a delegate, and that field is only used when theTask is created to invoke a delegate (e.g.Task.Run,ContinueWith, etc.). Since most tasks allocated today come from async methods, that means that the majority of tasks have all had a wasted field. It turns out we can just use that base field on theTask for this cachedMoveNextAction as well, making the field relevant to almost all tasks, and allowing us to remove the extraAction field on the state machine box.

There’s another existing field on the baseTask that also goes unused in async methods: the state object field. When you use a method likeStartNew orContinueWith to create aTask, you can provide anobject state that’s then passed to theTask‘s delegate. In an async method, though, the field just sits there, unused, lonely, forgotten, forelorn. Instead of having a separate field for theExecutionContext, then, we can just store theExecutionContext in this existing state field (being careful not to allow it to be exposed via theTask‘sAsyncState property that normally exposes the object state).

We can see the effect of getting rid of those two fields with a simple benchmark like this:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    public async Task YieldOnce() => await Task.Yield();}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
YieldOnce.NET 7.0918.6 ns1.00112 B1.00
YieldOnce.NET 8.0865.8 ns0.9496 B0.86

Note the 16-byte decrease just as we predicted.

Async method overheads are reduced in other ways, too.dotnet/runtime#82181, for example, shrinks the size of theManualResetValueTaskSourceCore<TResult> type that’s used as the workhorse for customIValueTaskSource/IValueTaskSource<TResult> implementations; it takes advantage of the 99.9% case to use a single field for something that previously required two fields. But my favorite addition in this regard isdotnet/runtime#22144, which adds newConfigureAwait overloads. Yes, I knowConfigureAwait is a sore subject with some, but these new overloads a) address a really useful scenario that many folks end up writing their own custom awaiters for, b) do it in a way that’s cheaper than custom solutions can provide, and c) actually help with theConfigureAwait naming, as it fulfills the original purpose ofConfigureAwait that led us to name it that in the first place. WhenConfigureAwait was originally devised, we debated many names, and we settled on “ConfigureAwait” because that’s what it was doing: it was allowing you to provide arguments that configured how the await behaved. Of course, for the last decade, the only configuration you’ve been able to do is pass a singleBoolean to indicate whether to capture the current context / scheduler or not, and that in part has led folks to bemoan the naming as overly verbose for something that’s a singlebool. Now in .NET 8, there are new overloads ofConfigureAwait that take aConfigureAwaitOptions enum:

[Flags]public enum ConfigureAwaitOptions{   None = 0,   ContinueOnCapturedContext = 1,   SuppressThrowing = 2,   ForceYielding = 4,}

ContinueOnCapturedContext you know; that’s the same asConfigureAwait(true) today.ForceYielding is something that comes up now and again in various capacities, but essentially you’re awaiting something and rather than continuing synchronously if the thing you’re awaiting has already completed by the time you await it, you effectively want the system to pretend it’s not completed even if it is. Then rather than continuing synchronously, the continuation will always end up running asynchronously from the caller. This can be helpful as an optimization in a variety of ways. Consider this code that was inSocketsHttpHandler‘s HTTP/2 implementation in .NET 7:

private void DisableHttp2Connection(Http2Connection connection){    _ = Task.Run(async () => // fire-and-forget    {        bool usable = await connection.WaitForAvailableStreamsAsync().ConfigureAwait(false);        ... // other stuff    };}

WithForceYielding in .NET 8, the code is now:

private void DisableHttp2Connection(Http2Connection connection){    _ = DisableHttp2ConnectionAsync(connection); // fire-and-forget    async Task DisableHttp2ConnectionAsync(Http2Connection connection)    {        bool usable = await connection.WaitForAvailableStreamsAsync().ConfigureAwait(ConfigureAwaitOptions.ForceYielding);        .... // other stuff    }}

Rather than have a separateTask.Run, we’ve just piggy-backed on theawait for the task returned fromWaitForAvailableStreamsAsync (which we know will quickly return the task to us), ensuring that the work that comes after it doesn’t run synchronously as part of the call toDisableHttp2Connection. Or imagine you had code that was doing:

return Task.Run(WorkAsync);static async Task WorkAsync(){    while (...) await Something();}

This is usingTask.Run to queue an async method’s invocation. That async method results in a Task being allocated, plus theTask.Run results in aTask being allocated, plus a work item needs to be queued to theThreadPool, so at least three allocations. Now, this same functionality can be written as:

return WorkAsync();static async Task WorkAsync(){    await Task.CompletedTask.ConfigureAwait(ConfigureAwaitOptions.ForceYielding);    while (...) await Something();}

and rather than three allocations, we end up with just one: for the asyncTask. That’s because with all the optimizations introduced in previous releases, the state machine box object is also what will be queued to the thread pool.

Arguably the most valuable addition to this support, though, isSuppressThrowing. It does what it sounds like: when youawait a task that completes in failure or cancellation, such that normally theawait would propagate the exception, it won’t. So, for example, inSystem.Text.Json where we previously had this code:

// Exceptions should only be propagated by the resuming convertertry{    await state.PendingTask.ConfigureAwait(false);}catch { }

now we have this code:

// Exceptions should only be propagated by the resuming converterawait state.PendingTask.ConfigureAwait(ConfigureAwaitOptions.SuppressThrowing);

or inSemaphoreSlim where we had this code:

await new ConfiguredNoThrowAwaiter<bool>(asyncWaiter.WaitAsync(TimeSpan.FromMilliseconds(millisecondsTimeout), cancellationToken));if (cancellationToken.IsCancellationRequested){    // If we might be running as part of a cancellation callback, force the completion to be asynchronous.    await TaskScheduler.Default;}private readonly struct ConfiguredNoThrowAwaiter<T> : ICriticalNotifyCompletion, IStateMachineBoxAwareAwaiter{    private readonly Task<T> _task;    public ConfiguredNoThrowAwaiter(Task<T> task) => _task = task;    public ConfiguredNoThrowAwaiter<T> GetAwaiter() => this;    public bool IsCompleted => _task.IsCompleted;    public void GetResult() => _task.MarkExceptionsAsHandled();    public void OnCompleted(Action continuation) => TaskAwaiter.OnCompletedInternal(_task, continuation, continueOnCapturedContext: false, flowExecutionContext: true);    public void UnsafeOnCompleted(Action continuation) => TaskAwaiter.OnCompletedInternal(_task, continuation, continueOnCapturedContext: false, flowExecutionContext: false);    public void AwaitUnsafeOnCompleted(IAsyncStateMachineBox box) => TaskAwaiter.UnsafeOnCompletedInternal(_task, box, continueOnCapturedContext: false);}internal readonly struct TaskSchedulerAwaiter : ICriticalNotifyCompletion{    private readonly TaskScheduler _scheduler;    public TaskSchedulerAwaiter(TaskScheduler scheduler) => _scheduler = scheduler;    public bool IsCompleted => false;    public void GetResult() { }    public void OnCompleted(Action continuation) => Task.Factory.StartNew(continuation, CancellationToken.None, TaskCreationOptions.DenyChildAttach, _scheduler);    public void UnsafeOnCompleted(Action continuation)    {        if (ReferenceEquals(_scheduler, Default))        {            ThreadPool.UnsafeQueueUserWorkItem(s => s(), continuation, preferLocal: true);        }        else        {            OnCompleted(continuation);        }    }}

now we just have this:

await ((Task)asyncWaiter.WaitAsync(TimeSpan.FromMilliseconds(millisecondsTimeout), cancellationToken)).ConfigureAwait(ConfigureAwaitOptions.SuppressThrowing);if (cancellationToken.IsCancellationRequested){    // If we might be running as part of a cancellation callback, force the completion to be asynchronous.    await Task.CompletedTask.ConfigureAwait(ConfigureAwaitOptions.ForceYielding);}

It is useful to note the(Task) cast that’s in there.WaitAsync returns aTask<bool>, but thatTask<bool> is being cast to the baseTask becauseSuppressThrowing is incompatible withTask<TResult>. That’s because, without an exception propagating, the await will complete successfully and return aTResult, which may be invalid if the task actually faulted. So if you have aTask<TResult> that you want to await withSuppressThrowing, cast to the baseTask and await it, and then you can inspect theTask<TResult> immediately after the await completes. (If you do end up usingConfigureAwaitOptions.SuppressThrowing with aTask<TResult>, theCA2261 analyzer introduced indotnet/roslyn-analyzers#6669 will alert you to it.)

The above example withSemaphoreSlim is using the newConfigureAwaitOptions to replace a previous optimization added in .NET 8, as well.dotnet/runtime#83294 added to thatConfiguredNoThrowAwaiter<T> an implementation of the internalIStateMachineBoxAwareAwaiter interface, which is the special sauce that enables the async method builders to backchannel with a known awaiter to avoid theAction delegate allocation. Now that the behaviors thisConfiguredNoThrowAwaiter was providing are built-in, it’s no longer needed, and the built-in implementation enjoys the same privileges viaIStateMachineBoxAwareAwaiter. The net result of these changes forSemaphoreSlim is that it now not only has simpler code, but faster code, too. Here’s a benchmark showing the decrease in execution time and allocation associated withSemaphoreAsync.WaitAsync calls that need to wait with aCancellationToken and/or timeout:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly CancellationToken _token = new CancellationTokenSource().Token;    private readonly SemaphoreSlim _sem = new SemaphoreSlim(0);    private readonly Task[] _tasks = new Task[100];    [Benchmark]    public Task WaitAsync()    {        for (int i = 0; i < _tasks.Length; i++)        {            _tasks[i] = _sem.WaitAsync(_token);        }        _sem.Release(_tasks.Length);        return Task.WhenAll(_tasks);    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
WaitAsync.NET 7.085.48 us1.0044.64 KB1.00
WaitAsync.NET 8.069.37 us0.8236.02 KB0.81

There have been other improvements on other operations onTask as well.dotnet/runtime#81065 removes a defensiveTask[] allocation fromTask.WhenAll. It was previously doing a defensive copy such that it could then validate on the copy whether any of the elements werenull (a copy because another thread could erroneously and concurrently null out elements); that’s a large cost to pay for argument validation in the face of multi-threaded misuse. Instead, the method will still validate whethernull is in the input, and if anull slips through because the input collection was erroneously mutated concurrently with the synchronous call toWhenAll, it’ll just ignore thenull at that point. In making these changes, the PR also special-cased aList<Task> input to avoid making a copy, asList<Task> is also one of the main types we see fed intoWhenAll (e.g. someone builds up a list of tasks and then waits for all of them).

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.ObjectModel;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    public void WhenAll_Array()    {        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();        Task whenAll = Task.WhenAll(atmb1.Task, atmb2.Task);        atmb1.SetResult();        atmb2.SetResult();        whenAll.Wait();    }    [Benchmark]    public void WhenAll_List()    {        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();        Task whenAll = Task.WhenAll(new List<Task>(2) { atmb1.Task, atmb2.Task });        atmb1.SetResult();        atmb2.SetResult();        whenAll.Wait();    }    [Benchmark]    public void WhenAll_Collection()    {        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();        Task whenAll = Task.WhenAll(new ReadOnlyCollection<Task>(new[] { atmb1.Task, atmb2.Task }));        atmb1.SetResult();        atmb2.SetResult();        whenAll.Wait();    }    [Benchmark]    public void WhenAll_Enumerable()    {        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();        var q = new Queue<Task>(2);        q.Enqueue(atmb1.Task);        q.Enqueue(atmb2.Task);        Task whenAll = Task.WhenAll(q);        atmb1.SetResult();        atmb2.SetResult();        whenAll.Wait();    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
WhenAll_Array.NET 7.0210.8 ns1.00304 B1.00
WhenAll_Array.NET 8.0160.9 ns0.76264 B0.87
WhenAll_List.NET 7.0296.4 ns1.00376 B1.00
WhenAll_List.NET 8.0185.5 ns0.63296 B0.79
WhenAll_Collection.NET 7.0271.3 ns1.00360 B1.00
WhenAll_Collection.NET 8.0199.7 ns0.74328 B0.91
WhenAll_Enumerable.NET 7.0328.2 ns1.00472 B1.00
WhenAll_Enumerable.NET 8.0230.0 ns0.70432 B0.92

The genericWhenAny was also improved as part ofdotnet/runtime#88154, which removes aTask allocation from an extra continuation that was an implementation detail. This is one of my favorite kinds of PRs: it not only improved performance, it also resulted in cleaner code, and less code.

GitHub plus/minus line count indicator for Task.WhenAny

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    public Task<Task<int>> WhenAnyGeneric_ListNotCompleted()    {        AsyncTaskMethodBuilder<int> atmb1 = default;        AsyncTaskMethodBuilder<int> atmb2 = default;        AsyncTaskMethodBuilder<int> atmb3 = default;        Task<Task<int>> wa = Task.WhenAny(new List<Task<int>>() { atmb1.Task, atmb2.Task, atmb3.Task });        atmb3.SetResult(42);        return wa;    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
WhenAnyGeneric_ListNotCompleted.NET 7.0555.0 ns1.00704 B1.00
WhenAnyGeneric_ListNotCompleted.NET 8.0260.3 ns0.47504 B0.72

One last example related to tasks, though this one is a bit different, as it’s specifically about improving test performance (and test reliability). Imagine you have a method like this:

public static async Task LogAfterDelay(Action<string, TimeSpan> log){    long startingTimestamp = Stopwatch.GetTimestamp();    await Task.Delay(TimeSpan.FromSeconds(30));    log("Completed", Stopwatch.GetElapsedTime(startingTimestamp));}

The purpose of this method is to wait for 30 seconds and then log a completion message as well as how much time the method observed to pass. This is obviously a simplification of the kind of functionality you’d find in real applications, but you can extrapolate from it to code you’ve likely written. How do you test this? Maybe you’ve written a test like this:

[Fact]public async Task LogAfterDelay_Success_CompletesAfterThirtySeconds(){    TimeSpan ts = default;    Stopwatch sw = Stopwatch.StartNew();    await LogAfterDelay((message, time) => ts = time);    sw.Stop();    Assert.InRange(ts, TimeSpan.FromSeconds(30), TimeSpan.MaxValue);    Assert.InRange(sw.Elapsed, TimeSpan.FromSeconds(30), TimeSpan.MaxValue);}

This is validating both that the method included a value of at least 30 seconds in its log and also that at least 30 seconds passed. What’s the problem? From a performance perspective, the problem is this test had to wait 30 seconds! That’s a ton of overhead for something which would otherwise complete close to instantaneously. Now imagine the delay was longer, like 10 minutes, or that we had a bunch of tests that all needed to do the same thing. It becomes untenable to test well and thoroughly.

To address these kinds of situations, many developers have introduced their own abstractions for the flow of time. Now in .NET 8, that’s no longer needed. As ofdotnet/runtime#83604, the core libraries includeSystem.TimeProvider. This abstract base class abstracts over the flow of time, with members for getting the current UTC time, getting the current local time, getting the current time zone, getting a high-frequency timestamp, and creating a timer (which in turn returns the newSystem.Threading.ITimer that supports changing the timer’s tick interval). Then core library members likeTask.Delay andCancellationTokenSource‘s constructor have new overloads that accept aTimeProvider, and use it for time-related functionality rather than being hardcoded toDateTime.UtcNow,Stopwatch, orSystem.Threading.Timer. With that, we can rewrite our previous method:

public static async Task LogAfterDelay(Action<string, TimeSpan> log, TimeProvider provider){    long startingTimestamp = provider.GetTimestamp();    await Task.Delay(TimeSpan.FromSeconds(30), provider);    log("Completed", provider.GetElapsedTime(startingTimestamp));}

It’s been augmented to accept aTimeProvider parameter, though in a system that uses a dependency injection (DI) mechanism, it would likely just fetch aTimeProvider singleton from DI. Then instead of usingStopwatch.GetTimestamp orStopwatch.GetElapsedTime, it uses the corresponding members on theprovider, and instead of using theTask.Delay overload that just takes a duration, it uses the overload that also takes aTimeProvider. When used in production, this can be passedTimeProvider.System, which is implemented based on the system clock (exactly what you would get without providing aTimeProvider at all), but in a test, it can be passed a custom instance, one that manually controls the observed flow of time. Exactly such a customTimeProvider exists in theMicrosoft.Extensions.TimeProvider.Testing NuGet package:FakeTimeProvider. Here’s an example of using it with ourLogAfterDelay method:

// dotnet run -c Release -f net8.0 --filter "*"using Microsoft.Extensions.Time.Testing;using System.Diagnostics;Stopwatch sw = Stopwatch.StartNew();var fake = new FakeTimeProvider();Task t = LogAfterDelay((message, time) => Console.WriteLine($"{message}: {time}"), fake);fake.Advance(TimeSpan.FromSeconds(29));Console.WriteLine(t.IsCompleted);fake.Advance(TimeSpan.FromSeconds(1));Console.WriteLine(t.IsCompleted);Console.WriteLine($"Actual execution time: {sw.Elapsed}");static async Task LogAfterDelay(Action<string, TimeSpan> log, TimeProvider provider){    long startingTimestamp = provider.GetTimestamp();    await Task.Delay(TimeSpan.FromSeconds(30), provider);    log("Completed", provider.GetElapsedTime(startingTimestamp));}

When I run this, it outputs the following:

FalseCompleted: 00:00:30TrueActual execution time: 00:00:00.0119943

In other words, after manually advancing time by 29 seconds, the operation still hadn’t completed. Then we manually advanced time by one more second, and the operation completed. It reported that 30 seconds passed, but in reality, the whole operation took only 0.01 seconds of actual wall clock time.

With that, let’s move up the stack toParallel

Parallel

.NET 6 introduced new async methods ontoParallel in the form ofParallel.ForEachAsync. After its introduction, we started getting requests for an equivalent forfor loops, so now in .NET 8, withdotnet/runtime#84804, the class gains a set ofParallel.ForAsync methods. These were previously achievable by passing in anIEnumerable<T> created from a method likeEnumerable.Range, e.g.

await Parallel.ForEachAsync(Enumerable.Range(0, 1_000), async i =>{   ... });

but you can now achieve the same more simply and cheaply with:

await Parallel.ForAsync(0, 1_000, async i =>{   ... });

It ends up being cheaper because you don’t need to allocate the enumerable/enumerator, and the synchronization involved in multiple workers trying to peel off the next iteration can be done in a much less expensive manner, a singleInterlocked rather than using an asynchronous lock likeSemaphoreSlim.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark(Baseline = true)]    public Task ForEachAsync() => Parallel.ForEachAsync(Enumerable.Range(0, 1_000_000), (i, ct) => ValueTask.CompletedTask);    [Benchmark]    public Task ForAsync() => Parallel.ForAsync(0, 1_000_000, (i, ct) => ValueTask.CompletedTask);}
MethodMeanRatioAllocatedAlloc Ratio
ForEachAsync589.5 ms1.0087925272 B1.000
ForAsync147.5 ms0.25792 B0.000

The allocation column here is particularly stark, and also a tad misleading. Why isForEachAsyncso much worse here allocation-wise? It’s because of the synchronization mechanism. There’s zero work being performed here by the delegate in the test, so all of the time is spent hammering on the source. In the case ofParallel.ForAsync, that’s a singleInterlocked instruction to get the next value. In the case ofParallel.ForEachAsync, it’s aWaitAsync, and under a lot of contention, many of thoseWaitAsync calls are going to complete asynchronously, resulting in allocation. In a real workload, where the body delegate is doing real work, synchronously or asynchronously, the impact of that synchronization is much, much less dramatic. Here I’ve changed the calls to just be a simpleTask.Delay for 1ms (and also significantly lowered the iteration count):

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark(Baseline = true)]    public Task ForEachAsync() => Parallel.ForEachAsync(Enumerable.Range(0, 100), async (i, ct) => await Task.Delay(1));    [Benchmark]    public Task ForAsync() => Parallel.ForAsync(0, 100, async (i, ct) => await Task.Delay(1));}

and the two methods are the effectively same:

MethodMeanRatioAllocatedAlloc Ratio
ForEachAsync89.39 ms1.0027.96 KB1.00
ForAsync89.44 ms1.0027.84 KB1.00

Interestingly, thisParallel.ForAsync method is also one of the first public methods in the core libraries to be based on the generic math interfaces introduced in .NET 7:

public static Task ForAsync<T>(T fromInclusive, T toExclusive, Func<T, CancellationToken, ValueTask> body)    where T : notnull, IBinaryInteger<T>

When initially designing the method, we copied the synchronousFor counterpart, which has overloads specific toint and overloads specific tolong. Now that we haveIBinaryInteger<T>, however, we realized we could not only reduce the number of overloads and not only reduce the number of implementations, by usingIBinaryInteger<T> we could also open the same method up to other types folks want to use, such asnint orUInt128 orBigInteger; they all “just work,” which is pretty cool. (The newTotalOrderIeee754Comparer<T>, added in .NET 8 indotnet/runtime#75517 by@huoyaoyuan, is another new public type relying on these interfaces.) Once we did that, indotnet/runtime#84853 we used a similar technique to deduplicate theParallel.For implementations, such that bothint andlong share the same generic implementations internally.

Exceptions

In .NET 6,ArgumentNullException gained aThrowIfNull method, as we dipped our toes into the waters of providing “throw helpers.” The intent of the method is to concisely express the constraint being verified, letting the system throw a consistent exception for failure to meet the constraint while also optimizing the success and 99.999% case where no exception need be thrown. The method is structured in such a way that the fast path performing the check gets inlined, with as little work as possible on that path, and then everything else is relegated to a method that performs the actual throwing (the JIT won’t inline that throwing method, as it’ll look at its implementation and see that the method always throws).

public static void ThrowIfNull(    [NotNull] object? argument,    [CallerArgumentExpression(nameof(argument))] string? paramName = null){    if (argument is null)        Throw(paramName);}[DoesNotReturn]internal static void Throw(string? paramName) => throw new ArgumentNullException(paramName);

In .NET 7,ArgumentNullException.ThrowIfNull gained another overload, this time for pointers, and two new methods were introduced:ArgumentException.ThrowIfNullOrEmpty forstrings andObjectDisposedException.ThrowIf.

Now in .NET 8, a slew of new such helpers have been added. Thanks todotnet/runtime#86007,ArgumentException gainsThrowIfNullOrWhiteSpace to complementThrowIfNullOrEmpty:

public static void ThrowIfNullOrWhiteSpace([NotNull] string? argument, [CallerArgumentExpression(nameof(argument))] string? paramName = null);

and thanks todotnet/runtime#78222 from@hrrrrustic anddotnet/runtime#83853,ArgumentOutOfRangeException gains 9 new methods:

public static void ThrowIfEqual<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : System.IEquatable<T>?;public static void ThrowIfNotEqual<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : System.IEquatable<T>?;public static void ThrowIfLessThan<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable<T>;public static void ThrowIfLessThanOrEqual<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable<T>;public static void ThrowIfGreaterThan<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable<T>;public static void ThrowIfGreaterThanOrEqual<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable<T>;public static void ThrowIfNegative<T>(T value, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : INumberBase<T>;public static void ThrowIfZero<T>(T value, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : INumberBase<T>;public static void ThrowIfNegativeOrZero<T>(T value, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : INumberBase<T>;

Those PRs used these new methods in a few places, but thendotnet/runtime#79460,dotnet/runtime#80355,dotnet/runtime#82357,dotnet/runtime#82533, anddotnet/runtime#85858 rolled out their use more broadly throughout the core libraries. To get a sense for the usefulness of these methods, here are the number of times each of these methods is being called from within thesrc for the core libraries indotnet/runtime as of the time I’m writing this paragraph:

MethodCount
ANE.ThrowIfNull(object)4795
AOORE.ThrowIfNegative873
AE.ThrowIfNullOrEmpty311
ODE.ThrowIf237
AOORE.ThrowIfGreaterThan223
AOORE.ThrowIfNegativeOrZero100
AOORE.ThrowIfLessThan89
ANE.ThrowIfNull(void*)55
AOORE.ThrowIfGreaterThanOrEqual39
AE.ThrowIfNullOrWhiteSpace32
AOORE.ThrowIfLessThanOrEqual20
AOORE.ThrowIfNotEqual13
AOORE.ThrowIfZero5
AOORE.ThrowIfEqual3

These new methods also do more work in the throwing portion (e.g. formatting the exception message with the invalid arguments), which helps to better exemplify the benfits of moving all of that work out into a separate method. For example, here is theThrowIfGreaterThan copied straight fromSystem.Private.CoreLib:

public static void ThrowIfGreaterThan<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable<T>{    if (value.CompareTo(other) > 0)        ThrowGreater(value, other, paramName);}private static void ThrowGreater<T>(T value, T other, string? paramName) =>    throw new ArgumentOutOfRangeException(paramName, value, SR.Format(SR.ArgumentOutOfRange_Generic_MustBeLessOrEqual, paramName, value, other));

and here is a benchmark showing what consumption would look like if thethrow expression were directly part ofThrowIfGreaterThan:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD", "value1", "value2")][DisassemblyDiagnoser]public class Tests{    [Benchmark(Baseline = true)]    [Arguments(1, 2)]    public void WithOutline(int value1, int value2)    {        ArgumentOutOfRangeException.ThrowIfGreaterThan(value1, 100);        ArgumentOutOfRangeException.ThrowIfGreaterThan(value2, 200);    }    [Benchmark]    [Arguments(1, 2)]    public void WithInline(int value1, int value2)    {        ThrowIfGreaterThan(value1, 100);        ThrowIfGreaterThan(value2, 200);    }    public static void ThrowIfGreaterThan<T>(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable<T>    {        if (value.CompareTo(other) > 0)            throw new ArgumentOutOfRangeException(paramName, value, SR.Format(SR.ArgumentOutOfRange_Generic_MustBeLessOrEqual, paramName, value, other));    }    internal static class SR    {        public static string Format(string format, object arg0, object arg1, object arg2) => string.Format(format, arg0, arg1, arg2);        internal static string ArgumentOutOfRange_Generic_MustBeLessOrEqual => GetResourceString("ArgumentOutOfRange_Generic_MustBeLessOrEqual");        [MethodImpl(MethodImplOptions.NoInlining)]        static string GetResourceString(string resourceKey) => "{0} ('{1}') must be less than or equal to '{2}'.";    }}
MethodMeanRatioCode Size
WithOutline0.4839 ns1.00118 B
WithInline2.4976 ns5.16235 B

The most relevant highlight from the generated assembly is from theWithInline case:

; Tests.WithInline(Int32, Int32)       push      rbx       sub       rsp,20       mov       ebx,r8d       mov       ecx,edx       mov       edx,64       mov       r8,1F5815EA8F8       call      qword ptr [7FF99C03DEA8]; Tests.ThrowIfGreaterThan[[System.Int32, System.Private.CoreLib]](Int32, Int32, System.String)       mov       ecx,ebx       mov       edx,0C8       mov       r8,1F5815EA920       add       rsp,20       pop       rbx       jmp       qword ptr [7FF99C03DEA8]; Tests.ThrowIfGreaterThan[[System.Int32, System.Private.CoreLib]](Int32, Int32, System.String); Total bytes of code 59

Because there’s more cruft inside theThrowIfGreaterThan method, the system decides not to inline it, and so we end up with two method invocations that occur even when the value is within range (the first is acall, the second here is ajmp, since there was no follow-up work in this method that would require control flow returning).

To make it easier to roll out usage of these helpers,dotnet/roslyn-analyzers#6293 added new analyzers to look for argument validation that can be replaced by one of the throw helper methods onArgumentNullException,ArgumentException,ArgumentOutOfRangeException, orObjectDisposedException.dotnet/runtime#80149 enables the analyzers fordotnet/runtime and fixes up many call sites.CA1510, CA1511, CA1512, CA1513

Reflection

There have been a variety of improvements here and there in the reflection stack in .NET 8, mostly around reducing allocation or caching information so that subsequent access is faster. For example,dotnet/runtime#87902 tweaks some code inGetCustomAttributes to avoid allocating anobject[1] array in order to set a property on an attribute.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    public object[] GetCustomAttributes() => typeof(C).GetCustomAttributes(typeof(MyAttribute), inherit: true);    [My(Value1 = 1, Value2 = 2)]    class C { }    [AttributeUsage(AttributeTargets.All)]    public class MyAttribute : Attribute    {        public int Value1 { get; set; }        public int Value2 { get; set; }    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
GetCustomAttributes.NET 7.01,287.1 ns1.00296 B1.00
GetCustomAttributes.NET 8.0994.0 ns0.77232 B0.78

Other changes likedotnet/runtime#76574 from@teo-tsirpanis,dotnet/runtime#81059 from@teo-tsirpanis, anddotnet/runtime#86657 from@teo-tsirpanis also removed allocations in the reflection stack, in particular by more liberal use of spans. Anddotnet/runtime#78288 from@lateapexearlyspeed improves the handling of generics information on aType, leading to a boost for various generics-related members, in particular forGetGenericTypeDefinition for which the result is now cached on theType object.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly Type _type = typeof(List<int>);    [Benchmark] public Type GetGenericTypeDefinition() => _type.GetGenericTypeDefinition();}
MethodRuntimeMeanRatio
GetGenericTypeDefinition.NET 7.047.426 ns1.00
GetGenericTypeDefinition.NET 8.03.289 ns0.07

However, the largest impact on performance in reflection in .NET 8 comes fromdotnet/runtime#88415. This is a continuation of work done in .NET 7 to improve the performance ofMethodBase.Invoke. When you know at compile-time the signature of the target method you want to invoke via reflection, you can achieve the best performance by usingCreateDelegate<DelegateType> to get and cache a delegate for the method in question, and then performing all invocations via that delegate. However, if you don’t know the signature at compile-time, you need to rely on more dynamic means, likeMethodBase.Invoke, which historically has been much more costly. Some enterprising developers turned to reflection emit to avoid that overhead by emitting custom invocation stubs at run-time, and that’s one of the optimization approaches taken under the covers in .NET 7 as well. Now in .NET 8, the code generated for many of these cases has improved; previously the emitter was always generating code that could accommodateref/out arguments, but many methods don’t have such arguments, and the generated code can be more efficient when it needn’t factor those in.

// If you have .NET 6 installed, you can update the csproj to include a net6.0 in the target frameworks, and then run://     dotnet run -c Release -f net6.0 --filter "*" --runtimes net6.0 net7.0 net8.0// Otherwise, you can run://     dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Reflection;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private MethodInfo _method0, _method1, _method2, _method3;    private readonly object[] _args1 = new object[] { 1 };    private readonly object[] _args2 = new object[] { 2, 3 };    private readonly object[] _args3 = new object[] { 4, 5, 6 };    [GlobalSetup]    public void Setup()    {        _method0 = typeof(Tests).GetMethod("MyMethod0", BindingFlags.NonPublic | BindingFlags.Static);        _method1 = typeof(Tests).GetMethod("MyMethod1", BindingFlags.NonPublic | BindingFlags.Static);        _method2 = typeof(Tests).GetMethod("MyMethod2", BindingFlags.NonPublic | BindingFlags.Static);        _method3 = typeof(Tests).GetMethod("MyMethod3", BindingFlags.NonPublic | BindingFlags.Static);    }    [Benchmark] public void Method0() => _method0.Invoke(null, null);    [Benchmark] public void Method1() => _method1.Invoke(null, _args1);    [Benchmark] public void Method2() => _method2.Invoke(null, _args2);    [Benchmark] public void Method3() => _method3.Invoke(null, _args3);    private static void MyMethod0() { }    private static void MyMethod1(int arg1) { }    private static void MyMethod2(int arg1, int arg2) { }    private static void MyMethod3(int arg1, int arg2, int arg3) { }}
MethodRuntimeMeanRatio
Method0.NET 6.091.457 ns1.00
Method0.NET 7.07.205 ns0.08
Method0.NET 8.05.719 ns0.06
Method1.NET 6.0132.832 ns1.00
Method1.NET 7.026.151 ns0.20
Method1.NET 8.021.602 ns0.16
Method2.NET 6.0172.224 ns1.00
Method2.NET 7.037.937 ns0.22
Method2.NET 8.026.951 ns0.16
Method3.NET 6.0211.247 ns1.00
Method3.NET 7.042.988 ns0.20
Method3.NET 8.034.112 ns0.16

However, there’s overhead involved here on each call and that’s repeated on each call. If we could extract that upfront work, do it once, and cache it, we can achieve much better performance. That’s exactly what the newMethodInvoker andConstructorInvoker types implemented indotnet/runtime#88415 provide. These don’t incorporate all of the obscure corner-cases thatMethodBase.Invoke handles (like specially recognizing and handlingType.Missing), but for everything else, it provides a great solution for optimizing the repeated invocation of methods whose signatures are unknown at build time.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Reflection;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly object _arg0 = 4, _arg1 = 5, _arg2 = 6;    private readonly object[] _args3 = new object[] { 4, 5, 6 };    private MethodInfo _method3;    private MethodInvoker _method3Invoker;    [GlobalSetup]    public void Setup()    {        _method3 = typeof(Tests).GetMethod("MyMethod3", BindingFlags.NonPublic | BindingFlags.Static);        _method3Invoker = MethodInvoker.Create(_method3);    }    [Benchmark(Baseline = true)]     public void MethodBaseInvoke() => _method3.Invoke(null, _args3);    [Benchmark]    public void MethodInvokerInvoke() => _method3Invoker.Invoke(null, _arg0, _arg1, _arg2);    private static void MyMethod3(int arg1, int arg2, int arg3) { }}
MethodMeanRatio
MethodBaseInvoke32.42 ns1.00
MethodInvokerInvoke11.47 ns0.35

As ofdotnet/runtime#90119, these types are then used by theActivatorUtilities.CreateFactory method inMicrosoft.Extensions.DependencyInjection.Abstractions to further improve DI service construction performance.dotnet/runtime#91881 improves it further by adding a an additional caching layer that further avoids reflection on each construction.

Primitives

It’s hard to believe that after two decades we’re still finding opportunity to improve the core primitive types in .NET, yet here we are. Some of this comes from new scenarios that drive optimization into different places; some of it comes from new opportunity based on new support that enables different approaches to the same problem; some of it comes from new research highlighting new ways to approach a problem; and some of it simply comes from many new eyes looking at a well-worn space (yay open source!) Regardless of the reason, there’s a lot to be excited about here in .NET 8.

Enums

Let’s start withEnum.Enum has obviously been around since the earliest days of .NET and is used heavily. AlthoughEnum‘s functionality and implementation have evolved, and although it’s received new APIs, at its core, how the data is stored has fundamentally remained the same for many years. In the .NET Framework implementation, there’s an internalValuesAndNames class that stores aulong[] and astring[], and in .NET 7, there’s anEnumInfo that serves the same purpose. Thatstring[] contains the names of all of the enum’s values, and theulong[] stores their numeric counterparts. It’s aulong[] to accommodate all possible underlying types anEnum can be, including those supported by C# (sbyte,byte,short,ushort,int,uint,long,ulong) and those additionally supported by the runtime (nint,nuint,char,float,double) even though effectively no one uses those (partialbool support used to be on this list as well, but was deleted in .NET 8 indotnet/runtime#79962 by@pedrobsaila).

As an aside, as part of all of this work, we examined the breadth of appropriately-licensed NuGet packages, looking for what the most common underlying types were in their use ofenum. Out of ~163 millionenums found, here’s the breakdown of their underlying types. The result is likely not surprising, given the default underlying type forEnum, but it’s still interesting:

Graph of how common is each underlying Enum type

There are several issues with the cited design for howEnum stores its data. Every operation translates between theseulong[] values and the actual type being used by the particularEnum, plus the array is often twice as large as it needs to be (int is the default underlying type for an enum and, as seen in the above graph, byfar the most commonly used). The approach also leads to significant assembly code bloat when dealing with all the new generic methods that have been added toEnum in recent years.enums are structs, and when a struct is used as a generic type argument, the JIT specializes the code for that value type (whereas for reference types it emits a single shared implementation used by all of them). That specialization is great for throughput, but it means that you get a copy of the code for every value type it’s used with; if you have a lot of code (e.g.Enum formatting) and a lot of possible types being substituted (e.g. every declaredenum type), that’s a lot of possible increase in code size.

To address all of this, to modernize the implementation, and to make various operations faster,dotnet/runtime#78580 rewritesEnum. Rather than having a non-genericEnumInfo that stores aulong[] array of all values, it introduces a genericEnumInfo<TUnderlyingValue> that stores aTUnderlyingValue[]. Then based on the enum’s type, every generic and non-genericEnum method looks up the underlyingTUnderlyingType and invokes a generic method with thatTUnderlyingType butnot with a generic type parameter for theenum type, e.g.Enum.IsDefined<TEnum>(...) andEnum.IsDefined(typeof(TEnum), ...) both look up theTUnderlyingValue forTEnum and invoke the internalEnum.IsDefinedPrimitive<TUnderlyingValue>(typeof(TEnum)). In this way, the implementation stores a strongly-typedTUnderlyingValue[] value rather than storing the worst caseulong[], and all of the implementations across generic and non-generic entrypoints are shared while not having full generic specialization for everyTEnum: worst case, we end up with one generic specialization per underlying type, of which only the previously cited 8 are expressible in C#. The generic entrypoints are able to do the mapping very efficiently, thanks todotnet/runtime#71685 from@MichalPetryka which makestypeof(TEnum).IsEnum a JIT intrinsic (such that it effectively becomes a const), and the non-generic entrypoints use switches onTypeCode/CorElementType as was already being done in a variety of methods.

Other improvements were made toEnum as well.dotnet/runtime#76162 improves the performance of various methods likeToString andIsDefined in cases where all of theenum‘s defined values are sequential starting from 0. In that common case, the internal function that looks up the value in theEnumInfo<TUnderlyingValue> can do so with a simple array access, rather than needing to search for the target.

The net result of all of these changes are some very nice performance improvements:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly DayOfWeek _dow = DayOfWeek.Saturday;    [Benchmark] public bool IsDefined() => Enum.IsDefined(_dow);    [Benchmark] public string GetName() => Enum.GetName(_dow);    [Benchmark] public string[] GetNames() => Enum.GetNames<DayOfWeek>();    [Benchmark] public DayOfWeek[] GetValues() => Enum.GetValues<DayOfWeek>();    [Benchmark] public Array GetUnderlyingValues() => Enum.GetValuesAsUnderlyingType<DayOfWeek>();    [Benchmark] public string EnumToString() => _dow.ToString();    [Benchmark] public bool TryParse() => Enum.TryParse<DayOfWeek>("Saturday", out _);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
IsDefined.NET 7.020.021 ns1.00NA
IsDefined.NET 8.02.502 ns0.12NA
GetName.NET 7.024.563 ns1.00NA
GetName.NET 8.03.648 ns0.15NA
GetNames.NET 7.037.138 ns1.0080 B1.00
GetNames.NET 8.022.688 ns0.6180 B1.00
GetValues.NET 7.0694.356 ns1.00224 B1.00
GetValues.NET 8.039.406 ns0.0656 B0.25
GetUnderlyingValues.NET 7.041.012 ns1.0056 B1.00
GetUnderlyingValues.NET 8.017.249 ns0.4256 B1.00
EnumToString.NET 7.032.842 ns1.0024 B1.00
EnumToString.NET 8.014.620 ns0.4424 B1.00
TryParse.NET 7.049.121 ns1.00NA
TryParse.NET 8.030.394 ns0.62NA

These changes, however, also madeenums play much more nicely with string interpolation. First,Enum now sports a new staticTryFormat method, which enables formatting anenum‘s string representation directly into aSpan<char>:

public static bool TryFormat<TEnum>(TEnum value, Span<char> destination, out int charsWritten, [StringSyntax(StringSyntaxAttribute.EnumFormat)] ReadOnlySpan<char> format = default) where TEnum : struct, Enum

Second,Enum now implementsISpanFormattable, such that any code written to use a value’sISpanFormattable.TryFormat method now lights up withenums, too. However, even though enums are value types, they’re special and weird in that they derive from the reference typeEnum, and that means calling instance methods likeToString orISpanFormattable.TryFormat end up boxing the enum value.

So, third, the various interpolated string handlers inSystem.Private.CoreLib were updated to special-casetypeof(T).IsEnum, which as noted is now effectively free thanks to JIT optimizations, usingEnum.TryFormat directly in order to avoid the boxing. We can see the impact this has by running the following benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly char[] _dest = new char[100];    private readonly FileAttributes _attr = FileAttributes.Hidden | FileAttributes.ReadOnly;    [Benchmark]    public bool Interpolate() => _dest.AsSpan().TryWrite($"Attrs: {_attr}", out int charsWritten);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Interpolate.NET 7.081.58 ns1.0080 B1.00
Interpolate.NET 8.034.41 ns0.420.00

Numbers

Such formatting improvements weren’t just reserved forenums. The performance of number formatting also sees a nice set of improvements in .NET 8. Daniel Lemire has anice blog post from 2021 discussing various approaches to counting the number of digits in an integer. Digit counting is relevant to number formatting as we need to know how many characters the number will be, either to allocate a string of the right length to format into or to ensure that a destination buffer is of a sufficient length.dotnet/runtime#76519 implements this inside of .NET’s number formatting, providing a branch-free, table-based lookup solution for computing the number of digits in a formatted value.

dotnet/runtime#76726 improves performance further by using a trickother formatting libraries use. One of the more expensive parts of formatting a decimal is in dividing by 10 to pull off each digit; if we can reduce the number of divisions, we can reduce the overall expense of the formatting operation. The trick here is, rather than dividing by 10 for each digit in the number, we instead divide by 100 for each pair of digits in the number, and then have a precomputed lookup table for thechar-based representation of all values 0 to 99. This lets us cut the number of divisions in half.

dotnet/runtime#79061 also expands on a previous optimization already present in .NET. The formatting code contained a table of precomputed strings for single digit numbers, so if you asked for the equivalent of0.ToString(), the implementation wouldn’t need to allocate a new string, it would just fetch"0" from the table and return it. This PR expands that cache from single digit numbers to being all numbers 0 through 299 (it also makes the cache lazy, such that we don’t need to pay for the strings for values that are never used). The choice of 299 is somewhat arbitrary and could be raised in the future if the need presents itself, but in examining data from various services, this addresses a significant chunk of the allocations that come from number formatting. Coincidentally or not, it also includes all success status codes from the HTTP protocol.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    [Arguments(12)]    [Arguments(123)]    [Arguments(1_234_567_890)]    public string Int32ToString(int i) => i.ToString();}
MethodRuntimeiMeanRatioAllocatedAlloc Ratio
Int32ToString.NET 7.01216.253 ns1.0032 B1.00
Int32ToString.NET 8.0121.985 ns0.120.00
Int32ToString.NET 7.012318.056 ns1.0032 B1.00
Int32ToString.NET 8.01231.971 ns0.110.00
Int32ToString.NET 7.0123456789026.964 ns1.0048 B1.00
Int32ToString.NET 8.0123456789017.082 ns0.6348 B1.00

Numbers in .NET 8 also gain the ability to format as binary (viadotnet/runtime#84889, and parse from binary (viadotnet/runtime#84998), via the new “b” specifier. For example, this:

// dotnet run -f net8.0int i = 12345;Console.WriteLine(i.ToString("x16")); // 16 hex digitsConsole.WriteLine(i.ToString("b16")); // 16 binary digits

outputs:

00000000000030390011000000111001

That implementation is then used to reimplement the existingConvert.ToString(int value, int toBase) method, such that it’s also now optimized:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly int _value = 12345;    [Benchmark]    public string ConvertBinary() => Convert.ToString(_value, 2);}
MethodRuntimeMeanRatio
ConvertBinary.NET 7.0104.73 ns1.00
ConvertBinary.NET 8.023.76 ns0.23

In a significant addition to the primitive types (numerical and beyond), .NET 8 also sees the introduction of the newIUtf8SpanFormattable interface.ISpanFormattable was introduced in .NET 6, and with itTryFormat methods on many types that enable those types to directly format into aSpan<char>:

public interface ISpanFormattable : IFormattable{    bool TryFormat(Span<char> destination, out int charsWritten, ReadOnlySpan<char> format, IFormatProvider? provider);}

Now in .NET 8, we also have theIUtf8SpanFormattable interface:

public interface IUtf8SpanFormattable{    bool TryFormat(Span<byte> utf8Destination, out int bytesWritten, ReadOnlySpan<char> format, IFormatProvider? provider);}

that enables types to directly format into aSpan<byte>. These are by design almost identical, the key difference being whether the implementation of these interfaces writes out UTF16chars or UTF8bytes. Withdotnet/runtime#84587 anddotnet/runtime#84841, all of the numerical primitives inSystem.Private.CoreLib both implement the new interface and expose a publicTryFormat method. So, for example,ulong exposes these:

public bool TryFormat(Span<char> destination, out int charsWritten, [StringSyntax(StringSyntaxAttribute.NumericFormat)] ReadOnlySpan<char> format = default, IFormatProvider? provider = null);public bool TryFormat(Span<byte> utf8Destination, out int bytesWritten, [StringSyntax(StringSyntaxAttribute.NumericFormat)] ReadOnlySpan<char> format = default, IFormatProvider? provider = null);

They have the exact same functionality, support the exact same format strings, the same general performance characteristics, and so on, and simply differ in whether writing out UTF16 or UTF8. How can I be so sure they’re so similar? Because, drumroll, they share the same implementation. Thanks to generics, the two methods above delegate to the exact same helper:

public static bool TryFormatUInt64<TChar>(ulong value, ReadOnlySpan<char> format, IFormatProvider? provider, Span<TChar> destination, out int charsWritten)

just with one withTChar aschar and the other asbyte. So, when we run a benchmark like this:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly ulong _value = 12345678901234567890;    private readonly char[] _chars = new char[20];    private readonly byte[] _bytes = new byte[20];    [Benchmark] public void FormatUTF16() => _value.TryFormat(_chars, out _);    [Benchmark] public void FormatUTF8() => _value.TryFormat(_bytes, out _);}

we get practically identical results like this:

MethodMean
FormatUTF1612.10 ns
FormatUTF812.96 ns

And now that the primitive types themselves are able to format with full fidelity as UTF8, theUtf8Formatter class largely becomes legacy. In fact, the previously mentioned PR also rips outUtf8Formatter‘s implementation and just reparents it on top of the same formatting logic from the primitive types. All of the previously cited performance improvements to number formatting then not only accrue toToString andTryFormat for UTF16, and not only toTryFormat for UTF8, but then also toUtf8Formatter (plus, removing duplicated code and reducing maintenance burden makes me giddy).

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _bytes = new byte[10];    [Benchmark]    [Arguments(123)]    [Arguments(1234567890)]    public bool Utf8FormatterTryFormat(int i) => Utf8Formatter.TryFormat(i, _bytes, out int bytesWritten);}
MethodRuntimeiMeanRatio
Utf8FormatterTryFormat.NET 7.01238.849 ns1.00
Utf8FormatterTryFormat.NET 8.01234.645 ns0.53
Utf8FormatterTryFormat.NET 7.0123456789015.844 ns1.00
Utf8FormatterTryFormat.NET 8.012345678907.174 ns0.45

Not only is UTF8 formatting directly supported by all these types, so, too, is parsing.dotnet/runtime#86875 added the newIUtf8SpanParsable<TSelf> interface and implemented it on the primitive numeric types. Just as with its formatting counterpart, this provides identical behavior toIParsable<TSelf>, just for UTF8 instead of UTF16. And just as with its formatting counterpart, all of the parsing logic is shared in generic routines between the two modes. In fact, not only does this share logic between UTF16 and UTF8 parsing, it follows closely on the heals ofdotnet/runtime#84582, which uses the same generic tricks to deduplicate the parsing logic across all the primitive types, such that the same generic routines end up being used for all the types and both UTF8 and UTF16. That PR removed almost 2,000 lines of code fromSystem.Private.CoreLib:

GitHub plus/minus line count indicator for parsing deduplication

DateTime

Parsing and formatting are improved on other types, as well. TakeDateTime andDateTimeOffset.dotnet/runtime#84963 improved a variety of aspects ofDateTime{Offset} formatting:

  • The formatting logic has general support used as a fallback and that supports any custom format, but then there are dedicated routines used for the most popular formats, allowing them to be optimized and tuned. Dedicated routines already existed for the very popular “r” (RFC1123 pattern) and “o” (round-trip date/time pattern) formats; this PR adds dedicated routines for the default format (“G”) when used with the invariant culture, the “s” format (sortable date/time pattern), and “u” format (universal sortable date/time pattern), all of which are used frequently in a variety of domains.
  • For the “U” format (universal full date/time pattern), the implementation would end up always allocating newDateTimeFormatInfo andGregorianCalendar instances, resulting in a significant amount of allocation even though it was only needed in a rare fallback case. This fixed it to only allocate when truly required.
  • When there’s no dedicated formatting routine, formatting is done into an internalref struct calledValueListBuilder<T> that starts with a provided span buffer (typically seeded from astackalloc) and then grows withArrayPool memory as needed. After the formatting has completed, that builder is either copied into a destination span or a new string, depending on the method that triggered the formatting. However, we can avoid that copy for a destination span if we just seed the builder with the destination span. Then if the builder still contains the initial span when formatting has completed (having not grown out of it), we know all the data fit, and we can skip the copy, as all the data is already there.

Here’s some of the example impact:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Globalization;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly DateTime _dt = new DateTime(2023, 9, 1, 12, 34, 56);    private readonly char[] _chars = new char[100];    [Params(null, "s", "u", "U", "G")]    public string Format { get; set; }    [Benchmark] public string DT_ToString() => _dt.ToString(Format);    [Benchmark] public string DT_ToStringInvariant() => _dt.ToString(Format, CultureInfo.InvariantCulture);    [Benchmark] public bool DT_TryFormat() => _dt.TryFormat(_chars, out _, Format);    [Benchmark] public bool DT_TryFormatInvariant() => _dt.TryFormat(_chars, out _, Format, CultureInfo.InvariantCulture);}
MethodRuntimeFormatMeanRatioAllocatedAlloc Ratio
DT_ToString.NET 7.0?166.64 ns1.0064 B1.00
DT_ToString.NET 8.0?102.45 ns0.6264 B1.00
DT_ToStringInvariant.NET 7.0?161.94 ns1.0064 B1.00
DT_ToStringInvariant.NET 8.0?28.74 ns0.1864 B1.00
DT_TryFormat.NET 7.0?151.52 ns1.00NA
DT_TryFormat.NET 8.0?78.57 ns0.52NA
DT_TryFormatInvariant.NET 7.0?140.35 ns1.00NA
DT_TryFormatInvariant.NET 8.0?18.26 ns0.13NA
DT_ToString.NET 7.0G162.86 ns1.0064 B1.00
DT_ToString.NET 8.0G109.49 ns0.6864 B1.00
DT_ToStringInvariant.NET 7.0G162.20 ns1.0064 B1.00
DT_ToStringInvariant.NET 8.0G102.71 ns0.6364 B1.00
DT_TryFormat.NET 7.0G148.32 ns1.00NA
DT_TryFormat.NET 8.0G83.60 ns0.57NA
DT_TryFormatInvariant.NET 7.0G145.05 ns1.00NA
DT_TryFormatInvariant.NET 8.0G79.77 ns0.55NA
DT_ToString.NET 7.0s186.44 ns1.0064 B1.00
DT_ToString.NET 8.0s29.35 ns0.1764 B1.00
DT_ToStringInvariant.NET 7.0s182.15 ns1.0064 B1.00
DT_ToStringInvariant.NET 8.0s27.67 ns0.1664 B1.00
DT_TryFormat.NET 7.0s165.08 ns1.00NA
DT_TryFormat.NET 8.0s15.53 ns0.09NA
DT_TryFormatInvariant.NET 7.0s155.24 ns1.00NA
DT_TryFormatInvariant.NET 8.0s15.50 ns0.10NA
DT_ToString.NET 7.0u184.71 ns1.0064 B1.00
DT_ToString.NET 8.0u29.62 ns0.1664 B1.00
DT_ToStringInvariant.NET 7.0u184.01 ns1.0064 B1.00
DT_ToStringInvariant.NET 8.0u26.98 ns0.1564 B1.00
DT_TryFormat.NET 7.0u171.73 ns1.00NA
DT_TryFormat.NET 8.0u16.08 ns0.09NA
DT_TryFormatInvariant.NET 7.0u158.42 ns1.00NA
DT_TryFormatInvariant.NET 8.0u15.58 ns0.10NA
DT_ToString.NET 7.0U1,622.28 ns1.001240 B1.00
DT_ToString.NET 8.0U206.08 ns0.1396 B0.08
DT_ToStringInvariant.NET 7.0U1,567.92 ns1.001240 B1.00
DT_ToStringInvariant.NET 8.0U207.60 ns0.1396 B0.08
DT_TryFormat.NET 7.0U1,590.27 ns1.001144 B1.00
DT_TryFormat.NET 8.0U190.98 ns0.120.00
DT_TryFormatInvariant.NET 7.0U1,560.00 ns1.001144 B1.00
DT_TryFormatInvariant.NET 8.0U184.11 ns0.120.00

Parsing has also improved meaningfully. For example,dotnet/runtime#82877 improves the handling of “ddd” (abbreviated name of the day of the week), “dddd” (full name of the day of the week), “MMM” (abbreviated name of the month), and “MMMM” (full name of the month) in a custom format string; these show up in a variety of commonly used format strings, such as in the expanded definition of the RFC1123 format:ddd, dd MMM yyyy HH':'mm':'ss 'GMT'. When the general parsing routine encounters these in a format string, it needs to consult the suppliedCultureInfo /DateTimeFormatInfo for that culture’s associated month and day names, e.g.DateTimeFormatInfo.GetAbbreviatedMonthName, and then needs to do a linguistic ignore-case comparison for each name against the input text; that’s not particularly cheap. However, if we’re given an invariant culture, we can do the comparison much, much faster. Take “MMM” for abbreviated month name, for example. We can read the next three characters (uint m0 = span[0], m1 = span[1], m2 = span[2]), ensure they’re all ASCII ((m0 | m1 | m2) <= 0x7F), and then combine them all into a singleuint, employing the same ASCII casing trick discussed earlier ((m0 << 16) | (m1 << 8) | m2 | 0x202020). We can do the same thing, precomputed, for each month name, which for the invariant culture we know in advance, and the entire lookup becomes a single numericalswitch:

switch ((m0 << 16) | (m1 << 8) | m2 | 0x202020){    case 0x6a616e: /* 'jan' */ result = 1; break;    case 0x666562: /* 'feb' */ result = 2; break;    case 0x6d6172: /* 'mar' */ result = 3; break;    case 0x617072: /* 'apr' */ result = 4; break;    case 0x6d6179: /* 'may' */ result = 5; break;    case 0x6a756e: /* 'jun' */ result = 6; break;    case 0x6a756c: /* 'jul' */ result = 7; break;    case 0x617567: /* 'aug' */ result = 8; break;    case 0x736570: /* 'sep' */ result = 9; break;    case 0x6f6374: /* 'oct' */ result = 10; break;    case 0x6e6f76: /* 'nov' */ result = 11; break;    case 0x646563: /* 'dec' */ result = 12; break;    default: maxMatchStrLen = 0; break; // undo match assumption}

Nifty, and way faster.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Globalization;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private const string Format = "ddd, dd MMM yyyy HH':'mm':'ss 'GMT'";    private readonly string _s = new DateTime(1955, 11, 5, 6, 0, 0, DateTimeKind.Utc).ToString(Format, CultureInfo.InvariantCulture);    [Benchmark]    public void ParseExact() => DateTimeOffset.ParseExact(_s, Format, CultureInfo.InvariantCulture, DateTimeStyles.AllowInnerWhite | DateTimeStyles.AssumeUniversal);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
ParseExact.NET 7.01,139.3 ns1.0080 B1.00
ParseExact.NET 8.0318.6 ns0.280.00

A variety of other PRs contributed as well. The decreased allocation in the previous benchmark is thanks todotnet/runtime#82861, which removed a string allocation that might occur when the format string contained quotes; the PR simply replaced the string allocation with use of spans.dotnet/runtime#82925 further reduced the cost of parsing with the “r” and “o” formats by removing some work that ended up being unnecessary, removing a virtual dispatch, and general streamlining of the code paths. Anddotnet/runtime#84964 removed somestring[] allocations that occured inParseExact when parsing with some cultures, in particular those that employ genitive month names. If the parser needed to retrieve theMonthGenitiveNames orAbbreviatedMonthGenitiveNames arrays, it would do so via the public properties for these onDateTimeFormatInfo; however, out of concern that code could mutate those arrays, these public properties hand back copies. That means that the parser was allocating a copy every time it accessed one of these. The parser can instead access the underlying original array, and pinky swear not to change it.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Globalization;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly CultureInfo _ci = new CultureInfo("ru-RU");    [Benchmark] public DateTime Parse() => DateTime.ParseExact("вторник, 18 апреля 2023 04:31:26", "dddd, dd MMMM yyyy HH:mm:ss", _ci);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Parse.NET 7.02.654 us1.00128 B1.00
Parse.NET 8.02.353 us0.900.00

DateTime andDateTimeOffset also implementIUtf8SpanFormattable, thanks todotnet/runtime#84469, and as with the numerical types, the implementations are all shared between UTF16 and UTF8; thus all of the optimizations previously mentioned accrue to both. And again,Utf8Formatter‘s support for formattingDateTimeOffset is just reparented on top of this same shared logic.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly DateTime _dt = new DateTime(2023, 9, 1, 12, 34, 56);    private readonly byte[] _bytes = new byte[100];    [Benchmark] public bool TryFormatUtf8Formatter() => Utf8Formatter.TryFormat(_dt, _bytes, out _); }
MethodRuntimeMeanRatio
TryFormatUtf8Formatter.NET 7.019.35 ns1.00
TryFormatUtf8Formatter.NET 8.016.24 ns0.83

Since we’re talking aboutDateTime, a brief foray intoTimeZoneInfo.TimeZoneInfo.FindSystemTimeZoneById gets aTimeZoneInfo object for the specified identifier. One of theimprovements introduced in .NET 6 is thatFindSystemTimeZoneById supports both the Windows time zone set as well as the IANA time zone set, regardless of whether running on Windows or Linux or macOS. However, theTimeZoneInfo was only being cached when its ID matched that for the current OS, and as such calls that resolved to the other set weren’t being fulfilled by the cache and were falling back to re-reading from the OS.dotnet/runtime#85615 ensures a cache can be used in both cases. It also allows returning the immutableTimeZoneInfo objects directly, rather than cloning them on every access.dotnet/runtime#88368 also improvesTimeZoneInfo, in particularGetSystemTimeZones on Linux and macOS, by lazily loading several of the properties.dotnet/runtime#89985 then improves on that with a new overload ofGetSystemTimeZones that allows the caller to skip the sort the implementation would otherwise perform on the result.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    [Arguments("America/Los_Angeles")]    [Arguments("Pacific Standard Time")]    public TimeZoneInfo FindSystemTimeZoneById(string id) => TimeZoneInfo.FindSystemTimeZoneById(id);}
MethodRuntimeidMeanRatioAllocatedAlloc Ratio
FindSystemTimeZoneById.NET 7.0America/Los_Angeles1,503.75 ns1.0080 B1.00
FindSystemTimeZoneById.NET 8.0America/Los_Angeles40.96 ns0.030.00
FindSystemTimeZoneById.NET 7.0Pacif(…) Time [21]3,951.60 ns1.00568 B1.00
FindSystemTimeZoneById.NET 8.0Pacif(…) Time [21]57.00 ns0.010.00

Back to formatting and parsing…

Guid

Formatting and parsing improvements go beyond the numerical and date types.Guid also gets in on the game. Thanks todotnet/runtime#84553,Guid implementsIUtf8SpanFormattable, and as with all the other cases, it shares the exact same routines between UTF16 and UTF8 support. Thendotnet/runtime#81650,dotnet/runtime#81666, anddotnet/runtime#87126 from@SwapnilGaikwad vectorize that formatting support.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly Guid _guid = Guid.Parse("7BD626F6-4396-41E3-A491-4B1DC538DD92");    private readonly char[] _dest = new char[100];    [Benchmark]    [Arguments("D")]    [Arguments("N")]    [Arguments("B")]    [Arguments("P")]    public bool TryFormat(string format) => _guid.TryFormat(_dest, out _, format);}
MethodRuntimeformatMeanRatio
TryFormat.NET 7.0B23.622 ns1.00
TryFormat.NET 8.0B7.341 ns0.31
TryFormat.NET 7.0D22.134 ns1.00
TryFormat.NET 8.0D5.485 ns0.25
TryFormat.NET 7.0N20.891 ns1.00
TryFormat.NET 8.0N4.852 ns0.23
TryFormat.NET 7.0P24.139 ns1.00
TryFormat.NET 8.0P6.101 ns0.25

Before moving on from primitives and numerics, let’s take a quick look atSystem.Random, which has methods for producing pseudo-random numerical values.

Random

dotnet/runtime#79790 from@mla-alm provides an implementation inRandom based on@lemire‘sunbiased range functions. When a method likeNext(int min, int max) is invoked, it needs to provide a value in the range[min, max). In order to provide an unbiased answer, the .NET 7 implementation generates a 32-bit value, narrows down the range to the smallest power of 2 that contains the max (by taking the log2 of the max and shifting to throw away bits), and then checks whether the result is less than the max: if it is, it returns the result as the answer. But if it’s not, it rejects the value (a process referred to as “rejection sampling”) and loops around to start the whole process over. While the cost to produce each sample in the current approach isn’t terrible, the nature of the approach makes it reasonably likely the sample will need to be rejected, which means looping and retries. With the new approach, it effectively implements modulo reduction (e.g.Next() % max), except replacing the expensive modulo operation with a cheaper multiplication and shift; then a rejection sampling loop is still employed, but the bias it corrects for happens much more rarely and thus the more expensive path happens much more rarely. The net result is a nice boost on average to the throughput ofRandom‘s methods (Random can also get a boost from dynamic PGO, as the internal abstractionRandom uses can be devirtualized, so I’ve shown here the impact with and without PGO enabled.)

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 7").WithRuntime(CoreRuntime.Core70).AsBaseline())    .AddJob(Job.Default.WithId(".NET 8 w/o PGO").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_TieredPGO", "0"))    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables")]public class Tests{    private static readonly Random s_rand = new();    [Benchmark]    public int NextMax() => s_rand.Next(12345);}
MethodRuntimeMeanRatio
NextMax.NET 7.05.793 ns1.00
NextMax.NET 8.0 w/o PGO1.840 ns0.32
NextMax.NET 8.01.598 ns0.28

dotnet/runtime#87219 from@MichalPetryka then further improves this forlong values. The core part of the algorithm involves multiplying the random value by the max value and then taking the low part of the product:

UInt128 randomProduct = (UInt128)maxValue * xoshiro.NextUInt64();ulong lowPart = (ulong)randomProduct;

This can be made more efficient by not usingUInt128‘s multiplication implementation and instead usingMath.BigMul,

ulong randomProduct = Math.BigMul(maxValue, xoshiro.NextUInt64(), out ulong lowPart);

which is implemented to use theBmi2.X64.MultiplyNoFlags orArmbase.Arm64.MultiplyHigh intrinsics when one is available.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables")]public class Tests{    private static readonly Random s_rand = new();    [Benchmark]    public long NextMinMax() => s_rand.NextInt64(123456789101112, 1314151617181920);}
MethodRuntimeMeanRatio
NextMinMax.NET 7.09.839 ns1.00
NextMinMax.NET 8.01.927 ns0.20

Finally, I’ll mentiondotnet/runtime#81627.Random is both a commonly-used type in its own right and also an abstraction; many of the APIs onRandom are virtual, such that a derived type can be implemented to completely swap out the algorithm employed. So, for example, if you wanted to implement aMersenneTwisterRandom that derived fromRandom and completely replaced the base algorithm by overriding every virtual method, you could do so, pass your instance around asRandom, and everyone’s happy… unless you’re creating your derived type frequently and care about allocation.Random actually includes multiple pseudo-random generators. .NET 6 imbued it with an implementation of thexoshiro128**/xoshiro256** algorithms, which are used when you just donew Random(). However, if you instead instantiate a derived type, the implementation falls back to the same algorithm (a variant of Knuth’s subtractive random number generator algorithm) it’s used since the dawn ofRandom, as it doesn’t know what the derived type will be doing nor what dependencies it may have taken on the nature of the algorithm employed. That algorithm carries with it a 56-elementint[], which means that derived classes end up instantiating and initializing that array even if they never use it. With this PR, the creation of that array is made lazy, such that it’s only initialized if and when it’s used. With that, a derived implementation that wants to avoid that cost can.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark] public Random NewDerived() => new NotRandomRandom();    private sealed class NotRandomRandom : Random { }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
NewDerived.NET 7.01,237.73 ns1.00312 B1.00
NewDerived.NET 8.020.49 ns0.0272 B0.23

Strings, Arrays, and Spans

.NET 8 sees a tremendous amount of improvement in the realm of data processing, in particular in the efficient manipulation of strings, arrays, and spans. Since we’ve just been talking about UTF8 andIUtf8SpanFormattable, let’s start there.

UTF8

As noted,IUtf8SpanFormattable is now implemented on a bunch of types. I noted all the numerical primitives,DateTime{Offset}, andGuid, and withdotnet/runtime#84556 theSystem.Version type also implements it, as doIPAddress and the newIPNetwork types, thanks todotnet/runtime#84487. However, .NET 8 doesn’t just provide implementations of this interface on all of these types, it also consumes the interface in a key place.

If you’ll recall,string interpolation in C# 10 and .NET 6 was completely overhauled. This included not only making string interpolation much more efficient, but also in providing a pattern that a type could implement to allow for the string interpolation syntax to be used efficiently to do things other than create a new string. For example, a newTryWrite extension method forSpan<char> was added that makes it possible to format an interpolated string directly into a destinationchar buffer:

public bool Format(Span<char> span, DateTime dt, out int charsWritten) =>    span.TryWrite($"Date: {dt:R}", out charsWritten);

The above gets translated (“lowered”) by the compiler into the equivalent of the following:

public bool Format(Span<char> span, DateTime dt, out int charsWritten){    var handler = new MemoryExtensions.TryWriteInterpolatedStringHandler(6, 1, span, out bool shouldAppend);    _ = shouldAppend &&        handler.AppendLiteral("Date: ") &&        handler.AppendFormatted<DateTime>(dt, "R");    return MemoryExtensions.TryWrite(span, ref handler, out charsWritten);

The implementation of that genericAppendFormatted<T> call examines theT and tries to do the most optimal thing. In this case, it’ll see thatT implementsISpanFormattable, and it’ll end up using itsTryFormat to format directly into the destination span.

That’s for UTF16. Now withIUtf8SpanFormattable, we have the opportunity to do the same thing but for UTF8. And that’s exactly whatdotnet/runtime#83852 does. It introduces the newUtf8.TryWrite method, which behaves exactly like the aforementionedTryWrite, except writing as UTF8 into a destinationSpan<byte> instead of as UTF16 into a destinationSpan<char>. The implementation also special-casesIUtf8SpanFormattable, using itsTryFormat to write directly into the destination buffer.

With that, we can write the equivalent to the method we wrote earlier:

public bool Format(Span<byte> span, DateTime dt, out int bytesWritten) =>    Utf8.TryWrite(span, $"Date: {dt:R}", out bytesWritten);

and that gets lowered as you’d now expect:

public bool Format(Span<byte> span, DateTime dt, out int bytesWritten){    var handler = new Utf8.TryWriteInterpolatedStringHandler(6, 1, span, out bool shouldAppend);    _ = shouldAppend &&        handler.AppendLiteral("Date: ") &&        handler.AppendFormatted<DateTime>(dt, "R");    return Utf8.TryWrite(span, ref handler, out bytesWritten);

So, identical, other than the parts you expect to change. But that’s also a problem in some ways. Take a look at thatAppendLiteral("Date: ") call. In the UTF16 case where we’re dealing with a destinationSpan<char>, the implementation ofAppendLiteral simply needs to copy that string into the destination; not only that, but the JIT will inline the call, see that a string literal is being copied, and will unroll the copy, making it super efficient. But in the UTF8 case, we can’t just copy the UTF16 stringchars into the destination UTF8Span<byte> buffer; we need to UTF8 encode the string. And while we can certainly do that (dotnet/runtime#84609 anddotnet/runtime#85120 make that trivial with the addition of a newEncoding.TryGetBytes method), it’s frustratingly inefficient to need to spend cycles repeatedly at run-time doing work that could be done at compile time. After all, we’re dealing with a string literal known at JIT time; it’d be really, really nice if the JIT could do the UTF8 encoding and then do an unrolled copy just as it’s already doing in the UTF16 case. And withdotnet/runtime#85328 anddotnet/runtime#89376, that’s exactly what happens, such that performance is effectively the same between them.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Unicode;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly char[] _chars = new char[100];    private readonly byte[] _bytes = new byte[100];    private readonly int _major = 1, _minor = 2, _build = 3, _revision = 4;    [Benchmark] public bool FormatUTF16() => _chars.AsSpan().TryWrite($"{_major}.{_minor}.{_build}.{_revision}", out int charsWritten);    [Benchmark] public bool FormatUTF8() => Utf8.TryWrite(_bytes, $"{_major}.{_minor}.{_build}.{_revision}", out int bytesWritten);}
MethodMean
FormatUTF1619.07 ns
FormatUTF819.33 ns

ASCII

UTF8 is the predominent encoding for text on the internet and for the movement of text between endpoints. However, much of this data is actually the ASCII subset, the 128 values in the range[0, 127]. When you know the data you’re working with is ASCII, you can achieve even better performance by using routines optimized for the subset. The newAscii class in .NET 8, introduced indotnet/runtime#75012 anddotnet/runtime#84886, and then further optimized indotnet/runtime#85926 from@gfoidl,dotnet/runtime#85266 from@Daniel-Svensson,dotnet/runtime#84881, anddotnet/runtime#87141, provides this:

namespace System.Text;public static class Ascii{    public static bool Equals(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);    public static bool Equals(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);    public static bool Equals(ReadOnlySpan<char> left, ReadOnlySpan<byte> right);    public static bool Equals(ReadOnlySpan<char> left, ReadOnlySpan<char> right);    public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<byte> right);    public static bool EqualsIgnoreCase(ReadOnlySpan<byte> left, ReadOnlySpan<char> right);    public static bool EqualsIgnoreCase(ReadOnlySpan<char> left, ReadOnlySpan<byte> right);    public static bool EqualsIgnoreCase(ReadOnlySpan<char> left, ReadOnlySpan<char> right);    public static bool IsValid(byte value);    public static bool IsValid(char value);    public static bool IsValid(ReadOnlySpan<byte> value);    public static bool IsValid(ReadOnlySpan<char> value);    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);    public static OperationStatus ToLowerInPlace(Span<byte> value, out int bytesWritten);    public static OperationStatus ToLowerInPlace(Span<char> value, out int charsWritten);    public static OperationStatus ToUpperInPlace(Span<byte> value, out int bytesWritten);    public static OperationStatus ToUpperInPlace(Span<char> value, out int charsWritten);    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);    public static Range Trim(ReadOnlySpan<byte> value);    public static Range Trim(ReadOnlySpan<char> value);    public static Range TrimEnd(ReadOnlySpan<byte> value);    public static Range TrimEnd(ReadOnlySpan<char> value);    public static Range TrimStart(ReadOnlySpan<byte> value);    public static Range TrimStart(ReadOnlySpan<char> value);}

Note that it provides overloads that operate on UTF16 (char) and UTF8 (byte), and in many cases, intermixes them, such that you can, for example, compare a UTF8ReadOnlySpan<byte> with a UTF16ReadOnlySpan<char>, or transcode a UTF16ReadOnlySpan<char> to a UTF8ReadOnlySpan<byte> (which, when working with ASCII, is purely a narrowing operation, getting rid of the leading 0byte in eachchar). For example, the PR that added these methods also used them in a variety of places (something I advocate for strongly, in order to ensure what has been designed is actually meeting the need, or ensure that other core library code is benefiting from the new APIs, which in turn makes those APIs more valuable, as their benefits accrue to more indirect consumers), including in multiple places inSocketsHttpHandler. Previously,SocketsHttpHandler had its own helpers for this purpose, an example of which I’ve copied here into this benchmark:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _bytes = "Strict-Transport-Security"u8.ToArray();    private readonly string _chars = "Strict-Transport-Security";    [Benchmark(Baseline = true)]    public bool Equals_OpenCoded() => EqualsOrdinalAsciiIgnoreCase(_chars, _bytes);    [Benchmark]    public bool Equals_Ascii() => Ascii.EqualsIgnoreCase(_chars, _bytes);    internal static bool EqualsOrdinalAsciiIgnoreCase(string left, ReadOnlySpan<byte> right)    {        if (left.Length != right.Length)            return false;        for (int i = 0; i < left.Length; i++)        {            uint charA = left[i], charB = right[i];            if ((charA - 'a') <= ('z' - 'a')) charA -= ('a' - 'A');            if ((charB - 'a') <= ('z' - 'a')) charB -= ('a' - 'A');            if (charA != charB)                return false;        }        return true;    }}
MethodMeanRatio
Equals_OpenCoded31.159 ns1.00
Equals_Ascii3.985 ns0.13

Many of these newAscii APIs also got theVector512 treatment, such that they light up when AVX512 is supported by the current machine, thanks todotnet/runtime#88532 from@anthonycanino anddotnet/runtime#88650 from@khushal1996.

Base64

An even further constrained subset of text is Base64-encoded data. This is used when arbitrary bytes need to be transferred as text, and results in text that uses only 64 characters (lowercase ASCII letters, uppercase ASCII letters, ASCII digits, ‘+’, and ‘/’). .NET has long had methods onSystem.Convert for encoding and decoding Base64 with UTF16 (char), and it got an additional set of span-based methods in .NET Core 2.1 with the introduction ofSpan<T>. At that point, theSystem.Text.Buffers.Base64 class was also introduced, with dedicated surface area for encoding and decodingBase64 with UTF8 (byte). That’s now improved further in .NET 8.

dotnet/runtime#85938 from@heathbm anddotnet/runtime#86396 make two contributions here. First, they bring the behavior of theBase64.Decode methods for UTF8 in line with its counterparts on theConvert class, in particular around handling of whitespace. As it’s very common for there to be newlines in Base64-encoded data, theConvert class’ methods for decodingBase64 permitted whitespace; in contrast, theBase64 class’ methods for decoding would fail if whitespace was encountered. These decoding methods now permit exactly the same whitespace thatConvert does. And that’s important in part because of the second contribution from these PRs, which is a new set ofBase64.IsValid static methods. As withAscii.IsValid andUtf8.IsValid, these methods simply state whether the supplied UTF8 or UTF16 input represents a validBase64 input, such that the decoding methods on bothConvert andBase64 could successfully decode it. And as with all such processing we see introduced into .NET, we’ve strived to make the new functionality as efficient as possible so that it can be used to maximal benefit elsewhere. For example,dotnet/runtime#86221 from@WeihanLi updated the newBase64Attribute to use it, anddotnet/runtime#86002 updatedPemEncoding.TryCountBase64 to use it. Here we can see a benchmark comparing the old non-vectorizedTryCountBase64 with the new version using the vectorizedBase64.IsValid:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers.Text;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string _exampleFromPemEncodingTests =        "MHQCAQEEICBZ7/8T1JL2amvNB/QShghtgZPtnPD4W+sAcHxA+hJsoAcGBSuBBAAK\n" +        "oUQDQgAE3yNC5as8JVN5MjF95ofNSgRBVXjf0CKtYESWfPnmvT3n+cMMJUB9lUJf\n" +        "dkFNgaSB7JlB+krZVVV8T7HZQXVDRA==\n";    [Benchmark(Baseline = true)]    public bool Count_Old() => TryCountBase64_Old(_exampleFromPemEncodingTests, out _, out _, out _);    [Benchmark]     public bool Count_New() => TryCountBase64_New(_exampleFromPemEncodingTests, out _, out _, out _);    private static bool TryCountBase64_New(ReadOnlySpan<char> str, out int base64Start, out int base64End, out int base64DecodedSize)    {        int start = 0, end = str.Length - 1;        for (; start < str.Length && IsWhiteSpaceCharacter(str[start]); start++) ;        for (; end > start && IsWhiteSpaceCharacter(str[end]); end--) ;        if (Base64.IsValid(str.Slice(start, end + 1 - start), out base64DecodedSize))        {            base64Start = start;            base64End = end + 1;            return true;        }        base64Start = 0;        base64End = 0;        return false;    }    private static bool TryCountBase64_Old(ReadOnlySpan<char> str, out int base64Start, out int base64End, out int base64DecodedSize)    {        base64Start = 0;        base64End = str.Length;        if (str.IsEmpty)        {            base64DecodedSize = 0;            return true;        }        int significantCharacters = 0;        int paddingCharacters = 0;        for (int i = 0; i < str.Length; i++)        {            char ch = str[i];            if (IsWhiteSpaceCharacter(ch))            {                if (significantCharacters == 0) base64Start++;                else base64End--;                continue;            }            base64End = str.Length;            if (ch == '=') paddingCharacters++;            else if (paddingCharacters == 0 && IsBase64Character(ch)) significantCharacters++;            else            {                base64DecodedSize = 0;                return false;            }        }        int totalChars = paddingCharacters + significantCharacters;        if (paddingCharacters > 2 || (totalChars & 0b11) != 0)        {            base64DecodedSize = 0;            return false;        }        base64DecodedSize = (totalChars >> 2) * 3 - paddingCharacters;        return true;    }    [MethodImpl(MethodImplOptions.AggressiveInlining)]    private static bool IsBase64Character(char ch) => char.IsAsciiLetterOrDigit(ch) || ch is '+' or '/';    [MethodImpl(MethodImplOptions.AggressiveInlining)]    private static bool IsWhiteSpaceCharacter(char ch) => ch is ' ' or '\t' or '\n' or '\r';}
MethodMeanRatio
Count_Old356.37 ns1.00
Count_New33.72 ns0.09

Hex

Another relevant subset of ASCII is hexadecimal, and improvements have been made in .NET 8 around conversions between bytes and their representation in hex. In particular,dotnet/runtime#82521 vectorized theConvert.FromHexString method using an algorithmoutlined by Langdale and Mula. On even a moderate length input, this has a very measurable impact on throughput:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _hex;    [Params(4, 16, 128)]    public int Length { get; set; }    [GlobalSetup]    public void Setup() => _hex = Convert.ToHexString(RandomNumberGenerator.GetBytes(Length));    [Benchmark]    public byte[] ConvertFromHex() => Convert.FromHexString(_hex);}
MethodRuntimeLengthMeanRatio
ConvertFromHex.NET 7.0424.94 ns1.00
ConvertFromHex.NET 8.0420.71 ns0.83
ConvertFromHex.NET 7.01657.66 ns1.00
ConvertFromHex.NET 8.01617.29 ns0.30
ConvertFromHex.NET 7.0128337.41 ns1.00
ConvertFromHex.NET 8.012856.72 ns0.17

Of course, the improvements in .NET 8 go well beyond just the manipulation of certain known sets of characters; there is a wealth of other improvements to explore. Let’s start withSystem.Text.CompositeFormat, which was introduced indotnet/runtime#80753.

String Formatting

Since the beginning of .NET,string and friends have provided APIs for handling composite format strings, strings with text interspersed with format item placeholders, e.g."The current time is {0:t}". These strings can then be passed to various APIs, likestring.Format, which are provided with both the composite format string and the arguments that should be substituted in for the placeholders, e.g.string.Format("The current time is {0:t}", DateTime.Now) will return a string like"The current time is 3:44 PM" (the0 in the placeholder indicates the 0-based number of the argument to substitute, and thet is the format that should be used, in this case thestandard short time pattern). Such a method invocation needs to parse the composite format string each time it’s called, even though for a given call site the composite format string typically doesn’t change from invocation to invocation. These APIs are also generally non-generic, which means if an argument is a value type (as isDateTime in my example), it’ll incur a boxing allocation. To simplify the syntax around these operations, C# 6 gained support for string interpolation, such that instead of writingstring.Format(null, "The current time is {0:t}", DateTime.Now), you could instead write$"The current time is {DateTime.Now:t}", and it was then up to the compiler to achieve the same behavior as ifstring.Format had been used (which the compiler typically achieved simply by lowering the interpolation into a call tostring.Format).

In .NET 6 and C# 10, string interpolation wassignificantly improved, both in terms of the scenarios supported and in terms of its efficiency. One key aspect of the efficiency is it enabled the parsing to be performed once (at compile-time). It also enabled avoiding all of the allocation associated with providing arguments. These improvements contributed to all use of string interpolation and a significant portion of the use ofstring.Format in real-world applications and services. However, the compiler support works by being able to see the string at compile time. What if the format string isn’t known until run-time, such as if it’s pulled from a.resx resource file or some other source of configuration? At that point,string.Format remains the answer.

Now in .NET 8, there’s a new answer available:CompositeFormat. Just as an interpolated string allows the compiler to do the heavy lifting once in order to optimize repeated use,CompositeFormat allows that reusable work to be done once in order to optimize repeated use. As it does the parsing at run-time, it’s able to tackle the remaining cases that string interpolation can’t reach. To create an instance, one simply calls itsParse method, which takes a composite format string, parses it, and returns aCompositeFormat instance:

private static readonly CompositeFormat s_currentTimeFormat = CompositeFormat.Parse(SR.CurrentTime);

Then, existing methods likestring.Format now have new overloads, exactly the same as the existing ones, but instead of taking astring format, they take aCompositeFormat format. The same formatting as was done earlier can then instead be done like this:

string result = string.Format(null, s_currentTimeFormat, DateTime.Now);

This overload (and other new overloads of methods likeStringBuilder.AppendFormat andMemoryExtensions.TryWrite) accepts generic arguments, avoiding the boxing.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private static readonly CompositeFormat s_format = CompositeFormat.Parse(SR.CurrentTime);    [Benchmark(Baseline = true)]    public string FormatString() => string.Format(null, SR.CurrentTime, DateTime.Now);    [Benchmark]    public string FormatComposite() => string.Format(null, s_format, DateTime.Now);}internal static class SR{    public static string CurrentTime => /*load from resource file*/"The current time is {0:t}";}
MethodMeanRatioAllocatedAlloc Ratio
FormatString163.6 ns1.0096 B1.00
FormatComposite146.5 ns0.9072 B0.75

If you know the composite format string at compile time, interpolated strings are the answer. Otherwise,CompositeFormat can give you throughput in the same ballpark at the expense of some startup costs. Formatting with aCompositeFormat is actually implemented with the same interpolated string handlers that are used for string interpolation, e.g.string.Format(..., compositeFormat, ...) ends up calling into methods onDefaultInterpolatedStringHandler to do the actual formatting work.

There’s also a new analyzer to help with this. CA1863 “Use ‘CompositeFormat'” was introduced indotnet/roslyn-analyzers#6675 to identifystring.Format andStringBuilder.AppendFormat calls that could possibly benefit from switching to use aCompositeFormat argument instead.CA1863

Spans

Moving on from formatting, let’s turn our attention to all the other kinds of operations one frequently wants to perform on sequences of data, whether that be arrays, strings, or the unifying force of spans. A home for many routines for manipulating all of these, via spans, is theSystem.MemoryExtensions type, which has received a multitude of new APIs in .NET 8.

One very common operation is to count how many of something there are. For example, in support of multiline comments,System.Text.Json needs to count how many line feed characters there are in a given piece of JSON. This is, of course, trivial to write as a loop, whether character-by-character or usingIndexOf and slicing. Now in .NET 8, you can also just call theCount extension method, thanks todotnet/runtime#80662 from@bollhals anddotnet/runtime#82687 from@gfoidl. Here we’re counting the number of line feed characters in“The Adventures of Sherlock Holmes” from Project Gutenberg:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly byte[] s_utf8 = new HttpClient().GetByteArrayAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    [Benchmark(Baseline = true)]    public int Count_ForeachLoop()    {        int count = 0;        foreach (byte c in s_utf8)        {            if (c == '\n') count++;        }        return count;    }    [Benchmark]    public int Count_IndexOf()    {        ReadOnlySpan<byte> remaining = s_utf8;        int count = 0;        int pos;        while ((pos = remaining.IndexOf((byte)'\n')) >= 0)        {            count++;            remaining = remaining.Slice(pos + 1);        }        return count;    }    [Benchmark]    public int Count_Count() => s_utf8.AsSpan().Count((byte)'\n');}
MethodMeanRatio
Count_ForeachLoop314.23 us1.00
Count_IndexOf95.39 us0.30
Count_Count13.68 us0.04

The core of the implementation here that enablesMemoryExtensions.Count to be so fast, in particular when searching for a single value, is based on just two key primitives:PopCount andExtractMostSignificantBits. Here’s theVector128 loop that forms the bulk of theCount implementation (the implementation has similar loops forVector256 andVector512 as well):

Vector128<T> targetVector = Vector128.Create(value);ref T oneVectorAwayFromEnd = ref Unsafe.Subtract(ref end, Vector128<T>.Count);do{    count += BitOperations.PopCount(Vector128.Equals(Vector128.LoadUnsafe(ref current), targetVector).ExtractMostSignificantBits());    current = ref Unsafe.Add(ref current, Vector128<T>.Count);}while (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd));

This is creating a vector where every element of the vector is the target (in this case,'\n'). Then, as long as there’s at least one vector’s worth of data remaining, it loads the next vector (Vector128.LoadUnsafe) and compares that with the target vector (Vector128.Equals). That produces a newVector128<T> where eachT element is all ones when the values are equal and all zeros when they’re not. We then extract out the most significant bit of each element (ExtractMostSignificantBits), so getting a bit with the value1 where the values were equal, otherwise0. And then we useBitOperations.PopCount on the resultinguint to get the “population count,” i.e. the number of bits that are1, and we add that to our running tally. In this way, the inner loop of the count operation remains branch-free, and the implementation can churn through the data very quickly. You can find several examples of usingCount indotnet/runtime#81325, which used it in several places in the core libraries.

A similar newMemoryExtensions method isReplace, which comes in .NET 8 in two shapes.dotnet/runtime#76337 from@gfoidl added an in-place variant:

public static unsafe void Replace<T>(this Span<T> span, T oldValue, T newValue) where T : IEquatable<T>?;

anddotnet/runtime#83120 added a copying variant:

public static unsafe void Replace<T>(this ReadOnlySpan<T> source, Span<T> destination, T oldValue, T newValue) where T : IEquatable<T>?;

As an example of where this comes in handy,Uri has some code paths that need to normalize directory separators to be'/', such that any'\\' characters need to be replaced. This previously used anIndexOf loop as was shown in the previousCount benchmark, and now it can just useReplace. Here’s a comparison (which, purely for benchmarking purposes, is normalizing back and forth so that each time the benchmark runs it finds things in the original state):

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly char[] _uri = "server/somekindofpathneeding/normalizationofitsslashes".ToCharArray();    [Benchmark(Baseline = true)]    public void Replace_ForLoop()    {        Replace(_uri, '/', '\\');        Replace(_uri, '\\', '/');        static void Replace(char[] chars, char from, char to)        {            for (int i = 0; i < chars.Length; i++)            {                if (chars[i] == from)                {                    chars[i] = to;                }            }        }    }    [Benchmark]    public void Replace_IndexOf()    {        Replace(_uri, '/', '\\');        Replace(_uri, '\\', '/');        static void Replace(char[] chars, char from, char to)        {            Span<char> remaining = chars;            int pos;            while ((pos = remaining.IndexOf(from)) >= 0)            {                remaining[pos] = to;                remaining = remaining.Slice(pos + 1);            }        }    }    [Benchmark]    public void Replace_Replace()    {        _uri.AsSpan().Replace('/', '\\');        _uri.AsSpan().Replace('\\', '/');    }}
MethodMeanRatio
Replace_ForLoop40.28 ns1.00
Replace_IndexOf29.26 ns0.73
Replace_Replace18.88 ns0.47

The newReplace does better than both the manual loop and theIndexOf loop. As withCount,Replace has a fairly simple and tight inner loop; again, here’s theVector128 variant of that loop:

do{    original = Vector128.LoadUnsafe(ref src, idx);    mask = Vector128.Equals(oldValues, original);    result = Vector128.ConditionalSelect(mask, newValues, original);    result.StoreUnsafe(ref dst, idx);    idx += (uint)Vector128<T>.Count;}while (idx < lastVectorIndex);

This is loading the next vector’s worth of data (Vector128.LoadUnsafe) and comparing that with a vector filled with theoldValue, which produces a newmask vector with1s for equality and0 for inequality. It then calls the super handyVector128.ConditionalSelect. This is a branchless SIMD condition operation: it produces a new vector that has an element from one vector if mask’s bits were1s and from another vector if the mask’s bits were0s (think a ternary operator). That resulting vector is then saved out as the result. In this manner, it’s overwriting the whole span, in some cases just writing back the value that was previously there, and in cases where the original value was the targetoldValue, writing out thenewValue instead. This loop body is branch-free and doesn’t change in cost based on how many elements need to be replaced. In an extreme case where there’s nothing to be replaced, anIndexOf-based loop could end up being a tad bit faster, since the body ofIndexOf‘s inner loop has even fewer instructions, but such anIndexOf loop pays a relatively high cost for every replacement that needs to be done.

StringBuilder also had such anIndexOf-based implementation for itsReplace(char oldChar, char newChar) andReplace(char oldChar, char newChar, int startIndex, int count) methods, and they’re now based onMemoryExtensions.Replace, so the improvements accrue there as well.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly StringBuilder _sb = new StringBuilder("http://server\\this\\is\\a\\test\\of\\needing\\to\\normalize\\directory\\separators\\");    [Benchmark]    public void Replace()    {        _sb.Replace('\\', '/');        _sb.Replace('/', '\\');    }}
MethodRuntimeMeanRatio
Replace.NET 7.0150.47 ns1.00
Replace.NET 8.024.79 ns0.16

Interestingly, whereasStringBuilder.Replace(char, char) was usingIndexOf and switched to useReplace,StringBuilder.Replace(string, string) wasn’t usingIndexOf at all, a gap that’s been fixed indotnet/runtime#81098.IndexOf when dealing with strings is more complicated inStringBuilder because of its segmented nature.StringBuilder isn’t just backed by an array: it’s actually a linked list of segments, each of which stores an array. With thechar-basedReplace, it can simply operate on each segment individually, but for thestring-basedReplace, it needs to deal with the possibility that the value being searched for crosses a segment boundary.StringBuilder.Replace(string, string) was thus walking each segment character-by-character, doing an equality check at each position. Now with this PR, it’s usingIndexOf and only falling back to a character-by-character check when close enough to a segment boundary that it might be crossed.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly StringBuilder _sb = new StringBuilder()        .Append("Shall I compare thee to a summer's day? ")        .Append("Thou art more lovely and more temperate: ")        .Append("Rough winds do shake the darling buds of May, ")        .Append("And summer's lease hath all too short a date; ")        .Append("Sometime too hot the eye of heaven shines, ")        .Append("And often is his gold complexion dimm'd; ")        .Append("And every fair from fair sometime declines, ")        .Append("By chance or nature's changing course untrimm'd; ")        .Append("But thy eternal summer shall not fade, ")        .Append("Nor lose possession of that fair thou ow'st; ")        .Append("Nor shall death brag thou wander'st in his shade, ")        .Append("When in eternal lines to time thou grow'st: ")        .Append("So long as men can breathe or eyes can see, ")        .Append("So long lives this, and this gives life to thee.");    [Benchmark]    public void Replace()    {        _sb.Replace("summer", "winter");        _sb.Replace("winter", "summer");    }}
MethodRuntimeMeanRatio
Replace.NET 7.05,158.0 ns1.00
Replace.NET 8.0476.4 ns0.09

As long as we’re on the subject ofStringBuilder, it saw some other nice improvements in .NET 8.dotnet/runtime#85894 from@yesmey tweaked bothStringBuilder.Append(string value) and the JIT to enable the JIT to unroll the memory copies that occur as part of appending a constant string.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly StringBuilder _sb = new();    [Benchmark]    public void Append()    {        _sb.Clear();        _sb.Append("This is a test of appending a string to StringBuilder");    }}
MethodRuntimeMeanRatio
Append.NET 7.07.597 ns1.00
Append.NET 8.03.756 ns0.49

Anddotnet/runtime#86287 from@yesmey changedStringBuilder.Append(char value, int repeatCount) to useSpan<T>.Fill instead of manually looping, taking advantage of the optimizedFill implementation, even for reasonably small counts.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly StringBuilder _sb = new();    [Benchmark]    public void Append()    {        _sb.Clear();        _sb.Append('x', 8);    }}
MethodRuntimeMeanRatio
Append.NET 7.011.520 ns1.00
Append.NET 8.05.292 ns0.46

Back toMemoryExtensions, another new helpful method isMemoryExtensions.Split (andMemoryExtensions.SplitAny). This is a span-based counterpart tostring.Split forsome uses ofstring.Split. I say “some” because there are effectively two main patterns for usingstring.Split: when you expect a certain number of parts, and when there are an unknown number of parts. For example, if you want to parse a version string as would be used bySystem.Version, there are at most four parts (“major.minor.build.revision”). But if you want to split, say, the contents of a file into all of the lines in the file (delimited by a\n), that’s an unknown (and potentially quite large) number of parts. The newMemoryExtensions.Split method is focused on the situations where there’s a known (and reasonably small) maximum number of parts expected. In such a case, it can be significantly more efficient thanstring.Split, especially from an allocation perspective.

string.Split has overloads that accept anint count, andMemoryExtensions.Split behaves identically to these overloads; however, rather than giving it anint count, you give it aSpan<Range> destination whose length is the same value you would have used forcount. For example, let’s say you want to split a key/value pair separated by an'='. If this werestring.Split, you could write that as:

string[] parts = keyValuePair.Split('=');

Of course, if the input was actually erroneous for what you were expecting and there were 100 equal signs, you’d end up creating an array of 101 strings. So instead, you might write that as:

string[] parts = keyValuePair.Split('=', 3);

Wait, “3”? Aren’t there only two parts, and if so, why not pass “2”? Because of the behavior of what happens with the last part. The last part contains the remainder of the string after the separator before it, so for example the call:

"shall=i=compare=thee".Split(new[] { '=' }, 2)

produces the array:

string[2] { "shall", "i=compare=thee" }

If you want to know whether there were more than two parts, you need to request at least one more, and then if that last one was produced, you know the input was erroneous. For example, this:

"shall=i=compare=thee".Split(new[] { '=' }, 3)

produces this:

string[3] { "shall", "i", "compare=thee" }

and this:

"shall=i".Split(new[] { '=' }, 3)

produces this:

string[2] { "shall", "i" }

We can do the same thing with the new overload, except a) the caller provides the destination span to write the results into, and b) the results are stored as aSystem.Range rather than as astring. That means that the whole operation is allocation-free. And thanks to the indexer onSpan<T> that lets you pass in aRange and slice the span, you can easily use the written ranges to access the relevant portions of the input.

Span<Range> parts = stackalloc Range[3];int count = keyValuePairSpan.Split(parts, '=');if (count == 2){    Console.WriteLine($"Key={keyValuePairSpan[parts[0]]}, Value={keyValuePairSpan[parts[1]]}");"}

Here’s an example fromdotnet/runtime#80211, which usedSplitAny to reduce the cost ofMimeBasePart.DecodeEncoding:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly string _input = "=?utf-8?B?RmlsZU5hbWVf55CG0Y3Qq9C60I5jw4TRicKq0YIM0Y1hSsSeTNCy0Klh?=";    private static readonly char[] s_decodeEncodingSplitChars = new char[] { '?', '\r', '\n' };    [Benchmark(Baseline = true)]    public Encoding Old()    {        if (string.IsNullOrEmpty(_input))        {            return null;        }        string[] subStrings = _input.Split(s_decodeEncodingSplitChars);        if (subStrings.Length < 5 ||             subStrings[0] != "=" ||             subStrings[4] != "=")        {            return null;        }        string charSet = subStrings[1];        return Encoding.GetEncoding(charSet);    }    [Benchmark]    public Encoding New()    {        if (string.IsNullOrEmpty(_input))        {            return null;        }        ReadOnlySpan<char> valueSpan = _input;        Span<Range> subStrings = stackalloc Range[6];        if (valueSpan.SplitAny(subStrings, "?\r\n") < 5 ||            valueSpan[subStrings[0]] is not "=" ||            valueSpan[subStrings[4]] is not "=")        {            return null;        }        return Encoding.GetEncoding(_input[subStrings[1]]);    }}
MethodMeanRatioAllocatedAlloc Ratio
Old143.80 ns1.00304 B1.00
New94.52 ns0.6632 B0.11

More examples ofMemoryExtensions.Split andMemoryExtensions.SplitAny being used are indotnet/runtime#80471 anddotnet/runtime#82007. Both of those remove allocations from variousSystem.Net types that were previously usingstring.Split.

MemoryExtensions also includes a new set ofIndexOf methods for ranges, thanks todotnet/runtime#76803:

public static int IndexOfAnyInRange<T>(this ReadOnlySpan<T> span, T lowInclusive, T highInclusive) where T : IComparable<T>;public static int IndexOfAnyExceptInRange<T>(this ReadOnlySpan<T> span, T lowInclusive, T highInclusive) where T : IComparable<T>;public static int LastIndexOfAnyInRange<T>(this ReadOnlySpan<T> span, T lowInclusive, T highInclusive) where T : IComparable<T>;public static int LastIndexOfAnyExceptInRange<T>(this ReadOnlySpan<T> span, T lowInclusive, T highInclusive) where T : IComparable<T>;

Want to find the index of the next ASCII digit? No problem:

int pos = text.IndexOfAnyInRange('0', '9');

Want to determine whether some input contains any non-ASCII or control characters? You got it:

bool nonAsciiOrControlCharacters = text.IndexOfAnyExceptInRange((char)0x20, (char)0x7e) >= 0;

For example,dotnet/runtime#78658 usesIndexOfAnyInRange to quickly determine whether portions of aUri might contain a bidirectional control character, searching for anything in the range[\u200E, \u202E], and then only examining further if anything in that range is found. Anddotnet/runtime#79357 usesIndexOfAnyExceptInRange to determine whether to useEncoding.UTF8 orEncoding.ASCII. It was previously implemented with a simpleforeach loop, and it’s now implemented with an even simpler call toIndexOfAnyExceptInRange:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string _text =        "Shall I compare thee to a summer's day? " +        "Thou art more lovely and more temperate: " +        "Rough winds do shake the darling buds of May, " +        "And summer's lease hath all too short a date; " +        "Sometime too hot the eye of heaven shines, " +        "And often is his gold complexion dimm'd; " +        "And every fair from fair sometime declines, " +        "By chance or nature's changing course untrimm'd; " +        "But thy eternal summer shall not fade, " +        "Nor lose possession of that fair thou ow'st; " +        "Nor shall death brag thou wander'st in his shade, " +        "When in eternal lines to time thou grow'st: " +        "So long as men can breathe or eyes can see, " +        "So long lives this, and this gives life to thee.";    [Benchmark(Baseline = true)]    public Encoding Old()    {        foreach (char c in _text)            if (c > 126 || c < 32)                return Encoding.UTF8;        return Encoding.ASCII;    }    [Benchmark]    public Encoding New() =>        _text.AsSpan().IndexOfAnyExceptInRange((char)32, (char)126) >= 0 ?            Encoding.UTF8 :            Encoding.ASCII;}
MethodMeanRatio
Old297.56 ns1.00
New20.69 ns0.07

More of a productivity thing than performance (at least today), but .NET 8 also includes newContainsAny methods (dotnet/runtime#87621) that allow writing these kind ofIndexOf calls that are then compared against 0 in a slightly cleaner fashion, e.g. the previous example could have been simplified slightly to:

public Encoding New() =>    _text.AsSpan().ContainsAnyExceptInRange((char)32, (char)126) ?        Encoding.UTF8 :        Encoding.ASCII;

One of the things I love about these kinds of helpers is that code can simplify down to use them, and then as the helpers improve, so too does the code that relies on them. And in .NET 8, there’s a lot of “the helpers improve.”

dotnet/runtime#86655 from@DeepakRajendrakumaran added support forVector512 to most of these span-based helpers inMemoryExtensions. That means that when running on hardware which supports AVX512, many of these operations simply get faster. This benchmark uses environment variables to explicitly disable support for the various instruction sets, such that we can compare performance of a given operation when nothing is vectorized, whenVector128 is used and hardware accelerated, whenVector256 is used and hardware accelerated, and whenVector512 is used and hardware accelerated. I’ve run this on my Dev Box that does support AVX512:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using BenchmarkDotNet.Toolchains.CoreRun;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId("Scalar").WithEnvironmentVariable("DOTNET_EnableHWIntrinsic", "0").AsBaseline())    .AddJob(Job.Default.WithId("Vector128").WithEnvironmentVariable("DOTNET_EnableAVX512F", "0").WithEnvironmentVariable("DOTNET_EnableAVX2", "0"))    .AddJob(Job.Default.WithId("Vector256").WithEnvironmentVariable("DOTNET_EnableAVX512F", "0"))    .AddJob(Job.Default.WithId("Vector512"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables")]public class Tests{    private readonly char[] _sourceChars = Enumerable.Repeat('a', 1024).ToArray();    [Benchmark]    public bool Contains() => _sourceChars.AsSpan().IndexOfAny('b', 'c') >= 0;}
MethodJobMeanRatio
ContainsScalar491.50 ns1.00
ContainsVector12853.77 ns0.11
ContainsVector25634.75 ns0.07
ContainsVector51221.12 ns0.04

So, notquite a halving going from 128-bit to 256-bit or another halving going from 256-bit to 512-bit, but pretty close.

dotnet/runtime#77947 vectorizedEquals(..., StringComparison.OrdinalIgnoreCase) for large enough inputs (the same underlying implementation is used for bothstring andReadOnlySpan<char>). In a loop, it loads the next two vectors. It then checks to see whether anything in those vectors is non-ASCII; it can do so efficiently by OR’ing them together (vec1 | vec2) and then seeing whether the high bit of any of the elements is set… if none are, then all the elements in both of the input vectors are ASCII (((vec1 | vec2) & Vector128.Create(unchecked((ushort)~0x007F))) == Vector128<ushort>.Zero). If it finds anything non-ASCII, it just continues on with the old mode of comparison. But as long as everything is ASCII, then it can proceed to do the comparison in a vectorized manner. For each vector, it uses some bit hackery to create a lowercased version of the vector, and then compares the lowercased versions for equality.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string _a = "shall i compare thee to a summer's day? thou art more lovely and more temperate";    private readonly string _b = "SHALL I COMPARE THEE TO A SUMMER'S DAY? THOU ART MORE LOVELY AND MORE TEMPERATE";    [Benchmark]    public bool Equals() => _a.AsSpan().Equals(_b, StringComparison.OrdinalIgnoreCase);}
MethodRuntimeMeanRatio
Equals.NET 7.047.97 ns1.00
Equals.NET 8.018.93 ns0.39

dotnet/runtime#78262 uses the same tricks to vectorizeToLowerInvariant andToUpperInvariant:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string _a = "shall i compare thee to a summer's day? thou art more lovely and more temperate";    private readonly char[] _b = new char[100];    [Benchmark]    public int ToUpperInvariant() => _a.AsSpan().ToUpperInvariant(_b);}
MethodRuntimeMeanRatio
ToUpperInvariant.NET 7.033.22 ns1.00
ToUpperInvariant.NET 8.016.16 ns0.49

dotnet/runtime#78650 from@yesmey also streamlinedMemoryExtensions.Reverse:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _bytes = Enumerable.Range(0, 32).Select(i => (byte)i).ToArray();    [Benchmark]    public void Reverse() => _bytes.AsSpan().Reverse();}
MethodRuntimeMeanRatio
Reverse.NET 7.03.801 ns1.00
Reverse.NET 8.02.052 ns0.54

dotnet/runtime#75640 improves the internalRuntimeHelpers.IsBitwiseEquatable method that’s used by the vast majority ofMemoryExtensions. If you look in the source forMemoryExtensions, you’ll find a fairly common pattern: special-casebyte,ushort,uint, andulong with a vectorized implementation, and then fall back to a general non-vectorized implementation for everything else. Except it’s not exactly “special-casebyte,ushort,uint, andulong“, but rather “special-case bitwise-equatable types that are the same size asbyte,ushort,uint, orulong.” If something is “bitwise equatable,” that means we don’t need to worry about anyIEquatable<T> implementation it might provide or anyEquals override it might have, and we can instead simply rely on the value’s bits being the same or different from another value to identify whether the values are the same or different. And if such bitwise equality semantics apply for a type, then the intrinsics that determine equality forbyte,ushort,uint, andulong can be used for any type that’s 1, 2, 4, or 8 bytes, respectively. In .NET 7,RuntimeHelpers.IsBitwiseEquatable would be true only for a finite and hardcoded list in the runtime:bool,byte,sbyte,char,short,ushort,int,uint,long,ulong,nint,nuint,Rune, andenums. Now in .NET 8, that list is extended to a dynamically discoverable set where the runtime can easily see that the type itself doesn’t provide any equality implementation.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private MyColor[] _values1, _values2;    [GlobalSetup]    public void Setup()    {        _values1 = Enumerable.Range(0, 1_000).Select(i => new MyColor { R = (byte)i, G = (byte)i, B = (byte)i, A = (byte)i }).ToArray();        _values2 = (MyColor[])_values1.Clone();    }    [Benchmark] public int IndexOf() => Array.IndexOf(_values1, new MyColor { R = 1, G = 2, B = 3, A = 4 });    [Benchmark] public bool SequenceEquals() => _values1.AsSpan().SequenceEqual(_values2);    struct MyColor { public byte R, G, B, A; }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
IndexOf.NET 7.024,912.42 ns1.00048000 B1.00
IndexOf.NET 8.070.44 ns0.0030.00
SequenceEquals.NET 7.025,041.00 ns1.00048000 B1.00
SequenceEquals.NET 8.068.40 ns0.0030.00

Note this not only means the result gets vectorized, it also ends up avoiding excessive boxing (hence all that allocation), as it’s no longer callingEquals(object) on each value type instance.

dotnet/runtime#85437 improved the vectorization ofIndexOf(string/span, StringComparison.OrdinalIgnoreCase). Imagine we’re searching some text for the word “elementary.” In .NET 7, it would end up doing anIndexOfAny('E', 'e') in order to find the first possible place “elementary” could match, and would then do the equivalent of aEquals("elementary", textAtFoundPosition, StringComparison.OrdinalIgnoreCase). If theEquals fails, then it loops around to search for the next possible starting location. This is ok if the the characters being searched for are rare, but in this example,'e' is the most common letter in the English alphabet, and so anIndexOfAny('E', 'e') is frequently stopping, breaking out of the vectorized inner loop, in order to do the fullEquals comparison. In contrast to this, in .NET 7IndexOf(string/span, StringComparison.Ordinal) was improved using the algorithmoutlined by Mula; the idea there is that rather than just searching for one character (e.g. the first), you have a vector for another character as well (e.g. the last), you offset them appropriately, and you AND their comparison results together as part of the inner loop. Even if'e' is very common,'e' and then a'y' nine characters later is much, much less common, and thus it can stay in its tight inner loop for longer. Now in .NET 8, we apply the same trick toOrdinalIgnoreCase when we can find two ASCII characters in the input, e.g. it’ll simultaneously search for'E' or'e' followed by a'Y' or'y‘ nine characters later.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    private readonly string _needle = "elementary";    [Benchmark]    public int Count()    {        ReadOnlySpan<char> haystack = s_haystack;        ReadOnlySpan<char> needle = _needle;        int count = 0;        int pos;        while ((pos = haystack.IndexOf(needle, StringComparison.OrdinalIgnoreCase)) >= 0)        {            count++;            haystack = haystack.Slice(pos + needle.Length);        }        return count;    }}
MethodRuntimeMeanRatio
Count.NET 7.0676.91 us1.00
Count.NET 8.062.04 us0.09

Even just a simpleIndexOf(char) is also significantly improved in .NET 8. Here I’m searching “The Adventures of Sherlock Holmes” for an'@', which I happen to know doesn’t appear, such that the entire search will be spent inIndexOf(char)‘s tight inner loop.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    [Benchmark]    public int IndexOfAt() => s_haystack.AsSpan().IndexOf('@');}
MethodRuntimeMeanRatio
IndexOfAt.NET 7.032.17 us1.00
IndexOfAt.NET 8.020.84 us0.64

That improvement is thanks todotnet/runtime#78861. The goal of SIMD and vectorization is to do more with the same; rather than processing one thing at a time, process 2 or 4 or 8 or 16 or 32 or 64 things at a time. Forchars, which are 16 bits in size, in a 128-bit vector you can process 8 of them at a time; double that for 256-bit, and double it again for 512-bit. But it’s not just about the size of the vector; you can also find creative ways to use a vector to process more than you otherwise could. For example, in a 128-bit vector, you can process 8chars at a time… but you can process 16bytes at a time. What if you could process thechars instead asbytes? You could of course reinterpret the 8chars as 16bytes, but for most algorithms you’d end up with the wrong answer (since eachbyte of thechar would be treated independently). What if instead you could condense two vectors’ worth ofchars down to a single vector ofbyte, and then do the subsequent processing on that single vector ofbyte? Then as long as you were doing a few instructions-worth of processing on thebyte vector and the cost of that condensing was cheap enough, you could approach doubling your algorithm’s performance. And that’s exactly what this PR does, at least for very common needles, and on hardware that supports SSE2. SSE2 has dedicated instructions for taking two vectors and narrowing them down to a single vector, e.g. take aVector128<short> a and aVector128<short> b, and combine them into aVector<byte> c by taking the lowbyte from eachshort in the input. However, these particular instructions don’t simply ignore the otherbyte in eachshort completely; instead, they “saturate.” That means if casting theshort value to abyte would overflow, it produces 255, and if it would underflow, it produces 0. That means we can take two vectors of 16-bit values, pack them into a single vector of 8-bit values, and then as long as the thing we’re searching for is in the range [1, 254], we can be sure that equality checks against the vector will be accurate (comparisons against 0 or 255 might lead to false positives). Note that while Arm does have support for similar “narrowing with saturation,” the cost of those particular instructions was measured to be high enough that it wasn’t feasible to use them here (they are used elsewhere). This improvement applies to several otherchar-based methods as well, includingIndexOfAny(char, char) andIndexOfAny(char, char, char).

One lastSpan-centric improvement to highlight. TheMemory<T> andReadOnlyMemory<T> types don’t implementIEnumerable<T>, but theMemoryMarshal.ToEnumerable method does exist to enable getting an enumerable from them. It’s buried away inMemoryMarshal primarily so as to guide developers not to iterate through theMemory<T> directly, but to instead iterate through itsSpan, e.g.

foreach (T value in memory.Span) { ... }

The driving force behind this is that theMemory<T>.Span property has some overhead, as aMemory<T> can be backed by multiple different object types (namely aT[], astring if it’s aReadOnlyMemory<char>, or aMemoryManager<T>), andSpan needs to fetch aSpan<T> for the right one. Even so, from time to time you do actually need anIEnumerable<T> from a{ReadOnly}Memory<T>, andToEnumerable provides that. In such situations, it’s actually beneficial from a performance perspective that one doesn’t just pass the{ReadOnly}Memory<T> as anIEnumerable<T>, since doing so would box the value, and then enumerating that enumerable would require a second allocation for theIEnumerator<T>. In contrast,MemoryMarshal.ToEnumerable can return anIEnumerable<T> instance that is both theIEnumerable<T> and theIEnumerator<T>. In fact, that’s what it’s done since it was added, with the entirety of the implementation being:

public static IEnumerable<T> ToEnumerable<T>(ReadOnlyMemory<T> memory){    for (int i = 0; i < memory.Length; i++)        yield return memory.Span[i];}

The C# compiler generates anIEnumerable<T> for such an iterator that does in fact also implementIEnumerator<T> and return itself fromGetEnumerator to avoid an extra allocation, so that’s good. As noted, though,Memory<T>.Span has some overhead, and this is accessing.Span once per element… not ideal.dotnet/runtime#89274 addresses this in multiple ways. First,ToEnumerable itself can check the type of the underlying object behind theMemory<T>, and for aT[] or astring can return a different iterator that just directly indexes into the array or string rather than going through.Span on every access. Moreover,ToEnumerable can check to see whether the bounds represented by theMemory<T> are for the full length of the array or string… if they are, thenToEnumerable can just return the original object, without any additional allocation. The net result is a much more efficient enumeration scheme for anything other than aMemoryManager<T>, which is much more rare (but also not negatively impacted by the improvements for the other types).

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;using System.Runtime.InteropServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly Memory<char> _array = Enumerable.Repeat('a', 1000).ToArray();    [Benchmark]    public int Count() => Count(MemoryMarshal.ToEnumerable<char>(_array));    [Benchmark]    public int CountLINQ() => Enumerable.Count(MemoryMarshal.ToEnumerable<char>(_array));    private static int Count<T>(IEnumerable<T> source)    {        int count = 0;        foreach (T item in source) count++;        return count;    }    private sealed class WrapperMemoryManager<T>(Memory<T> memory) : MemoryManager<T>    {        public override Span<T> GetSpan() => memory.Span;        public override MemoryHandle Pin(int elementIndex = 0) => throw new NotSupportedException();        public override void Unpin() => throw new NotSupportedException();        protected override void Dispose(bool disposing) { }    }}
MethodRuntimeMeanRatio
Count.NET 7.06,336.147 ns1.00
Count.NET 8.01,323.376 ns0.21
CountLINQ.NET 7.04,972.580 ns1.000
CountLINQ.NET 8.09.200 ns0.002

SearchValues

As should be obvious from the length of this document, there are a sheer ton of performance-focused improvements in .NET 8. As I previously noted, I think the most valuable addition in .NET 8 is enabling dynamic PGO by default. After that, I think the next most exciting addition is the newSystem.Buffers.SearchValues type. It is simply awesome, in my humble opinion.

Functionally,SearchValues doesn’t do anything you couldn’t already do. For example, let’s say you wanted to search for the next ASCII letter or digit in text. You can already do that viaIndexOfAny:

ReadOnlySpan<char> text = ...;int pos = text.IndexOfAny("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");

And that works, but it hasn’t been particularly fast. In .NET 7,IndexOfAny(ReadOnlySpan<char>) is optimized for searching for up to 5 target characters, e.g. it could efficiently vectorize a search for English vowels (IndexOfAny("aeiou")). But with a target set of 62 characters like in the previous example, it would no longer vectorize, and instead of trying to see how many characters it could process per instruction, switches to trying to see how few instructions it can employ per character (meaning we’re no longer talking about fractions of an instruction per character in the haystack and now talking about multiple instructions per character in the haystack). It does this via aBloom filter, referred to in the implementation as a “probabilistic map.” The idea is to maintain a bitmap of 256 bits. For every needle character, it sets 2 bits in that bitmap. Then when searching the haystack, for each character it looks to see whether both bits are set in the bitmap; if at least one isn’t set, then this character can’t be in the needle and the search can continue, but if both bits are in the bitmap, then it’s likely but not confirmed that the haystack character is in the needle, and the needle is then searched for the character to see whether we’ve found a match.

There are actually known algorithms for doing these searches more efficiently. For example, the“Universal” algorithm described by Mula is a great choice when searching for an arbitrary set of ASCII characters, enabling us to efficiently vectorize a search for a needle composed of any subset of ASCII. Doing so requires some amount of computation to analyze the needle and build up the relevant bitmaps and vectors that are required for performing the search, just as we have to do so for the Bloom filter (albeit generating different artifacts).dotnet/runtime#76740 implements these techniques in{Last}IndexOfAny{Except}. Rather than always building up a probabilistic map, it first examines the needle to see if all of the values are ASCII, and if they are, then it switches over to this optimized ASCII-based search; if they’re not, it falls back to the same probabilistic map approach used previously. The PR also recognizes that it’s only worth attempting either optimization under the right conditions; if the haystack is really short, for example, we’re better off just doing the naiveO(M*N) search, where for every character in the haystack we search through the needle to see if thechar is a target.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    [Benchmark]    public int CountEnglishVowelsAndSometimeVowels()    {        ReadOnlySpan<char> remaining = s_haystack;        int count = 0, pos;        while ((pos = remaining.IndexOfAny("aeiouyAEIOUY")) >= 0)        {            count++;            remaining = remaining.Slice(pos + 1);        }        return count;    }}
MethodRuntimeMeanRatio
CountEnglishVowelsAndSometimeVowels.NET 7.06.823 ms1.00
CountEnglishVowelsAndSometimeVowels.NET 8.03.735 ms0.55

Even with those improvements, this work of building up these vectors is quite repetitive, and it’s not free. If you have such anIndexOfAny in a loop, you’re paying to build up those vectors over and over and over again. There’s also additional work we could do to further examine the data to choose an even more optimal approach, but every additional check performed comes at the cost of more overhead for theIndexOfAny call. This is whereSearchValues comes in. The idea behindSearchValues is to perform all this work once and then cache it. Almost invariably, the pattern for using aSearchValues is to create one, store it in astatic readonly field, and then use thatSearchValues for all searching operations for that target set. And there are now overloads of methods likeIndexOfAny that take aSearchValues<char> orSearchValues<byte>, for example, instead of aReadOnlySpan<char> orReadOnlySpan<byte>, respectively. Thus, my previous ASCII letter or digit example would instead look like this:

private static readonly SearchValues<char> s_asciiLettersOrDigits = SearchValues.Create("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");...int pos = text.IndexOfAny(s_asciiLettersOrDigits);

dotnet/runtime#78093 provided the initial implementation ofSearchValues (it was originally namedIndexOfAnyValues, but we renamed it subsequently to the more generalSearchValues so that we can use it now and in the future with other methods, likeCount orReplace). If you peruse the implementation, you’ll see that theCreate factory methods don’t just return a concreteSearchValues<T> type; rather,SearchValues<T> provides an internal abstraction that’s then implemented by more than fifteen derived implementations, each specialized for a different scenario. You can see this fairly easily in code by running the following program:

// dotnet run -f net8.0using System.Buffers;Console.WriteLine(SearchValues.Create(""));Console.WriteLine(SearchValues.Create("a"));Console.WriteLine(SearchValues.Create("ac"));Console.WriteLine(SearchValues.Create("ace"));Console.WriteLine(SearchValues.Create("ab\u05D0\u05D1"));Console.WriteLine(SearchValues.Create("abc\u05D0\u05D1"));Console.WriteLine(SearchValues.Create("abcdefghijklmnopqrstuvwxyz"));Console.WriteLine(SearchValues.Create("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"));Console.WriteLine(SearchValues.Create("\u00A3\u00A5\u00A7\u00A9\u00AB\u00AD"));Console.WriteLine(SearchValues.Create("abc\u05D0\u05D1\u05D2"));

and you’ll see output like the following:

System.Buffers.EmptySearchValues`1[System.Char]System.Buffers.SingleCharSearchValues`1[System.Buffers.SearchValues+TrueConst]System.Buffers.Any2CharSearchValues`1[System.Buffers.SearchValues+TrueConst]System.Buffers.Any3CharSearchValues`1[System.Buffers.SearchValues+TrueConst]System.Buffers.Any4SearchValues`2[System.Char,System.Int16]System.Buffers.Any5SearchValues`2[System.Char,System.Int16]System.Buffers.RangeCharSearchValues`1[System.Buffers.SearchValues+TrueConst]System.Buffers.AsciiCharSearchValues`1[System.Buffers.IndexOfAnyAsciiSearcher+Default]System.Buffers.ProbabilisticCharSearchValuesSystem.Buffers.ProbabilisticWithAsciiCharSearchValues`1[System.Buffers.IndexOfAnyAsciiSearcher+Default]

highlighting that each of these different inputs ends up getting mapped to a differentSearchValues<T>-derived type.

After that initial PR,SearchValues has been successively improved and refined.dotnet/runtime#78863, for example, added AVX2 support, such that with 256-bit vectors being employed (when available) instead of 128-bit vectors, some benchmarks close to doubled in throughput, anddotnet/runtime#83122 enabled WASM support.dotnet/runtime#78996 added aContains method to be used when implementing scalar fallback paths. Anddotnet/runtime#86046 reduced the overhead of callingIndexOfAny with aSearchValues simply by tweaking how the relevant bitmaps and vectors are internally passed around. But two of my favorite tweaks aredotnet/runtime#82866 anddotnet/runtime#84184, which improve overheads when ‘\0’ (null) is one of the characters in the needle. Why would this matter? Surely searching for ‘\0’ can’t be so common? Interestingly, in a variety of scenarios it can be. Imagine you have an algorithm that’s really good at searching for any subset of ASCII, but you want to use it to search for either a specific subset of ASCIIor something non-ASCII. If you just search for the subset, you won’t learn about non-ASCII hits. And if you search for everything other than the subset, you’ll learn about non-ASCII hits but also all the wrong ASCII characters. Instead what you want to do is invert the ASCII subset, e.g. if your target characters are ‘A’ through ‘Z’ and ‘a’ through ‘z’, you instead create the subset including ‘\u0000’ through ‘\u0040’, ‘\u005B’ through ‘\u0060’, and ‘\u007B’ through ‘\u007F’. Then, rather than doing anIndexOfAny with that inverted subset, you instead doIndexOfAnyExcept with that inverted subset; this is a true case of “two wrongs make a right,” as we’ll end up with our desired behavior of searching for the original subset of ASCII letter plus anything non-ASCII. And as you’ll note, ‘\0’ is in our inverted subset, making the performance when ‘\0’ is in there more important than it otherwise would be.

Interestingly, the probabilistic map code path in .NET 8 actually also enjoys some amount of vectorization, even withoutSearchValues, thanks todotnet/runtime#80963 (it was also further improved indotnet/runtime#85189 that used better instructions on Arm, and indotnet/runtime#85203 that avoided some wasted work). That means that whether or notSearchValues is used, searches involving probabilistic map get much faster than in .NET 7. For example, here’s a benchmark that again searches “The Adventures of Sherlock Holmes” and counts the number of line endings in it, using the same needle thatstring.ReplaceLineEndings uses:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    [Benchmark]    public int CountLineEndings()    {        int count = 0;        ReadOnlySpan<char> haystack = s_haystack;        int pos;        while ((pos = haystack.IndexOfAny("\n\r\f\u0085\u2028\u2029")) >= 0)        {            count++;            haystack = haystack.Slice(pos + 1);        }        return count;    }}
MethodRuntimeMeanRatio
CountLineEndings.NET 7.02.155 ms1.00
CountLineEndings.NET 8.01.323 ms0.61

SearchValues can then be used to improve upon that. It does so not only by caching the probabilistic map that each call toIndexOfAny above needs to recompute, but also by recognizing that when a needle contains ASCII, that’s a good indication (heuristically) that ASCII haystacks will be prominent. As such,dotnet/runtime#89155 adds a fast path that performs a search for either any of the ASCII needle values or any non-ASCII value, and if it finds a non-ASCII value, then it falls back to performing the vectorized probabilistic map search.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    private static readonly SearchValues<char> s_lineEndings = SearchValues.Create("\n\r\f\u0085\u2028\u2029");    [Benchmark]    public int CountLineEndings_Chars()    {        int count = 0;        ReadOnlySpan<char> haystack = s_haystack;        int pos;        while ((pos = haystack.IndexOfAny("\n\r\f\u0085\u2028\u2029")) >= 0)        {            count++;            haystack = haystack.Slice(pos + 1);        }        return count;    }    [Benchmark]    public int CountLineEndings_SearchValues()    {        int count = 0;        ReadOnlySpan<char> haystack = s_haystack;        int pos;        while ((pos = haystack.IndexOfAny(s_lineEndings)) >= 0)        {            count++;            haystack = haystack.Slice(pos + 1);        }        return count;    }}
MethodMean
CountLineEndings_Chars1,300.3 us
CountLineEndings_SearchValues430.9 us

dotnet/runtime#89224 further augments that heuristic by guarding that ASCII fast path behind a quick check to see if the very next character is non-ASCII, skipping the ASCII-based search if it is and thereby avoiding the overhead when dealing with an all non-ASCII input. For example, here’s the result of running the previous benchmark, with the exact same code, except changing the URL to behttps://www.gutenberg.org/files/39963/39963-0.txt, which is an almost entirely Greek document containing Aristotle’s “The Constitution of the Athenians”:

MethodMean
CountLineEndings_Chars542.6 us
CountLineEndings_SearchValues283.6 us

With all of that goodness imbued inSearchValues, it’s now being used extensively throughoutdotnet/runtime. For example,System.Text.Json previously had its own dedicated implementation of a functionIndexOfQuoteOrAnyControlOrBackSlash that it used to search for any character with an ordinal value less than 32, or a quote, or a backslash. That implementation in .NET 7 was~200 lines of complicatedVector<T>-based code. Now in .NET 8 thanks todotnet/runtime#82789, it’s simply this:

[MethodImpl(MethodImplOptions.AggressiveInlining)]public static int IndexOfQuoteOrAnyControlOrBackSlash(this ReadOnlySpan<byte> span) =>    span.IndexOfAny(s_controlQuoteBackslash);private static readonly SearchValues<byte> s_controlQuoteBackslash = SearchValues.Create(    "\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\u0009\u000A\u000B\u000C\u000D\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F"u8 + // Any Control, < 32 (' ')    "\""u8 + // Quote    "\\"u8); // Backslash

Such use was rolled out in a bunch of PRs, for exampledotnet/runtime#78664 that usedSearchValues inSystem.Private.Xml,dotnet/runtime#81976 inJsonSerializer,dotnet/runtime#78676 inX500NameEncoder,dotnet/runtime#78667 inRegex.Escape,dotnet/runtime#79025 inZipFile andTarFile,dotnet/runtime#79974 inWebSocket,dotnet/runtime#81486 inSystem.Net.Mail, anddotnet/runtime#78896 inCookie.dotnet/runtime#78666 anddotnet/runtime#79024 inUri are particularly nice, including optimizing the commonly-usedUri.EscapeDataString helper withSearchValues; this shows up as a sizable improvement, especially when there’s nothing to escape.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _value = Convert.ToBase64String("How did I escape? With difficulty. How did I plan this moment? With pleasure. "u8);    [Benchmark]    public string EscapeDataString() => Uri.EscapeDataString(_value);}
MethodRuntimeMeanRatio
EscapeDataString.NET 7.085.468 ns1.00
EscapeDataString.NET 8.08.443 ns0.10

All in all, just indotnet/runtime,SearchValues.Create is now used in more than 40 places, and that’s not including all the uses that get generated as part ofRegex (more on that in a bit). This is helped along bydotnet/roslyn-analyzers#6898, which adds a new analyzer that will flag opportunities forSearchValues and update the code to use it:CA1870

Throughout this discussion, I’ve mentionedReplaceLineEndings several times, using it as an example of the kind of thing that wants to efficiently search for multiple characters. Afterdotnet/runtime#78678 anddotnet/runtime#81630, it now also usesSearchValues, plus has been enhanced with other optimizations. Given the discussion ofSearchValues, it’ll be obvious how it’s employed here, at least the basics of it. Previously,ReplaceLineEndings relied on an internal helperIndexOfNewlineChar which did this:

internal static int IndexOfNewlineChar(ReadOnlySpan<char> text, out int stride){    const string Needles = "\r\n\f\u0085\u2028\u2029";    int idx = text.IndexOfAny(needles);    ...}

Now, it does:

int idx = text.IndexOfAny(SearchValuesStorage.NewLineChars);

where thatNewLineChars is just:

internal static class SearchValuesStorage{    public static readonly SearchValues<char> NewLineChars = SearchValues.Create("\r\n\f\u0085\u2028\u2029");}

Straightforward. However, it takes things a bit further. Note that there are 6 characters in that list, some of which are ASCII, some of which aren’t. Knowing the algorithmsSearchValues currently employs, we know that this will knock it off thepath of just doing an ASCII search, and it’ll instead use the algorithm that does a search for one of the 3 ASCII characters plus anything non-ASCII, and if it finds anything non-ASCII, will then fallback to doing the probabilistic map search. If we could remove just one of those characters, we’d be back into the range of just being able to use theIndexOfAny implementation that can work with any 5 characters. On non-Windows systems, we’re in luck.ReplaceLineEndings by default replaces a line ending withEnvironment.NewLine; on Windows, that’s"\r\n", but on Linux and macOS, that’s"\n". If the replacement text is"\n" (which can also be opted-into on Windows by using theReplaceLineEndings(string replacementText) overload), then searching for'\n' only to replace it with'\n' is a nop, which means we can remove'\n' from the search list when the replacement text is"\n", bringing us down to only 5 target characters, and giving us a little edge. And while that’s a nice little gain, the bigger gain is that we won’t end up breaking out of the vectorized loop as frequently, or at all if all of the line endings are the replacement text. Further, the .NET 7 implementation was always creating a new string to return, but we can avoid allocating it if we didn’t actually replace anything with anything new. The net result of all of this are huge improvements toReplaceLineEndings, some due toSearchValues and some beyond.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    // NOTE: This text uses \r\n as its line endings    private static readonly string s_text = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    [Benchmark]    [Arguments("\r\n")]    [Arguments("\n")]    public string ReplaceLineEndings(string replacement) => s_text.ReplaceLineEndings(replacement);}
MethodRuntimereplacementMeanRatioAllocatedAlloc Ratio
ReplaceLineEndings.NET 7.0\n2,746.3 us1.001163121 B1.00
ReplaceLineEndings.NET 8.0\n995.9 us0.361163121 B1.00
ReplaceLineEndings.NET 7.0\r\n2,920.1 us1.001187729 B1.00
ReplaceLineEndings.NET 8.0\r\n356.5 us0.120.00

TheSearchValue changes also accrue to the span-based non-allocatingEnumerateLines:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_text = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    [Benchmark]    public int CountLines()    {        int count = 0;        foreach (ReadOnlySpan<char> _ in s_text.AsSpan().EnumerateLines()) count++;        return count;    }}
MethodRuntimeMeanRatio
CountLines.NET 7.02,029.9 us1.00
CountLines.NET 8.0353.2 us0.17

Regex

Having just examinedSearchValues, it’s a good time to talk aboutRegex, as the former now plays an integral role in the latter.Regex was significantly improved in.NET 5, and then again was overhauled for.NET 7, which saw the introduction of the regex source generator. Now in .NET 8,Regex continues to receive significant investment, in particular this release in taking advantage of much of the work already discussed that was introduced lower in the stack to enable more efficient searching.

As a reminder, there are effectively three different “engines” withinSystem.Text.RegularExpressions, meaning effectively three different components for actually processing a regex. The simplest engine is the “interpreter”; theRegex constructor translates the regular expression into a series ofregex opcodes which theRegexInterpreter then evaluates against the incoming text. This is done in a “scan” loop, which (simplified) looks like this:

while (TryFindNextStartingPosition(text)){    if (TryMatchAtCurrentPosition(text) || _currentPosition == text.Length) break;    _currentPosition++;}

TryFindNextStartingPosition tries to move through as much of the input text as possible until it finds a position in the input that could feasibly start a match, and thenTryMatchAtCurrentPosition evaluates the pattern at that position against the input. That evaluation in the interpreter involves a loop like this, processing the opcodes that were produced from the pattern:

while (true){    switch (_opcode)    {        case RegexOpcode.Stop:            return match.FoundMatch;        case RegexOpcode.Goto:            Goto(Operand(0));            continue;        ... // cases for ~50 other opcodes    }}

Then there’s the non-backtracking engine, which is what you get when you select theRegexOptions.NonBacktracking option introduced in .NET 7. This engine shares the sameTryFindNextStartingPosition implementation as the interpreter, such that all of the optimizations involved in skipping through as much text as possible (ideally via vectorizedIndexOf operations) accrue to both the interpreter and the non-backtracking engine. However, that’s where the similarities end. Rather than processing regex opcodes, the non-backtracking engine works by converting the regular expression pattern into a lazily-constructed deterministic finite automata (DFA) or non-deterministic finite automata (NFA), which it then uses to evaluate the input text. The key benefit of the non-backtracking engine is that it provides linear-time execution guarantees in the length of the input. For a lot more detail, please readRegular Expression Improvements in .NET 7.

The third engine actually comes in two forms:RegexOptions.Compiled and the regex source generator (introduced in .NET 7). Except for a few corner-cases, these are effectively the same as each other in terms of how they work. They both generate custom code specific to the input pattern provided, with the former generating IL at run-time and the latter generating C# (which is then compiled to IL by the C# compiler) at build-time. The structure of the resulting code, and 99% of the optimizations applied, are identical between them; in fact, in .NET 7, theRegexCompiler was completely rewritten to be a block-by-block translation of the C# code the regex source generator emits. For both, the actual emitted code is fully customized to the exact pattern supplied, with both trying to generate code that processes the regex as efficiently as possible, and with the source generator trying to do so by generating code that is as close as possible to what an expert .NET developer might write. That’s in large part because the source it generates is visible, even in Visual Studio live as you edit your pattern:GeneratedRegex in Visual Studio

I mention all of this because there is ample opportunity throughoutRegex, both in theTryFindNextStartingPosition used by the interpreter and non-backtracking engines and throughout the code generated byRegexCompiler and the regex source generator, to use APIs introduced to make searching faster. I’m looking at you,IndexOf and friends.

As noted earlier, newIndexOf variants have been introduced in .NET 8 for searching for ranges, and as ofdotnet/runtime#76859,Regex will now take full advantage of them in generated code. For example, consider[GeneratedRegex(@"[0-9]{5}")], which might be used to search for a zip code in the United States. The regex source generator in .NET 7 would emit code forTryFindNextStartingPosition that contained this:

// The pattern begins with '0' through '9'.// Find the next occurrence. If it can't be found, there's no match.ReadOnlySpan<char> span = inputSpan.Slice(pos);for (int i = 0; i < span.Length - 4; i++){    if (char.IsAsciiDigit(span[i]))    ...}

Now in .NET 8, that same attribute instead generates this:

// The pattern begins with a character in the set [0-9].// Find the next occurrence. If it can't be found, there's no match.ReadOnlySpan<char> span = inputSpan.Slice(pos);for (int i = 0; i < span.Length - 4; i++){    int indexOfPos = span.Slice(i).IndexOfAnyInRange('0', '9');    ...}

That .NET 7 implementation is examining one character at a time, whereas the .NET 8 code is vectorizing the search viaIndexOfAnyInRange, examining multiple characters at a time. This can lead to significant speedups.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    private readonly Regex _regex = new Regex("[0-9]{5}", RegexOptions.Compiled);    [Benchmark]    public int Count() => _regex.Count(s_haystack);}
MethodRuntimeMeanRatio
Count.NET 7.0423.88 us1.00
Count.NET 8.029.91 us0.07

The generated code can use these APIs in other places as well, even as part of validating the match itself. Let’s say your pattern was instead[GeneratedRegex(@"(\w{3,})[0-9]")], which is going to look for and capture a sequence of at least three word characters that is then followed by an ASCII digit. This is a standard greedy loop, so it’s going to consume as many word characters as it can (which includes ASCII digits), and will then backtrack, giving back some of the word characters consumed, until it can find a digit. Previously, that was implemented just by giving back a single character, seeing if it was a digit, giving back a single character, seeing if it was a digit, and so on. Now? The source generator emits code that includes this:

charloop_ending_pos = inputSpan.Slice(charloop_starting_pos, charloop_ending_pos - charloop_starting_pos).LastIndexOfAnyInRange('0', '9')

In other words, it’s usingLastIndexOfAnyInRange to optimize that backwards search for the next viable backtracking location.

Another significant improvement that builds on improvements lower in the stack isdotnet/runtime#85438. As was previously covered, the vectorization ofspan.IndexOf("...", StringComparison.OrdinalIgnoreCase) has been improved in .NET 8. Previously,Regex wasn’t utilizing this API, as it was often able to do better with its own custom-generated code. But now that the API has been optimized, this PR changesRegex to use it, making the generated code both simpler and faster. Here I’m searching case-insensitively for the whole word “year”:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    private readonly Regex _regex = new Regex(@"\byear\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);    [Benchmark]    public int Count() => _regex.Count(s_haystack);}
MethodRuntimeMeanRatio
Count.NET 7.0181.80 us1.00
Count.NET 8.063.10 us0.35

In addition to learning how to use the existingIndexOf(..., StringComparison.OrdinalIgnoreCase) and the newIndexOfAnyInRange andIndexOfAnyExceptInRange,Regex in .NET 8 also learns how to use the newSearchValues<char>. This is a big boost forRegex, as it now means that it can vectorize searches for many more sets than it previously could. For example, let’s say you wanted to search for all hex numbers. You might use a pattern like[0123456789ABCDEFabcdef]+. If you plug that into the regex source generator in .NET 7, you’ll get aTryFindNextPossibleStartingPosition emitted that contains code like this:

// The pattern begins with a character in the set [0-9A-Fa-f].// Find the next occurrence. If it can't be found, there's no match.ReadOnlySpan<char> span = inputSpan.Slice(pos);for (int i = 0; i < span.Length; i++){    if (char.IsAsciiHexDigit(span[i]))    {        base.runtextpos = pos + i;        return true;    }}

Now in .NET 8, thanks in large part todotnet/runtime#78927, you’ll instead get code like this:

// The pattern begins with a character in the set [0-9A-Fa-f].// Find the next occurrence. If it can't be found, there's no match.int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_asciiHexDigits);if (i >= 0){    base.runtextpos = pos + i;    return true;}

What is thatUtilities.s_asciiHexDigits? It’s aSearchValues<char> emitted into the file’sUtilities class:

/// <summary>Supports searching for characters in or not in "0123456789ABCDEFabcdef".</summary>internal static readonly SearchValues<char> s_asciiHexDigits = SearchValues.Create("0123456789ABCDEFabcdef");

The source generator explicitly recognized this set and so created a nice name for it, but that’s purely about readability; it can still useSearchValues<char> even if it doesn’t recognize the set as something that’s well-known and easily nameable. For example, if I instead augment the set to be all valid hex digits and an underscore, I then instead get this:

/// <summary>Supports searching for characters in or not in "0123456789ABCDEF_abcdef".</summary>internal static readonly SearchValues<char> s_ascii_FF037E0000807E000000 = SearchValues.Create("0123456789ABCDEF_abcdef");

When initially added toRegex,SearchValues<char> was only used when the input set was all ASCII. But asSearchValues<char> improved over the development of .NET 8, so too didRegex‘s use of it. Withdotnet/runtime#89205,Regex now relies onSearchValues‘s ability to efficiently search for both ASCII and non-ASCII, and will similarly emit aSearchValues<char> if it’s able to efficiently enumerate the contents of a set and that set contains a reasonably small number of characters (today, that means no more than 128). Interestingly,SearchValues‘s optimization to first do a search for the ASCII subset of a target and then fallback to a vectorized probabilistic map search was first prototyped inRegex (dotnet/runtime#89140), after which we decided to push the optimization downwards intoSearchValues so thatRegex could generate simpler code and so that other non-Regex consumers would benefit.

That still, however, leaves the cases where we can’t efficiently enumerate the set in order to determine every character it includes, nor would we want to pass a gigantic number of characters off toSearchValues. Consider the set\w, i.e. “word characters.” Of the 65,536char values, 50,409 match the set\w. It would be inefficient to enumerate all of those characters in order to try to create aSearchValues<char> for them, andRegex doesn’t try. Instead, as ofdotnet/runtime#83992,Regex employs a similar approach as noted above, but with a scalar fallback. For example, for the pattern\w+, it emits the following helper intoUtilities:

internal static int IndexOfAnyWordChar(this ReadOnlySpan<char> span){    int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptWordChars);    if ((uint)i < (uint)span.Length)    {        if (char.IsAscii(span[i]))        {            return i;        }        do        {            if (Utilities.IsWordChar(span[i]))            {                return i;            }            i++;        }        while ((uint)i < (uint)span.Length);    }    return -1;}/// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&amp;'()*+,-./:;&lt;=&gt;?@[\\]^`{|}~\u007f".</summary>internal static readonly SearchValues<char> s_asciiExceptWordChars = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./:;<=>?@[\\]^`{|}~\u007f");

The fact that it named the helper “IndexOfAnyWordChar” is, again, separate from the fact that it was able to generate this helper; it simply recognizes the set here as part of determining a name and was able to come up with a nicer one, but if it didn’t recognize it, the body of the method would be the same and the name would just be less readable, as it would come up with something fairly gibberish but unique.

As an interesting aside, I noted that the source generator andRegexCompiler are effectively the same, just with one generating C# and one generating IL. That’s 99% correct. There is one interesting difference around their use ofSearchValues, though, one which makes the source generator a bit more efficient in how it’s able to utilize the type. Any time the source generator needs aSearchValues instance for a new combination of characters, it can just emit anotherstatic readonly field for that instance, and because it’sstatic readonly, the JIT’s optimizations around devirtualization and inlining can kick in, with calls to use this seeing the actual type of the instance and optimizing based on that. Yay.RegexCompiler is a different story.RegexCompiler emits IL for a givenRegex, and it does so usingDynamicMethod; this provides the lightest-weight solution to reflection emit, also allowing the generated methods to be garbage collected when they’re no longer referenced.DynamicMethods, however, are just that, methods. There’s no support for creating additional static fields on demand, without growing up into the much more expensiveTypeBuilder-based solution. How then canRegexCompiler create and store an arbitrary number ofSearchValue instances, and how can it do so in a way that similarly enables devirtualization? It employs a few tricks. First, a field was added to the internalCompiledRegexRunner type that stores the delegate to the generated method:private readonly SearchValues<char>[]? _searchValues; As an array, this enables any number ofSearchValues to be stored; the emitted IL can access the field, grab the array, and index into it to grab the relevantSearchValues<char> instance. Just doing that, of course, would not allow for devirtualization, and even dynamic PGO doesn’t help here because currentlyDynamicMethods don’t participate in tiering; compilation goes straight to tier 1, so there’d be no opportunity for instrumentation to see the actualSearchValues<char>-derived type employed. Thankfully, there are available solutions. The JIT can learn about the type of an instance from the type of a local in which it’s stored, so one solution is to create a local of the concrete and sealedSearchValues<char> derived type (we’re writing IL at this point, so we can do things like that without actually having access to the type in question), read theSearchValues<char> from the array, store it into the local, and then use the local for the subsequent access. And, in fact, we did that for a while during the .NET 8 development process. However, that does require a local, and requires an extra read/write of that local. Instead, a tweak indotnet/runtime#85954 allows the JIT to use theT inUnsafe.As<T>(object o) to learn about the actual type ofT, and soRegexCompiler can just useUnsafe.As to inform the JIT as to the actual type of the instance such that it’s then devirtualized. The codeRegexCompiler uses then to emit the IL to load aSearchValues<char> is this:

// from RegexCompiler.cs, tweaked for readability in this postprivate void LoadSearchValues(ReadOnlySpan<char> chars){    List<SearchValues<char>> list = _searchValues ??= new();    int index = list.Count;    list.Add(SearchValues.Create(chars));    // Unsafe.As<DerivedSearchValues>(Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(this._searchValues), index));    _ilg.Emit(OpCodes.Ldarg_0);    _ilg.Emit(OpCodes.Ldfld, s_searchValuesArrayField);    _ilg.Emit(OpCodes.Call, s_memoryMarshalGetArrayDataReferenceSearchValues);    _ilg.Emit(OpCodes.Ldc_I4, index * IntPtr.Size);    _ilg.Emit(OpCodes.Add);    _ilg.Emit(OpCodes.Ldind_Ref);    _ilg.Emit(OpCodes.Call, typeof(Unsafe).GetMethod("As", new[] { typeof(object) })!.MakeGenericMethod(list[index].GetType()));}

We can see all of this in action with a benchmark like this:

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    private static readonly Regex s_names = new Regex("Holmes|Watson|Lestrade|Hudson|Moriarty|Adler|Moran|Morstan|Gregson", RegexOptions.Compiled);    [Benchmark]    public int Count() => s_names.Count(s_haystack);}

Here we’re searching the same Sherlock Holmes text for the names of some of the most common characters in the detective stories. The regex pattern analyzer will try to find something for which it can vectorize a search, and it will look at all of the characters that can validly exist at each position in a match, e.g. all matches begin with ‘H’, ‘W’, ‘L’, ‘M’, ‘A’, or ‘G’. And since the shortest match is five letters (“Adler”), it’ll end up looking at the first five positions, coming up with these sets:

0: [AGHLMW]1: [adeoru]2: [delrst]3: [aegimst]4: [aenorst]

All of those sets have more than five characters in them, though, an important delineation as in .NET 7 that is the largest number of characters for whichIndexOfAny will vectorize a search. Thus, in .NET 7,Regex ends up emitting code that walks the input checking character by character (though it does match the set using a fast branch-free bitmap mechanism):

ReadOnlySpan<char> span = inputSpan.Slice(pos);for (int i = 0; i < span.Length - 4; i++){    if (((long)((0x8318020000000000UL << (int)(charMinusLow = (uint)span[i] - 'A')) & (charMinusLow - 64)) < 0) &&    ...

Now in .NET 8, withSearchValues<char> wecan efficiently search for any of these sets, and the implementation ends up picking the one it thinks is statistically least likely to match:

int indexOfPos = span.Slice(i).IndexOfAny(Utilities.s_ascii_8231800000000000);

where thats_ascii_8231800000000000 is defined as:

/// <summary>Supports searching for characters in or not in "AGHLMW".</summary>internal static readonly SearchValues<char> s_ascii_8231800000000000 = SearchValues.Create("AGHLMW");

This leads the overall searching process to be much more efficient.

MethodRuntimeMeanRatio
Count.NET 7.0630.5 us1.00
Count.NET 8.0142.3 us0.23

Other PRs likedotnet/runtime#84370,dotnet/runtime#89099, anddotnet/runtime#77925 have also contributed to howIndexOf and friends are used, tweaking the various heuristics involved. But there have been improvements toRegex as well outside of this realm.dotnet/runtime#84003, for example, streamlines the matching performance of\w when matching against non-ASCII characters by using a bit-twiddling trick. Anddotnet/runtime#84843 changes the underlying type of an internal enum fromint tobyte, and in doing so ends up shrinking the size of the object containing a value of this enum by 8 bytes (in a 64-bit process). More impactful isdotnet/runtime#85564, which makes a measurable improvement forRegex.Replace.Replace was maintaining a list ofReadOnlyMemory<char> segments to be composed back into the final string; some segments would come from the originalstring, while some would be the replacementstring. As it turns out, though, the string reference contained in thatReadOnlyMemory<char> is unnecessary. We can instead just maintain a list ofints, where every time we add a segment we add to the list theint offset and theint count, and with the nature of replace, we can simply rely on the fact that we’ll need to insert the replacement text between every pair of values.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;    private static readonly Regex s_vowels = new Regex("[aeiou]", RegexOptions.Compiled);    [Benchmark]    public string RemoveVowels() => s_vowels.Replace(s_haystack, "");}
MethodRuntimeMeanRatio
RemoveVowels.NET 7.08.763 ms1.00
RemoveVowels.NET 8.07.084 ms0.81

One last improvement inRegex to highlight isn’t actually due to anything inRegex, but actually in a primitiveRegex uses on every operation:Interlocked.Exchange. Consider this benchmark:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.RegularExpressions;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly Regex s_r = new Regex("", RegexOptions.Compiled);    [Benchmark]    public bool Overhead() => s_r.IsMatch("");}

This is purely measuring the overhead of calling into aRegex instance; the matching routine will complete immediately as the pattern matches any input. Since we’re only talking about tens of nanoseconds, your numbers may vary here, but I routinely get results like this:

MethodRuntimeMeanRatio
Overhead.NET 7.032.01 ns1.00
Overhead.NET 8.028.81 ns0.90

That several nanosecond improvement is primarily due todotnet/runtime#79181, which madeInterlocked.CompareExchange andInterlocked.Exchange for reference types into intrinsics, special-casing when the JIT can see that the new value to be written isnull. These APIs need to employ a GC write barrier as part of writing the object reference into the shared location, for the same reasons previously discussed earlier in this post, but when writingnull, no such barrier is required. This benefitsRegex, which usesInterlocked.Exchange as part of renting aRegexRunner to use to actually process the match. EachRegex instance caches a runner object, and every operation tries to rent and return it… that renting is done withInterlocked.Exchange:

RegexRunner runner = Interlocked.Exchange(ref _runner, null) ?? CreateRunner();try { ... }finally { _runner = runner; }

Many object pool implementations employ a similar use ofInterlocked.Exchange and will similarly benefit.

Hashing

TheSystem.IO.Hashing library was introduced in .NET 6 to providenon-cryptographic hash algorithm implementations; initially, it shipped with four types:Crc32,Crc64,XxHash32, andXxHash64. In .NET 8, it gets significant investment, in adding new optimized algorithms, in improving the performance of existing implementations, and in adding new surface area across all of the algorithms.

The xxHash family of hash algorithms has become quite popular of late due to its high performance on both large and small inputs and its overall level of quality (e.g. how few collisions are produced, how well inputs are dispersed, etc.)System.IO.Hashing previously included implementations of the older XXH32 and XXH64 algorithms (asXxHash32 andXxHash64, respectively). Now in .NET 8, thanks todotnet/runtime#76641, it includes the XXH3 algorithm (asXxHash3), and thanks todotnet/runtime#77944 from@xoofx, it includes the XXH128 algorithm (asXxHash128). TheXxHash3 algorithm was also further optimized indotnet/runtime#77756 from@xoofx by amortizing the costs of some loads and stores, and indotnet/runtime#77881 from@xoofx, which improved throughput on Arm by making better use of theAdvSimd hardware intrinsics.

To see overall performance of these hash functions, here’s a microbenchmark comparing the throughput of the cryptographic SHA256 with each of these non-cryptographic hash functions. I’ve also included an implementation of FNV-1a, which is the hash algorithm that may be used by the C# compiler forswitch statements (when it needs toswitch over a string, for example, and it can’t come up with a better scheme, it hashes the input, and then does a binary search through the pregenerated hashes for each of the cases), as well as an implementation based onSystem.HashCode (noting thatHashCode is different from the rest of these, in that it’s focused on enabling the hashing of arbitrary .NET types, and includes per-process randomization, whereas a goal of these other hash functions is to be 100% deterministic across process boundaries).

// For this test, you'll also need to add://     <PackageReference Include="System.IO.Hashing" Version="8.0.0-rc.1.23419.4" />// to the benchmarks.csproj's <ItemGroup>.// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Buffers.Binary;using System.IO.Hashing;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _result = new byte[100];    private byte[] _source;    [Params(3, 33_333)]    public int Length { get; set; }    [GlobalSetup]    public void Setup() => _source = Enumerable.Range(0, Length).Select(i => (byte)i).ToArray();    // Cryptographic    [Benchmark(Baseline = true)] public void TestSHA256() => SHA256.HashData(_source, _result);    // Non-cryptographic    [Benchmark] public void TestCrc32() => Crc32.Hash(_source, _result);    [Benchmark] public void TestCrc64() => Crc64.Hash(_source, _result);    [Benchmark] public void TestXxHash32() => XxHash32.Hash(_source, _result);    [Benchmark] public void TestXxHash64() => XxHash64.Hash(_source, _result);    [Benchmark] public void TestXxHash3() => XxHash3.Hash(_source, _result);    [Benchmark] public void TestXxHash128() => XxHash128.Hash(_source, _result);    // Algorithm used by the C# compiler for switch statements    [Benchmark]    public void TestFnv1a()    {        int hash = unchecked((int)2166136261);        foreach (byte b in _source) hash = (hash ^ b) * 16777619;        BinaryPrimitives.WriteInt32LittleEndian(_result, hash);    }    // Randomized with a custom seed per process    [Benchmark]    public void TestHashCode()    {        HashCode hc = default;        hc.AddBytes(_source);        BinaryPrimitives.WriteInt32LittleEndian(_result, hc.ToHashCode());    }}
MethodLengthMeanRatio
TestSHA2563856.168 ns1.000
TestHashCode39.933 ns0.012
TestXxHash6437.724 ns0.009
TestXxHash12835.522 ns0.006
TestXxHash3235.457 ns0.006
TestCrc3233.954 ns0.005
TestCrc6433.405 ns0.004
TestXxHash333.343 ns0.004
TestFnv1a31.617 ns0.002
TestSHA2563333360,407.625 ns1.00
TestFnv1a3333331,027.249 ns0.51
TestHashCode333334,879.262 ns0.08
TestXxHash32333334,444.116 ns0.07
TestXxHash64333333,636.989 ns0.06
TestCrc64333331,571.445 ns0.03
TestXxHash3333331,491.740 ns0.03
TestXxHash128333331,474.551 ns0.02
TestCrc32333331,295.663 ns0.02

A key reasonXxHash3 andXxHash128 do so much better thanXxHash32 andXxHash64 is that their design is focused on being vectorizable. As such, the .NET implementations employ the support inSystem.Runtime.Intrinsics to take full advantage of the underlying hardware. This data also hints at why the C# compiler uses FNV-1a: it’s really simple and also really low overhead for small inputs, which are the most common form of input used inswitch statements, but it would be a poor choice if you expected primarily longer inputs.

You’ll note that in the previous example,Crc32 andCrc64 both end up in the same ballpark asXxHash3 in terms of throughput (XXH3 generally ranks better than CRC32/64 in terms of quality). CRC32 in that comparison benefits significantly fromdotnet/runtime#83321 from@brantburnett,dotnet/runtime#86539 from@brantburnett, anddotnet/runtime#85221 from@brantburnett. These vectorize theCrc32 andCrc64 implementations, based on a decade-old paper from Intel titled “Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction.” The citedPCLMULQDQ instruction is part of SSE2, however the PR is also able to vectorize on Arm by taking advantage of Arm’sPMULL instruction. The net result is huge gains over .NET 7, in particular for larger inputs being hashed.

// For this test, you'll also need to add://     <PackageReference Include="System.IO.Hashing" Version="7.0.0" />// to the benchmarks.csproj's <ItemGroup>.// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using System.IO.Hashing;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet("System.IO.Hashing", "7.0.0").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("System.IO.Hashing", "8.0.0-rc.1.23419.4"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    private readonly byte[] _source = Enumerable.Range(0, 1024).Select(i => (byte)i).ToArray();    private readonly byte[] _destination = new byte[4];    [Benchmark]    public void Hash() => Crc32.Hash(_source, _destination);}
MethodRuntimeMeanRatio
Hash.NET 7.02,416.24 ns1.00
Hash.NET 8.039.01 ns0.02

Another change also further improves performance of some of these algorithms, but with a primary purpose of actually making them easier to use in a variety of scenarios. The original design ofNonCryptographicHashAlgorithm was focused on creating non-cryptographic alternatives to the existing cryptographic algorithms folks were using, and thus the APIs are all focused on writing out the resulting digests, which are opaque bytes, e.g. CRC32 produces a 4-byte hash. However, especially for these non-cryptographic algorithms, many developers are more familiar with getting back a numerical result, e.g. CRC32 produces anuint. Same data, just a different representation. Interestingly, as well, some of these algorithms operate in terms of such integers, so getting back bytes actually requires a separate step, both ensuring some kind of storage location is available in which to write the resulting bytes and then extracting the result to that location. To address all of this,dotnet/runtime#78075 adds to all of the types inSystem.IO.Hashing new utility methods for producing such numbers. For example,Crc32 has two new methods added to it:

public static uint HashToUInt32(ReadOnlySpan<byte> source);public uint GetCurrentHashAsUInt32();

If you just want theuint-based CRC32 hash for some input bytes, you can simply call this one-shot static methodHashToUInt32. Or if you’re building up the hash incrementally, having created an instance of theCrc32 type and having appended data to it, you can get the currentuint hash viaGetCurrentHashAsUInt32. This also shaves off a few instructions for an algorithm likeXxHash3 which actually needs to do more work to produce the result as bytes, only to then need to get those bytes back as aulong:

// For this test, you'll also need to add://     <PackageReference Include="System.IO.Hashing" Version="8.0.0-rc.1.23419.4" />// to the benchmarks.csproj's <ItemGroup>.// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.IO.Hashing;using System.Runtime.InteropServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _source = new byte[] { 1, 2, 3 };    [Benchmark(Baseline = true)]    public ulong HashToBytesThenGetUInt64()    {        ulong hash = 0;        XxHash3.Hash(_source, MemoryMarshal.AsBytes(new Span<ulong>(ref hash)));        return hash;    }    [Benchmark]    public ulong HashToUInt64() => XxHash3.HashToUInt64(_source);}
MethodMeanRatio
HashToBytesThenGetUInt643.686 ns1.00
HashToUInt643.095 ns0.84

Also on the hashing front,dotnet/runtime#61558 from@deeprobin adds newBitOperations.Crc32C methods that allow for iterative crc32c hash computation. A nice aspect of crc32c is that multiple platforms provide instructions for this operation, including SSE 4.2 and Arm, and the .NET method will employ whatever hardware support is available, by delegating into the relevant hardware intrinsics inSystem.Runtime.Intrinsics, e.g.

if (Sse42.X64.IsSupported) return (uint)Sse42.X64.Crc32(crc, data);if (Sse42.IsSupported) return Sse42.Crc32(Sse42.Crc32(crc, (uint)(data)), (uint)(data >> 32));if (Crc32.Arm64.IsSupported) return Crc32.Arm64.ComputeCrc32C(crc, data);

We can see the impact those intrinsics have by comparing a manual implementation of the crc32c algorithm against the now built-in implementation:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Numerics;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _data = RandomNumberGenerator.GetBytes(1024 * 1024);    [Benchmark(Baseline = true)]    public uint Crc32c_Manual()    {        uint c = 0;        foreach (byte b in _data) c = Tests.Crc32C(c, b);        return c;    }    [Benchmark]    public uint Crc32c_BitOperations()    {        uint c = 0;        foreach (byte b in _data) c = BitOperations.Crc32C(c, b);        return c;    }    private static readonly uint[] s_crcTable = Generate(0x82F63B78u);    internal static uint Crc32C(uint crc, byte data) =>        s_crcTable[(byte)(crc ^ data)] ^ (crc >> 8);    internal static uint[] Generate(uint reflectedPolynomial)    {        var table = new uint[256];        for (int i = 0; i < 256; i++)        {            uint val = (uint)i;            for (int j = 0; j < 8; j++)            {                if ((val & 0b0000_0001) == 0)                {                    val >>= 1;                }                else                {                    val = (val >> 1) ^ reflectedPolynomial;                }            }            table[i] = val;        }        return table;    }}
MethodMeanRatio
Crc32c_Manual1,977.9 us1.00
Crc32c_BitOperations739.9 us0.37

Initialization

Several releases ago, the C# compiler added a valuable optimization that’s now heavily employed throughout the core libraries, and that newer C# constructs (likeu8) rely on heavily. It’s quite common to want to store and access sequences or tables of data in code. For example, let’s say I want to quickly look up how many days there are in a month in the Gregorian calendar, based on that month’s 0-based index. I can use a lookup table like this (ignoring leap years for explanatory purposes):

byte[] daysInMonth = new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };

Of course, now I’m allocating abyte[], so I should move that out to astatic readonly field. Even then, though, that array has to be allocated, and the data loaded into it, incurring some startup overhead the first time it’s used. Instead, I can write it as:

ReadOnlySpan<byte> daysInMonth = new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };

While this looks like it’s allocating, it’s actually not. The C# compiler recognizes that all of the data being used to initialize thebyte[] is constant and that the array is being stored directly into aReadOnlySpan<byte>, which doesn’t provide any means for extracting the array back out. As such, the compiler instead lowers this into code that effectively does this (we can’t exactly express in C# the IL that gets generated, so this is pseudo-code):

ReadOnlySpan<byte> daysInMonth = new ReadOnlySpan<byte>(    &<PrivateImplementationDetails>.9D61D7D7A1AA7E8ED5214C2F39E0C55230433C7BA728C92913CA4E1967FAF8EA,    12);

It blits the data for the array into the assembly, and then constructing the span isn’t via an array allocation, but rather just wrapping the span around a pointer directly into the assembly’s data. This not only avoids the startup overhead and the extra object on the heap, it also better enables various JIT optimizations, especially when the JIT is able to see what offset is being accessed. If I run this benchmark:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][DisassemblyDiagnoser]public class Tests{    private static readonly byte[] s_daysInMonthArray = new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };    private static ReadOnlySpan<byte> DaysInMonthSpan => new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };    [Benchmark] public int ViaArray() => s_daysInMonthArray[0];    [Benchmark] public int ViaSpan() => DaysInMonthSpan[0];}

it produces this assembly:

; Tests.ViaArray()       mov       rax,1B860002028       mov       rax,[rax]       movzx     eax,byte ptr [rax+10]       ret; Total bytes of code 18; Tests.ViaSpan()       mov       eax,1F       ret; Total bytes of code 6

In other words, for the array, it’s reading the address of the array and is then reading the element at offset 0x10, or decimal 16, which is where the array’s data begins. For the span, it’s simply loading the value 0x1F, or decimal 31, as it’s directly reading the data from the assembly data. (This isn’t a case of a missing optimization in the JIT for the array example… arrays are mutable, so the JIT can’t constant fold based on the current value stored in the array, since technically it could change.)

However, this compiler optimization only applied tobyte,sbyte, andbool. Any other primitive, and the compiler would simply do exactly what you asked it to do: allocate the array. Far from ideal. The reason for the limitation was endianness. The compiler needs to generate binaries that work on both little-endian and big-endian systems; for single-byte types, there’s no endianness concern (since endianness is about the ordering of the bytes, and if there’s only one byte, there’s only one ordering), but for multi-byte types, the generated code could no longer just point directly into the data, as on some systems the data’s bytes would be reversed.

.NET 7 added a new API to help with this,RuntimeHelpers.CreateSpan<T>. Rather than just emittingnew ReadOnlySpan<T>(ptrIntoData, dataLength), the idea was that the compiler would emit a call toCreateSpan<T>, passing in a reference to the field containing the data. The JIT and VM would then collude to ensure the data was loaded correctly and efficiently; on a little-endian system, the code would be emitted as if the call weren’t there (replaced by the equivalent of wrapping a span around the pointer and length), and on a big-endian system, the data would be loaded, reversed, and cached into an array, and the code gen would then be creating a span wrapping that array. Unfortunately, although the API shipped in .NET 7, the compiler support for it didn’t, and because no one was then actually using it, there were a variety of issues in the toolchain that went unnoticed.

Thankfully, all of these issues are now addressed in .NET 8 and the C# compiler (and also backported to .NET 7).dotnet/roslyn#61414 added support to the C# compiler for also supportingshort,ushort,char,int,uint,long,ulong,double,float, andenums based on these. On target frameworks whereCreateSpan<T> is available (.NET 7+), the compiler generates code that uses it. On frameworks where the function isn’t available, the compiler falls back to emitting astatic readonly array to cache the data and wrapping a span around that. This was an important consideration for libraries that build for multiple target frameworks, so that when building “downlevel”, the implementation doesn’t fall off the proverbial performance cliff due to relying on this optimization (this optimization is a bit of an oddity, as you actually need to write your code in a way that, without the optimization, ends up performing worse than what you would have otherwise had). With the compiler implementation in place, and fixes to the Mono runtime indotnet/runtime#82093 anddotnet/runtime#81695, and with fixes to the trimmer (which needs to preserve the alignment of the data that’s emitted by the compiler) indotnet/cecil#60, the rest of the runtime was then able to consume the feature, which it did indotnet/runtime#79461. So now, for example,System.Text.Json can use this to store not only how many days there are in a (non-leap) year, but also store how many days there are before a given month, something that wasn’t previously possible efficiently in this form due to there being values larger than can be stored in abyte.

// dotnet run -c Release -f net8.0 --filter **using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD", "i")][MemoryDiagnoser(displayGenColumns: false)][DisassemblyDiagnoser]public class Tests{    private static ReadOnlySpan<int> DaysToMonth365 => new int[] { 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365 };    [Benchmark]    [Arguments(1)]    public int DaysToMonth(int i) => DaysToMonth365[i];}
MethodMeanCode SizeAllocated
DaysToMonth0.0469 ns35 B
; Tests.DaysToMonth(Int32)       sub       rsp,28       cmp       edx,0D       jae       short M00_L00       mov       eax,edx       mov       rcx,12B39072DD0       mov       eax,[rcx+rax*4]       add       rsp,28       retM00_L00:       call      CORINFO_HELP_RNGCHKFAIL       int       3; Total bytes of code 35

dotnet/roslyn#69820 (which hasn’t yet merged but should soon) then rounds things out by ensuring that the pattern of initializing aReadOnlySpan<T> to anew T[] { const of T, const of T, ... /* all const values */ } will always avoid the array allocation, regardless of the type ofT being used. TheT need only be expressible as a constant in C#. That means this optimization now also applies tostring,decimal,nint, andnuint. For these, the compiler will fallback to using a cached array singleton. With that, this code:

// dotnet build -c Release -f net8.0internal static class Program{    private static void Main() { }    private static ReadOnlySpan<bool> Booleans => new bool[] { false, true };    private static ReadOnlySpan<sbyte> SBytes => new sbyte[] { 0, 1, 2 };    private static ReadOnlySpan<byte> Bytes => new byte[] { 0, 1, 2 };    private static ReadOnlySpan<short> Shorts => new short[] { 0, 1, 2 };    private static ReadOnlySpan<ushort> UShorts => new ushort[] { 0, 1, 2 };    private static ReadOnlySpan<char> Chars => new char[] { '0', '1', '2' };    private static ReadOnlySpan<int> Ints => new int[] { 0, 1, 2 };    private static ReadOnlySpan<uint> UInts => new uint[] { 0, 1, 2 };    private static ReadOnlySpan<long> Longs => new long[] { 0, 1, 2 };    private static ReadOnlySpan<ulong> ULongs => new ulong[] { 0, 1, 2 };    private static ReadOnlySpan<float> Floats => new float[] { 0, 1, 2 };    private static ReadOnlySpan<double> Doubles => new double[] { 0, 1, 2 };    private static ReadOnlySpan<nint> NInts => new nint[] { 0, 1, 2 };    private static ReadOnlySpan<nuint> NUInts => new nuint[] { 0, 1, 2 };    private static ReadOnlySpan<decimal> Decimals => new decimal[] { 0, 1, 2 };    private static ReadOnlySpan<string> Strings => new string[] { "0", "1", "2" };}

now compiles down to something like this (again, this is pseudo-code, since we can’t exactly represent in C# what’s emitted in IL):

internal static class Program{    private static void Main() { }    //    // No endianness concerns. Create a span that points directly into the assembly data,    // using the `ReadOnlySpan<T>(void*, int)` constructor.    //    private static ReadOnlySpan<bool> Booleans => new ReadOnlySpan<bool>(        &<PrivateImplementationDetails>.B413F47D13EE2FE6C845B2EE141AF81DE858DF4EC549A58B7970BB96645BC8D2, 2);    private static ReadOnlySpan<sbyte> SBytes => new ReadOnlySpan<sbyte>(        &<PrivateImplementationDetails>.AE4B3280E56E2FAF83F414A6E3DABE9D5FBE18976544C05FED121ACCB85B53FC, 3);    private static ReadOnlySpan<byte> Bytes => new ReadOnlySpan<byte>(        &<PrivateImplementationDetails>.AE4B3280E56E2FAF83F414A6E3DABE9D5FBE18976544C05FED121ACCB85B53FC, 3);    //    // Endianness concerns but with data that a span could point to directly if    // of the correct byte ordering. Go through the RuntimeHelpers.CreateSpan intrinsic.    //    private static ReadOnlySpan<short> Shorts => RuntimeHelpers.CreateSpan<short>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.90C2698921CA9FD02950BE353F721888760E33AB5095A21E50F1E4360B6DE1A02);    private static ReadOnlySpan<ushort> UShorts => RuntimeHelpers.CreateSpan<ushort>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.90C2698921CA9FD02950BE353F721888760E33AB5095A21E50F1E4360B6DE1A02);    private static ReadOnlySpan<char> Chars => RuntimeHelpers.CreateSpan<char>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.9B9A3CBF2B718A8F94CE348CB95246738A3A9871C6236F4DA0A7CC126F03A8B42);    private static ReadOnlySpan<int> Ints => RuntimeHelpers.CreateSpan<int>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC4);    private static ReadOnlySpan<uint> UInts => RuntimeHelpers.CreateSpan<uint>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC4);    private static ReadOnlySpan<long> Longs => RuntimeHelpers.CreateSpan<long>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.AB25350E3E65EFEBE24584461683ECDA68725576E825E550038B90E7B14799468);    private static ReadOnlySpan<ulong> ULongs => RuntimeHelpers.CreateSpan<ulong>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.AB25350E3E65EFEBE24584461683ECDA68725576E825E550038B90E7B14799468);    private static ReadOnlySpan<float> Floats => RuntimeHelpers.CreateSpan<float>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.75664B4DA1C08DE9E8FAD52303CC458B3E420EDDE6591E58761E138CC5E3F1634);    private static ReadOnlySpan<double> Doubles => RuntimeHelpers.CreateSpan<double>((RuntimeFieldHandle)        &<PrivateImplementationDetails>.B0C45303F7F11848CB5E6E5B2AF2FB2AECD0B72C28748B88B583AB6BB76DF1748);    //    // Create a span around a cached array.    //    private unsafe static ReadOnlySpan<nuint> NUInts => new ReadOnlySpan<nuint>(        <PrivateImplementationDetails>.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC_B16            ??= new nuint[] { 0, 1, 2 });    private static ReadOnlySpan<nint> NInts => new ReadOnlySpan<nint>(        <PrivateImplementationDetails>.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC_B8            ??= new nint[] { 0, 1, 2 });    private static ReadOnlySpan<decimal> Decimals => new ReadOnlySpan<decimal>(        <PrivateImplementationDetails>.93AF9093EDC211A9A941BDE5EF5640FD395604257F3D945F93C11BA9E918CC74_B18            ??= new decimal[] { 0, 1, 2 });    private static ReadOnlySpan<string> Strings => new ReadOnlySpan<string>(        <PrivateImplementationDetails>.9B9A3CBF2B718A8F94CE348CB95246738A3A9871C6236F4DA0A7CC126F03A8B4_B11            ??= new string[] { "0", "1", "2" });}

Another closely-related C# compiler improvement comes indotnet/runtime#66251 from@alrz. The previously mentioned optimization around single-byte types also applies tostackalloc initialization. If I write:

Span<int> span = stackalloc int[] { 1, 2, 3 };

the C# compiler emits code similar to if I’d written the following:

byte* ptr = stackalloc byte[12];*(int*)ptr = 1;*(int*)(ptr) = 2;*(int*)(ptr + (nint)2 * (nint)4) = 3;Span<int> span = new Span<int>(ptr);

If, however, I switch from the multi-byteint to the single-bytebyte:

Span<byte> span = stackalloc byte[] { 1, 2, 3 };

then I get something closer to this:

byte* ptr = stackalloc byte[3];Unsafe.CopyBlock(ptr, ref <PrivateImplementationDetails>.039058C6F2C0CB492C533B0A4D14EF77CC0F78ABCCCED5287D84A1A2011CFB81, 3); // actually the cpblk instructionSpan<byte> span = new Span<byte>(ptr, 3);

Unlike the thenew[] case, however, which optimized not only forbyte,sbyte, andbool but also forenums withbyte andsbyte as an underlying type, thestackalloc optimization didn’t. Thanks to this PR, it now does.

There’s another semi-related new feature spanning C# 12 and .NET 8:InlineArrayAttribute.stackalloc has long provided a way to use stack space as a buffer, rather than needing to allocate memory on the heap; however, for most of .NET’s history, this was “unsafe,” in that it produced a pointer:

byte* buffer = stackalloc byte[8];

C# 7.2 introduced the immensely useful improvement to stack allocate directly into a span, at which point it becomes “safe,” not requiring being in anunsafe context and with all access to the span bounds checked appropriately, as with any other span:

Span<byte> buffer = stackalloc byte[8];

The C# compiler will lower that to something along the lines of:

Span<byte> buffer;unsafe{    byte* tmp = stackalloc byte[8];    buffer = new Span<byte>(tmp, 8);}

However, this is still limited to the kinds of things that can bestackalloc‘d, namelyunmanaged types (types which don’t contain any managed references), and it’s limited in where it can be used. That’s not only becausestackalloc can’t be used in places likecatch andfinally blocks, but also because there are places where you want to be able to have such buffers that aren’t limited to the stack: inside of other types. C# has long supported the notion of “fixed-size buffers,” e.g.

struct C{    internal unsafe fixed char name[30];}

but these require being in anunsafe context since they present to a consumer as a pointer (in the above example, the type ofC.name is achar*) and they’re not bounds-checked, and they’re limited in the element type supported (it can only bebool,sbyte,byte,short,ushort,char,int,uint,long,ulong,double, orfloat).

.NET 8 and C# 12 provide an answer for this:[InlineArray]. This new attribute can be placed onto astruct containing a single field, like this:

[InlineArray(8)]internal struct EightStrings{    private string _field;}

The runtime then expands that struct to be logically the same as if you wrote:

internal struct EightStrings{    private string _field0;    private string _field1;    private string _field2;    private string _field3;    private string _field4;    private string _field5;    private string _field6;    private string _field7;}

ensuring that all of the storage is appropriately contiguous and aligned. Why is that important? Because C# 12 then makes it easy to get a span from one of these instances, e.g.

EightStrings strings = default;Span<string> span = strings;

This is all “safe,” and the type of the field can be anything that’s valid as a generic type argument. That means pretty much anything other thanrefs,ref structs, and pointers. This is a constraint imposed by the C# language, since with such a field typeT you wouldn’t be able to construct aSpan<T>, but the warning can be suppressed, as the runtime itself does support anything as the field type. The compiler-generated code for getting a span is equivalent to if you wrote:

EightStrings strings = default;Span<string> span = MemoryMarshal.CreateSpan(ref Unsafe.As<EightStrings, string>(ref strings), 8);

which is obviously complicated and not something you’d want to be writing frequently. In fact, the compiler doesn’t want to emit that frequently, either, so it puts it into a helper in the assembly that it can reuse.

[CompilerGenerated]internal sealed class <PrivateImplementationDetails>{    internal static Span<TElement> InlineArrayAsSpan<TBuffer, TElement>(ref TBuffer buffer, int length) =>        MemoryMarshal.CreateSpan(ref Unsafe.As<TBuffer, TElement>(ref buffer), length);    ...}

(<PrivateImplementationDetails> is a class the C# compiler emits to contain helpers and other compiler-generated artifacts used by code it emits elsewhere in the program. You saw it in the previous discussion as well, as it’s where it emits the data in support of array and span initialization from constants.)

The[InlineArray]-attributed type is also a normalstruct like any other, and can be used anywhere any otherstruct can be used; that it’s using[InlineArray] is effectively an implementation detail. So, for example, you can embed it into another type, and the following code will print out “0” through “7” as you’d expect:

// dotnet run -c Release -f net8.0using System.Runtime.CompilerServices;MyData data = new();Span<string> span = data.Strings;for (int i = 0; i < span.Length; i++) span[i] = i.ToString();foreach (string s in data.Strings) Console.WriteLine(s);public class MyData{    private EightStrings _strings;    public Span<string> Strings => _strings;    [InlineArray(8)]    private unsafe struct EightStrings { private string _field; }}

dotnet/runtime#82744 provided the CoreCLR runtime support forInlineArray,dotnet/runtime#83776 anddotnet/runtime#84097 provided the Mono runtime support, anddotnet/roslyn#68783 merged the C# compiler support.

This feature isn’t just about you using it directly, either. The compiler itself also uses[InlineArray] as an implementation detail behind other new and planned features… we’ll talk more about that when discussing collections.

Analyzers

Lastly, even though the runtime and core libraries have made great strides in improving the performance of existing functionality and adding new performance-focused support, sometimes the best fix is actually in the consuming code. That’s where analyzers come in. Several new analyzers have been added in .NET 8 to help find particular classes of string-related performance issues.

CA1858, added indotnet/roslyn-analyzers#6295 from@Youssef1313, looks for calls toIndexOf where the result is then being checked for equality with 0. This is functionally the same as a call toStartsWith, but is much more expensive as it could end up examining the entire source string rather than just the starting position (dotnet/runtime#79896 fixes a few such uses indotnet/runtime).CA1858

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string _haystack = """        It was the best of times, it was the worst of times,        it was the age of wisdom, it was the age of foolishness,        it was the epoch of belief, it was the epoch of incredulity,        it was the season of light, it was the season of darkness,        it was the spring of hope, it was the winter of despair.        """;    private readonly string _needle = "hello";    [Benchmark(Baseline = true)]    public bool StartsWith_IndexOf0() =>        _haystack.IndexOf(_needle, StringComparison.OrdinalIgnoreCase) == 0;    [Benchmark]    public bool StartsWith_StartsWith() =>        _haystack.StartsWith(_needle, StringComparison.OrdinalIgnoreCase);}
MethodMeanRatio
StartsWith_IndexOf031.327 ns1.00
StartsWith_StartsWith4.501 ns0.14

CA1865, CA1866, and CA1867 are all related to each other. Added indotnet/roslyn-analyzers#6799 from@mrahhal, these look for calls tostring methods likeStartsWith, searching for calls passing in a single-characterstring argument, e.g.str.StartsWith("@"), and recommending the argument be converted into achar. Which diagnostic ID the analyzer raises depends on whether the transformation is 100% equivalent behavior or whether a change in behavior could potentially result, e.g. switching from a linguistic comparison to an ordinal comparison.CA1865

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string _haystack = "All we have to decide is what to do with the time that is given us.";    [Benchmark(Baseline = true)]    public int IndexOfString() => _haystack.IndexOf("v");    [Benchmark]    public int IndexOfChar() => _haystack.IndexOf('v');}
MethodMeanRatio
IndexOfString37.634 ns1.00
IndexOfChar1.979 ns0.05

CA1862, added indotnet/roslyn-analyzers#6662, looks for places where code is performing a case-insensitive comparison (which is fine) but doing so by first lower/uppercasing an input string and then comparing that (which is far from fine). It’s much more efficient to just use aStringComparison.dotnet/runtime#89539 fixes a few such cases.CA1862

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly string _input = "https://dot.net";    [Benchmark(Baseline = true)]    public bool IsHttps_ToUpper() => _input.ToUpperInvariant().StartsWith("HTTPS://");    [Benchmark]    public bool IsHttps_StringComparison() => _input.StartsWith("HTTPS://", StringComparison.OrdinalIgnoreCase);}
MethodMeanRatioAllocatedAlloc Ratio
IsHttps_ToUpper46.3702 ns1.0056 B1.00
IsHttps_StringComparison0.4781 ns0.010.00

AndCA1861, added indotnet/roslyn-analyzers#5383 from@steveberdy, looks for opportunities to lift and cache arrays being passed as arguments.dotnet/runtime#86229 addresses the issues found by the analyzer indotnet/runtime.CA1861

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private static readonly char[] s_separator = new[] { ',', ':' };    private readonly string _value = "1,2,3:4,5,6";    [Benchmark(Baseline = true)]    public string[] Split_Original() => _value.Split(new[] { ',', ':' });    [Benchmark]    public string[] Split_Refactored() => _value.Split(s_separator);}
MethodMeanRatioAllocatedAlloc Ratio
Split_Original108.6 ns1.00248 B1.00
Split_Refactored104.0 ns0.96216 B0.87

Collections

Collections are the bread and butter of practically every application and service. Have more than one of something? You need a collection to manage them. And since they’re so commonly needed and used, every release of .NET invests meaningfully in improving their performance and driving down their overheads.

General

Some of the changes made in .NET 8 are largely collection-agnostic and affect a large number of collections. For example,dotnet/runtime#82499 special-cases “empty” on a bunch of the built-in collection types to return an empty singleton enumerator, thus avoiding allocating a largely useless object. This is wide-reaching, affectingList<T>,Queue<T>,Stack<T>,LinkedList<T>,PriorityQueue<TElement, TPriority>,SortedDictionary<TKey, TValue>,SortedList<TKey, TValue>,HashSet<T>,Dictionary<TKey, TValue>, andArraySegment<T>. Interestingly,T[] was already on this plan (as were a few other collections, likeConditionalWeakTable<TKey, TValue>); if you calledIEnumerable<T>.GetEnumerator on anyT[] of length 0, you already got back a singleton enumerator hardcoded to returnfalse from itsMoveNext. That same enumerator singleton is what’s now returned from theGetEnumerator implementations of all of those cited collection types when they’re empty at the momentGetEnumerator is called.

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 7").WithRuntime(CoreRuntime.Core70).AsBaseline())    .AddJob(Job.Default.WithId(".NET 8 w/o PGO").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_TieredPGO", "0"))    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables", "Runtime")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly IEnumerable<int> _list = new List<int>();    private readonly IEnumerable<int> _queue = new Queue<int>();    private readonly IEnumerable<int> _stack = new Stack<int>();    private readonly IEnumerable<int> _linkedList = new LinkedList<int>();    private readonly IEnumerable<int> _hashSet = new HashSet<int>();    private readonly IEnumerable<int> _segment = new ArraySegment<int>(Array.Empty<int>());    private readonly IEnumerable<KeyValuePair<int, int>> _dictionary = new Dictionary<int, int>();    private readonly IEnumerable<KeyValuePair<int, int>> _sortedDictionary = new SortedDictionary<int, int>();    private readonly IEnumerable<KeyValuePair<int, int>> _sortedList = new SortedList<int, int>();    private readonly IEnumerable<(int, int)> _priorityQueue = new PriorityQueue<int, int>().UnorderedItems;    [Benchmark] public IEnumerator<int> GetList() => _list.GetEnumerator();    [Benchmark] public IEnumerator<int> GetQueue() => _queue.GetEnumerator();    [Benchmark] public IEnumerator<int> GetStack() => _stack.GetEnumerator();    [Benchmark] public IEnumerator<int> GetLinkedList() => _linkedList.GetEnumerator();    [Benchmark] public IEnumerator<int> GetHashSet() => _hashSet.GetEnumerator();    [Benchmark] public IEnumerator<int> GetArraySegment() => _segment.GetEnumerator();    [Benchmark] public IEnumerator<KeyValuePair<int, int>> GetDictionary() => _dictionary.GetEnumerator();    [Benchmark] public IEnumerator<KeyValuePair<int, int>> GetSortedDictionary() => _sortedDictionary.GetEnumerator();    [Benchmark] public IEnumerator<KeyValuePair<int, int>> GetSortedList() => _sortedList.GetEnumerator();    [Benchmark] public IEnumerator<(int, int)> GetPriorityQueue() => _priorityQueue.GetEnumerator();}
MethodJobMeanRatioAllocatedAlloc Ratio
GetList.NET 715.9046 ns1.0040 B1.00
GetList.NET 8 w/o PGO2.1016 ns0.130.00
GetList.NET 80.8954 ns0.060.00
GetQueue.NET 716.5115 ns1.0040 B1.00
GetQueue.NET 8 w/o PGO1.8934 ns0.110.00
GetQueue.NET 81.1068 ns0.070.00
GetStack.NET 716.2183 ns1.0040 B1.00
GetStack.NET 8 w/o PGO4.5345 ns0.280.00
GetStack.NET 82.7712 ns0.170.00
GetLinkedList.NET 719.9335 ns1.0048 B1.00
GetLinkedList.NET 8 w/o PGO4.6176 ns0.230.00
GetLinkedList.NET 82.5660 ns0.130.00
GetHashSet.NET 715.8322 ns1.0040 B1.00
GetHashSet.NET 8 w/o PGO1.8871 ns0.120.00
GetHashSet.NET 81.1129 ns0.070.00
GetArraySegment.NET 717.0096 ns1.0040 B1.00
GetArraySegment.NET 8 w/o PGO3.9111 ns0.230.00
GetArraySegment.NET 81.3438 ns0.080.00
GetDictionary.NET 718.3397 ns1.0048 B1.00
GetDictionary.NET 8 w/o PGO2.3202 ns0.130.00
GetDictionary.NET 81.0185 ns0.060.00
GetSortedDictionary.NET 749.5423 ns1.00112 B1.00
GetSortedDictionary.NET 8 w/o PGO5.6333 ns0.110.00
GetSortedDictionary.NET 82.9824 ns0.060.00
GetSortedList.NET 718.9600 ns1.0048 B1.00
GetSortedList.NET 8 w/o PGO4.4282 ns0.230.00
GetSortedList.NET 82.2451 ns0.120.00
GetPriorityQueue.NET 717.4375 ns1.0040 B1.00
GetPriorityQueue.NET 8 w/o PGO4.3855 ns0.250.00
GetPriorityQueue.NET 82.8931 ns0.170.00

Enumerator allocations are avoided in other contexts, as well.dotnet/runtime#78613 from@madelson avoids an unnecessary enumerator allocation inHashSet<T>.SetEquals andHashSet<T>.IsProperSupersetOf, rearranging some code in order to useHashSet<T>‘s struct-based enumerator rather than relying on it being boxed as anIEnumerator<T>. This both saves an allocation and avoids unnecessary interface dispatch.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly HashSet<int> _source1 = new HashSet<int> { 1, 2, 3, 4, 5 };    private readonly IEnumerable<int> _source2 = new HashSet<int> { 1, 2, 3, 4, 5 };    [Benchmark]    public bool SetEquals() => _source1.SetEquals(_source2);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
SetEquals.NET 7.075.02 ns1.0040 B1.00
SetEquals.NET 8.026.29 ns0.350.00

There are other places where “empty” has been special-cased.dotnet/runtime#76097 anddotnet/runtime#76764 added anEmpty singleton toReadOnlyCollection<T>,ReadOnlyDictionary<TKey, TValue>, andReadOnlyObservableCollection<T>, and then used that singleton in a bunch of places, multiple of which accrue further to many other places that consume them. For example,Array.AsReadOnly now checks whether the array being wrapped is empty, and if it is,AsReadOnly returnsReadOnlyCollection<T>.Empty rather than allocating a newReadOnlyCollection<T> to wrap the empty array (it also makes a similar update toReadOnlyCollection<T>.GetEnumerator as was discussed with the previous PRs).ConcurrentDictionary<TKey, TValue>‘sKeys andValues will now return the same singleton if the count is known to be 0. And so on. These kinds of changes reduce the overall “peanut butter” layer of allocation across uses of collections.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.ObjectModel;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly int[] _array = new int[0];    [Benchmark]    public ReadOnlyCollection<int> AsReadOnly() => Array.AsReadOnly(_array);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
AsReadOnly.NET 7.013.380 ns1.0024 B1.00
AsReadOnly.NET 8.01.460 ns0.110.00

Of course, there are many much more targeted and impactful improvements for specific collection types, too.

List

The most widely used collection in .NET, other thanT[], isList<T>. While that claim feels accurate, I also like to be data-driven, so as one measure, looking at the same NuGet packages we looked at earlier for enums, here’s a graph showing the number of references to the various concrete collection types:References to collection types in NuGet packages

Given its ubiquity,List<T> sees a variety of improvements in .NET 8.dotnet/runtime#76043 improves the performance of itsAddRange method, in particular when dealing with non-ICollection<T> inputs. When adding anICollection<T>,AddRange reads the collection’sCount, ensures the list’s array is large enough to store all the incoming data, and then copies it as efficiently as the source collection can muster by invoking the collection’sCopyTo method to propagate the data directly into theList<T>‘s backing store. But if the input enumerable isn’t anICollection<T>,AddRange has little choice but to enumerate the collection and add each item one at a time. Prior to this release,AddRange(collection) simply delegated toInsertRange(Count, collection), which meant that whenInsertRange discovered the source wasn’t anICollection<T>, it would fall back to callingInsert(i++, item) with each item from the enumerable. ThatInsert method is too large to be inlined by default, plus involves additional checks that aren’t necessary for theAddRange usage (e.g. it needs to validate that the supplied position is within the range of the list, but for adding, we’re always just implicitly adding at the end, with a position implicitly known to be valid). This PR rewroteAddRange to not just delegate toInsertRange, at which point when it falls back to enumerating the non-ICollection<T> enumerable, it calls the optimizedAdd, which is inlineable, and which doesn’t have any extraneous checks.

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithId(".NET 7").WithRuntime(CoreRuntime.Core70).AsBaseline())    .AddJob(Job.Default.WithId(".NET 8 w/o PGO").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_TieredPGO", "0"))    .AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "EnvironmentVariables", "Runtime")]public class Tests{    private readonly IEnumerable<int> _source = GetItems(1024);    private readonly List<int> _list = new();    [Benchmark]    public void AddRange()    {        _list.Clear();        _list.AddRange(_source);    }    private static IEnumerable<int> GetItems(int count)    {        for (int i = 0; i < count; i++) yield return i;    }}

For this test, I’ve configured it to run with and without PGO on .NET 8, because this particular test benefits significantly from PGO, and I want to tease those improvements apart from those that come from the cited improvements toAddRange. Why does PGO help here? Because theAddRange method will see that the type of the enumerable is always the compiler-generated iterator forGetItems and will thus generate code specific to that type, enabling the calls that would otherwise involve interface dispatch to instead be devirtualized.

MethodJobMeanRatio
AddRange.NET 76.365 us1.00
AddRange.NET 8 w/o PGO4.396 us0.69
AddRange.NET 82.445 us0.38

AddRange has improved in other ways, too. One of the long-requested features forList<T>, ever since spans were introduced in .NET Core 2.1, was better integration betweenList<T> and{ReadOnly}Span<T>.dotnet/runtime#76274 provides that, adding support to bothAddRange andInsertRange for data stored in aReadOnlySpan<T>, and also support for copying all of the data in aList<T> to aSpan<T> via aCopyTo method. It was of course previously possible to achieve this, but doing so required handling one element at a time, which when compared to vectorized copy implementations is significantly slower.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly int[] _source = new int[1024];    private readonly List<int> _list = new();    [Benchmark(Baseline = true)]    public void OpenCoded()    {        _list.Clear();        foreach (int i in (ReadOnlySpan<int>)_source)        {            _list.Add(i);        }    }    [Benchmark]    public void AddRange()    {        _list.Clear();        _list.AddRange((ReadOnlySpan<int>)_source);    }}
MethodMeanRatio
OpenCoded1,261.66 ns1.00
AddRange51.74 ns0.04

You may note that these newAddRange,InsertRange, andCopyTo methods were added as extension methods rather than as instance methods onList<T>. That was done for a few reasons, but the primary motivating factor was avoiding ambiguity. Consider this example:

var c = new MyCollection<int>();c.AddRange(new int[] { 1, 2, 3 });public class MyCollection<T>{    public void AddRange(IEnumerable<T> source) { }    public void AddRange(ReadOnlySpan<T> source) { }}

This will fail to compile with:

error CS0121: The call is ambiguous between the following methods or properties: ‘MyCollection.AddRange(IEnumerable)’ and ‘MyCollection.AddRange(ReadOnlySpan)’

because an arrayT[] both implementsIEnumerable<T> and has an implicit conversion toReadOnlySpan<T>, and as such the compiler doesn’t know which to use. It’s likely this ambiguity will be resolved in a future version of the language, but for now we resolved it ourselves by making the span-based overload an extension method:

namespace System.Collections.Generic{    public static class CollectionExtensions    {        public static void AddRange<T>(this List<T> list, ReadOnlySpan<T> source) { ... }    }}

The other significant addition forList<T> comes indotnet/runtime#82146 from@MichalPetryka. In .NET 5, theCollectionsMarshal.AsSpan(List<T>) method was added; it returns aSpan<T> for the in-use area of aList<T>‘s backing store. For example, if you write:

var list = new List<int>(42) { 1, 2, 3 };Span<int> span = CollectionsMarshal.AsSpan(list);

that will provide you with aSpan<int> with length 3, since the list’sCount is 3. This is very useful for a variety of scenarios, in particular for consuming aList<T>‘s data via span-based APIs. It doesn’t, however, enable scenarios that want to efficiently write to aList<T>, in particular where it would require increasing aList<T>‘s count. Let’s say, for example, you wanted to create a newList<char> that contained 100 ‘a’ values. You might think you could write:

var list = new List<char>(100);Span<char> span = CollectionsMarshal.AsSpan(list); // oopsspan.Fill('a');

but that won’t impact the contents of the created list at all, because the span’sLength will match theCount of the list: 0. What we need to be able to do is change the count of the list, effectively telling it “pretend like 100 values were just added to you, even though they weren’t.” This PR adds the newSetCount method, which does just that. We can now write the previous example like:

var list = new List<char>();CollectionsMarshal.SetCount(list, 100);Span<char> span = CollectionsMarshal.AsSpan(list);span.Fill('a'); // yay!

and we will successfully find ourselves with a list containing 100 ‘a’ elements.

LINQ

That newSetCount method is not only exposed publicly, it’s also used as an implementation detail now in LINQ (Language-Integrated Query), thanks todotnet/runtime#85288.Enumerable‘sToList method now benefits from this in a variety of places. For example, callingEnumerable.Repeat('a', 100).ToList() will behave very much like the previous example (albeit with an extra enumerable allocation for theRepeat), creating a new list, usingSetCount to set its count to 100, getting the backing span, and callingFill to populate it. The impact of directly writing to the span rather than going throughList<T>.Add for each item is visible in the following examples:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly IEnumerable<int> _source = Enumerable.Range(0, 1024).ToArray();    [Benchmark]    public List<int> SelectToList() => _source.Select(i => i * 2).ToList();    [Benchmark]    public List<byte> RepeatToList() => Enumerable.Repeat((byte)'a', 1024).ToList();    [Benchmark]    public List<int> RangeSelectToList() => Enumerable.Range(0, 1024).Select(i => i * 2).ToList();}
MethodRuntimeMeanRatio
SelectToList.NET 7.02,627.8 ns1.00
SelectToList.NET 8.01,096.6 ns0.42
RepeatToList.NET 7.01,543.2 ns1.00
RepeatToList.NET 8.0106.1 ns0.07
RangeSelectToList.NET 7.02,908.9 ns1.00
RangeSelectToList.NET 8.0865.2 ns0.29

In the case ofSelectToList andRangeSelectToList, the benefit is almost entirely due to writing directly into the span for each element vs the overhead ofAdd. In the case ofRepeatToList, because theToList call has direct access to the span, it’s able to use the vectorizedFill method (as it was previously doing just forToArray), achieving an even larger speedup.

You’ll note that I didn’t include a test forEnumerable.Range(...).ToList() above. That’s because it was improved in other ways, and I didn’t want to conflate them in the measurements. In particular,dotnet/runtime#87992 from@neon-sunset vectorized the internalFill method that’s used by the specialization of bothToArray andToList on the iterator returned fromEnumerable.Range. That means that rather than writing oneint at a time, on a system that supports 128-bit vectors (which is pretty much all hardware you might use today) it’ll instead write fourints at a time, and on a system that supports 256-bit vectors, it’ll write eightints at a time. Thus,Enumerable.Range(...).ToList() benefits both from writing directly into the span and from the now vectorized implementation, which means it ends up with similar speedups asRepeatToList above. We can also tease apart these improvements by changing what instruction sets are seen as available.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public List<int> RangeToList() => Enumerable.Range(0, 16_384).ToList();}
MethodRuntimeMeanRatio
RangeToList.NET 7.025.374 us1.00
RangeToList.NET 8.06.872 us0.27

These optimized span-based implementations now also accrue to other usage beyondToArray andToList. If you look at theEnumerable.Repeat andEnumerable.Range implementations in .NET Framework, you’ll see that they’re just normal C# iterators, e.g.

static IEnumerable<int> RangeIterator(int start, int count){    for (int i = 0; i < count; i++)    {        yield return start + i;    }}

but years ago, these methods were changed in .NET Core to return a custom iterator (just a normal class implementingIEnumerator<T> where we provide the full implementation rather than the compiler doing so). Once we have a dedicated type, we can add additional interfaces to it, anddotnet/runtime#88249 does exactly that, making these internalRangeIterator,RepeatIterator, and several other types implementIList<T>. That then means that any code which queries anIEnumerable<T> for whether it implementsIList<T>, such as to use itsCount andCopyTo methods, will light up when passed one of these instances as well. And the sameFill implementation that’s used internally to implementToArray andToList is then used as well withCopyTo. That means if you write code like:

List<T> list = ...;IEnumerable<T> enumerable = ...;list.AddRange(enumerable);

and thatenumerable came from one of these enlightened types, it’ll now benefit from the exact same use of vectorization previously discussed, as theList<T> will ensure its array is appropriately sized to handle the incoming data and will then hand its array off to the iterator’sICollection<T>.CopyTo method to write into directly.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly List<byte> _list = new();    [Benchmark]    public void AddRange()    {        _list.Clear();        _list.AddRange(Enumerable.Repeat((byte)'a', 1024));    }}
MethodRuntimeMeanRatio
AddRange.NET 7.06,826.89 ns1.000
AddRange.NET 8.020.30 ns0.003

Vectorization with LINQ was also improved in other ways. In .NET 7,Enumerable.Min andEnumerable.Max were taught how to vectorize the handling of some inputs (when the enumerable was actually an array or list ofint orlong values), and in .NET 8dotnet/runtime#76144 expanded that to coverbyte,sbyte,ushort,short,uint,ulong,nint, andnuint as well (it also switched the implementation from usingVector<T> to using bothVector128<T> andVector256<T>, so that shorter inputs could still benefit from some level of vectorization).

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _values = Enumerable.Range(0, 4096).Select(_ => (byte)Random.Shared.Next(0, 256)).ToArray();    [Benchmark]    public byte Max() => _values.Max();}
MethodRuntimeMeanRatio
Max.NET 7.016,496.96 ns1.000
Max.NET 8.053.77 ns0.003

Enumerable.Sum has now also been vectorized, forint andlong, thanks todotnet/runtime#84519 from@brantburnett.Sum in LINQ performschecked arithmetic, and normalVector<T> operations areunchecked, which makes the vectorization of this method a bit more challenging. To achieve it, it takes advantage of a neat little bit hack trick for determining whether an addition of two signed twos-complement numbers underflow or overflow. The same logic applies for bothint andlong here, so we’ll focus just onint. It’s impossible for the addition of a negativeint to overflow when added to a positiveint, so the only way two summed values can underflow or overflow is if they have the same sign. Further, if any wrapping occurs, it can’t wrap back to the same sign; if you add two positives numbers together and it overflows, the result will be negative, and if you add two negative numbers together and it underflows, the result will be positive. Thus, a function like this can tell us whether the sum wrapped:

static int Sum(int a, int b, out bool overflow){    int sum = a + b;    overflow = (((sum ^ a) & (sum ^ b)) & int.MinValue) != 0;    return sum;}

We’rexor‘ing the result with each of the inputs, andand‘ing those together. That will produce a number who’s top-most bit is 1 if there was overflow/underflow, and otherwise 0, so we can then mask off all the other bits and compare to 0 to determine whether wrapping occurred. This is useful for vectorization, because we can easily do the same thing with vectors, summing the two vectors and reporting on whether any of the elemental sums overflowed:

static Vector128<int> Sum(Vector128<int> a, Vector128<int> b, out bool overflow){    Vector128<int> sum = a + b;    overflow = (((sum ^ a) & (sum ^ b)) & Vector128.Create(int.MinValue)) != Vector128<int>.Zero;    return sum;}

With that,Enumerable.Sum can be vectorized. For sure, it’s not as efficient as if we didn’t need to care about thechecked; after all, for every addition operation, there’s at least an extra set of instructions for the twoxors and theand‘ing of them (we can amortize the bit check across several operations by doing some loop unrolling). With 256-bit vectors, an ideal speedup for such a sum operation overint values would be 8x, since we can process eight 32-bit values at a time in a 256-bit vector. We’re then doing fairly well that we get a 4x speedup in that situation:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly IEnumerable<int> _values = Enumerable.Range(0, 1024).ToArray();    [Benchmark]    public int Sum() => _values.Sum();}
MethodRuntimeMeanRatio
Sum.NET 7.0347.28 ns1.00
Sum.NET 8.078.26 ns0.23

LINQ has improved in .NET 8 beyond just vectorization; other operators have seen other kinds of optimization. TakeOrder/OrderDescending, for example. These LINQ operators implement a “stable sort”; that means that while sorting the data, if two items compare equally, they’ll end up in the final result in the same order they were in the original (an “unstable sort” doesn’t care about the ordering of two values that compare equally). The core sorting routine shared by spans, arrays, and lists in .NET (e.g.Array.Sort) provides an unstable sort, so to use that implementation and provide stable ordering guarantees, LINQ has to layer the stability on top, which it does by factoring into the comparison operation between keys the original location of the key in the input (e.g. if two values otherwise compare equally, then it proceeds to compare their original locations). That, however, means it needs to remember their original locations, which means it needs to allocate a separateint[] for positions. Interestingly, though, sometimes you can’t tell the difference between whether a sort is stable or unstable.dotnet/runtime#76733 takes advantage of the fact that for primitive types likeint, two values that compare equally with the default comparer are indistinguishable, in which case it’s fine to use an unstable sort because the only values that can compare equally have identical bits and thus trying to maintain an order between them doesn’t matter. It thus enables avoiding all of the overhead associated with maintaining a stable sort.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private IEnumerable<int> _source;    [GlobalSetup]    public void Setup() => _source = Enumerable.Range(0, 1000).Reverse();    [Benchmark]    public int EnumerateOrdered()    {        int sum = 0;        foreach (int i in _source.Order())         {            sum += i;        }        return sum;    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
EnumerateOrdered.NET 7.073.728 us1.008.09 KB1.00
EnumerateOrdered.NET 8.09.753 us0.134.02 KB0.50

dotnet/runtime#76418 also improves sorting in LINQ, this time forOrderBy/OrderByDescending, and in particular when the type of the key used (the type returned by thekeySelector delegate provided toOrderBy) is a value type and the default comparer is used. This change employs the same approach that some of the .NET collections likeDictionary<TKey, TValue> already do, which is to take advantage of the fact that value types when used as generics get a custom copy of the code dedicated to that type (“generic specialization”), and thatComparer<TValueType>.Default.Compare will get devirtualized and possibly inlined. As such, it adds a dedicated path for when the key is a value type, and that enables the comparison operation (which is invokedO(n log n) times) to be sped up.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly int[] _values = Enumerable.Range(0, 1_000_000).Reverse().ToArray();    [Benchmark]    public int OrderByToArray()    {        int sum = 0;        foreach (int i in _values.OrderBy(i => i * 2)) sum += i;        return sum;    }}
MethodRuntimeMeanRatio
OrderByToArray.NET 7.0187.17 ms1.00
OrderByToArray.NET 8.067.54 ms0.36

Of course, sometimes the most efficient use of LINQ is simply not using it. It’s an amazing productivity tool, and it goes to great lengths to be efficient, but sometimes there are better answers that are just as simple.CA1860, added indotnet/roslyn-analyzers#6236 from@CollinAlpert, flags one such case. It looks for use ofEnumerable.Any on collections that directly expose aCount,Length, orIsEmpty property that could be used instead. WhileAny does useEnumerable.TryGetNonEnumeratedCount in an attempt to check the collection’s number of items without allocating or using an enumerator, even if it’s successful in doing so it incurs the overhead of the interface check and dispatch. It’s faster to just use the properties directly.dotnet/runtime#81583 fixed several cases of this.CA1860

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly string _str = "hello";    private readonly List<int> _list = new() { 1, 2, 3 };    private readonly int[] _array = new int[] { 4, 5, 6 };    [Benchmark(Baseline = true)]    public bool AllNonEmpty_Any() =>        _str.Any() &&        _list.Any() &&        _array.Any();    [Benchmark]    public bool AllNonEmpty_Property() =>        _str.Length != 0 &&        _list.Count != 0 &&        _array.Length != 0;}
MethodMeanRatio
AllNonEmpty_Any12.5302 ns1.00
AllNonEmpty_Property0.3701 ns0.03

Dictionary

In addition to making existing methods faster, LINQ has also gained some new methods in .NET 8.dotnet/runtime#85811 from@lateapexearlyspeed added new overloads ofToDictionary. Unlike the existing overloads that are extensions on any arbitraryIEnumerable<TSource> and accept delegates for extracting from eachTSource aTKey and/orTValue, these new overloads are extensions onIEnumerable<KeyValuePair<TKey, TValue>> andIEnumerable<(TKey, TValue)>. This is primarily an addition for convenience, as it means that such an enumerable that previously used code like:

return collection.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

can instead be simplified to just be:

return collection.ToDictionary();

Beyond being simpler, this has the nice benefit of also being cheaper, as it means the method doesn’t need to invoke two delegates per item. It also means that this new method is a simple passthrough toDictionary<TKey, TValue>‘s constructor, which has its own optimizations that take advantage of knowing aboutDictionary<TKey, TValue> internals, e.g. it can more efficiently copy the source data if it’s aDictionary<TKey, TValue>.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly IEnumerable<KeyValuePair<string, int>> _source = Enumerable.Range(0, 1024).ToDictionary(i => i.ToString(), i => i);    [Benchmark(Baseline = true)]    public Dictionary<string, int> WithDelegates() => _source.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);    [Benchmark]    public Dictionary<string, int> WithoutDelegates() => _source.ToDictionary();}
MethodMeanRatio
WithDelegates21.208 us1.00
WithoutDelegates8.652 us0.41

It also benefits from theDictionary<TKey, TValue>‘s constructor being optimized in additional ways. As noted, its constructor accepting anIEnumerable<KeyValuePair<TKey, TValue>> already special-cased when the enumerable is actually aDictionary<TKey, TValue>. Withdotnet/runtime#86254, it now also special-cases when the enumerable is aKeyValuePair<TKey, TValue>[] or aList<KeyValuePair<TKey, TValue>>. When such a source is found, a span is extracted from it (a simple cast for an array, or viaCollectionsMarshal.AsSpan for aList<>), and then that span (rather than the originalIEnumerable<>) is what’s enumerated. That saves an enumerator allocation and several interface dispatches per item for these reasonably common cases.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly List<KeyValuePair<int, int>> _list = Enumerable.Range(0, 1000).Select(i => new KeyValuePair<int, int>(i, i)).ToList();    [Benchmark] public Dictionary<int, int> FromList() => new Dictionary<int, int>(_list);}
MethodRuntimeMeanRatio
FromList.NET 7.012.250 us1.00
FromList.NET 8.06.780 us0.55

The most common operation performed on a dictionary is looking up a key, whether to see if it exists, to add a value, or to get the current value. Previous .NET releases have seen significant improvements in this lookup time, but even better than optimizing a lookup is not needing to do one at all. One common place we’ve seen unnecessary lookups is with guard clauses that end up being unnecessary, for example code that does:

if (!dictionary.ContainsKey(key)){    dictionary.Add(key, value);}

This incurs two lookups, one as part ofContainsKey, and then if the key wasn’t in the dictionary, another as part of theAdd call. Code can instead achieve the same operation with:

dictionary.TryAdd(key, value);

which incurs only one lookup.CA1864, added indotnet/roslyn-analyzers#6199 from@CollinAlpert, looks for such places where anAdd call is guarded by aContainsKey call.dotnet/runtime#88700 fixed a few occurrences of this indotnet/runtime.CA1864

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly Dictionary<string, string> _dict = new();    [Benchmark(Baseline = true)]    public void ContainsThenAdd()    {        _dict.Clear();        if (!_dict.ContainsKey("key"))        {            _dict.Add("key", "value");        }    }    [Benchmark]    public void TryAdd()    {        _dict.Clear();        _dict.TryAdd("key", "value");    }}
MethodMeanRatio
ContainsThenAdd25.93 ns1.00
TryAdd19.50 ns0.75

Similarly,dotnet/roslyn-analyzers#6767 from@mpidash addedCA1868, which looks forAdd orRemove calls onISet<T>s where the call is guarded by aContains, and recommends removing theContains call.dotnet/runtime#89652 from@mpidash fixes occurrences of this indotnet/runtime.CA1868

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly HashSet<string> _set = new();    [Benchmark(Baseline = true)]    public bool ContainsThenAdd()    {        _set.Clear();        if (!_set.Contains("key"))        {            _set.Add("key");            return true;        }        return false;    }    [Benchmark]    public bool Add()    {        _set.Clear();        return _set.Add("key");    }}
MethodMeanRatio
ContainsThenAdd22.98 ns1.00
Add17.99 ns0.78

Other related analyzers previously released have also been improved.dotnet/roslyn-analyzers#6387 improvedCA1854 to find more opportunities for usingIDictionary<TKey, TValue>.TryGetValue, withdotnet/runtime#85613 anddotnet/runtime#80996 using the analyzer to find and fix more occurrences.

Other dictionaries have also improved in .NET 8.ConcurrentDictionary<TKey, TValue> in particular got a nice boost fromdotnet/runtime#81557, for all key types but especially for the very common case whereTKey isstring and the equality comparer is either the default comparer (whether that benull,EqualityComparer<TKey>.Default, orStringComparer.Ordinal, all of which behave identically) orStringComparer.OrdinalIgnoreCase. In .NET Core,string hash codes are randomized, meaning there’s a random seed value unique to any given process that’s incorporated into string hash codes. So if, for example, I run the following program:

// dotnet run -f net8.0string s = "Hello, world!";Console.WriteLine(s.GetHashCode());Console.WriteLine(s.GetHashCode());Console.WriteLine(s.GetHashCode());

I get the following output, showing that the hash code for a given string is stable across multipleGetHashCode calls:

144238523214423852321442385232

but when I run the program again, I get a different stable value:

740992523740992523740992523

This randomization is done to help mitigate a class of denial-of-service (DoS) attacks involving dictionaries, where an attacker might be able to trigger the worst-case algorithmic complexity of a dictionary by forcing lots of collisions amongst the keys. However, the randomization also incurs some amount of overhead. It’s enough overhead so thatDictionary<TKey, TValue> actually special-casesstring keys with a default orOrdinalIgnoreCase comparer to skip the randomization until a sufficient number of collisions has been detected. Now in .NET 8,ConcurrentDictionary<string, TValue> employs the same trick. When it starts life, aConcurrentDictionary<string, TValue> instance using a default orOrdinalIgnoreCase comparer performs hashing using a non-randomized comparer. Then as it’s adding an item and traversing its internal data structure, it keeps track of how many keys it has to examine that had the same hash code. If that count surpasses a threshold, it then switches back to using a randomized comparer, rehashing the whole dictionary in order to mitigate possible attacks.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Concurrent;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private KeyValuePair<string, string>[] _pairs;    private ConcurrentDictionary<string, string> _cd;    [GlobalSetup]    public void Setup()    {        _pairs =            // from https://github.com/dotnet/runtime/blob/a30de6d40f69ef612b514344a5ec83fffd10b957/src/libraries/System.Formats.Asn1/src/System/Formats/Asn1/WellKnownOids.cs#L317-L419            new[]            {                "1.2.840.10040.4.1", "1.2.840.10040.4.3", "1.2.840.10045.2.1", "1.2.840.10045.1.1", "1.2.840.10045.1.2", "1.2.840.10045.3.1.7", "1.2.840.10045.4.1", "1.2.840.10045.4.3.2", "1.2.840.10045.4.3.3", "1.2.840.10045.4.3.4",                "1.2.840.113549.1.1.1", "1.2.840.113549.1.1.5", "1.2.840.113549.1.1.7", "1.2.840.113549.1.1.8", "1.2.840.113549.1.1.9", "1.2.840.113549.1.1.10", "1.2.840.113549.1.1.11", "1.2.840.113549.1.1.12", "1.2.840.113549.1.1.13",                "1.2.840.113549.1.5.3", "1.2.840.113549.1.5.10", "1.2.840.113549.1.5.11", "1.2.840.113549.1.5.12", "1.2.840.113549.1.5.13", "1.2.840.113549.1.7.1", "1.2.840.113549.1.7.2", "1.2.840.113549.1.7.3", "1.2.840.113549.1.7.6",                "1.2.840.113549.1.9.1", "1.2.840.113549.1.9.3", "1.2.840.113549.1.9.4", "1.2.840.113549.1.9.5", "1.2.840.113549.1.9.6", "1.2.840.113549.1.9.7", "1.2.840.113549.1.9.14", "1.2.840.113549.1.9.15", "1.2.840.113549.1.9.16.1.4",                "1.2.840.113549.1.9.16.2.12", "1.2.840.113549.1.9.16.2.14", "1.2.840.113549.1.9.16.2.47", "1.2.840.113549.1.9.20", "1.2.840.113549.1.9.21", "1.2.840.113549.1.9.22.1", "1.2.840.113549.1.12.1.3", "1.2.840.113549.1.12.1.5",                "1.2.840.113549.1.12.1.6", "1.2.840.113549.1.12.10.1.1", "1.2.840.113549.1.12.10.1.2", "1.2.840.113549.1.12.10.1.3", "1.2.840.113549.1.12.10.1.5", "1.2.840.113549.1.12.10.1.6", "1.2.840.113549.2.5", "1.2.840.113549.2.7",                "1.2.840.113549.2.9", "1.2.840.113549.2.10", "1.2.840.113549.2.11", "1.2.840.113549.3.2", "1.2.840.113549.3.7", "1.3.6.1.4.1.311.17.1", "1.3.6.1.4.1.311.17.3.20", "1.3.6.1.4.1.311.20.2.3", "1.3.6.1.4.1.311.88.2.1",                "1.3.6.1.4.1.311.88.2.2", "1.3.6.1.5.5.7.3.1", "1.3.6.1.5.5.7.3.2", "1.3.6.1.5.5.7.3.3", "1.3.6.1.5.5.7.3.4", "1.3.6.1.5.5.7.3.8", "1.3.6.1.5.5.7.3.9", "1.3.6.1.5.5.7.6.2", "1.3.6.1.5.5.7.48.1", "1.3.6.1.5.5.7.48.1.2",                "1.3.6.1.5.5.7.48.2", "1.3.14.3.2.26", "1.3.14.3.2.7", "1.3.132.0.34", "1.3.132.0.35", "2.5.4.3", "2.5.4.5", "2.5.4.6", "2.5.4.7", "2.5.4.8", "2.5.4.10", "2.5.4.11", "2.5.4.97", "2.5.29.14", "2.5.29.15", "2.5.29.17", "2.5.29.19",                "2.5.29.20", "2.5.29.35", "2.16.840.1.101.3.4.1.2", "2.16.840.1.101.3.4.1.22", "2.16.840.1.101.3.4.1.42", "2.16.840.1.101.3.4.2.1", "2.16.840.1.101.3.4.2.2", "2.16.840.1.101.3.4.2.3", "2.23.140.1.2.1", "2.23.140.1.2.2",            }.Select(s => new KeyValuePair<string, string>(s, s)).ToArray();        _cd = new ConcurrentDictionary<string, string>(_pairs, StringComparer.OrdinalIgnoreCase);    }    [Benchmark]    public int TryGetValue()    {        int count = 0;        foreach (KeyValuePair<string, string> pair in _pairs)        {            if (_cd.TryGetValue(pair.Key, out _))            {                count++;            }        }        return count;    }}
MethodRuntimeMeanRatio
TryGetValue.NET 7.02.917 us1.00
TryGetValue.NET 8.01.462 us0.50

The above benchmark also benefited fromdotnet/runtime#77005, which tweaked another long-standing optimization in the type.ConcurrentDictionary<TKey, TValue> maintains aNode object for every key/value pair it stores. As multiple threads might be reading from the dictionary concurrent with updates happening, the dictionary needs to be really careful about how it mutates data stored in the collection. If an update is performed that needs to update aTValue in an existing node (e.g.cd[existingKey] = newValue), the dictionary needs to be very careful to avoid torn reads, such that one thread could be reading the value while another thread is writing the value, leading to the reader seeing part of the old value and part of the new value. It does this by only reusing that sameNode for an update if it can write theTValue atomically. It can write it atomically if theTValue is a reference type, in which case it’s simply writing a pointer-sized reference, or if theTValue is a primitive value that’s defined by the platform to always be written atomically when written with appropriate alignment, e.g.int, orlong when in a 64-bit process. To make this check efficient,ConcurrentDictionary<TKey, TValue> computes once whether a givenTValue is writable atomically, storing it into astatic readonly field, such that in tier 1 compilation, the JIT can treat the value as aconst. However, thisconst trick doesn’t always work. The field was onConcurrentDictionary<TKey, TValue> itself, and if one of those generic type parameters ended up being a reference type (e.g.ConcurrentDictionary<object, int>), accessing thestatic readonly field would require a generic lookup (the JIT isn’t currently able to see that the value stored in the field is only dependent on theTValue and not on theTKey). To fix this, the field was moved to a separate type whereTValue is the only generic parameter, and a check fortypeof(TValue).IsValueType (which is itself a JIT intrinsic that manifests as aconst) is done separately.

ConcurrentDictionary<TKey, TValue>‘sTryRemove was also improved this release, viadotnet/runtime#82004. Mutation of aConcurrentDictionary<TKey, TValue> requires taking a lock. However, in the case ofTryRemove, we only actually need the lock if it’s possible the item being removed is contained. If the number of items protected by the given lock is 0, we knowTryRemove will be a nop. Thus, this PR added a fast path toTryRemove that read the count for that lock and immediately bailed if it was 0.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Concurrent;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly ConcurrentDictionary<int, int> _empty = new();    [Benchmark]    public bool TryRemoveEmpty() => _empty.TryRemove(default, out _);}
MethodRuntimeMeanRatio
TryRemoveEmpty.NET 7.026.963 ns1.00
TryRemoveEmpty.NET 8.05.853 ns0.22

Another dictionary that’s been improved in .NET 8 isConditionalWeakTable<TKey, TValue>. As background if you haven’t used this type before,ConditionalWeakTable<TKey, TValue> is a very specialized dictionary based onDependentHandle; think of it as every key being a weak reference (so if the GC runs, the key in the dictionary won’t be counted as a strong root that would keep the object alive), and that if the key is collected, the whole entry is removed from the table. It’s particularly useful in situations where additional data needs to be associated with an object but where for whatever reason you’re unable to modify that object to have a reference to the additional data.dotnet/runtime#80059 improves the performance of lookups on aConditionalWeakTable<TKey, TValue>, in particular for objects thataren’t in the collection, and even more specifically for an object that’s never been in any dictionary. SinceConditionalWeakTable<TKey, TValue> is about object references, unlike other dictionaries in .NET, it doesn’t use the defaultEqualityComparer<TKey>.Default to determine whether an object is in the collection; it just uses object reference equality. And that means to get a hash code for an object, it uses the same functionality that the baseobject.GetHashCode does. It can’t just callGetHashCode, as the method could have been overridden, so instead it directly calls to the same publicRuntimeHelpers.GetHashCode thatobject.GetHashCode uses:

public class Object{    public virtual int GetHashCode() => RuntimeHelpers.GetHashCode(this);    ...}

This PR tweaks whatConditionalWeakTable<,> does here. It introduces a new internalRuntimeHelpers.TryGetHashCode that will avoid creating and storing a hash code for the object if the object doesn’t already have one. It then uses that method fromConditionalWeakTable<TKey, TValue> as part ofTryGetValue (andRemove, and other related APIs). IfTryGetHashCode returns a value indicating the object doesn’t yet have one, then the operation can early-exit, because for the object to have been stored into the collection, it must have had a hash code generated for it.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private ConditionalWeakTable<SomeObject, Data> _cwt;    private List<object> _rooted;    private readonly SomeObject _key = new();    [GlobalSetup]    public void Setup()    {        _cwt = new();        _rooted = new();        for (int i = 0; i < 1000; i++)        {            SomeObject key = new();            _rooted.Add(key);            _cwt.Add(key, new());        }    }    [Benchmark]    public int GetValue() => _cwt.TryGetValue(_key, out Data d) ? d.Value : 0;    private sealed class SomeObject { }    private sealed class Data    {        public int Value;    }}
MethodRuntimeMeanRatio
GetValue.NET 7.04.533 ns1.00
GetValue.NET 8.03.028 ns0.67

So, improvements toDictionary<TKey, TValue>,ConcurrentDictionary<TKey, TValue>,ConditionalWeakTable<TKey, TValue>… are those the “end all be all” of hash table world? Don’t be silly…

Frozen Collections

There are many specialized libraries available on NuGet, providing all manner of data structures with this or that optimization or targeted at this or that scenario. Our goal with the core .NET libraries has never been to provide all possible data structures (it’s actually been a goal not to), but rather to provide the most commonly needed data structures focused on the most commonly needed scenarios, and rely on the ecosystem to provide alternatives where something else is deemed valuable. As a result, we don’t add new collection types all that frequently; we continually optimize the ones that are there and we routinely augment them with additional functionality, but we rarely introduce brand new collection types. In fact, in the last several years, the only new general-purpose collection type introduced into the core libraries wasPriorityQueue<TElement, TPriority> class, which was added in .NET 6. However, enough of a need has presented itself that .NET 8 sees the introduction of not one but two new collection types:System.Collections.Frozen.FrozenDictionary<TKey, TValue> andSystem.Collections.Frozen.FrozenSet<TKey, TValue>.

Beyond causing “Let It Go” to be stuck in your head for the rest of the day (“you’re welcome”), what benefit do these new types provide, especially when we already haveSystem.Collections.Immutable.ImmutableDictionary<TKey, TValue> andSystem.Collections.Immutable.ImmutableSet<T>? There are enough similarities between the existing immutable collections and the new frozen collections that the latter are actually included in theSystem.Collections.Immutable library, which means they’re also available as part of theSystem.Collections.Immutable NuGet package. But there are also enough differences to warrant us adding them. In particular, this is an example of where scenario and intended use make a big impact on whether a particular data structure makes sense for your needs.

Arguably, the existingSystem.Collections.Immutable collections were misnamed. Yes, they’re “immutable,” meaning that once you’ve constructed an instance of one of the collection types, you can’t change its contents. However, that could have easily been achieved simply by wrapping an immutable facade around one of the existing mutable ones, e.g. an immutable dictionary type that just copied the data into a mutableDictionary<TKey, TValue> and exposed only reading operations:

public sealed class MyImmutableDictionary<TKey, TValue> :    IReadOnlyDictionary<TKey, TValue>    where TKey : notnull{    private readonly Dictionary<TKey, TValue> _data;    public MyImmutableDictionary(IEnumerable<KeyValuePair<TKey, TValue>> source) => _data = source.ToDictionary();    public bool TryGetValue(TKey key, [MaybeNullWhen(false)] out TValue value) => _data.TryGetValue(key, out value);    ...}

Yet, if you look at the implementation ofImmutableDictionary<TKey, TValue>, you’ll see a ton of code involved in making the type tick. Why? Because it and its friends are optimized for something very different. In academic nomenclature, the immutable collections are actually “persistent” collections. A persistent data structure is one that provides mutating operations on the collection (e.g. Add, Remove, etc.) but where those operations don’t actually change the existing instance, instead resulting in a new instance being created that contains that modification. So, for example,ImmutableDictionary<TKey, TValue> ironically exposes anAdd(TKey key, TValue value) method, but this method doesn’t actually modify the collection instance on which it’s called; instead, it creates and returns a brand newImmutableDictionary<TKey, TValue> instance, containing all of the key/value pairs from the original instance as well as the new key/value pair being added. Now, you could imagine that being done simply by copying all of the data to a newDictionary<TKey, TValue> and adding in the new value, e.g.

public sealed class MyPersistentDictionary<TKey, TValue> where TKey : notnull{    private readonly Dictionary<TKey, TValue> _data;    public MyPersistentDictionary<TKey, TValue> Add(TKey key, TValue value)    {        var newDictionary = new Dictionary<TKey, TValue>(_data);        newDictionary.Add(key, value);        return newDictionary;    }    ...}

but while functional, that’s terribly inefficient from a memory consumption perspective, as every addition results in a brand new copy of all of the data being made, just to store that one additional pair in the new instance. It’s also terribly inefficient from an algorithmic complexity perspective, as adding N values would end up being anO(n^2) algorithm (each new item would result in copying all previous items). As such,ImmutableDictionary<TKey, TValue> is optimized to share as much as possible between instances. Its implementation uses anAVL tree, a self-balancing binary search tree. Adding into such a tree not only requiresO(log n) time (whereas the full copy shown inMyPersistentDictionary<TKey, TValue> above isO(n)), it also enables reusing entire portions of a tree between instances of dictionaries. If adding a key/value pair doesn’t require mutating a particular subtree, then both the new and old dictionary instances can point to that same subtree, thereby avoiding significant memory increase. You can see this from a benchmark like the following:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Immutable;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private const int Items = 10_000;    [Benchmark(Baseline = true)]    public Dictionary<int, int> DictionaryAdds()    {        Dictionary<int, int> d = new();        for (int i = 0; i < Items; i++)        {            var newD = new Dictionary<int, int>(d);            newD.Add(i, i);            d = newD;        }        return d;    }    [Benchmark]    public ImmutableDictionary<int, int> ImmutableDictionaryAdds()    {        ImmutableDictionary<int, int> d = ImmutableDictionary<int, int>.Empty;        for (int i = 0; i < Items; i++)        {            d = d.Add(i, i);        }        return d;    }}

which when run on .NET 8 yields the following results for me:

MethodMeanRatio
DictionaryAdds478.961 ms1.000
ImmutableDictionaryAdds4.067 ms0.009

That highlights that the tree-based nature ofImmutableDictionary<TKey, TValue> makes it significantly more efficient (~120x better in both throughput and allocation in this run) forthis example of performing lots of additions, when compared with using for the same purpose aDictionary<TKey, TValue> treated as being immutable. And that’s why these immutable collections came into being in the first place. The C# compiler uses lots and lots of dictionaries and sets and the like, and it employs a lot of concurrency. It needs to enable one thread to “tear off” an immutable view of a collection even while other threads are updating the collection, and for such purposes it usesSystem.Collections.Immutable.

However, just because the above numbers look amazing doesn’t meanImmutableDictionary<TKey, TValue> is always the right tool for the immutable job… it actually rarely is. Why? Because the exact thing that made it so fast and memory efficient for the above benchmark is also its downfall on one of the most common tasks needed for an “immutable” dictionary: reading. With its tree-based data structure, not only are addsO(log n), but lookups are alsoO(log n), which for a large dictionary can be extremely inefficient when compared to theO(1) access times of a type likeDictionary<TKey, TValue>. We can see this as well with a simple benchmark. Let’s say we’ve built up our 10,000-element dictionary as in the previous example, and now we want to query it:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Immutable;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private const int Items = 1_000_000;    private static readonly Dictionary<int, int> s_d = new Dictionary<int, int>(Enumerable.Range(0, Items).ToDictionary(x => x, x => x));    private static readonly ImmutableDictionary<int, int> s_id = ImmutableDictionary.CreateRange(Enumerable.Range(0, Items).ToDictionary(x => x, x => x));    [Benchmark]    public int EnumerateDictionary()    {        int sum = 0;        foreach (var pair in s_d) sum++;        return sum;    }    [Benchmark]    public int EnumerateImmutableDictionary()    {        int sum = 0;        foreach (var pair in s_id) sum++;        return sum;    }    [Benchmark]    public int IndexerDictionary()    {        int sum = 0;        for (int i = 0; i < Items; i++)        {            sum += s_d[i];        }        return sum;    }    [Benchmark]    public int IndexerImmutableDictionary()    {        int sum = 0;        for (int i = 0; i < Items; i++)        {            sum += s_id[i];        }        return sum;    }}
MethodMean
EnumerateImmutableDictionary28.065 ms
EnumerateDictionary1.404 ms
IndexerImmutableDictionary46.538 ms
IndexerDictionary3.780 ms

Uh oh. OurImmutableDictionary<TKey, TValue> in this example is ~12x as expensive for lookups and ~20x as expensive for enumeration asDictionary<TKey, TValue>. If your process will be spending most of its time performing reads on the dictionary rather than creating it and/or performing mutation, that’s a lot of cycles being left on the table.

And that’s where frozen collections come in. The collections inSystem.Collections.Frozen are immutable, just as are those inSystem.Collections.Immutable, but they’re optimized for a different scenario. Whereas the purpose of a type likeImmutableDictionary<TKey, TValue> is to enable efficient mutation (into a new instance), the purpose ofFrozenDictionary<TKey, TValue> is to represent data that never changes, and thus it doesn’t expose any operations that suggest mutation, only operations for reading. Maybe you’re loading some configuration data into a dictionary once when your process starts (and then re-loading it only rarely when the configuration changes) and then querying that data over and over and over again. Maybe you’re creating a mapping from HTTP status codes to delegates representing how those status codes should be handled. Maybe you’re caching schema information about a set of dynamically-discovered types and then using the resulting parsed information every time you encounter those types later on. Whatever the scenario, you’re creating an immutable collection that you want to be optimized for reads, and you’re willing to spend some more cycles creating the collection (because you do it only once, or only once in a while) in order to make reads as fast as possible. That’s exactly whatFrozenDictionary<TKey, TValue> andFrozenSet<T> provide.

Let’s update our previous example to now also includeFrozenDictionary<TKey, TValue>:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Frozen;using System.Collections.Immutable;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private const int Items = 10_000;    private static readonly Dictionary<int, int> s_d = new Dictionary<int, int>(Enumerable.Range(0, Items).ToDictionary(x => x, x => x));    private static readonly ImmutableDictionary<int, int> s_id = ImmutableDictionary.CreateRange(Enumerable.Range(0, Items).ToDictionary(x => x, x => x));    private static readonly FrozenDictionary<int, int> s_fd = FrozenDictionary.ToFrozenDictionary(Enumerable.Range(0, Items).ToDictionary(x => x, x => x));    [Benchmark]    public int DictionaryGets()    {        int sum = 0;        for (int i = 0; i < Items; i++)        {            sum += s_d[i];        }        return sum;    }    [Benchmark]    public int ImmutableDictionaryGets()    {        int sum = 0;        for (int i = 0; i < Items; i++)        {            sum += s_id[i];        }        return sum;    }    [Benchmark(Baseline = true)]    public int FrozenDictionaryGets()    {        int sum = 0;        for (int i = 0; i < Items; i++)        {            sum += s_fd[i];        }        return sum;    }}
MethodMeanRatio
ImmutableDictionaryGets360.55 us13.89
DictionaryGets39.43 us1.52
FrozenDictionaryGets25.95 us1.00

Now we’re talkin’. Whereas for this lookup testDictionary<TKey, TValue> was ~9x faster thanImmutableDictionary<TKey, TValue>,FrozenDictionary<TKey, TValue> was 50% faster than evenDictionary<TKey, TValue>.

How does that improvement happen? Just asImmutableDictionary<TKey, TValue> doesn’t just wrap aDictionary<TKey, TValue>,FrozenDictionary<TKey, TValue> doesn’t just wrap one, either. It has a customized implementation focused on making read operations as fast as possible, both for lookups and for enumerations. In fact, it doesn’t have just one implementation; it has many implementations.

To start to see that, let’s change the example. In the United States, the Social Security Administration tracks the popularity of baby names. In 2022, themost popular baby names for girls were Olivia, Emma, Charlotte, Amelia, Sophia, Isabella, Ava, Mia, Evelyn, and Luna. Here’s a benchmark that checks to see whether a name is one of those:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Frozen;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly HashSet<string> s_s = new(StringComparer.OrdinalIgnoreCase)    {         "Olivia", "Emma", "Charlotte", "Amelia", "Sophia", "Isabella", "Ava", "Mia", "Evelyn", "Luna"    };    private static readonly FrozenSet<string> s_fs = s_s.ToFrozenSet(StringComparer.OrdinalIgnoreCase);    [Benchmark(Baseline = true)]    public bool HashSet_IsMostPopular() => s_s.Contains("Alexandria");    [Benchmark]    public bool FrozenSet_IsMostPopular() => s_fs.Contains("Alexandria");}
MethodMeanRatio
HashSet_IsMostPopular9.824 ns1.00
FrozenSet_IsMostPopular1.518 ns0.15

Significantly faster. Internally,ToFrozenSet can pick an implementation based on the data supplied, both the type of the data and the exact values being used. In this case, if we print out the type ofs_fs, we see:

System.Collections.Frozen.LengthBucketsFrozenSet

That’s an implementation detail, but what we’re seeing here is that thes_fs, even though it’s strongly-typed asFrozenSet<string>, is actually a derived type namedLengthBucketsFrozenSet.ToFrozenSet has analyzed the data supplied to it and chosen a strategy that it thinks will yield the best overall throughput. Part of that is just seeing that the type of the data isstring, in which case all thestring-based strategies are able to quickly discard queries that can’t possibly match. In this example, the set will have tracked that the longest string in the collection is “Charlotte” at only nine characters long; as such, when it’s asked whether the set contains “Alexandria”, it can immediately answer “no,” because it does a quick length check and sees that “Alexandria” at 10 characters can’t possibly be contained.

Let’s take another example. Internal to the C# compiler, it has the notion of “special types,” and it has a dictionary that maps from a string-based type name to anenum used to identify that special-type. As a simplified representation of this, I’ve just extracted those strings to create a set:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Frozen;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly HashSet<string> s_s = new()    {        "System.Object", "System.Enum", "System.MulticastDelegate", "System.Delegate", "System.ValueType", "System.Void",        "System.Boolean", "System.Char", "System.SByte", "System.Byte", "System.Int16", "System.UInt16", "System.Int32",        "System.UInt32", "System.Int64","System.UInt64", "System.Decimal", "System.Single", "System.Double", "System.String",        "System.IntPtr", "System.UIntPtr", "System.Array", "System.Collections.IEnumerable", "System.Collections.Generic.IEnumerable`1",        "System.Collections.Generic.IList`1", "System.Collections.Generic.ICollection`1", "System.Collections.IEnumerator",        "System.Collections.Generic.IEnumerator`1", "System.Collections.Generic.IReadOnlyList`1", "System.Collections.Generic.IReadOnlyCollection`1",        "System.Nullable`1", "System.DateTime", "System.Runtime.CompilerServices.IsVolatile", "System.IDisposable", "System.TypedReference",        "System.ArgIterator", "System.RuntimeArgumentHandle", "System.RuntimeFieldHandle", "System.RuntimeMethodHandle", "System.RuntimeTypeHandle",        "System.IAsyncResult", "System.AsyncCallback", "System.Runtime.CompilerServices.RuntimeFeature", "System.Runtime.CompilerServices.PreserveBaseOverridesAttribute",    };    private static readonly FrozenSet<string> s_fs = s_s.ToFrozenSet();    [Benchmark(Baseline = true)]    public bool HashSet_IsSpecial() => s_s.Contains("System.Collections.Generic.IEnumerable`1");    [Benchmark]    public bool FrozenSet_IsSpecial() => s_fs.Contains("System.Collections.Generic.IEnumerable`1");}
MethodMeanRatio
HashSet_IsSpecial15.228 ns1.00
FrozenSet_IsSpecial8.218 ns0.54

Here the item we’re searching for is in the collection, so it’s not getting its performance boost from a fast path to fail out of the search. The concrete type ofs_fs in this case sheds some light on it:

System.Collections.Frozen.OrdinalStringFrozenSet_RightJustifiedSubstring

One of the biggest costs involved in looking up something in a hash table is often the cost of producing the hash in the first place. For a type likeint, it’s trivial, as it’s literally just its value. But for a type likestring, the hash is produced by looking at the string’s contents and factoring each character into the resulting value. The more characters need to be considered, the more it costs. In this case, the type has identified that in order to differentiate all of the items in the collection, only a subset of them needs to be hashed, such that it only needs to examine a subset of the incoming string to determine what a possible match might be in the collection.

A bunch of PRs went into makingSystem.Collections.Frozen happen in .NET 8. It started as an internal project used by several services at Microsoft, and was then cleaned up and added as part ofdotnet/runtime#77799. That provided the core types and initial strategy implementations, withdotnet/runtime#79794 following it to provide additional strategies (although we subsequently backed out a few due to lack of motivating scenarios for what their optimizations were targeting).

dotnet/runtime#81021 then removed some virtual dispatch from the string-based implementations. As noted in the previous example, one approach the strategies take is to try to hash less, so there’s a phase of analysis where the implementation looks at the various substrings in each of the items and determines whether there’s an offset and length for substring that across all of the items provides an ideal differentiation. For example, consider the strings “12a34”, “12b34”, “12c34”; the analyzer would determine that there’s no need to hash the whole string, it need only consider the character at index 2, as that’s enough to uniquely hash the relevant strings. This was initially achieved by using a custom comparer type, but that then meant that virtual dispatch was needed in order to invoke the hashing routine. Instead, this PR created more concrete derived types fromFrozenSet/FrozenDictionary, such that the choice of hashing logic was dictated by the choice of concrete collection type to instantiate, saving on the per-operation dispatch.

In any good story, there’s a twist, and we encountered a twist with these frozen collection types as well. I’ve already described the scenarios that drove the creation of these types: create once, usea lot. And as such, a lot of attention was paid to overheads involved in reading from the collection, but initially very little time was paid to optimizing construction time. In fact, improving construction time was initially a non-goal, with a willingness to spend as much time as was needed to eke out more throughput for reading. This makes sense if you’re focusing on long-lived services, where you’re happy to spend extra seconds once an hour or day or week to optimize something that will then be used many thousands of times per second. However, the equation changes a bit when types like this are exposed in the core libraries, such that the expected number of developers using them, the use cases they have for them, and the variations of data thrown at them grows by orders of magnitude. We started hearing from developers that they were excited to useFrozenDictionary/FrozenSet not just because of performance but also because they were truly immutable, both in implementation and in surface area (e.g. noAdd method to confuse things), and that they’d be interested in employing them in object models, UIs, and so on. At that point, you’re no longer in the world of “we can take as much time for construction as we want,” and instead need to be concerned about construction taking inordinate amounts of time and resources.

As a stop-gap measure,dotnet/runtime#81194 changed the existingToFrozenDictionary/ToFrozenSet methods to not do any analysis of the incoming data, and instead have both construction time and read throughput in line with that ofDictionary/HashSet. It then added new overloads with abool optimizeForReading argument, to enable developers to opt-in to those longer construction times in exchange for better read throughput. This wasn’t an ideal solution, as it meant that it took more discovery and more code for a developer to achieve the primary purpose of these types, but it also helped developers avoid pits of failure by using what looked like a harmless method but could result in significant increases in processing time (one degenerate example I created resulted inToFrozenDictionary running literally for minutes).

We then set about to improve the overall performance of the collections, with a bunch of PRs geared towards driving down the costs:

  • dotnet/runtime#81389 removed various allocations and a dependency from some of the optimizations on the generic math interfaces from .NET 7, such that the optimizations would apply downlevel as well, simplifying the code.
  • dotnet/runtime#81603 moved some code around to reduce how much code was in a generic context. With Native AOT, with type parameters involving value types, every unique set of type parameters used with these collections results in a unique copy of the code being made, and with all of the various strategies around just in case they’re necessary to optimize a given set, there’s potentially a lot of code that gets duplicated. This change was able to shave ~10Kb off each generic instantiation.
  • dotnet/runtime#86293 made a large number of tweaks, including limiting the maximum length substring that would be evaluated as part of determining the optimal hashing length to employ. This significantly reduced the worst-case running time when supplying problematic inputs.
  • dotnet/runtime#84301 added similar early-exit optimizations as were seen earlier with string, but for a host of other types, including all the primitives,TimeSpan,Guid, and such. For these types, when no comparer is provided, we can sort the inputs, quickly check whether a supplied input is greater than anything known to be in the collection, and when dealing with a small number of elements such that we don’t hash at all and instead just do a linear search, we can stop searching once we’ve reached an item in the collection that’s larger than the one being tested (e.g. if the first item in the sorted list is larger than the one being tested, nothing will match). It’s interesting why we don’t just do this for anIComparable<T>; we did, initially, actually, but removed it because of several prominentIComparable<T> implementations that didn’t work for this purpose.ValueTuple<...>, for example, implementsIComparable<ValueTuple<...>>, but theT1,T2, etc. types theValueTuple<...> wraps may not, and the frozen collections didn’t have a good way to determine the viability of anIComparable<T> implementation. Instead, this PR added the optimization back with an allow list, such that all the relevant known good types that could be referenced were special-cased.
  • dotnet/runtime#87510 was the first in a series of PRs to focus significantly on driving down the cost of construction. Its main contribution in this regard was in how collisions are handled. One of the main optimizations employed in the general case byToFrozenDictionary/ToFrozenSet is to try to drive down the number of collisions in the hash table, since the more collisions there are, the more work will need to be performed during lookups. It does this by populating the table and tracking the number of collisions, and then if there were too many, increasing the size of the table and trying again, repeatedly, until the table has grown large enough that collisions are no longer an issue. This process would hash everything, and then check to make sure it was as good as was desired. This PR changed that to instead bail the moment we knew there were enough collisions that we’d need to retry, rather than waiting until having processed everything.
  • dotnet/runtime#87630,dotnet/runtime#87688, anddotnet/runtime#88093 in particular improve collections keyed byints, by avoiding unnecessary work. For example, as part of determining the ideal table size (to minimize collisions), the implementation generates a set of all unique hash codes, eliminating duplicate hash codes because they’d always collide regardless of the size of the table. But withints, we can skip this step, becauseints are their own hash codes, and so a set of uniqueints is guaranteed to be a set of unique hash codes as well. This was then extended to also apply foruint,short/ushort,byte/sbyte, andnint/nuint (in 32-bit processes), as they all similarly use their own value as the hash code.
  • dotnet/runtime#87876 anddotnet/runtime#87989 improve the “LengthBucket” strategy referenced in the earlier examples. This implementation buckets strings by their length and then does a lookup just within the strings of that length; if there are only a few strings per length, this can make searching very efficient. The initial implementation used an array of arrays, and this PR flattens that into a single array. This makes construction time much faster for this strategy, as there’s significantly less allocation involved.
  • dotnet/runtime#87960 is based on an observation that we would invariably need to resize at least once in order to obtain the desired minimal collision rate, so it simply starts at a higher initial count than was previously being used.

With all of those optimizations in place, construction time has now improved to the point where it’s no longer a threat, anddotnet/runtime#87988 effectively reverteddotnet/runtime#81194, getting rid of theoptimizeForReading-based overloads, such that everything is now optimized for reading.

As an aside, it’s worth noting that forstring keys in particular, the C# compiler has now also gotten in on the game of better optimizing based on the known characteristics of the data, such that if you know all of yourstring keys at compile-time, and you just need an ordinal, case-sensitive lookup, you might be best off simply writing aswitch statement or expression. This is all thanks todotnet/roslyn#66081. Let’s take the name popularity example from earlier, and express it as aswitch statement:

static bool IsMostPopular(string name){    switch (name)    {        case "Olivia":        case "Emma":        case "Charlotte":        case "Amelia":        case "Sophia":        case "Isabella":        case "Ava":        case "Mia":        case "Evelyn":        case "Luna":            return true;        default:            return false;    }}

Previously compiling this would result in the C# compiler providing a lowered equivalent to this:

static bool IsMostPopular(string name){    uint num = <PrivateImplementationDetails>.ComputeStringHash(name);    if (num <= 1803517931)    {        if (num <= 452280388)        {            if (num != 83419291)            {                if (num == 452280388 && name == "Isabella")                    goto IL_012c;            }            else if (name == "Olivia")                goto IL_012c;        }        else if (num != 596915366)        {            if (num != 708112360)            {                if (num == 1803517931 && name == "Charlotte")                    goto IL_012c;            }            else if (name == "Evelyn")                goto IL_012c;        }        else if (name == "Mia")            goto IL_012c;    }    else if (num <= 2263917949u)    {        if (num != 2234485159u)        {            if (num == 2263917949u && name == "Ava")                goto IL_012c;        }        else if (name == "Luna")            goto IL_012c;    }    else if (num != 2346269629u)    {        if (num != 3517830433u)        {            if (num == 3552467688u && name == "Amelia")                goto IL_012c;        }        else if (name == "Sophia")            goto IL_012c;    }    else if (name == "Emma")        goto IL_012c;    return false;    IL_012c:    return true;}

If you stare at that for a moment, you’ll see the compiler has implemented a binary search tree. It hashes the name, and then having hashed all of the cases at build time, it does a binary search on the hash codes to find the the right case. Now with the recent improvements, it instead generates an equivalent of this:

static bool IsMostPopular(string name){    if (name != null)    {        switch (name.Length)        {            case 3:                switch (name[0])                {                    case 'A':                        if (name == "Ava")                            goto IL_012f;                        break;                    case 'M':                        if (name == "Mia")                            goto IL_012f;                        break;                }            case 4:                switch (name[0])                {                    case 'E':                        if (name == "Emma")                            goto IL_012f;                        break;                    case 'L':                        if (name == "Luna")                            goto IL_012f;                        break;                }            case 6:                switch (name[0])                {                    case 'A':                        if (name == "Amelia")                            goto IL_012f;                        break;                    case 'E':                        if (name == "Evelyn")                            goto IL_012f;                        break;                    case 'O':                        if (name == "Olivia")                            goto IL_012f;                        break;                    case 'S':                        if (name == "Sophia")                            goto IL_012f;                        break;                }            case 8:                if (name == "Isabella")                    goto IL_012f;                break;            case 9:                if (name == "Charlotte")                    goto IL_012f;                break;        }    }    return false;    IL_012f:    return true;}

Now what’s it doing? First, it’s bucketed the strings by their length; any string that comes in that’s not 3, 4, 6, 8, or 9 characters long will be immediately rejected. For 8 and 9 characters, there’s only one possible answer it could be for each, so it simply checks against that string. For the others, it’s recognized that each name in that length begins with a different letter, and switches over that. In this particular example, the first character in each bucket is a perfect differentiator, but if it wasn’t, the compiler will also consider other indices to see if any of those might be better differentiators. This is implementing the same basic strategy as theSystem.Collections.Frozen.LengthBucketsFrozenSet we saw earlier.

I was careful in my choice above to use aswitch. If I’d instead written the possibly more naturalis expression:

static bool IsMostPopular(string name) =>    name is "Olivia" or            "Emma" or            "Charlotte" or            "Amelia" or            "Sophia" or            "Isabella" or            "Ava" or            "Mia" or            "Evelyn" or            "Luna";

then up until recently the compiler wouldn’t even have output the binary search, and would have instead just generated a cascadingif/else if as if I’d written:

static bool IsMostPopular(string name) =>    name == "Olivia" ||    name == "Emma" ||    name == "Charlotte" ||    name == "Amelia" ||    name == "Sophia" ||    name == "Isabella" ||    name == "Ava" ||    name == "Mia" ||    name == "Evelyn" ||    name == "Luna";

Withdotnet/roslyn#65874 from@alrz, however, theis-based version is now lowered the same as theswitch-based version.

Back to frozen collections. As noted,System.Collections.Frozen types are in theSystem.Collections.Immutable library, and they’re not the only improvements to that library. A variety of new APIs have been added to help enable more productive and efficient use of the existing immutable collections…

Immutable Collections

For years, developers have found the need to bypass anImmutableArray<T>‘s immutability. For example, the previously-discussedFrozenDictionary<TKey, TValue> exposes anImmutableArray<TKey> for its keys and anImmutableArray<TValue> for its values. It does this by creating aTKey[], which it uses for a variety of purposes while building up the collection, and then it wants to wrap that as anImmutableArray<TKey> to be exposed for consumption. But with the public APIs available onImmutableArray/ImmutableArray<T>, there’s no way to transfer ownership like that; all the APIs that accept an inputT[] orIEnumerable<T> allocate a new array and copy all of the data into it, so that the implementation can be sure no one else is still holding onto a reference to the array being wrapped (if someone was, they could use that mutable reference to mutate the contents of the immutable array, and guarding against that is one of the key differentiators between a read-only collection and an immutable collection). Enabling such wrapping of the original array is thus an “unsafe” operation, albeit one that’s valuable to enable for developers willing to accept the responsibility. Previously, developers could achieve this by employing a hack that works but only because of implementation detail: usingUnsafe.As to cast between the types. When a value type’s first field is a reference type, a reference to the beginning of the struct is also a reference to the reference type, since they’re both at the exact same memory location. Thus, becauseImmutableArray<T> contains just a single field (for theT[] it wraps), a method like the following will successfully wrap anImmutableArray<T> around aT[]:

static ImmutableArray<T> UnsafeWrap<T>(T[] array) => Unsafe.As<T[], ImmutableArray<T>>(ref array);

That, however, is both uintuitive and depends onImmutableArray<T> having the array at a 0-offset from the start of the struct, making it a brittle solution. To provide something robust,dotnet/runtime#85526 added the newSystem.Runtime.InteropServices.ImmutableCollectionsMarshal class, and on it two new methods:AsImmutableArray andAsArray. These methods support casting back and forth between aT[] and anImmutableArray<T>, without allocation. They’re defined inInteropServices on aMarshal class, as that’s one of the ways we have to both hide more dangerous functionality and declare that something is inherently “unsafe” in some capacity.

There are also new overloads exposed for constructing immutable collections with less allocation. All of the immutable collections have a corresponding static class that provides aCreate method, e.g.ImmutableList<T> has the corresponding static classImmutableList which provides astatic ImmutableList<T> Create<T>(params T[] items) method. Now in .NET 8 as ofdotnet/runtime#87945, these methods all have a new overload that takes aReadOnlySpan<T>, e.g.static ImmutableList<T> Create<T>(ReadOnlySpan<T> items). This means an immutable collection can be created without incurring the allocation required to either go through the associated builder (which is a reference type) or to allocate an array of the exact right size.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections.Immutable;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark(Baseline = true)]    public ImmutableList<int> CreateArray() => ImmutableList.Create<int>(1, 2, 3, 4, 5);    [Benchmark]    public ImmutableList<int> CreateBuilder()    {        var builder = ImmutableList.CreateBuilder<int>();        for (int i = 1; i <= 5; i++) builder.Add(i);        return builder.ToImmutable();    }    [Benchmark]    public ImmutableList<int> CreateSpan() => ImmutableList.Create<int>(stackalloc int[] { 1, 2, 3, 4, 5 });}
MethodMeanRatioAllocatedAlloc Ratio
CreateBuilder132.22 ns1.42312 B1.00
CreateArray92.98 ns1.00312 B1.00
CreateSpan85.54 ns0.92264 B0.85

BitArray

dotnet/runtime#81527 from@lateapexearlyspeed added two new methods toBitArray,HasAllSet andHasAnySet, which do exactly what their names suggest:HasAllSet returns whether all of the bits in the array are set, andHasAnySet returns whether any of the bits in the array are set. While useful, what I really like about these additions is that they make good use of theContainsAnyExcept method introduced in .NET 8.BitArray‘s storage is anint[], where each element in the array represents 32 bits (for the purposes of this discussion, I’m ignoring the corner-case it needs to deal with of the last element’s bits not all being used because the count of the collection isn’t a multiple of 32). Determining whether any bits are set is then simply a matter of doing_array.AsSpan().ContainsAnyExcept(0). Similarly, determining whether all bits are set is simply a matter of doing!_array.AsSpan().ContainsAnyExcept(-1). The bit pattern for-1 is all 1s, soContainsAnyExcept(-1) will return true if and only if it finds any integer that doesn’t have all of its bits set; thus if the call doesn’t find any, all bits are set. The net result isBitArray gets to maintain simple code that’s also vectorized and optimized, thanks to delegating to these shared helpers. You can see examples of these methods being used indotnet/runtime#82057, which replaced bespoke implementations of the same functionality with the new built-in helpers.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Collections;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly BitArray _bitArray = new BitArray(1024);    [Benchmark(Baseline = true)]    public bool HasAnySet_Manual()    {        for (int i = 0; i < _bitArray.Length; i++)        {            if (_bitArray[i])            {                return true;            }        }        return false;    }    [Benchmark]    public bool HasAnySet_BuiltIn() => _bitArray.HasAnySet();}
MethodMeanRatio
HasAnySet_Manual731.041 ns1.000
HasAnySet_BuiltIn5.423 ns0.007

Collection Expressions

Withdotnet/roslyn#68831 and then a myriad of subsequent PRs, C# 12 introduces a new terse syntax for constructing collections: “collection expressions.” Let’s say I want to construct aList<int>, for example, with the elements 1, 2, and 3. I could do it like so:

var list = new List<int>();list.Add(1);list.Add(2);list.Add(3);

or utilizing collection initializers that were added in C# 3:

var list = new List<int>() { 1, 2, 3 };

Now in C# 12, I can write that as:

List<int> list = [1, 2, 3];

I can also use “spreads,” where enumerables can be used in the syntax and have all of their contents splat into the collection. For example, instead of:

var list = new List<int>() { 1, 2 };foreach (int i in GetData()){    list.Add(i);}list.Add(3);

or:

var list = new List<int>() { 1, 2 };list.AddRange(GetData());list.Add(3);

I can simply write:

List<int> list = [1, 2, ..GetData(), 3];

If it were just a simpler syntax for collections, it wouldn’t be worth discussing in this particular post. What makes it relevant from a performance perspective, however, is that the C# compiler is free to optimize this however it sees fit, and it goes to great lengths to write the best code it can for the given circumstance; some optimizations are already in the compiler, more will be in place by the time .NET 8 and C# 12 are released, and even more will come later, with the language specified in such a way that gives the compiler the freedom to innovate here. Let’s take a few examples…

If you write:

IEnumerable<int> e = [];

the compiler won’t just translate that into:

IEnumerable<int> e = new int[0];

After all, we have a perfectly good singleton for this in the way ofArray.Empty<int>(), something the compiler already emits use of for things likeparams T[], and it can emit the same thing here:

IEnumerable<int> e = Array.Empty<int>();

Ok, what about the optimizations we previously saw around the compiler lowering the creation of an array involving only constants and storing that directly into aReadOnlySpan<T>? Yup, that applies here, too. So, instead of writing:

ReadOnlySpan<int> daysToMonth365 = new int[] { 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365 };

you can write:

ReadOnlySpan<int> daysToMonth365 = [0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365];

and the exact same code results.

What aboutList<T>? Earlier in the discussion of collections we saw thatList<T> now sports anAddRange(ReadOnlySpan<T>), and the compiler is free to use that. For example, if you write this:

Span<int> source1 = ...;IList<int> source2 = ...;List<int> result = [1, 2, ..source1, ..source2];

the compiler could emit the equivalent of this:

Span<int> source1 = ...;IList<int> source2 = ...;List<int> result = new List<int>(2 + source1.Length + source2.Count);result.Add(1);result.Add(2);result.AddRange(source1);result.AddRange(source2);

One of my favorite optimizations it achieves, though, is with spans and the use of the[InlineArray] attribute we already saw. If you write:

int a = ..., b = ..., c = ..., d = ..., e = ..., f = ..., g = ..., h = ...;Span<int> span = [a, b, c, d, e, f, g, h];

the compiler can lower that to code along the lines of this:

int a = ..., b = ..., c = ..., d = ..., e = ..., f = ..., g = ..., h = ...;<>y__InlineArray8<int> buffer = default;Span<int> span = buffer;span[0] = a;span[1] = b;span[2] = c;span[3] = d;span[4] = e;span[5] = f;span[6] = g;span[7] = h;...[InlineArray(8)]internal struct <>y__InlineArray8<T>{    private T _element0;}

In short, this collection expression syntax becomesthe way to utilize[InlineArray] in the vast majority of situations, allowing the compiler to create a shared definition for you.

That optimization also feeds into another, which is both an optimization and a functional improvement over what’s in C# 11. Let’s say you have this code… what do you expect it to print?

// dotnet run -f net8.0using System.Collections.Immutable;ImmutableArray<int> array = new ImmutableArray<int> { 1, 2, 3 };foreach (int i in array){    Console.WriteLine(i);}

Unless you’re steeped inSystem.Collections.Immutable and how collection initializers work, you likely didn’t predict the (unfortunate) answer:

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.   at System.Collections.Immutable.ImmutableArray`1.get_IsEmpty()   at System.Collections.Immutable.ImmutableArray`1.Add(T item)   at Program.<Main>$(String[] args)

ImmutableArray<T> is a struct, so this will end up using its default initialization, which contains anull array. But even if that was made to work, the C# compiler will have lowered the code I wrote to the equivalent of this:

ImmutableArray<int> immutableArray = default;immutableArray.Add(1);immutableArray.Add(2);immutableArray.Add(3);foreach (int i in immutableArray){    Console.WriteLine(enumerator.Current);}

which is “wrong” in multiple ways.ImmutableArray<int>.Add doesn’t actually mutate the original collection, but instead returns a new instance that contains the additional element, so when we enumerateimmutableArray, we wouldn’t see any of the additions. Plus, we’re doing all this work and allocation to create the results ofAdd, only to drop those results on the floor.

Collection expressions fix this. Now you can write this:

// dotnet run -f net8.0using System.Collections.Immutable;ImmutableArray<int> array = [1, 2, 3];foreach (int i in array){    Console.WriteLine(i);}

and running it successfully produces:

123

Why? Becausedotnet/runtime#88470 added a new[CollectionBuilder] attribute that’s recognized by the C# compiler. That attribute is placed on a type and points to a factory method for creating that type, accepting aReadOnlySpan<T> and returning the instance constructed from that data. That PR also taggedImmutableArray<T> with this attribute:

[CollectionBuilder(typeof(ImmutableArray), nameof(ImmutableArray.Create))]

such that when the compiler sees anImmutableArray<T> being constructed from a collection expression, it runs to useImmutableArray.Create<T>(ReadOnlySpan<T>). Not only that, it’s able to use the[InlineArray]-based optimization we just talked about for creating that input. As such, the code the compiler generates for this example as of today is equivalent to this:

<>y__InlineArray3<int> buffer = default;buffer._element = 1;Unsafe.Add(ref buffer._element, 1) = 2;Unsafe.Add(ref buffer._element, 2) = 3;ImmutableArray<int> array = ImmutableArray.Create(buffer);foreach (int i in array){    Console.WriteLine(array);}

ImmutableList<T>,ImmutableStack<T>,ImmutableQueue<T>,ImmutableHashSet<T>, andImmutableSortedSet<T> are all similarly attributed such that they all work with collection expressions as well.

Of course, the compiler could actually do a bit better forImmutableArray<T>. As was previously noted, the compiler is free to optimize these how it sees fit, and we already mentioned the newImmutableCollectionsMarshal.AsImmutableArray method. As I write this, the compiler doesn’t currently employ that method, but in the future the compiler can special-caseImmutableArray<T>, such that it could then generate code equivalent to the following:

ImmutableArray<int> array = ImmutableCollectionsMarshal.AsImmutableArray(new[] { 1, 2, 3 });

saving on both stack space as well as an extra copy of the data. This is just one of the additional optimizations possible.

In short, collection expressions are intended to be a great way to express the collection you want built, and the compiler will ensure it’s done efficiently.

File I/O

.NET 6 overhauled how file I/O is implemented in .NET, rewritingFileStream, introducing theRandomAccess class, and a multitude of other changes. .NET 8 continues to improve performance with file I/O further.

One of the more interesting ways performance of a system can be improved is cancellation. After all, the fastest work is work you don’t have to do at all, and cancellation is about stopping doing unneeded work. The original patterns for asynchrony in .NET were based on a non-cancelable model (seeHow Async/Await Really Works in C# for an in-depth history and discussion), and over time as all of that support has shifted to theTask-based model based onCancellationToken, more and more implementations have become fully cancelable as well. As of .NET 7, the vast majority of code paths that accepted aCancellationToken actually respected it, more than just doing an up-front check to see whether cancellation was already requested but then not paying attention to it during the operation. Most of the holdouts have been very corner-case, but there’s one notable exception:FileStreams created withoutFileOptions.Asynchronous.

FileStream inherited the bifurcated model of asynchrony from Windows, where at the time you open a file handle you need to specify whether it’s being opened for synchronous or asynchronous (“overlapped”) access. A file handle opened for overlapped access requires that all operations be asynchronous, and vice versa if it’s opened for non-overlapped access requires that all operations be synchronous. That causes some friction withFileStream, which exposes both synchronous (e.g.Read) and asynchronous (e.g.ReadAsync) methods, as it means that one set of those needs to emulate the behavior. If theFileStream is opened for asynchronous access, thenRead needs to do the operation asynchronously and block waiting for it complete (a pattern we less-than-affectionately refer to as“sync-over-async”), and if theFileStream is opened for synchronous access, thenReadAsync needs to queue a work item that will do the operation synchronously (“async-over-sync”). Even though thatReadAsync method accepts aCancellationToken, the actual synchronousRead that ends up being invoked as part of aThreadPool work item hasn’t been cancelable. Now in .NET 8, thanks todotnet/runtime#87103, it is, at least on Windows.

In .NET 7,PipeStream was fixed for this same case, relying on an internalAsyncOverSyncWithIoCancellation helper that would use the Win32CancelSynchronousIo to interrupt pending I/O, while also using appropriate synchronization to ensure that only the intended associated work was interrupted and not work that happened to be running on the same worker thread before or after (Linux already fully supportedPipeStream cancellation as of .NET 5). This PR adapted that same helper to then be usable as well inside ofFileStream on Windows, in order to gain the same benefits. The same PR also further improved the implementation of that helper to reduce allocation and to further streamline the processing, such that the existing support inPipeStream gets leaner as well.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.IO.Pipes;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly CancellationTokenSource _cts = new();    private readonly byte[] _buffer = new byte[1];    private AnonymousPipeServerStream _server;    private AnonymousPipeClientStream _client;    [GlobalSetup]    public void Setup()    {        _server = new AnonymousPipeServerStream(PipeDirection.Out);        _client = new AnonymousPipeClientStream(PipeDirection.In, _server.ClientSafePipeHandle);    }    [GlobalCleanup]    public void Cleanup()    {        _server.Dispose();        _client.Dispose();    }    [Benchmark(OperationsPerInvoke = 100_000)]    public async Task ReadWriteAsync()    {        for (int i = 0; i < 100_000; i++)        {            ValueTask<int> read = _client.ReadAsync(_buffer, _cts.Token);            await _server.WriteAsync(_buffer, _cts.Token);            await read;        }    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
ReadWriteAsync.NET 7.03.863 us1.00181 B1.00
ReadWriteAsync.NET 8.02.941 us0.760.00

Interacting with paths viaPath andFile has also improved in various ways.dotnet/runtime#74855 improvedPath.GetTempFileName() on Windows both functionally and for performance; in many situations in the past, we’ve made the behavior of .NET on Unix match the behavior of .NET on Windows, but this PR interestingly goes in the other direction. On Unix,Path.GetTempFileName() uses the libcmkstemp function, which accepts a template that must end in “XXXXXX” (6Xs), and it populates thoseXs with random values, using the resulting name for a new file that gets created. On Windows,GetTempFileName() was using the Win32GetTempFileNameW function, which uses a similar pattern but with only 4Xs. With the characters Windows will fill in, that enables only 65,536 possible names, and as the temp directory fills up, it becomes more and more likely there will be conflicts, leading to longer and longer times for creating a temp file (it also means that on WindowsPath.GetTempFileName() has been limited to creating 65,536 simultaneously-existing files). This PR changes the format on Windows to match that used on Unix, and avoids the use ofGetTempFileNameW, instead doing the random name assignment and retries-on-conflict itself. The net result is more consistency across OSes, a much larger number of temporary files possible (a billion instead of tens of thousands), as well as a better-performing method:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0// NOTE: The results for this benchmark will vary wildly based on how full the temp directory is.using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly List<string> _files = new();    // NOTE: The performance of this benchmark is highly influenced by what's currently in your temp directory.    [Benchmark]    public void GetTempFileName()    {        for (int i = 0; i < 1000; i++) _files.Add(Path.GetTempFileName());    }    [IterationCleanup]    public void Cleanup()    {        foreach (string path in _files) File.Delete(path);        _files.Clear();    }}
MethodRuntimeMeanRatio
GetTempFileName.NET 7.01,947.8 ms1.00
GetTempFileName.NET 8.0276.5 ms0.34

Path.GetFileName is another on the list of methods that improves, thanks to making use ofIndexOf methods. Here,dotnet/runtime#75318 usesLastIndexOf (on Unix, where the only directory separator is'/') orLastIndexOfAny (on Windows, where both'/' and'\' can be a directory separator) to search for the beginning of the file name.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private string _path = Path.Join(Path.GetTempPath(), "SomeFileName.cs");    [Benchmark]    public ReadOnlySpan<char> GetFileName() => Path.GetFileName(_path.AsSpan());}
MethodRuntimeMeanRatio
GetFileName.NET 7.09.465 ns1.00
GetFileName.NET 8.04.733 ns0.50

Related toFile andPath, various methods onEnvironment also return paths.Microsoft.Extensions.Hosting.HostingHostBuilderExtensions had been usingEnvironment.GetSpecialFolder(Environment.SpecialFolder.System) to get the system path, but this was leading to noticeable overhead when starting up an ASP.NET application.dotnet/runtime#83564 changed this to useEnvironment.SystemDirectory directly, which on Windows takes advantage of the much more efficient path (and resulting in simpler code), but thendotnet/runtime#83593 also fixedEnvironment.GetSpecialFolder(Environment.SpecialFolder.System) on Windows to useEnvironment.SystemDirectory, such that its performance accrues to the higher-level uses as well.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    public string GetFolderPath() => Environment.GetFolderPath(Environment.SpecialFolder.System);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
GetFolderPath.NET 7.01,560.87 ns1.0088 B1.00
GetFolderPath.NET 8.045.76 ns0.0364 B0.73

dotnet/runtime#73983 improvesDirectoryInfo andFileInfo, making theFileSystemInfo.Name property lazy. Previously when constructing the info object if only the full name existed (and not the name of just the directory or file itself), the constructor would promptly create theName string, even if the info object is never used (as is often the case when it’s returned from a method likeCreateDirectory). Now, thatName string is lazily created on first use of theName property.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly string _path = Environment.CurrentDirectory;    [Benchmark]    public DirectoryInfo Create() => new DirectoryInfo(_path);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Create.NET 7.0225.0 ns1.00240 B1.00
Create.NET 8.0170.1 ns0.76200 B0.83

File.Copy has gotten a whole lot faster on macOS, thanks todotnet/runtime#79243 from@hamarb123.File.Copy now employs the OS’sclonefile function (if available) to perform the copy, and if both the source and destination are on the same volume,clonefile creates a copy-on-write clone of the file in the destination; this makes the copy at the OS level much faster, incurring the majority cost of actually duplicating the data only occurring if one of the files is subsequently written to.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD", "Min", "Max")]public class Tests{    private string _source;    private string _dest;    [GlobalSetup]    public void Setup()    {        _source = Path.GetTempFileName();        File.WriteAllBytes(_source, Enumerable.Repeat((byte)42, 1_000_000).ToArray());        _dest = Path.GetRandomFileName();    }    [Benchmark]    public void FileCopy() => File.Copy(_source, _dest, overwrite: true);    [GlobalCleanup]    public void Cleanup()    {        File.Delete(_source);        File.Delete(_dest);    }}
MethodRuntimeMeanRatio
FileCopy.NET 7.01,624.8 us1.00
FileCopy.NET 8.0366.7 us0.23

Some more specialized changes have been incorporated as well.TextWriter is a core abstraction for writing text to an arbitrary destination, but sometimes you want that destination to be nowhere, a la/dev/null on Linux. For this,TextWriter provides theTextWriter.Null property, which returns aTextWriter instance that nops on all of its members. Or, at least that’s the visible behavior. In practice, only a subset of its members were actually overridden, which meant that although nothing would end up being output, some work might still be incurred and then the fruits of that labor thrown away.dotnet/runtime#83293 ensures that all of the writing methods are overridden in order to do away with all of that wasted work.

Further, one of the placesTextWriter ends up being used is inConsole, whereConsole.SetOut allows you to replacestdout with your own writer, at which point all of the writing methods onConsole output to thatTextWriter instead. In order to provide thread-safety of writes,Console synchronizes access to the underlying writer, but if the writer is doing nops anyway, there’s no need for that synchronization.dotnet/runtime#83296 does away with it in that case, such that if you want to temporarily silenceConsole, you can simply set its output to go toTextWriter.Null, and the overhead of operations onConsole will be minimized.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly string _value = "42";    [GlobalSetup]    public void Setup() => Console.SetOut(TextWriter.Null);    [Benchmark]    public void WriteLine() => Console.WriteLine("The value was {0}", _value);}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
WriteLine.NET 7.080.361 ns1.0056 B1.00
WriteLine.NET 8.01.743 ns0.020.00

Networking

Networking is the heart and soul of most modern services and applications, which makes it all the more important that .NET’s networking stack shine.

Networking Primitives

Let’s start at the bottom of the networking stack, looking at some primitives. Most of these improvements are around formatting, parsing, and manipulation as bytes. Takedotnet/runtime#75872, for example, which improved the performance of various such operations onIPAddress.IPAddress stores auint that’s used as the address when it’s representing an IPv4 address, and it stores aushort[8] that’s used when it’s representing an IPv6 address. Aushort is two bytes, so aushort[8] is 16 bytes, or 128 bits. “128 bits” is a very convenient number when performing certain operations, as such a value can be manipulated as aVector128<> (accelerating computation on systems that accelerate it, which is most). This PR takes advantage of that to optimize common operations with anIPAddress. TheIPAddress constructor, for example, is handed aReadOnlySpan<byte> for an IPv6 address, which it needs to read into itsushort[8]; previously that was done with a loop over the input, but now it’s handled with a single vector: load the single vector, possibly reverse the endianness (which can be done in just three instructions: OR together the vector shifted left by one byte and shifted right by one byte), and store it.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly IPAddress _addr = IPAddress.Parse("2600:141b:13:781::356e");    private readonly byte[] _ipv6Bytes = IPAddress.Parse("2600:141b:13:781::356e").GetAddressBytes();    [Benchmark] public IPAddress NewIPv6() => new IPAddress(_ipv6Bytes, 0);    [Benchmark] public bool WriteBytes() => _addr.TryWriteBytes(_ipv6Bytes, out _);}
MethodRuntimeMeanRatio
NewIPv6.NET 7.036.720 ns1.00
NewIPv6.NET 8.016.715 ns0.45
WriteBytes.NET 7.014.443 ns1.00
WriteBytes.NET 8.02.036 ns0.14

IPAddress now also implementsISpanFormattable andIUtf8SpanFormattable, thanks todotnet/runtime#82913 anddotnet/runtime#84487. That means, for example, that using anIPAddress as part of string interpolation no longer needs to allocate an intermediate string. As part of this, some changes were made toIPAddress formatting to streamline it. It’s a bit harder to measure these changes, though, becauseIPAddress caches a string it creates, such that subsequentToString calls just return the previous string created. To work around that, we can use private reflection to null out the field (never do this in a real code; private reflection against the core libraries is very much unsupported).

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Reflection;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IPAddress _address;    private FieldInfo _toStringField;    [GlobalSetup]    public void Setup()    {        _address = IPAddress.Parse("123.123.123.123");        _toStringField = typeof(IPAddress).GetField("_toString", BindingFlags.NonPublic | BindingFlags.Instance);    }    [Benchmark]    public string NonCachedToString()    {        _toStringField.SetValue(_address, null);        return _address.ToString();    }}
MethodRuntimeMeanRatio
NonCachedToString.NET 7.092.63 ns1.00
NonCachedToString.NET 8.075.53 ns0.82

Unfortunately, such use of reflection has a non-trivial amount of overhead associated with it, which then decreases the perceived benefit from the improvement. Instead, we can use reflection emit either directly or viaSystem.Linq.Expression to emit a custom helper that makes it less expensive to null out that private field.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Linq.Expressions;using System.Net;using System.Reflection;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private IPAddress _address;    private Action<IPAddress, string> _setter;    [GlobalSetup]    public void Setup()    {        _address = IPAddress.Parse("123.123.123.123");        _setter = BuildSetter<IPAddress, string>(typeof(IPAddress).GetField("_toString", BindingFlags.NonPublic | BindingFlags.Instance));    }    [Benchmark]    public string NonCachedToString()    {        _setter(_address, null);        return _address.ToString();    }    private static Action<TSource, TArg> BuildSetter<TSource, TArg>(FieldInfo field)    {        ParameterExpression target = Expression.Parameter(typeof(TSource));        ParameterExpression value = Expression.Parameter(typeof(TArg));        return Expression.Lambda<Action<TSource, TArg>>(            Expression.Assign(Expression.Field(target, field), value),            target,            value).Compile();    }}
MethodRuntimeMeanRatio
NonCachedToString.NET 7.048.39 ns1.00
NonCachedToString.NET 8.036.30 ns0.75

But .NET 8 actually includes a feature that streamlines this; the feature’s primary purpose is in support of scenarios like source generators with Native AOT, but it’s useful for this kind of benchmarking, too. The newUnsafeAccessor attribute (introduced in and supported bydotnet/runtime#86932,dotnet/runtime#88626, anddotnet/runtime#88925) lets you define anextern method that bypasses visibility. In this case, I’ve used it to get aref to the private field, at which point I can just assignnull through theref.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly IPAddress _address = IPAddress.Parse("123.123.123.123");    [Benchmark]    public string NonCachedToString()    {        _toString(_address) = null;        return _address.ToString();        [UnsafeAccessor(UnsafeAccessorKind.Field, Name = "_toString")]        extern static ref string _toString(IPAddress c);    }}
MethodMean
NonCachedToString34.42 ns

Uri is another networking primitive that saw multiple improvements.dotnet/runtime#80469 removed a variety of allocations, primarily around substrings that were instead replaced by spans.dotnet/runtime#90087 replaced unsafe code as part of scheme parsing with safe span-based code, making it both safer and faster. Butdotnet/runtime#88012 is more interesting, as it madeUri implementISpanFormattable. That means that when, for example, aUri is used as an argument to an interpolated string, theUri can now format itself directly to the underlying buffer rather than needing to allocate a temporary string that’s then added in. This can be particularly useful for reducing the costs of logging and other forms of telemetry. It’s a little difficult to isolate just the formatting aspect of aUri for benchmarking purposes, asUri caches information gathered in the process, but even with constructing a new one each time you can see gains:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    [Benchmark]    public string Interpolate() => $"Uri: {new Uri("http://dot.net")}";}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Interpolate.NET 7.0356.3 ns1.00296 B1.00
Interpolate.NET 8.0278.4 ns0.78240 B0.81

Other networking primitives improved in other ways.dotnet/runtime#82095 reduced the overhead of theGetHashCode methods of several networking types, likeCookie.Cookie.GetHashCode was previously allocating and is now allocation-free. Same forDnsEndPoint.GetHashCode.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly Cookie _cookie = new Cookie("Cookie", "Monster");    private readonly DnsEndPoint _dns = new DnsEndPoint("localhost", 80);    [Benchmark]    public int CookieHashCode() => _cookie.GetHashCode();    [Benchmark]    public int DnsHashCode() => _dns.GetHashCode();}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
CookieHashCode.NET 7.0105.30 ns1.00160 B1.00
CookieHashCode.NET 8.022.51 ns0.210.00
DnsHashCode.NET 7.0136.78 ns1.00192 B1.00
DnsHashCode.NET 8.012.92 ns0.090.00

AndHttpUtility improved indotnet/runtime#78240. This is a quintessential example of code doing its own manual looping looking for something (in this case, the four characters that require encoding) when it could have instead just used a well-placedIndexOfAny.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Web;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    [Benchmark]    public string HtmlAttributeEncode() =>        HttpUtility.HtmlAttributeEncode("To encode, or not to encode: that is the question");}
MethodRuntimeMeanRatio
HtmlAttributeEncode.NET 7.032.688 ns1.00
HtmlAttributeEncode.NET 8.06.734 ns0.21

Moving up the stack toSystem.Net.Sockets, there are some nice improvements in .NET 8 here as well.

Sockets

dotnet/runtime#86524 anddotnet/runtime#89808 are for Windows only because the problem they address doesn’t manifest on other operatings systems, due to how asynchronous operations are implemented on the various platforms.

On Unix operatings systems, the typical approach to asynchrony is to put the socket into non-blocking mode. Issuing an operation likerecv (Socket.Receive{Async}) when there’s nothing to receive then fails immediately with anerrno value ofEWOULDBLOCK orEAGAIN, informing the caller that no data was available to receive yet and it’s not going to wait for said data because it’s been told not to. At that point, the caller can choose how it wants to wait for data to become available.Socket does what many other systems do, which is to useepoll (on Linux) orkqueues (on macOS). These mechanisms allow for a single thread to wait efficiently for any number of registered file descriptors to signal that something has changed. As such,Socket has one or more dedicated threads that sit in a wait loop, waiting on theepoll/kqueue to signal that there’s something to do, and when there is, queueing off the associated work, and then looping around to wait for the next notification. In the case of aReceiveAsync, that queued work will end up reissuing therecv, which will now succeed as data will be available. The interesting thing here is that during that interim period while waiting for data to become available, there was no pending call from .NET torecv or anything else that would require a managed buffer (e.g. an array) be available. That’s not the case on Windows…

On Windows, the OS provides dedicated asynchronous APIs (“overlapped I/O”), withReceiveAsync being a thin wrapper around the Win32WSARecv function.WSARecv accepts a pointer to the buffer to write into and a pointer to a callback that will be invoked when the operation has completed. That means that while waiting for data to be available,WSARecv actually needs a pointer to the buffer it’ll write the data into (unless 0 bytes have been requested, which we’ll talk more about in a bit). In .NET world, buffers are typically on the managed heap, which means they can be moved around by the GC, and thus in order to pass a pointer to such a buffer down toWSARecv, that buffer needs to be “pinned,” telling the GC “do not move this.” For synchronous operations, such pinning is best accomplished with the C#fixed keyword; for asynchronous operations,GCHandle or something that wraps it (likeMemory.Pin andMemoryHandle) are the answers. So, on Windows,Socket uses aGCHandle for any buffers it supplies to the OS to span an asynchronous operation’s lifetime.

For the last 20 years, though, it’s been overaggressive in doing so. There’s a buffer passed to various Win32 methods, includingWSAConnect (Socket.ConnectAsync), to represent the target IP address. Even though these are asynchronous operations, it turns out that data is only required as part of the synchronous part of the call to these APIs; only aReceiveFromAsync operation (which is typically only used with connectionless protocols, and in particular UDP) that receives not only payload data but also the sender’s address actually needs the address buffer pinned over the lifetime of the operation.Socket was pinning the buffer using aGCHandle, and in fact doing so for the lifetime of theSocket, even though aGCHandle wasn’t actually needed at all for these calls, and afixed would suffice around just the Win32 call itself. The first PR fixed that, the net effect of which is that aGCHandle that was previously pinning a buffer for the lifetime of everySocket on Windows then only did so forSockets issuingReceiveFromAsync calls. The second PR then fixedReceiveFromAsync, using a native buffer instead of a managed one that would need to be permanently pinned. The primary benefit of these changes is that it helps to avoid a lot of fragmentation that can result at scale in the managed heap. We can see this most easily by looking at the runtime’s tracing, which I consume in this example via anEventListener:

// dotnet run -c Release -f net7.0// dotnet run -c Release -f net8.0using System.Net;using System.Net.Sockets;using System.Diagnostics.Tracing;using var setCountListener = new GCHandleListener();Thread.Sleep(1000);using Socket listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));listener.Listen();for (int i = 0; i < 10_000; i++){    using Socket client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);    await client.ConnectAsync(listener.LocalEndPoint!);    listener.Accept().Dispose();}Thread.Sleep(1000);Console.WriteLine($"{Environment.Version} GCHandle count: {setCountListener.SetGCHandleCount}");sealed class GCHandleListener : EventListener{    public int SetGCHandleCount = 0;    protected override void OnEventSourceCreated(EventSource eventSource)    {        if (eventSource.Name == "Microsoft-Windows-DotNETRuntime")            EnableEvents(eventSource, EventLevel.Informational, (EventKeywords)0x2);    }    protected override void OnEventWritten(EventWrittenEventArgs eventData)    {        // https://learn.microsoft.com/dotnet/fundamentals/diagnostics/runtime-garbage-collection-events#setgchandle-event        if (eventData.EventId == 30 && eventData.Payload![2] is (uint)3)            Interlocked.Increment(ref SetGCHandleCount);    }}

When I run this on .NET 7 on Windows, I get this:

7.0.9 GCHandle count: 10000

When I run this on .NET 8, I get this:

8.0.0 GCHandle count: 0

Nice.

I mentioned UDP above, withReceiveFromAsync. We’ve invested a lot over the last several years in making the networking stack in .NET very efficient… for TCP. While most of the improvements there accrue to UDP as well, UDP has additional costs that hadn’t been addressed and that made it suboptimal from a performance perspective. The primary issues there are now addressed in .NET 8, thanks todotnet/runtime#88970 anddotnet/runtime#90086. The key problem here with the UDP-related APIs, namelySendTo{Async} andReceiveFrom{Async}, is that the API is based onEndPoint but the core implementation is based onSocketAddress. Every call toSendToAsync, for example, would accept the providedEndPoint and then callEndPoint.Serialize to produce aSocketAddress, which internally has its ownbyte[]; thatbyte[] contains the address actually passed down to the underlying OS APIs. The inverse happens on theReceiveFromAsync side: the received data includes an address that would be deserialized into anEndPoint which is then returned to the consumer. You can see these allocations show up by profiling a simple repro:

using System.Net;using System.Net.Sockets;var client = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);var server = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);EndPoint endpoint = new IPEndPoint(IPAddress.Loopback, 12345);server.Bind(endpoint);Memory<byte> buffer = new byte[1];for (int i = 0; i < 10_000; i++){    ValueTask<SocketReceiveFromResult> result = server.ReceiveFromAsync(buffer, endpoint);    await client.SendToAsync(buffer, endpoint);    await result;}

The .NET allocation profiler in Visual Studio shows this:

Allocations in a UDP benchmark in .NET 7

So for each send/receive pair, we see threeSocketAddresses which in turn leads to threebyte[]s, and anIPEndPoint which in turn leads to anIPAddress. These costs are very difficult to address efficiently purely in implementation, as they’re directly related to what’s surfaced in the corresponding APIs. Even so, with the exact same code, it does improve a bit in .NET 8:

Allocations in a UDP benchmark in .NET 8

So with zero code changes, we’ve managed to eliminate one of theSocketAddress allocations and its associatedbyte[], and to shrink the size of the remaining instances (in part due todotnet/runtime#78860). But, we can do much better…

.NET 8 introduces a new set of overloads. In .NET 7, we had these:

public int SendTo(byte[] buffer, int offset, int size, SocketFlags socketFlags, EndPoint remoteEP);public int ReceiveFrom(byte[] buffer, int offset, int size, SocketFlags socketFlags, ref EndPoint remoteEP);public ValueTask<int> SendToAsync(ReadOnlyMemory<byte> buffer, SocketFlags socketFlags, EndPoint remoteEP, CancellationToken cancellationToken = default)public ValueTask<SocketReceiveFromResult> ReceiveFromAsync(Memory<byte> buffer, SocketFlags socketFlags, EndPoint remoteEndPoint, CancellationToken cancellationToken = default);

and now in .NET 8 we also have these:

public int SendTo(ReadOnlySpan<byte> buffer, SocketFlags socketFlags, SocketAddress socketAddress);public int ReceiveFrom(Span<byte> buffer, SocketFlags socketFlags, SocketAddress receivedAddress);public ValueTask<int> SendToAsync(ReadOnlyMemory<byte> buffer, SocketFlags socketFlags, SocketAddress socketAddress, CancellationToken cancellationToken = default);public ValueTask<int> ReceiveFromAsync(Memory<byte> buffer, SocketFlags socketFlags, SocketAddress receivedAddress, CancellationToken cancellationToken = default);

Key things to note:

  • The new APIs no longer work in terms ofEndPoint. They now operate onSocketAddress directly. That means the implementation no longer needs to callEndPoint.Serialize to produce aSocketAddress and can just use the provided one directly.
  • There’s no moreref EndPoint argument in the synchronousReceiveFrom and no moreSocketReceiveFromResult in the asynchronousReceiveFromAsync. Both of these existed in order to pass back anIPEndPoint that represented the address of the received data’s sender.SocketAddress, however, is just a strongly-typed wrapper around abyte[] buffer, which means these methods can just mutate that provided instance, avoiding needing to instantiate anything to represent the received address.

Let’s change our code sample to use these new APIs:

using System.Net;using System.Net.Sockets;var client = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);var server = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);EndPoint endpoint = new IPEndPoint(IPAddress.Loopback, 12345);server.Bind(endpoint);Memory<byte> buffer = new byte[1];SocketAddress receiveAddress = endpoint.Serialize();SocketAddress sendAddress = endpoint.Serialize();for (int i = 0; i < 10_000; i++){    ValueTask<int> result = server.ReceiveFromAsync(buffer, SocketFlags.None, receiveAddress);    await client.SendToAsync(buffer, SocketFlags.None, sendAddress);    await result;}

When I profile that, and again look for objects created at least once per iteration, I now see this:

Allocations in a UDP benchmark in .NET 8 with new overloads

That’s not a mistake; I didn’t accidentally crop the screenshot incorrectly. It’s empty because there are no allocations per iteration; the whole program incurs only threeSocketAddress allocations as part of the up-front setup. We can see that more clearly with a standard BenchmarkDotNet repro:

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Net.Sockets;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly Memory<byte> _buffer = new byte[1];    SocketAddress _sendAddress, _receiveAddress;    IPEndPoint _ep;    Socket _client, _server;    [GlobalSetup]    public void Setup()    {        _client = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);        _server = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);        _ep = new IPEndPoint(IPAddress.Loopback, 12345);        _server.Bind(_ep);        _sendAddress = _ep.Serialize();        _receiveAddress = _ep.Serialize();    }    [Benchmark(OperationsPerInvoke = 1_000, Baseline = true)]    public async Task ReceiveFromSendToAsync_EndPoint()    {        for (int i = 0; i < 1_000; i++)        {            var result = _server.ReceiveFromAsync(_buffer, SocketFlags.None, _ep);            await _client.SendToAsync(_buffer, SocketFlags.None, _ep);            await result;        }    }    [Benchmark(OperationsPerInvoke = 1_000)]    public async Task ReceiveFromSendToAsync_SocketAddress()    {        for (int i = 0; i < 1_000; i++)        {            var result = _server.ReceiveFromAsync(_buffer, SocketFlags.None, _receiveAddress);            await _client.SendToAsync(_buffer, SocketFlags.None, _sendAddress);            await result;        }    }}
MethodMeanRatioAllocatedAlloc Ratio
ReceiveFromSendToAsync_EndPoint32.48 us1.00216 B1.00
ReceiveFromSendToAsync_SocketAddress31.78 us0.980.00

TLS

Moving up the stack further,SslStream has received some love in this release. While in previous releases work was done to reduce allocation, .NET 8 sees it reduced further:

  • dotnet/runtime#74619 avoids some allocations related to ALPN. Application-Layer Protocol Negotation is a mechanism that allows higher-level protocols to piggyback on the roundtrips already being performed as part of a TLS handshake. It’s used by an HTTP client and server to negotiate which HTTP version to use (e.g. HTTP/2 or HTTP/1.1). Previously, the implementation would end up allocating abyte[] for use with this HTTP version selection, but now with this PR, the implementation precomputesbyte[]s for the most common protocol selections, avoiding the need to re-allocate thosebyte[]s on each new connection.
  • dotnet/runtime#81096 removes a delegate allocation by moving some code around between the mainSslStream implementation and the Platform Abstraction Layer (PAL) that’s used to handle OS-specific code (everything in theSslStream layer is compiled intoSystem.Net.Security.dll regardless of OS, and then depending on the target OS, a different version of theSslStreamPal class is compiled in).
  • dotnet/runtime#84690 from@am11 avoids a giganticDictionary<TlsCipherSuite, TlsCipherSuiteData> that was being created to enable querying for information about a particular cipher suite for use with TLS. Instead of a dictionary mapping aTlsCipherSuite enum to aTlsCipherSuiteData struct (which contained details like anExchangeAlgorithmType enum value, aCipherAlgorithmType enum value, anintCipherAlgorithmStrength, etc.), aswitch statement is used, mapping that sameTlsCipherSuite enum to anint that’s packed with all the same information. This not only avoids the run-time costs associated with allocating that dictionary and populating it, it also shaves almost 20Kb off a published Native AOT binary, due to all of the code that was necessary to populate the dictionary.dotnet/runtime#84921 from@am11 uses a similarswitch for well-known OIDs.
  • dotnet/runtime#86163 changed an internalProtocolToken class into a struct, passing it around byref instead.
  • dotnet/runtime#74695 avoids someSafeHandle allocation in interop as part of certificate handling on Linux.SafeHandles are a valuable reliability feature in .NET: they wrap a native handle / file descriptor, providing the finalizer that ensures the resource isn’t leaked, but also providing ref counting to ensure that the resource isn’t closed while it’s still being used, leading to use-after-free and handle recycling bugs. They’re particularly helpful when a handle or file descriptor needs to be passed around and shared between multiple components, often as part of some larger object model (e.g. aFileStream wraps aSafeFileHandle). However, in some cases they’re unnecessary overhead. If you have a pattern like:
    SafeHandle handle = GetResource();try { Use(handle); }finally { handle.Dispose(); }

    such that the resource is provably used and freed correctly, you can avoid theSafeHandle and instead just use the resource directly:

    IntPtr handle = GetResource();try { Use(handle); }finally { Free(handle); }

    thereby saving on the allocation of a finalizable object (which is more expensive than a normal allocation as synchronization is required to add that object to a finalization queue in the GC) as well as on ref-counting overhead associated with using aSafeHandle in interop.

This benchmark repeatedly creates newSslStreams and performs handshakes:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Net.Security;using System.Net.Sockets;using System.Runtime.InteropServices;using System.Security.Authentication;using System.Security.Cryptography;using System.Security.Cryptography.X509Certificates;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private NetworkStream _client, _server;    private readonly SslServerAuthenticationOptions _options = new SslServerAuthenticationOptions    {        ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),    };    [GlobalSetup]    public void Setup()    {        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));        listener.Listen(1);        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };        client.Connect(listener.LocalEndPoint);        Socket serverSocket = listener.Accept();        serverSocket.NoDelay = true;        _server = new NetworkStream(serverSocket, ownsSocket: true);        _client = new NetworkStream(client, ownsSocket: true);    }    [GlobalCleanup]    public void Cleanup()    {        _client.Dispose();        _server.Dispose();    }    [Benchmark]    public async Task Handshake()    {        using var client = new SslStream(_client, leaveInnerStreamOpen: true, delegate { return true; });        using var server = new SslStream(_server, leaveInnerStreamOpen: true, delegate { return true; });        await Task.WhenAll(            client.AuthenticateAsClientAsync("localhost", null, SslProtocols.Tls12, checkCertificateRevocation: false),            server.AuthenticateAsServerAsync(_options));    }    private static X509Certificate2 GetCertificate()    {        X509Certificate2 cert;        using (RSA rsa = RSA.Create())        {            var certReq = new CertificateRequest("CN=localhost", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);            certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));            certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid("1.3.6.1.5.5.7.3.1") }, false));            certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));            cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));            if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))            {                cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));            }        }        return cert;    }}

It shows an ~13% reduction in overall allocation as part of theSslStream lifecycle:

MethodRuntimeMeanRatioAllocatedAlloc Ratio
Handshake.NET 7.0828.5 us1.007.07 KB1.00
Handshake.NET 8.0769.0 us0.936.14 KB0.87

My favoriteSslStream improvement in .NET 8, though, isdotnet/runtime#87563, which teachesSslStream to do “zero-byte reads” in order to minimize buffer use and pinning. This has been a long time coming, and is the result of multiple users ofSslStream reporting significant heap fragmentation.

When a read is issued toSslStream, it in turn needs to issue a read on the underlyingStream; the data it reads has a header, which gets peeled off, and then the remaining data is decrypted and stored into the user’s buffer. Since there’s manipulation of the data read from the underlyingStream, including not giving all of it to the user,SslStream doesn’t just pass the user’s buffer to the underlyingStream, but instead passes its own buffer down. That means it needs a buffer to pass. With performance improvements in recent .NET releases,SslStream rents said buffer on demand from theArrayPool and returns it as soon as that temporary buffer has been drained of all the data read into it. There are two issues with this, though. On Windows, a buffer is being provided toSocket, which needs to pin the buffer in order to give a pointer to that buffer to the Win32 overlapped I/O operation; that pinning means the GC can’t move the buffer on the heap, which can mean gaps end up being left on the heap that aren’t usable (aka “fragmentation”), and that in turn can lead to sporadic out-of-memory conditions. As noted earlier, theSocket implementation on Linux and macOS doesn’t need to do such pinning, however there’s still a problem here. Imagine you have a thousand open connections, or a million open connections, all of which are sitting in a read waiting for data; even if there’s no pinning, if each of those connections has anSslStream that’s rented a buffer of any meaningful size, that’s a whole lot of wasted memory just sitting there.

An answer to this that .NET has been making more and more use of over the last few years is “zero-byte reads.” If you need to read 100 bytes, rather than handing down your 100-byte buffer, at which point it needs to be pinned, you instead issue a read for 0 bytes, handing down an empty buffer, at which point nothing needs to be pinned. When there’s data available, that zero-byte read completes (without consuming anything), and then you issue the actual read for the 100 bytes, which is much more likely to be synchronously satisfiable at that point. As of .NET 6,SslStream is already capable of passing along zero-byte reads, e.g. if you dosslStream.ReadAsync(emptyBuffer) and it doesn’t have any data buffered already, it’ll in turn issue a zero-byte read on the underlyingStream. However, todaySslStream itself doesn’tcreate zero-byte reads, e.g. if you dosslStream.ReadAsync(someNonEmptyBuffer) and it doesn’t have enough data buffered, it in turn will issue a non-zero-byte read, and we’re back to pinning per operation at theSocket layer, plus needing a buffer to pass down, which means renting one.

dotnet/runtime#87563 teachesSslStream how to create zero-byte reads. Now when you dosslStream.ReadAsync(someNonEmptyBuffer) and theSslStream doesn’t have enough data buffered, rather than immediately renting a buffer and passing that down, it instead issues a zero-byte read on the underlyingStream. Only once that operation completes does it then proceed to actually rent a buffer and issue another read, this time with the rented buffer. The primary downside to this is a bit more overhead, in that it can lead to an extra syscall; however, our measurements show that overhead to largely be in the noise, with very meaningful upside in reduced fragmentation, working set reduction, andArrayPool stability.

TheGCHandle reduction on Windows is visible with this app, a variation of one showed earlier:

// dotnet run -c Release -f net7.0// dotnet run -c Release -f net8.0using System.Net;using System.Net.Security;using System.Net.Sockets;using System.Runtime.InteropServices;using System.Security.Cryptography.X509Certificates;using System.Security.Cryptography;using System.Diagnostics.Tracing;var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));listener.Listen();client.Connect(listener.LocalEndPoint!);Socket server = listener.Accept();listener.Dispose();X509Certificate2 cert;using (RSA rsa = RSA.Create()){    var certReq = new CertificateRequest("CN=localhost", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);    certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));    certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid("1.3.6.1.5.5.7.3.1") }, false));    certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));    cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));    if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))    {        cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));    }}var clientStream = new SslStream(new NetworkStream(client, ownsSocket: true), false, delegate { return true; });var serverStream = new SslStream(new NetworkStream(server, ownsSocket: true), false, delegate { return true; });await Task.WhenAll(    clientStream.AuthenticateAsClientAsync("localhost", null, false),    serverStream.AuthenticateAsServerAsync(cert, false, false));using var setCountListener = new GCHandleListener();Memory<byte> buffer = new byte[1];for (int i = 0; i < 100_000; i++){    ValueTask<int> read = clientStream.ReadAsync(buffer);    await serverStream.WriteAsync(buffer);    await read;}Thread.Sleep(1000);Console.WriteLine($"{Environment.Version} GCHandle count: {setCountListener.SetGCHandleCount:N0}");sealed class GCHandleListener : EventListener{    public int SetGCHandleCount = 0;    protected override void OnEventSourceCreated(EventSource eventSource)    {        if (eventSource.Name == "Microsoft-Windows-DotNETRuntime")            EnableEvents(eventSource, EventLevel.Informational, (EventKeywords)0x2);    }    protected override void OnEventWritten(EventWrittenEventArgs eventData)    {        // https://learn.microsoft.com/dotnet/fundamentals/diagnostics/runtime-garbage-collection-events#setgchandle-event        if (eventData.EventId == 30 && eventData.Payload[2] is (uint)3)            Interlocked.Increment(ref SetGCHandleCount);    }}

On .NET 7, this outputs:

7.0.9 GCHandle count: 100,000

whereas on .NET 8, I now get:

8.0.0 GCHandle count: 0

So pretty.

HTTP

The primary consumer ofSslStream in .NET itself is the HTTP stack, so let’s move up the stack now toHttpClient, which has seen important gains of its own in .NET 8. As withSslStream, there were a bunch of improvements here that all joined to make for a measurable end-to-end improvement (many of the opportunities here were found as part of improvingYARP):

  • dotnet/runtime#74393 streamlined how HTTP/1.1 response headers are parsed, including making better use ofIndexOfAny to speed up searching for various delimiters demarcating portions of the response.
  • dotnet/runtime#79525 anddotnet/runtime#79524 restructured buffer management for reading and writing on HTTP/1.1 connections.
  • dotnet/runtime#81251 reduced the size ofHttpRequestMessage by 8 bytes andHttpRequestHeaders by 16 bytes (on 64-bit).HttpRequestMessage had aBoolean field that was replaced by using a bit from an existingint field that wasn’t using all of its bits; as the rest of the message’s fields fit neatly into a multiple of 8 bytes, that extraBoolean, even though only a byte in size, required the object to grow by 8 bytes. ForHttpRequestHeaders, it already had an optimization where some uncommonly used headers were pushed off into a contingently-allocated array; there were additional rarely used fields that made more sense to be contingent.
  • dotnet/runtime#83640 shrunk the size of various strongly typedHeaderValue types. For example,ContentRangeHeaderValue has three public propertiesFrom,To, andLength, all of which arelong? akaNullable<long>. Each of these properties was backed by aNullable<long> field. Because of packing and alignment,Nullable<long> ends up consuming 16 bytes, 8 bytes for thelong and then 8 bytes for thebool indicating whether the nullable has a value (bool is stored as a single byte, but because of alignment and packing, it’s rounded up to 8). Instead of storing these asNullable<long>, they can just belong, using whether they contain a negative value to indicate whether they were initialized, reducing the size of the object from 72 bytes down to 48 bytes. Similar improvements were made to six other suchHeaderValue types.
  • dotnet/runtime#81253 tweaked how “Transfer-Encoding: chunked” is stored internally, special-casing it to avoid several allocations.
  • WhenActivity is in use in order to enable the correlation of tracing information across end-to-end usage, every HTTP request ends up creating a newActivity.Id, which incurs not only thestring for that ID, but also in the making of it temporarystring and a temporarystring[6] array.dotnet/runtime#86685 removes both of those intermediate allocations by making better use of spans.
  • dotnet/runtime#79484 is specific to HTTP/2 and applies to it similar changes to what was discussed forSslStream: it now rents buffers from theArrayPool on demand, returning those buffers when idle, and it issues zero-byte reads to the underlying transportStream. The net result of these changes is it can reduce the memory usage of an idle HTTP/2 connection by up to 80Kb.

We can use the following simple GET-request benchmark to how some of these changes accrue to reduced overheads withHttpClient:

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Net;using System.Net.Sockets;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private static readonly Socket s_listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);    private static readonly HttpMessageInvoker s_client = new(new SocketsHttpHandler());    private static Uri s_uri;    [Benchmark]    public async Task HttpGet()    {        var m = new HttpRequestMessage(HttpMethod.Get, s_uri);        using (HttpResponseMessage r = await s_client.SendAsync(m, default))        using (Stream s = r.Content.ReadAsStream())            await s.CopyToAsync(Stream.Null);    }    [GlobalSetup]    public void CreateSocketServer()    {        s_listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));        s_listener.Listen(int.MaxValue);        var ep = (IPEndPoint)s_listener.LocalEndPoint;        s_uri = new Uri($"http://{ep.Address}:{ep.Port}/");        Task.Run(async () =>        {            while (true)            {                Socket s = await s_listener.AcceptAsync();                _ = Task.Run(() =>                {                    using (var ns = new NetworkStream(s, true))                    {                        byte[] buffer = new byte[1024];                        int totalRead = 0;                        while (true)                        {                            int read = ns.Read(buffer, totalRead, buffer.Length - totalRead);                            if (read == 0) return;                            totalRead += read;                            if (buffer.AsSpan(0, totalRead).IndexOf("\r\n\r\n"u8) == -1)                            {                                if (totalRead == buffer.Length) Array.Resize(ref buffer, buffer.Length * 2);                                continue;                            }                            ns.Write("HTTP/1.1 200 OK\r\nDate: Sun, 05 Jul 2020 12:00:00 GMT \r\nServer: Example\r\nContent-Length: 5\r\n\r\nHello"u8);                            totalRead = 0;                        }                    }                });            }        });    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
HttpGet.NET 7.0151.7 us1.001.52 KB1.00
HttpGet.NET 8.0136.0 us0.901.41 KB0.93

WebSocket also sees improvements in .NET 8. Withdotnet/runtime#87329,ManagedWebSocket (the implementation that’s used byClientWebSocket and that’s returned fromWebSocket.CreateFromStream) gets in on the zero-byte reads game. In .NET 7, you could perform a zero-byteReceiveAsync onManagedWebSocket, but doing so would still issue aReadAsync to the underlying stream with the receive header buffer. That in turn could cause the underlyingStream to rent and/or pin a buffer. By special-casing zero-byte reads now in .NET 8,ClientWebSocket can take advantage of any special-casing in the base stream, and hopefully make it so that when the actual read is performed, the data necessary to satisfy it synchronously is already available.

And withdotnet/runtime#75025, allocation withClientWebSocket.ConnectAsync is reduced. This is a nice example of really needing to pay attention to defaults.ClientWebSocket has an optimization where it maintains a shared singletonHttpMessageInvoker that it reuses betweenClientWebSocket instances. However, it can only reuse them when the settings of theClientWebSocket match the settings of that shared singleton; by defaultClientWebSocketOptions.Proxy is set, and that’s enough to knock it off the path that lets it use the shared handler. This PR adds a second shared singleton for whenProxy is set, such that requests using the default proxy can now use a shared handler rather than creating one a new.

JSON

A significant focus forSystem.Text.Json in .NET 8 was on improving support for trimming and source-generatedJsonSerializer implementations, as its usage ends up on critical code paths in a multitude of services and applications, including those that are a primary focus area for Native AOT. Thus, a lot of work went into adding features to the source generator that might otherwise prevent a developer from prefering to use it.dotnet/runtime#79828, for example, added support forrequired andinit properties in C#,dotnet/runtime#83631 added support for “unspeakable” types (such as the compiler-generated types used to implement iterator methods), anddotnet/runtime#84768 added better support for boxed values.dotnet/runtime#79397 also added support for weakly-typed but trimmer-safeSerialize/Deserialize methods, takingJsonTypeInfo, that make it possible for ASP.NET and other such consumers to cache JSON contract metadata appropriately. All of these improvements are functionally valuable on their own, but also accrue to the overall goals of reducing deployed binary size, improving startup time, and generally being able to be successful with Native AOT and gaining the benefits it brings.

Even with that focus, however, there were still some nice throughput-focused improvements that made their way into .NET 8. In particular, a key improvement in .NET 8 is that theJsonSerializer is now able to utilize generated “fast-path” methods even when streaming.

One of the main things the JSON source generator does is generate at build-time all of the thingsJsonSerializer would otherwise need reflection to access at run-time, e.g. discovering the shape of a type, all of its members, their names, attributes that control their serialization, and so on. With just that, however, the serializer would still be using generic routines to perform operations like serialization, just doing so without needing to use reflection. Instead, the source generator can emit a customized serialization routine specific to the data in question, in order to optimize writing it out. For example, given the following types:

public class Rectangle{    public int X, Y, Width, Height;    public Color Color;}public struct Color{    public byte R, G, B, A;}[JsonSerializable(typeof(Rectangle))][JsonSourceGenerationOptions(IncludeFields = true)]private partial class JsonContext : JsonSerializerContext { }

the source generator will include the following serialization routines in the generated code:

private void RectangleSerializeHandler(global::System.Text.Json.Utf8JsonWriter writer, global::Tests.Rectangle? value){    if (value == null)    {        writer.WriteNullValue();        return;    }    writer.WriteStartObject();    writer.WriteNumber(PropName_X, ((global::Tests.Rectangle)value).X);    writer.WriteNumber(PropName_Y, ((global::Tests.Rectangle)value).Y);    writer.WriteNumber(PropName_Width, ((global::Tests.Rectangle)value).Width);    writer.WriteNumber(PropName_Height, ((global::Tests.Rectangle)value).Height);    writer.WritePropertyName(PropName_Color);    ColorSerializeHandler(writer, ((global::Tests.Rectangle)value).Color);    writer.WriteEndObject();}private void ColorSerializeHandler(global::System.Text.Json.Utf8JsonWriter writer, global::Tests.Color value){    writer.WriteStartObject();    writer.WriteNumber(PropName_R, value.R);    writer.WriteNumber(PropName_G, value.G);    writer.WriteNumber(PropName_B, value.B);    writer.WriteNumber(PropName_A, value.A);    writer.WriteEndObject();}

The serializer can then just invoke these routines to write the data directly to theUtf8JsonWriter.

However, in the past these routines weren’t used when serializing with one of the streaming routines (e.g. all of theSerializeAsync methods), in part because of the complexity of refactoring the implementation to accommodate them, but in larger part out of concern that an individual instance being serialized might need to write more data than should be buffered; these fast paths are synchronous-only today, and so can’t perform asynchronous flushes efficiently. This is particularly unfortunate because these streaming overloads are the primary ones used by ASP.NET, which means ASP.NET wasn’t benefiting from these fast paths. Thanks todotnet/runtime#78646, in .NET 8 they now do benefit. The PR does the necessary refactoring internally and also puts in place various heuristics to minimize chances of over-buffering. The net result is these existing optimizations now kick in for a much broader array of use cases, including the primary ones higher in the stack, and the wins are significant.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Json;using System.Text.Json.Serialization;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public partial class Tests{    private readonly Rectangle _data = new()    {        X = 1, Y = 2,        Width = 3, Height = 4,        Color = new Color { R = 5, G = 6, B = 7, A = 8 }    };    [Benchmark]    public void Serialize() => JsonSerializer.Serialize(Stream.Null, _data, JsonContext.Default.Rectangle);    [Benchmark]    public Task SerializeAsync() => JsonSerializer.SerializeAsync(Stream.Null, _data, JsonContext.Default.Rectangle);    public class Rectangle    {        public int X, Y, Width, Height;        public Color Color;    }    public struct Color    {        public byte R, G, B, A;    }    [JsonSerializable(typeof(Rectangle))]    [JsonSourceGenerationOptions(IncludeFields = true)]    private partial class JsonContext : JsonSerializerContext { }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Serialize.NET 7.0613.3 ns1.00488 B1.00
Serialize.NET 8.0205.9 ns0.340.00
SerializeAsync.NET 7.0654.2 ns1.00488 B1.00
SerializeAsync.NET 8.0259.6 ns0.4032 B0.07

The fast-path routines are better leveraged in additional scenarios now, as well. Another case where they weren’t used, even when not streaming, was when combining multiple source-generated contexts: if you have yourJsonSerializerContext-derived type for your own types to be serialized, and someone passes to you anotherJsonSerializerContext-derived type for a type they’re giving you to serialize, you need to combine those contexts together into something you can give toSerialize. In doing so, however, the fast paths could get lost.dotnet/runtime#80741 adds additional APIs and support to enable the fast paths to still be used.

BeyondJsonSerializer, there have been several other performance improvements. Indotnet/runtime#88194, for example,JsonNode‘s implementation is streamlined, including avoiding allocating a delegate while setting values into the node, and indotnet/runtime#85886,JsonNode.To is improved via a one-line change that stops unnecessarily callingMemory<byte>.ToArray() in order to pass it to a method that accepts aReadOnlySpan<byte>:Memory<byte>.Span can and should be used instead, saving on a potentially large array allocation and copy.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Json.Nodes;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly JsonNode _node = JsonNode.Parse("""{ "Name": "Stephen" }"""u8);    [Benchmark]    public string ToJsonString() => _node.ToString();}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
ToJsonString.NET 7.0244.5 ns1.00272 B1.00
ToJsonString.NET 8.0189.6 ns0.78224 B0.82

Lastly on the JSON front, there’s the newCA1869 analyzer added indotnet/roslyn-analyzers#6850.

CA1869

TheJsonSerializerOptions type looks like something that should be relatively cheap to allocate, just a small options type you could allocate on each call toJsonSerializer.Serialize orJsonSerializer.Deserialize with little ramification:

T value = JsonSerializer.Deserialize<T>(source, new JsonSerializerOptions { AllowTrailingCommas = true });

That’s not the case, however.JsonSerializer may need to use reflection to analyze the type being serialized or deserialized in order to learn about its shape and then potentially even use reflection emit to generate custom processing code for using that type. TheJsonSerializerOptions instance is then used not only as a simple bag for options information, but also as a place to store all of that state the serializer built up, enabling it to be shared from call to call. Prior to .NET 7, this meant that passing a newJsonSerializerOptions instance to each call resulted in a massive performance cliff. In .NET 7, the caching scheme was improved to combat the problems here, but even with those mitigations, there’s still significant overhead to using a newJsonSerializerOptions instance each time. Instead, aJsonSerializerOptions instance should be cached and reused.

// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Text.Json;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly string _json = """{ "Title":"Performance Improvements in .NET 8", "Author":"Stephen Toub", }""";    private readonly JsonSerializerOptions _options = new JsonSerializerOptions { AllowTrailingCommas = true };    [Benchmark(Baseline = true)]    public BlogData Deserialize_New() => JsonSerializer.Deserialize<BlogData>(_json, new JsonSerializerOptions { AllowTrailingCommas = true });    [Benchmark]    public BlogData Deserialize_Cached() => JsonSerializer.Deserialize<BlogData>(_json, _options);    public struct BlogData    {        public string Title { get; set; }        public string Author { get; set; }    }}
MethodMeanRatioAllocatedAlloc Ratio
Deserialize_New736.5 ns1.00358 B1.00
Deserialize_Cached290.2 ns0.39176 B0.49

Cryptography

Cryptography in .NET 8 sees a smattering of improvements, a few large ones and a bunch of smaller ones that contribute to removing some overhead across the system.

One of the larger improvements, specific to Windows because it’s about switching what functionality is employed from the underlying OS, comes fromdotnet/runtime#76277. Windows CNG (“Next Generation”) provides two libraries:bcrypt.dll andncrypt.dll. The former provides support for “ephemeral” operations, ones where the cryptographic key is in-memory only and generated on the fly as part of an operation. The latter supports both ephemeral and persisted-key operations, and as a result much of the .NET support has been based onncrypt.dll since it’s more universal. This, however, can add unnecessary expense, as all of the operations are handled out-of-process by thelsass.exe service, and thus require remote procedure calls, which add overhead. This PR switchesRSA ephemeral operations over to usingbcrypt instead ofncrypt, and the results are noteworthy (in the future, we expect other algorithms to also switch).

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Runtime.CompilerServices;using System.Security.Cryptography;using System.Security.Cryptography.X509Certificates;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")][MemoryDiagnoser(displayGenColumns: false)][SkipLocalsInit]public class Tests{    private static readonly RSA s_rsa = RSA.Create();    private static readonly byte[] s_signed = s_rsa.SignHash(new byte[256 / 8], HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);    private static readonly byte[] s_encrypted = s_rsa.Encrypt(new byte[3], RSAEncryptionPadding.OaepSHA256);    private static readonly X509Certificate2 s_cert = new X509Certificate2(Convert.FromBase64String("""        MIIE7DCCA9SgAwIBAgITMwAAALARrwqL0Duf3QABAAAAsDANBgkqhkiG9w0BAQUFADB5MQswCQYDVQQGEwJVUzETMBEGA1UECBMKV2FzaGluZ3RvbjEQMA4GA1UEBxMH        UmVkbW9uZDEeMBwGA1UEChMVTWljcm9zb2Z0IENvcnBvcmF0aW9uMSMwIQYDVQQDExpNaWNyb3NvZnQgQ29kZSBTaWduaW5nIFBDQTAeFw0xMzAxMjQyMjMzMzlaFw0x        NDA0MjQyMjMzMzlaMIGDMQswCQYDVQQGEwJVUzETMBEGA1UECBMKV2FzaGluZ3RvbjEQMA4GA1UEBxMHUmVkbW9uZDEeMBwGA1UEChMVTWljcm9zb2Z0IENvcnBvcmF0        aW9uMQ0wCwYDVQQLEwRNT1BSMR4wHAYDVQQDExVNaWNyb3NvZnQgQ29ycG9yYXRpb24wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDor1yiIA34KHy8BXt/        re7rdqwoUz8620B9s44z5lc/pVEVNFSlz7SLqT+oN+EtUO01Fk7vTXrbE3aIsCzwWVyp6+HXKXXkG4Unm/P4LZ5BNisLQPu+O7q5XHWTFlJLyjPFN7Dz636o9UEVXAhl        HSE38Cy6IgsQsRCddyKFhHxPuRuQsPWj/ov0DJpOoPXJCiHiquMBNkf9L4JqgQP1qTXclFed+0vUDoLbOI8S/uPWenSIZOFixCUuKq6dGB8OHrbCryS0DlC83hyTXEmm        ebW22875cHsoAYS4KinPv6kFBeHgD3FN/a1cI4Mp68fFSsjoJ4TTfsZDC5UABbFPZXHFAgMBAAGjggFgMIIBXDATBgNVHSUEDDAKBggrBgEFBQcDAzAdBgNVHQ4EFgQU        WXGmWjNN2pgHgP+EHr6H+XIyQfIwUQYDVR0RBEowSKRGMEQxDTALBgNVBAsTBE1PUFIxMzAxBgNVBAUTKjMxNTk1KzRmYWYwYjcxLWFkMzctNGFhMy1hNjcxLTc2YmMw        NTIzNDRhZDAfBgNVHSMEGDAWgBTLEejK0rQWWAHJNy4zFha5TJoKHzBWBgNVHR8ETzBNMEugSaBHhkVodHRwOi8vY3JsLm1pY3Jvc29mdC5jb20vcGtpL2NybC9wcm9k        dWN0cy9NaWNDb2RTaWdQQ0FfMDgtMzEtMjAxMC5jcmwwWgYIKwYBBQUHAQEETjBMMEoGCCsGAQUFBzAChj5odHRwOi8vd3d3Lm1pY3Jvc29mdC5jb20vcGtpL2NlcnRz        L01pY0NvZFNpZ1BDQV8wOC0zMS0yMDEwLmNydDANBgkqhkiG9w0BAQUFAAOCAQEAMdduKhJXM4HVncbr+TrURE0Inu5e32pbt3nPApy8dmiekKGcC8N/oozxTbqVOfsN        4OGb9F0kDxuNiBU6fNutzrPJbLo5LEV9JBFUJjANDf9H6gMH5eRmXSx7nR2pEPocsHTyT2lrnqkkhNrtlqDfc6TvahqsS2Ke8XzAFH9IzU2yRPnwPJNtQtjofOYXoJto        aAko+QKX7xEDumdSrcHps3Om0mPNSuI+5PNO/f+h4LsCEztdIN5VP6OukEAxOHUoXgSpRm3m9Xp5QL0fzehF1a7iXT71dcfmZmNgzNWahIeNJDD37zTQYx2xQmdKDku/        Og7vtpU6pzjkJZIIpohmgg==        """));    [Benchmark]    public void Encrypt()    {        Span<byte> src = stackalloc byte[3];        Span<byte> dest = stackalloc byte[s_rsa.KeySize >> 3];        s_rsa.TryEncrypt(src, dest, RSAEncryptionPadding.OaepSHA256, out _);    }    [Benchmark]    public void Decrypt()    {        Span<byte> dest = stackalloc byte[s_rsa.KeySize >> 3];        s_rsa.TryDecrypt(s_encrypted, dest, RSAEncryptionPadding.OaepSHA256, out _);    }    [Benchmark]    public void Verify()    {        Span<byte> hash = stackalloc byte[256 >> 3];        s_rsa.VerifyHash(hash, s_signed, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);    }    [Benchmark]    public void VerifyFromCert()    {        using RSA rsa = s_cert.GetRSAPublicKey();        Span<byte> sig = stackalloc byte[rsa.KeySize >> 3];        ReadOnlySpan<byte> hash = sig.Slice(0, 256 >> 3);        rsa.VerifyHash(hash, sig, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Encrypt.NET 7.0132.79 us1.0056 B1.00
Encrypt.NET 8.019.72 us0.150.00
Decrypt.NET 7.0653.77 us1.0057 B1.00
Decrypt.NET 8.0538.25 us0.820.00
Verify.NET 7.094.92 us1.0056 B1.00
Verify.NET 8.016.09 us0.170.00
VerifyFromCert.NET 7.0525.78 us1.00721 B1.00
VerifyFromCert.NET 8.031.60 us0.06696 B0.97

For cases where implementations are still usingncrypt, there are however ways we can still avoid of some of the remote procedure calls.dotnet/runtime#89599 does so by caching some information (in particular the key size) that doesn’t change but that still otherwise results in these remote procedure calls.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly byte[] _emptyDigest = new byte[256 / 8];    private byte[] _rsaSignedHash, _ecdsaSignedHash;    private RSACng _rsa;    private ECDsaCng _ecdsa;    [GlobalSetup]    public void Setup()    {        _rsa = new RSACng(2048);        _rsaSignedHash = _rsa.SignHash(_emptyDigest, HashAlgorithmName.SHA256, RSASignaturePadding.Pss);        _ecdsa = new ECDsaCng(256);        _ecdsaSignedHash = _ecdsa.SignHash(_emptyDigest);    }    [Benchmark]    public bool Rsa_VerifyHash() => _rsa.VerifyHash(_emptyDigest, _rsaSignedHash, HashAlgorithmName.SHA256, RSASignaturePadding.Pss);    [Benchmark]    public bool Ecdsa_VerifyHash() => _ecdsa.VerifyHash(_emptyDigest, _ecdsaSignedHash);}
MethodToolchainMeanRatio
Rsa_VerifyHash.NET 7.0130.27 us1.00
Rsa_VerifyHash.NET 8.075.30 us0.58
Ecdsa_VerifyHash.NET 7.0400.23 us1.00
Ecdsa_VerifyHash.NET 8.0343.69 us0.86

TheSystem.Format.Asn1 library provides the support used for encoding various data structures used in cryptographic protocols. For example,AsnWriter is used as part ofCertificateRequest to create thebyte[] that’s handed off to theX509Certificate2‘s constructor. As part of this, it relies heavily on OIDs (object identifiers) used to uniquely identify things like specific cryptographic algorithms.dotnet/runtime#75485 imbuesAsnReader andAsnWriter with knowledge of the most-commonly used OIDs, making reading and writing with them significantly faster.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Formats.Asn1;using System.Runtime.CompilerServices;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly AsnWriter _writer = new AsnWriter(AsnEncodingRules.DER);    [Benchmark]    public void Write()    {        _writer.Reset();        _writer.WriteObjectIdentifier("1.2.840.10045.4.3.3"); // ECDsa with SHA384    }}
MethodRuntimeMeanRatio
Write.NET 7.0608.50 ns1.00
Write.NET 8.033.69 ns0.06

Interestingly, this PR does most of its work in two large switch statements. The first is a nice example of using C# list patterns toswitch over a span of bytes and efficiently match to a case. The second is a great example of the C# compiler optimization mentioned earlier aroundswitches and length bucketing. The internalWellKnownOids.GetContents function this adds to do the lookup is based on a giant switch with ~100 cases. The C# compiler ends up generating aswitch over the length of the supplied OID string, and then in each length bucket, it either does a sequential scan through the small number of keys in that bucket, or it does a secondary switch over the character at a specific offset into the input, due to all of the keys having a discriminating character at that position.

Another interesting change comes inRandomNumberGenerator, which is the cryptographically-secure RNG inSystem.Security.Cryptography (as opposed to the non-cryptographically secureSystem.Random).RandomNumberGenerator provides aGetNonZeroBytes bytes method, which is the same asGetBytes but which promises not to yield any 0 values. It does so by usingGetBytes, finding any produced 0s, removing them, and then callingGetBytes again to replace all of the 0 values (if that call happens to produce any 0s, then the process repeats). The previous implementation ofGetNonZeroBytes was nicely using the vectorizedIndexOf((byte)0) to search for a 0. Once it found one, however, it would shift down one at a time the rest of the bytes until the next zero. Since we expect 0s to be rare (on average, they should only occur once ever 256 generated bytes), it’s much more efficient to search for the next 0 using a vectorized operation, and then shift everything down using a vectorized memory move operation. And that’s exactly whatdotnet/runtime#81340 does.

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Security.Cryptography;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private static readonly RandomNumberGenerator s_rng = RandomNumberGenerator.Create();    private readonly byte[] _bytes = new byte[1024];    [Benchmark]    public void GetNonZeroBytes() => s_rng.GetNonZeroBytes(_bytes);}
MethodRuntimeMeanRatio
GetNonZeroBytes.NET 7.01,115.8 ns1.00
GetNonZeroBytes.NET 8.0650.8 ns0.58

Finally, a variety of changes went in to reduce allocation:

  • AsnWriter now also has a constructor that lets a caller presize its internal buffer, thanks todotnet/runtime#73535. That new constructor is then used indotnet/runtime#81626 to improve throughput on other operations.
  • dotnet/runtime#75138 removes astring allocation as part of reading certificates on Linux. Stack allocation and spans are used along withEncoding.ASCII.GetString(ReadOnlySpan<byte>, Span<char>) instead ofEncoding.ASCII.GetString(byte[]) that produces astring.
  • ECDsa‘sLegalKeySizes don’t change. The property hands back aKeySizes[] array, and out of precaution the property needs to return a new array on each access, however the actualKeySizes instances are immutable.dotnet/runtime#76156 caches theseKeySizes instances.

Logging

Logging, along with telemetry, is the lifeblood of any service. The more logging one incorporates, the more information is available to diagnose issues. But of course the more logging one incorporates, the more resources are possibly spent on logging, and thus it’s desirable for logging-related code to be as efficient as possible.

One issue that’s plagued some applications is inMicrosoft.Extensions.Logging‘sLoggerFactory.CreateLogger method. Some libraries are passed anILoggerFactory, callCreateLogger once, and then store and use that logger for all subsequent interactions; in such cases, the overhead ofCreateLogger isn’t critical. However, other code paths, including some from ASP.NET, end up needing to “create” a logger on demand each time it needs to log. That puts significant stress onCreateLogger, incurring its overhead as part of every logging operation. To reduce these overheads,LoggerFactory.CreateLogger has long maintained aDictionary<TKey, TValue> cache of all logger instances it’s created: pass in the samecategoryName, get back the sameILogger instance (hence why I put “create” in quotes a few sentences back). However, that cache is also protected by a lock. That not only means everyCreateLogger call is incurring the overhead of acquiring and releasing a lock, but if that lock is contended (meaning others are trying to access it at the same time), that contention can dramatically increase the costs associated with the cache. This is the perfect use case for aConcurrentDictionary<TKey, TValue>, which is optimized with lock-free support for reads, and that’s exactly howdotnet/runtime#87904 improves performance here. We still want to perform some work atomically when there’s a cache miss, so the change uses “double-checked locking”: it performs a read on the dictionary, and only if the lookup fails does it then fall back to taking the lock, after which it checks the dictionary again, and only if that second read fails does it proceed to create the new logger and store it. The primary benefit ofConcurrentDictionary<TKey, TValue> here is it enables us to have that up-front read, which might execute concurrently with another thread mutating the dictionary; that’s not safe withDictionary<,> but is withConcurrentDictionary<,>. This measurably lowers the cost of even uncontended access, but dramatically reduces the overhead when there’s significant contention.

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using Microsoft.Extensions.Logging;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet("Microsoft.Extensions.Logging", "7.0.0").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("Microsoft.Extensions.Logging", "8.0.0-rc.1.23419.4"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]public class Tests{    private readonly LoggerFactory _factory = new();    [Benchmark]    public void Serial() => _factory.CreateLogger("test");    [Benchmark]    public void Concurrent()    {        Parallel.ForEach(Enumerable.Range(0, Environment.ProcessorCount), (i, ct) =>        {            for (int j = 0; j < 1_000_000; j++)            {                _factory.CreateLogger("test");            }        });    }}
MethodRuntimeMeanRatio
Serial.NET 7.032.775 ns1.00
Serial.NET 8.07.734 ns0.24
Concurrent.NET 7.0509,271,719.571 ns1.00
Concurrent.NET 8.021,613,226.316 ns0.04

(The same double-checked locking approach is also employed indotnet/runtime#73893 from@Daniel-Svensson, in that case for the Data Contract Serialization library. Similarly,dotnet/runtime#82536 replaces a lockedDictionary<,> with aConcurrentDictionary<,>, there inSystem.ComponentModel.DataAnnotations. In that case, it just usesConcurrentDictionary<,>‘sGetOrAdd method, which provides optimistic concurrency; the supplied delegate could be invoked multiple times in the case of contention to initialize a value for a given key, but only one such value will ever be published for all to consume.)

Also related toCreateLogger, there’s aCreateLogger(this ILoggerFactory factory, Type type) extension method and aCreateLogger<T>(this ILoggerFactory factory) extension method, both of which infer the category to use from specified type, using its pretty-printed name. Previously that pretty-printing involved always allocating both aStringBuilder to build up the name and the resultingstring. However, those are only necessary for more complex types, e.g. generic types, array types, and generic type parameters. For the common case,dotnet/runtime#79325 from@benaadams avoids those overheads, which were incurred even when the request for the logger could be satisfied from the cache, because the name was necessary to even perform the cache lookup.

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using Microsoft.Extensions.Logging;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet("Microsoft.Extensions.Logging", "7.0.0").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("Microsoft.Extensions.Logging", "8.0.0-rc.1.23419.4"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "NuGetReferences")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly LoggerFactory _factory = new();    [Benchmark]    public ILogger CreateLogger() => _factory.CreateLogger<Tests>();}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
CreateLogger.NET 7.0156.77 ns1.00160 B1.00
CreateLogger.NET 8.070.82 ns0.4524 B0.15

There are also changes in .NET 8 to reduce overheads when logging actually does occur, and one such change makes use of a new .NET 8 feature we’ve already talked about:CompositeFormat.CompositeFormat isn’t currently used in many places throughout the core libraries, as most of the formatting they do is either with strings known at build time (in which case they use interpolated strings) or are on exceptional code paths (in which case we generally don’t want to regress working set or startup in order to optimize error conditions). However, there is one key placeCompositeFormat is now used: inLoggerMessage.Define. This method is similar in concept toCompositeFormat: rather than having to redo work every time you want to log something, instead spend some more resources to frontload and cache that work, in order to optimize subsequent usage… that’s whatLoggerMessage.Define does, just for logging.Define returns a strongly-typed delegate that can then be used any time logging should be performed. As of the same PR that introducedCompositeFormat,LoggerMessage.Define now also constructs aCompositeFormat under the covers, and uses that instance to perform any formatting work necessary based on the log message pattern provided (previously it would just callstring.Format as part of every log operation that needed it).

// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using Microsoft.Extensions.Logging;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public class Tests{    private readonly Action<ILogger, int, Exception> _message = LoggerMessage.Define<int>(LogLevel.Critical, 1, "The value is {0}.");    private readonly ILogger _logger = new MyLogger();    [Benchmark]    public void Format() => _message(_logger, 42, null);    sealed class MyLogger : ILogger    {        public IDisposable BeginScope<TState>(TState state) => null;        public bool IsEnabled(LogLevel logLevel) => true;        public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func<TState, Exception, string> formatter) => formatter(state, exception);    }}
MethodRuntimeMeanRatio
Format.NET 7.0127.04 ns1.00
Format.NET 8.091.78 ns0.72

LoggerMessage.Define is used as part of the logging source generator, so the benefits there implicitly accrue not only to direct usage ofLoggerMessage.Define but also to any use of the generator. We can see that in this benchmark here:

// For this test, you'll also need to add://     <PackageReference Include="Microsoft.Extensions.Logging.Abstractions" Version="7.0.0" />// to the benchmarks.csproj's <ItemGroup>.// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using Microsoft.Extensions.Logging;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    private readonly ILogger _logger = new MyLogger();    [Benchmark]    public void Log() => LogValue(42);    [LoggerMessage(1, LogLevel.Critical, "The value is {Value}.")]    private partial void LogValue(int value);    sealed class MyLogger : ILogger    {        public IDisposable BeginScope<TState>(TState state) => null;        public bool IsEnabled(LogLevel logLevel) => true;        public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func<TState, Exception, string> formatter) => formatter(state, exception);    }}

Note theLogValue method, which is declared as apartial method with theLoggerMessage attribute applied to it. The generator will see that and inject into my application the following implementation (the only changes I’ve made to this copied code are removing the fully-qualified names, for readability), which as is visible here usesLoggerMessage.Define:

partial class Tests{    [GeneratedCode("Microsoft.Extensions.Logging.Generators", "7.0.0")]    private static readonly Action<ILogger, Int32, Exception?> __LogValueCallback =        LoggerMessage.Define<Int32>(LogLevel.Information, new EventId(1, nameof(LogValue)), "The value is {Value}.", new LogDefineOptions() { SkipEnabledCheck = true });    [GeneratedCode("Microsoft.Extensions.Logging.Generators", "7.0.0")]    private partial void LogValue(Int32 value)    {        if (_logger.IsEnabled(LogLevel.Information))        {            __LogValueCallback(_logger, value, null);        }    }}

When running the benchmark, then, we can see the improvements that useCompositeFormat end up translating nicely:

MethodRuntimeMeanRatio
Log.NET 7.094.10 ns1.00
Log.NET 8.074.68 ns0.79

Other changes have also gone into reducing overheads in logging. Here’s the sameLoggerMessage.Define benchmark as before, but I’ve tweaked two things:

  1. I’ve added[MemoryDiagnoser] so that allocation is more visible.
  2. I’ve explicitly controlled which NuGet package version is used for which run.

TheMicrosoft.Extensions.Logging.Abstractions package carries with it multiple “assets”; the v7.0.0 package, even though it’s “7.0.0,” carries with it a build for net7.0, for net6.0, for netstandard2.0, etc. Similarly, the v8.0.0 package, even though it’s “8.0.0,” carries with it a build for net8.0, for net7.0, and so on. Each of those is created from compiling the source for that Target Framework Moniker (TFM). Changes that are specific to a particular TFM, such as the change to useCompositeFormat, are only compiled into that build, but other improvements that aren’t specific to a particular TFM end up in all of them. As such, to be able to see improvements that have gone into the general code in the last year, we need to actually compare the two different NuGet packages, and can’t just compare the net8.0 vs net7.0 assets in the same package version.

// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using Microsoft.Extensions.Logging;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet("Microsoft.Extensions.Logging", "7.0.0").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("Microsoft.Extensions.Logging", "8.0.0-rc.1.23419.4"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "NuGetReferences")][MemoryDiagnoser(displayGenColumns: false)]public class Tests{    private readonly Action<ILogger, int, Exception> _message = LoggerMessage.Define<int>(LogLevel.Critical, 1, "The value is {0}.");    private readonly ILogger _logger = new MyLogger();    [Benchmark]    public void Format() => _message(_logger, 42, null);    sealed class MyLogger : ILogger    {        public IDisposable BeginScope<TState>(TState state) => null;        public bool IsEnabled(LogLevel logLevel) => true;        public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func<TState, Exception, string> formatter) => formatter(state, exception);    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Format.NET 7.096.44 ns1.0080 B1.00
Format.NET 8.046.75 ns0.4856 B0.70

Notice that throughput has increased and allocation has dropped. That’s primarily due todotnet/runtime#88560, which avoids boxing value type arguments as they’re being passed through the formatting logic.

dotnet/runtime#89160 is another interesting example, not because it’s a significant savings (it ends up saving an allocation per HTTP request made using anHttpClient created from anHttpClientFactory), but because of why the allocation is there in the first place. Consider this C# class:

public class C{    public void M(int value)    {        Console.WriteLine(value);        LocalFunction();        void LocalFunction() => Console.WriteLine(value);    }}

We’ve got a methodM that contains a local functionLocalFunction that “closes over”M‘sint value argument. How doesvalue find its way into thatLocalFunction? Let’s look at a decompiled version of the IL the compiler generates:

public class C{    public void M(int value)    {        <>c__DisplayClass0_0 <>c__DisplayClass0_ = default(<>c__DisplayClass0_0);        <>c__DisplayClass0_.value = value;        Console.WriteLine(<>c__DisplayClass0_.value);        <M>g__LocalFunction|0_0(ref <>c__DisplayClass0_);    }    [StructLayout(LayoutKind.Auto)]    [CompilerGenerated]    private struct <>c__DisplayClass0_0    {        public int value;    }    [CompilerGenerated]    private static void <M>g__LocalFunction|0_0(ref <>c__DisplayClass0_0 P_0)    {        Console.WriteLine(P_0.value);    }}

So, the compiler has emitted theLocalFunction as a static method, and it’s passed the state it needs by reference, with all of the state in a separate type (which the compiler refers to as a “display class”). Note that a) the instance of this type is constructed inM in order to store thevalue argument, and that all references tovalue, whether inM or inLocalFunction, are to the sharedvalue on the display class, and b) that “class” is actually declared as astruct. That means we’re not going to incur any allocation as part of that data sharing. But now, let’s add a single keyword to our repro: addasync toLocalFunction (I’ve elided some irrelevant code here for clarity):

public class C{    public void M(int value)    {        <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0();        <>c__DisplayClass0_.value = value;        Console.WriteLine(<>c__DisplayClass0_.value);        <>c__DisplayClass0_.<M>g__LocalFunction|0();    }    [CompilerGenerated]    private sealed class <>c__DisplayClass0_0    {        [StructLayout(LayoutKind.Auto)]        private struct <<M>g__LocalFunction|0>d : IAsyncStateMachine { ... }        public int value;        [AsyncStateMachine(typeof(<<M>g__LocalFunction|0>d))]        internal void <M>g__LocalFunction|0()        {            <<M>g__LocalFunction|0>d stateMachine = default(<<M>g__LocalFunction|0>d);            stateMachine.<>t__builder = AsyncVoidMethodBuilder.Create();            stateMachine.<>4__this = this;            stateMachine.<>1__state = -1;            stateMachine.<>t__builder.Start(ref stateMachine);        }    }}

The code forM looksalmost the same, but there’s a key difference: instead ofdefault(<>c__DisplayClass0_0), it hasnew <>c__DisplayClass0_0(). That’s because the display class now actually is aclass rather than being astruct, and that’s because the state can no longer live on the stack; it’s being passed to an asynchronous method, which may need to continue to use it even after the stack has unwound. And that means it becomes more important avoiding these kinds of implicit closures when dealing with local functions that are asynchronous.

In this particular case,LoggingHttpMessageHandler (andLoggingScopeHttpMessageHandler) had aSendCoreAsync method that looked like this:

private Task<HttpResponseMessage> SendCoreAsync(HttpRequestMessage request, bool useAsync, CancellationToken cancellationToken){    ThrowHelper.ThrowIfNull(request);    return Core(request, cancellationToken);    async Task<HttpResponseMessage> Core(HttpRequestMessage request, CancellationToken cancellationToken)    {        ...        HttpResponseMessage response = useAsync ? ... : ...;        ...    }}

Based on the previous discussion, you likely see the problem here:useAsync is being implicitly closed over by the local function, resulting in this allocating a display class to pass that state in. The cited PR changed the code to instead be:

private Task<HttpResponseMessage> SendCoreAsync(HttpRequestMessage request, bool useAsync, CancellationToken cancellationToken){    ThrowHelper.ThrowIfNull(request);    return Core(request, useAsync, cancellationToken);    async Task<HttpResponseMessage> Core(HttpRequestMessage request, bool useAsync, CancellationToken cancellationToken)    {        ...        HttpResponseMessage response = useAsync ? ... : ...;        ...    }}

and, voila, the allocation is gone.

EventSource is another logging mechanism in .NET that’s lower-level and which is used by the core libraries for their logging needs. The runtime itself publishes its events for things like the GC and the JIT via anEventSource, something I relied on earlier in this post when tracking how manyGCHandles were created (search above forGCHandleListener). When eventing is enabled for a particular source, thatEventSource publishes a manifest describing the possible events and the shape of the data associated with each. While in the future, we aim to use a source generator to create that manifest at build time, today it’s all generated at run-time, using reflection to analyze the events defined on theEventSource-derived type and to dynamically build up the description. That unfortunately has some cost, which can measurably impact startup. Thankfully, one of the main contributors here is the manifest for that runtime source,NativeRuntimeEventSource, as it’s ever present, but it’s not actually necessary, since tools that consume this information already know about the well-documented schema. As such,dotnet/runtime#78213 stopped emitting the manifest forNativeRuntimeEventSource such that it doesn’t send a large amount of data across to the consumer that will subsequently ignore it. That prevented it from being sent, but it was still being created.dotnet/runtime#86850 from@n77y addressed a large chunk of that by reducing the costs of that generation. The effect of this is obvious if we do a .NET allocation profile of a simple nop console application.

class Program { static void Main() { } }

On .NET 7, we observe this:Allocation from the NativeRuntimeEventSource on .NET 7And on .NET 8, that reduces to this:Allocation from the NativeRuntimeEventSource on .NET 8(In the future, hopefully this whole thing will go away due to precomputing the manifest.)

EventSource also relies heavily on interop, and as part of that it’s historically used delegate marshaling as part of implementing callbacks from native code.dotnet/runtime#79970 switches it over to using function pointers, which is not only more efficient, it eliminates this as one of the last uses of delegate marshaling in the core libraries. That means for Native AOT, all of the code associated with supporting delegate marshaling can typically now be trimmed away, reducing application size further.

Configuration

Configuration support is critical for many services and applications, such that information necessary to the execution of the code can be extracted from the code, whether that be into a JSON file, environment variables, Azure Key Vault, wherever. This information then needs to be loaded into the application in a convenient manner, typically at startup but also potentially any time the configuration is seen to change. It’s thus not a typical candidate for throughput-focused optimization, but it is still valuable to drive associated costs down, especially to help with startup performance.

WithMicrosoft.Extensions.Configuration, configuration is handled primarily with aConfigurationBuilder, anIConfiguration, and a “binder.” Using aConfigurationBuilder, you add in the various sources of your configuration information (e.g.AddEnvironmentVariables,AddAzureKeyVault, etc.), and then you publish that as anIConfiguration instance. In typical use, you then extract from thatIConfiguration the data you want by “binding” it to an object, meaning aBind method populates the provided object with data from the configuration based on the shape of the object. Let’s measure the cost of thatBind specifically:

// For this test, you'll also need to add://     <EnableConfigurationBindingGenerator>true</EnableConfigurationBindingGenerator>//     <Features>$(Features);InterceptorsPreview</Features>// to the PropertyGroup in the benchmarks.csproj file, and add://    <PackageReference Include="Microsoft.Extensions.Configuration" Version="7.0.0" />//    <PackageReference Include="Microsoft.Extensions.Configuration.EnvironmentVariables" Version="7.0.0" />//    <PackageReference Include="Microsoft.Extensions.Configuration.Binder" Version="8.0.0-rc.1.23419.4" Condition="'$(TargetFramework)'=='net8.0'" />// to the ItemGroup.// dotnet run -c Release -f net7.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Configs;using BenchmarkDotNet.Environments;using BenchmarkDotNet.Jobs;using BenchmarkDotNet.Running;using Microsoft.Extensions.Configuration;var config = DefaultConfig.Instance    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet("Microsoft.Extensions.Configuration", "7.0.0").AsBaseline())    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80)        .WithNuGet("Microsoft.Extensions.Configuration", "8.0.0-rc.1.23419.4")        .WithNuGet("Microsoft.Extensions.Configuration.Binder", "8.0.0-rc.1.23419.4"));BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);[HideColumns("Error", "StdDev", "Median", "RatioSD", "NuGetReferences")][MemoryDiagnoser(displayGenColumns: false)]public partial class Tests{    private readonly MyConfigSection _data = new();    private IConfiguration _config;    [GlobalSetup]    public void Setup()    {        Environment.SetEnvironmentVariable("MyConfigSection__Message", "Hello World!");        _config = new ConfigurationBuilder()            .AddEnvironmentVariables()            .Build();    }    [Benchmark]    public void Load() => _config.Bind("MyConfigSection", _data);    internal sealed class MyConfigSection    {        public string Message { get; set; }    }}
MethodRuntimeMeanRatioAllocatedAlloc Ratio
Load.NET 7.01,747.15 ns1.001328 B1.00
Load.NET 8.073.45 ns0.04112 B0.08

Whoa.

Much of that cost in .NET 7 comes from what I alluded to earlier when I said “based on the shape of the object.” ThatBind call is using this extension method defined in theMicrosoft.Extensions.Configuration.ConfigurationBinder type:

public static void Bind(this IConfiguration configuration, string key, object? instance)

How does it know what data to extract from the configuration and where on theobject to store it? Reflection, of course. That means that everyBind call is using reflection to walk the suppliedobject‘s type information, and is using reflection to store the configuration data onto the instance. That’s not cheap.

What changes then in .NET 8? The mention of “EnableConfigurationBindingGenerator” in the benchmark code above probably gives it away, but the answer is there’s a new source generator for configuration in .NET 8. This source generator was initially introduced indotnet/runtime#82179 and was then improved upon in a multitude of PRs likedotnet/runtime#84154,dotnet/runtime#86076,dotnet/runtime#86285, anddotnet/runtime#86365. The crux of the idea behind the configuration source generator is to emit areplacement for thatBind method, one that knows exactly what type is being populated and can do all the examination of its shape at build-time rather than at run-time via reflection.

“Replacement.” For anyone familiar with C# source generators, this might be setting off alarm bells in your head. Source generators plug into the compiler and are handed all the data the compiler has about the code being compiled; the source generator is then able toaugment that data, generating additional code into separate files that the compiler then also compiles into the same assembly. Source generators are able to add code but they can’t rewrite the code. This is why you see source generators like theRegex source generator or theLibraryImport source generator or theLoggerMessage source generator relying on partial methods: the developer writes the partial method declaration for the method they then consume in their code, and then separately the generator emits a partial method definition to supply the implementation for that method. How then is this new configuration generator able toreplace a call to an existing method? I’m glad you asked! It takes advantage of a new preview feature of the C# compiler, added primarily indotnet/roslyn#68564: interceptors.

Consider this program, defined in a/home/stoub/benchmarks/Program.cs file (and where the associated .csproj contains<Features>$(Features);InterceptorsPreview</Features> to enable the preview feature):

// dotnet run -c Release -f net8.0using System.Runtime.CompilerServices;Console.WriteLine("Hello World!");// ----------------------------------internal static class Helpers{    [InterceptsLocation(@"/home/stoub/benchmarks/Program.cs", 5, 9)]    internal static void NotTheRealWriteLine(string message) =>        Console.WriteLine($"The message was '{message}'.");}namespace System.Runtime.CompilerServices{    [AttributeUsage(AttributeTargets.Method, AllowMultiple = true)]    file sealed class InterceptsLocationAttribute : Attribute    {        public InterceptsLocationAttribute(string filePath, int line, int column) { }    }}

This is a “hello world” application, except not quite the one-liner you’re used to. There’s a call toConsole.WriteLine, but there’s also a method decorated withInterceptsLocation. That method has the same signature as theConsole.WriteLine being used, and the attribute is pointing to theWriteLine method call inProgram.cs‘s line 5 column 9. When the compiler sees this, it will change that call fromConsole.WriteLine("Hello World!") to instead beHelpers.NotTheRealWriteLine("Hello World!"), allowing this other method in the same compilation unit to intercept the original call. This interceptor needn’t be in the same file, so a source generator can analyze the code handed to it, find a call it wants to intercept, and augment the compilation unit with such an interceptor.

Decompiled "Hello World" with Interceptors

That’s exactly what the configuration source generator does. In this benchmark, for example, the core of what the source generator emits is here (I’ve elided stuff that’s not relevant to this discussion):

[InterceptsLocationAttribute(@".../LoggerFilterConfigureOptions.cs", 21, 35)]public static void Bind_TestsMyConfigSection(this IConfiguration configuration, string key, object? obj){    ...    var typedObj = (Tests.MyConfigSection)obj;    BindCore(configuration.GetSection(key), ref typedObj, binderOptions: null);}public static void BindCore(IConfiguration configuration, ref Tests.MyConfigSection obj, BinderOptions? binderOptions){    ...    obj.Message = configuration["Message"]!;}

We can see the generatedBind method is strongly typed for myMyConfigSection type, and the generatedBind_TestsMyConfigSection method it invokes extracts the"Message" value from theconfiguration and stores it directly into the property. No reflection anywhere in sight.

This is obviously great for throughput, but that actually wasn’t the primary goal for this particular source generator. Rather, it was in support of Native AOT and trimming. Without direct use of various portions of the object model for the bound object, the trimmer could see portions of it as being unused and trim them away (such as setters for properties that are only read by the application), at which point that data would not be available (because the deserializer would see the properties as being get-only). By having everything strongly typed in the generated source, that issue goes away. And as a bonus, if there isn’t other use of the reflection stack keeping it rooted, the trimmer can get rid of that, too.

Bind isn’t the only method that’s replaceable.ConfigurationBinder provides other methods consumers can use, likeGetValue, which just retrieves the value associated with a specific key, and the configuration source generator can emit replacements for those as well.dotnet/runtime#87935 modifiedMicrosoft.Extensions.Logging.Configuration to employ the config generator for this purpose, as it usesGetValue in itsLoadDefaultConfigValues method:

private void LoadDefaultConfigValues(LoggerFilterOptions options){    if (_configuration == null)    {        return;    }    options.CaptureScopes = _configuration.GetValue(nameof(options.CaptureScopes), options.CaptureScopes);    ...}

And if we look at what’s in the compiled binary (viaILSpy), we see this:ILSpy decompilation of LoadDefaultConfigValues

So, the code looks the same, but the actual target of theGetValue is the intercepting method emitted by the source generator. When that change merged, it knocked ~640Kb off the size of the ASP.NET app being used as an exemplar to track Native AOT app size!

Once data has been loaded from the configuration system into some kind of model, often the next step is to validate that the supplied data meets requirements. Whether a data model is populated once from configuration or per request for user input, a typical approach for achieving such validation is via theSystem.ComponentModel.DataAnnotations namespace. This namespace supplies attributes that can be applied to members of a type to indicate constraints the data must satisfy, such as[Required] to indicate the data must be supplied or[MinLength(...)] to indicate a minimum length for a string, and .NET 8 adds additional attributes viadotnet/runtime#82311, for example[Base64String]. On top of this,Microsoft.Extensions.Options.DataAnnotationValidateOptions provides an implementation of theIValidateOptions<TOptions> interface (an implementation of which is typically retrieved via DI) for validating models based on data annotations, and as you can probably guess, it does so via reflection. As is a trend you’re probably picking up on, for many such areas involving reflection, .NET has been moving to add source generators that can do at build-time what would have otherwise been done at run-time; that’s the case here as well. As ofdotnet/runtime#87587, theMicrosoft.Extensions.Options package in .NET 8 now includes a source generator that creates an implementation ofIValidateOptions<TOptions> for a specificTOptions type.

For example, consider this benchmark:

// For this test, you'll also need to add these://  <PackageReference Include="Microsoft.Extensions.Options" Version="8.0.0-rc.1.23419.4" />//  <PackageReference Include="Microsoft.Extensions.Options.DataAnnotations" Version="8.0.0-rc.1.23419.4" />// to the benchmarks.csproj's <ItemGroup>.// dotnet run -c Release -f net8.0 --filter "*"using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using Microsoft.Extensions.Options;using System.ComponentModel.DataAnnotations;BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);[HideColumns("Error", "StdDev", "Median", "RatioSD")]public partial class Tests{    private readonly DataAnnotationValidateOptions<MyOptions> _davo = new DataAnnotationValidateOptions<MyOptions>(null);    private readonly MyOptionsValidator _ov = new();    private readonly MyOptions _options = new() { Path = "1234567890", Address = "http://localhost/path", PhoneNumber = "555-867-5309" };    [Benchmark(Baseline = true)]    public ValidateOptionsResult WithReflection() => _davo.Validate(null, _options);    [Benchmark]    public ValidateOptionsResult WithSourceGen() => _ov.Validate(null, _options);    public sealed class MyOptions    {        [Length(1, 10)]        public string Path { get; set; }        [Url]        public string Address { get; set; }        [Phone]        public string PhoneNumber { get; set; }    }    [OptionsValidator]    public partial class MyOptionsValidator : IValidateOptions<MyOptions> { }}

Note the[OptionsValidator] at the end. It’s applied to apartial class that implementsIValidatOptions<MyOptions>, which tells the source generator to emit the implementation for this interface in order to validateMyOptions. It ends up emitting code like this (which I’ve simplified a tad, e.g. removing fully-qualified namespaces, for the purposes of this post):

[GeneratedCode("Microsoft.Extensions.Options.SourceGeneration", "8.0.8.41903")]public ValidateOptionsResult Validate(string? name, MyOptions options){    var context = new ValidationContext(options);    var validationResults = new List<ValidationResult>();    var validationAttributes = new List<ValidationAttribute>(2);    ValidateOptionsResultBuilder? builder = null;    context.MemberName = "Path";    context.DisplayName = string.IsNullOrEmpty(name) ? "MyOptions.Path" : $"{name}.Path";    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A1);    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A2);    if (!Validator.TryValidateValue(options.Path, context, validationResults, validationAttributes))        (builder ??= new()).AddResults(validationResults);    context.MemberName = "Address";    context.DisplayName = string.IsNullOrEmpty(name) ? "MyOptions.Address" : $"{name}.Address";    validationResults.Clear();    validationAttributes.Clear();    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A3);    if (!Validator.TryValidateValue(options.Address, context, validationResults, validationAttributes))        (builder ??= new()).AddResults(validationResults);    context.MemberName = "PhoneNumber";    context.DisplayName = string.IsNullOrEmpty(name) ? "MyOptions.PhoneNumber" : $"{name}.PhoneNumber";    validationResults.Clear();    validationAttributes.Clear();    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A4);    if (!Validator.TryValidateValue(options.PhoneNumber, context, validationResults, validationAttributes))        (builder ??= new()).AddResults(validationResults);    return builder is not null ? builder.Build() : ValidateOptionsResult.Success;}

eliminating the need to use reflection to discover the relevant properties and their attribution. The benchmark results highlight the benefits:

MethodMeanRatio
WithReflection2,926.2 ns1.00
WithSourceGen403.5 ns0.14

Peanut Butter

In every .NET release, there are a multitude of welcome PRs that make small improvements. These changes on their own typically don’t “move the needle,” don’t on their own make very measurable end-to-end changes. However, an allocation removed here, an unnecessary bounds check removed there, it all adds up. Constantly working to remove this “peanut butter,” as we often refer to it (a thin smearing of overhead across everything), helps improve the performance of the platform in the aggregate.

Here are some examples from the last year:

  • dotnet/runtime#77832. TheMemoryStream type provides a convenientToArray() method that gives you all the stream’s data as a newbyte[]. But while convenient, it’s a potentially large allocation and copy. The lesser knownGetBuffer andTryGetBuffer methods give one access to theMemoryStream‘s buffer directly, without incurring an allocation or copy. This PR replaced use ofToArray inSystem.Private.Xml and inSystem.Reflection.Metadata that were better served byGetBuffer(). Not only did it remove unnecessary allocation, as a bonus it also resulted in less code.
  • dotnet/runtime#80523 anddotnet/runtime#80389 removed string allocations from theSystem.ComponentModel.Annotations library.CreditCardAttribute was making two calls tostring.Replace to remove'-' and' ' characters, but it was then looping over every character in the input… rather than creating new strings without those characters, the loop can simply skip over them. Similarly,PhoneAttribute contained 6string.Substring calls, all of which could be replaced with simpleReadOnlySpan<char> slices.
  • dotnet/runtime#82041,dotnet/runtime#87479, anddotnet/runtime#80386 changed several hundred lines acrossdotnet/runtime to avoid various array andstring allocation. In some cases it usedstackalloc, in othersArrayPool, in others simply deleting arrays that were never used, in others usingReadOnlySpan<char> and slicing.
  • dotnet/runtime#82411 from@xtqqczze anddotnet/runtime#82456 from@xtqqczze do a similar optimization to one discussed previously in the context ofSslStream. Here, they’re removingSafeHandle allocations in places where a simpletry/finally with the rawIntPtr for the handle suffices.
  • dotnet/runtime#82096 anddotnet/runtime#83138 decreased some costs by using newer constructs: string interpolation instead of concatenation so as to avoid some intermediary string allocations, andu8 instead ofEncoding.UTF8.GetBytes to avoid the transcoding overhead.
  • dotnet/runtime#75850 removed some allocations as part of initializing aDictionary<,>. The dictionary inTypeConverter gets populated with a fixed set of predetermined items, and as such it’s provided with a capacity so as to presize its internal arrays to avoid intermediate allocations as part of growing. However, the provided capacity was smaller than the number of items actually being added. This PR simply fixed the number, and voila, less allocation.
  • dotnet/runtime#81036 from@xtqqczze anddotnet/runtime#81039 from@xtqqczze helped eliminate some bounds checking in various components across the core libraries. Today the JIT compiler recognizes the patternfor (int i = 0; i < arr.Length; i++) Use(arr[i]);, understanding that thei can’t ever be negative nor greater than thearr‘s length, and thus eliminates the bounds check it would have otherwise emitted onarr[i]. However, the compiler doesn’t currently recognize the same thing forfor (int i = 0; i != arr.Length; i++) Use(arr[i]);. These PRs primarily replaced!=s with<s in order to help in some such cases (it also makes the code more idiomatic, and so was welcomed even in cases where it wasn’t actually helping with bounds checks).
  • dotnet/runtime#89030 fixed a case where aDictionary<T, T> was being used as a set. Changing it to instead beHashSet<T> saves on the internal storage for the values that end up being identical to the keys.
  • dotnet/runtime#78741 replaces a bunch ofUnsafe.SizeOf<T>() withsizeof(T) andUnsafe.As<TFrom, TTo> with pointer manipulation. Most of these are with managedTs, such that it used to not be possible to do. However, as of C# 11, more of these operations are possible, with conditions that were previously always errors now being downgraded to warnings (which can then be suppressed) in anunsafe context. Such replacements generally won’t improve throughput, but they do make the binaries a bit smaller and require less work for the JIT, which can in turn help with startup time.dotnet/runtime#78914 takes advantage of this as well, this time to be able to pass a span as input to astring.Create call.
  • dotnet/runtime#78737 from@Poppyto anddotnet/runtime#79345 from@Poppyto remove somechar[] allocations fromMicrosoft.Win32.Registry by replacing some code that was usingList<string> to build up a result and thenToArray it at the end to get back astring[]. In the majority case, we know the exact required size ahead of time, and can avoid the extra allocations and copy by just using an array from the get-go.
  • dotnet/runtime#82598 from@huoyaoyuan also tweakedRegistry, taking advantage of a Win32 function that was added after the original code was written, in order to reduce the number of system calls required to delete a subtree.
  • Multiple changes went intoSystem.Xml andSystem.Runtime.Serialization.Xml to streamline away peanut butter related to strings and arrays.dotnet/runtime#75452 from@TrayanZapryanov replaces multiplestring.Trim calls with span trimming and slicing, taking advantage of the C# language’s recently added support for usingswitch overReadOnlySpan<char>.dotnet/runtime#75946 removes some use ofToCharArray (these days, there’s almost always a better alternative thanstring.ToCharArray), whiledotnet/runtime#82006 replaces somenew char[] with spans andstackalloc char[].dotnet/runtime#85534 removed an unnecessary dictionary lookup, replacing a use ofContainsKey followed by the indexer with justTryGetValue.dotnet/runtime#84888 from@mla-alm removed some synchronous I/O from the asynchronous code paths inXsdValidatingReader.dotnet/runtime#74955 from@TrayanZapryanov deleted the internalXmlConvert.StrEqual helper that was comparing the two inputs character by character with just usingSequenceEqual andStartsWith.dotnet/runtime#75812 from@jlennox replaced some manual UTF8 encoding with"..."u8.dotnet/runtime#76436 from@TrayanZapryanov removed intermediatestring allocation when writing primitive types as part of XML serialization. Anddotnet/runtime#73336 from@Daniel-Svensson anddotnet/runtime#71478 from@Daniel-Svensson improvedXmlDictionaryWriter by usingEncoding.UTF8 for UTF8 encoding and by doing more efficient writing using spans.
  • dotnet/runtime#87905 makes a tiny tweak to theArrayPool, but one that can lead to very measurable gains. TheArrayPool<T> instance returned fromArrayPool<T>.Shared currently is a multi-layered cache. The first layer is in thread-local storage. If renting can’t be satisfied by that layer, it falls through to the next layer, where there’s a “partition” per array size per core (by default). Each partition is an array of arrays. By default, thisT[][] could store 8 arrays. Now with this PR, it can store 32 arrays, decreasing the chances that code will need to spend additional cycles searching other partitions. Withdotnet/runtime#86109, that 32 value can also be changed, by setting theDOTNET_SYSTEM_BUFFERS_SHAREDARRAYPOOL_MAXARRAYSPERPARTITION environment variable to the desired maximum capacity. TheDOTNET_SYSTEM_BUFFERS_SHAREDARRAYPOOL_MAXPARTITIONCOUNT environment variable can also be used to control how many partitions are employed.

What’s Next?

Whew! That was… a lot! So, what’s next?

The .NET 8 Release Candidate is now available, and I encourage you todownload it and take it for a spin. As you can likely sense from my enthusiasm throughout this post, I’m thrilled about the potential .NET 8 has to improve your system’s performance just by upgrading, and I’m thrilled about new features .NET 8 offers to help you tweak your code to be even more efficient. We’re eager to hear from you about your experiences in doing so, and if you find something that can be improved even further, we’d love for you to make it better by contributing to the various .NET repos, whether it be issues with your thoughts or PRs with your coded improvements. Your efforts will benefit not only you but every other .NET developer around the world!

Thanks for reading, and happy coding!

Author

Stephen Toub - MSFT
Partner Software Engineer

Stephen Toub is a developer on the .NET team at Microsoft.

131 comments

Discussion is closed.Login to edit/delete existing comments.

Stay informed

Get notified when new posts are published.
Follow this blog
facebooklinkedinyoutubetwitchStackoverflow