No trial. No credit card required. Just your GitHub account.
Performance Improvements in .NET 7

A year ago, I publishedPerformance Improvements in .NET 6, following on the heels of similar posts for.NET 5,.NET Core 3.0,.NET Core 2.1, and.NET Core 2.0. I enjoy writing these posts and love reading developers’ responses to them. One comment in particular last year resonated with me. The commenter cited the Die Hard movie quote, “‘When Alexander saw the breadth of his domain, he wept for there were no more worlds to conquer’,” and questioned whether .NET performance improvements were similar. Has the well run dry? Are there no more “[performance] worlds to conquer”? I’m a bit giddy to say that, even with how fast .NET 6 is, .NET 7 definitively highlights how much more can be and has been done.
As with previous versions of .NET, performance is a key focus that pervades the entire stack, whether it be features created explicitly for performance or non-performance-related features that are still designed and implemented with performance keenly in mind. And now that a .NET 7 release candidate is just around the corner, it’s a good time to discuss much of it. Over the course of the last year, every time I’ve reviewed a PR that might positively impact performance, I’ve copied that link to a journal I maintain for the purposes of writing this post. When I sat down to write this a few weeks ago, I was faced with a list of almost 1000 performance-impacting PRs (out of more than 7000 PRs that went into the release), and I’m excited to share almost 500 of them here with you.
One thought before we dive in. In past years, I’ve received the odd piece of negative feedback about the length of some of my performance-focused write-ups, and while I disagree with the criticism, I respect the opinion. So, this year, consider this a “choose your own adventure.” If you’re here just looking for a super short adventure, one that provides the top-level summary and a core message to take away from your time here, I’m happy to oblige:
TL;DR: .NET 7 is fast. Really fast. A thousand performance-impacting PRs went into runtime and core libraries this release, never mind all the improvements in ASP.NET Core and Windows Forms and Entity Framework and beyond. It’s the fastest .NET ever. If your manager asks you why your project should upgrade to .NET 7, you can say “in addition to all the new functionality in the release, .NET 7 is super fast.”
Or, if you prefer a slightly longer adventure, one filled with interesting nuggets of performance-focused data, consider skimming through the post, looking for the small code snippets and corresponding tables showing a wealth of measurable performance improvements. At that point, you, too, may walk away with your head held high and my thanks.
Both noted paths achieve one of my primary goals for spending the time to write these posts, to highlight the greatness of the next release and to encourage everyone to give it a try. But, I have other goals for these posts, too. I want everyone interested to walk away from this post with an upleveled understanding of how .NET is implemented, why various decisions were made, tradeoffs that were evaluated, techniques that were employed, algorithms that were considered, and valuable tools and approaches that were utilized to make .NET even faster than it was previously. I want developers to learn from our own learnings and find ways to apply this new-found knowledge to their own codebases, thereby further increasing the overall performance of code in the ecosystem. I want developers to take an extra beat, think about reaching for a profiler the next time they’re working on a gnarly problem, think about looking at the source for the component they’re using in order to better understand how to work with it, and think about revisiting previous assumptions and decisions to determine whether they’re still accurate and appropriate. And I want developers to be excited at the prospect of submitting PRs to improve .NET not only for themselves but for every developer around the globe using .NET. If any of that sounds interesting, then I encourage you to choose the last adventure: prepare a carafe of your favorite hot beverage, get comfortable, and please enjoy.
Oh, and please don’t print this to paper. “Print to PDF” tells me it would take a third of a ream. If you would like a nicely formatted PDF, one isavailable for download here.
Table of Contents
- Setup
- JIT
- GC
- Native AOT
- Mono
- Reflection
- Interop
- Threading
- Primitive Types and Numerics
- Arrays, Strings, and Spans
- Regex
- Collections
- LINQ
- File I/O
- Compression
- Networking
- JSON
- XML
- Cryptography
- Diagnostics
- Exceptions
- Registry
- Analyzers
- What’s Next?
Setup
The microbenchmarks throughout this post utilizebenchmarkdotnet. To make it easy for you to follow along with your own validation, I have a very simple setup for the benchmarks I use. Create a new C# project:
dotnet new console -o benchmarkscd benchmarks
Your newbenchmarks
directory will contain abenchmarks.csproj
file and aProgram.cs
file. Replace the contents ofbenchmarks.csproj
with this:
<Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <OutputType>Exe</OutputType> <TargetFrameworks>net7.0;net6.0</TargetFrameworks> <LangVersion>Preview</LangVersion> <AllowUnsafeBlocks>true</AllowUnsafeBlocks> <ServerGarbageCollection>true</ServerGarbageCollection> </PropertyGroup> <ItemGroup> <PackageReference Include="benchmarkdotnet" Version="0.13.2" /> </ItemGroup></Project>
and the contents ofProgram.cs
with this:
using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using Microsoft.Win32;using System;using System.Buffers;using System.Collections.Generic;using System.Collections.Immutable;using System.ComponentModel;using System.Diagnostics;using System.IO;using System.IO.Compression;using System.IO.MemoryMappedFiles;using System.IO.Pipes;using System.Linq;using System.Net;using System.Net.Http;using System.Net.Http.Headers;using System.Net.Security;using System.Net.Sockets;using System.Numerics;using System.Reflection;using System.Runtime.CompilerServices;using System.Runtime.InteropServices;using System.Runtime.Intrinsics;using System.Security.Authentication;using System.Security.Cryptography;using System.Security.Cryptography.X509Certificates;using System.Text;using System.Text.Json;using System.Text.RegularExpressions;using System.Threading;using System.Threading.Tasks;using System.Xml;[MemoryDiagnoser(displayGenColumns: false)][DisassemblyDiagnoser][HideColumns("Error", "StdDev", "Median", "RatioSD")]public partial class Program{ static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); // ... copy [Benchmark]s here}
For each benchmark included in this write-up, you can then just copy and paste the code into this test class, and run the benchmarks. For example, to run a benchmark comparing performance on .NET 6 and .NET 7, do:
dotnet run -c Release -f net6.0 --filter '**' --runtimes net6.0 net7.0
This command says “build the benchmarks in release configuration targeting the .NET 6 surface area, and then run all of the benchmarks on both .NET 6 and .NET 7.” Or to run just on .NET 7:
dotnet run -c Release -f net7.0 --filter '**' --runtimes net7.0
which instead builds targeting the .NET 7 surface area and then only runs once against .NET 7. You can do this on any of Windows, Linux, or macOS. Unless otherwise called out (e.g. where the improvements are specific to Unix and I run the benchmarks on Linux), the results I share were recorded on Windows 11 64-bit but aren’t Windows-specific and should show similar relative differences on the other operating systems as well.
The release of the first .NET 7 release candidate is right around the corner. All of the measurements in this post were gathered with a recentdaily build of .NET 7 RC1.
Also, my standard caveat:These are microbenchmarks. It is expected that different hardware, different versions of operating systems, and the way in which the wind is currently blowing can affect the numbers involved. Your mileage may vary.
JIT
I’d like to kick off a discussion of performance improvements in the Just-In-Time (JIT) compiler by talking about something that itself isn’t actually a performance improvement. Being able to understand exactly what assembly code is generated by the JIT is critical when fine-tuning lower-level, performance-sensitive code. There are multiple ways to get at that assembly code. The online toolsharplab.io isincredibly useful for this (thanks to@ashmind for this tool); however it currently only targets a single release, so as I write this I’m only able to see the output for .NET 6, which makes it difficult to use for A/B comparisons.godbolt.org is also valuable for this, with C# support added incompiler-explorer/compiler-explorer#3168 from@hez2010, with similar limitations. The most flexible solutions involve getting at that assembly code locally, as it enables comparing whatever versions or local builds you desire with whatever configurations and switches set that you need.
One common approach is to use the[DisassemblyDiagnoser]
in benchmarkdotnet. Simply slap the[DisassemblyDiagnoser]
attribute onto your test class: benchmarkdotnet will find the assembly code generated for your tests and some depth of functions they call, and dump out the found assembly code in a human-readable form. For example, if I run this test:
using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System;[DisassemblyDiagnoser]public partial class Program{ static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); private int _a = 42, _b = 84; [Benchmark] public int Min() => Math.Min(_a, _b);}
with:
dotnet run -c Release -f net7.0 --filter '**'
in addition to doing all of its normal test execution and timing, benchmarkdotnet also outputs aProgram-asm.md
file that contains this:
; Program.Min() mov eax,[rcx+8] mov edx,[rcx+0C] cmp eax,edx jg short M00_L01 mov edx,eaxM00_L00: mov eax,edx retM00_L01: jmp short M00_L00; Total bytes of code 17
Pretty neat. This support was recently improved further indotnet/benchmarkdotnet#2072, which allows passing a filter list on the command-line to benchmarkdotnet to tell it exactly which methods’ assembly code should be dumped.
If you can get your hands on a “debug” or “checked” build of the .NET runtime (“checked” is a build that has optimizations enabled but also still includes asserts), and specifically of clrjit.dll, another valuable approach is to set an environment variable that causes the JIT itself to spit out a human-readable description of all of the assembly code it emits. This can be used with any kind of application, as it’s part of the JIT itself rather than part of any specific tool or other environment, it supports showing the code the JIT generates each time it generates code (e.g. if it first compiles a method without optimization and then later recompiles it with optimization), and overall it’s the most accurate picture of the assembly code as it comes “straight from the horses mouth,” as it were. The (big) downside of course is that it requires a non-release build of the runtime, which typically means you need to build it yourself from the sources in thedotnet/runtime repo.
… until .NET 7, that is. As ofdotnet/runtime#73365, this assembly dumping support is now available in release builds as well, which means it’s simply part of .NET 7 and you don’t need anything special to use it. To see this, try creating a simple “hello world” app like:
using System;class Program{ public static void Main() => Console.WriteLine("Hello, world!");}
and building it (e.g.dotnet build -c Release
). Then, set theDOTNET_JitDisasm
environment variable to the name of the method we care about, in this case “Main” (the exact syntax allowed is more permissive and allows for some use of wildcards, optional namespace and class names, etc.). As I’m using PowerShell, that means:
$env:DOTNET_JitDisasm="Main"
and then running the app. You should see code like this output to the console:
; Assembly listing for method Program:Main(); Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-0 compilation; MinOpts code; rbp based frame; partially interruptibleG_M000_IG01: ;; offset=0000H 55 push rbp 4883EC20 sub rsp, 32 488D6C2420 lea rbp, [rsp+20H]G_M000_IG02: ;; offset=000AH 48B9D820400A8E010000 mov rcx, 0x18E0A4020D8 488B09 mov rcx, gword ptr [rcx] FF1583B31000 call [Console:WriteLine(String)] 90 nopG_M000_IG03: ;; offset=001EH 4883C420 add rsp, 32 5D pop rbp C3 ret; Total bytes of code 36Hello, world!
This is immeasurably helpful for performance analysis and tuning, even for questions as simple as “did my function get inlined” or “is this code I expected to be optimized away actually getting optimized away.” Throughout the rest of this post, I’ll include assembly snippets generated by one of these two mechanisms, in order to help exemplify concepts.
Note that it can sometimes be a little confusing figuring out what name to specify as the value forDOTNET_JitDisasm
, especially when the method you care about is one that the C# compiler names or name mangles (since the JIT only sees the IL and metadata, not the original C#), e.g. the name of the entry point method for a program with top-level statements, the names of local functions, etc. To both help with this and to provide a really valuable top-level view of the work the JIT is doing, .NET 7 also supports the newDOTNET_JitDisasmSummary
environment variable (introduced indotnet/runtime#74090). Set that to “1”, and it’ll result in the JIT emitting a line every time it compiles a method, including the name of that method which is copy/pasteable withDOTNET_JitDisasm
. This feature is useful in-and-of-itself, however, as it can quickly highlight for you what’s being compiled, when, and with what settings. For example, if I set the environment variable and then run a “hello, world” console app, I get this output:
1: JIT compiled CastHelpers:StelemRef(Array,long,Object) [Tier1, IL size=88, code size=93] 2: JIT compiled CastHelpers:LdelemaRef(Array,long,long):byref [Tier1, IL size=44, code size=44] 3: JIT compiled SpanHelpers:IndexOfNullCharacter(byref):int [Tier1, IL size=792, code size=388] 4: JIT compiled Program:Main() [Tier0, IL size=11, code size=36] 5: JIT compiled ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long [Tier0, IL size=490, code size=1187]Hello, world!
We can see for “hello, world” there’s only 5 methods that actually get JIT compiled. There are of course many more methods that get executed as part of a simple “hello, world,” but almost all of them have precompiled native code available as part of the“Ready To Run” (R2R) images of the core libraries. The first three in the above list (StelemRef
,LdelemaRef
, andIndexOfNullCharacter
) don’t because they explicitly opted-out of R2R via use of the[MethodImpl(MethodImplOptions.AggressiveOptimization)]
attribute (despite the name, this attribute should almost never be used, and is only used for very specific reasons in a few very specific places in the core libraries). Then there’s ourMain
method. And lastly there’s theNarrowUtf16ToAscii
method, which doesn’t have R2R code, either, due to using the variable-widthVector<T>
(more on that later). Every other method that’s run doesn’t require JIT’ing. If we instead first set theDOTNET_ReadyToRun
environment variable to0
, the list is much longer, and gives you a very good sense of what the JIT needs to do on startup (and why technologies like R2R are important for startup time). Note how many methods get compiled before “hello, world” is output:
1: JIT compiled CastHelpers:StelemRef(Array,long,Object) [Tier1, IL size=88, code size=93] 2: JIT compiled CastHelpers:LdelemaRef(Array,long,long):byref [Tier1, IL size=44, code size=44] 3: JIT compiled AppContext:Setup(long,long,int) [Tier0, IL size=68, code size=275] 4: JIT compiled Dictionary`2:.ctor(int):this [Tier0, IL size=9, code size=40] 5: JIT compiled Dictionary`2:.ctor(int,IEqualityComparer`1):this [Tier0, IL size=102, code size=444] 6: JIT compiled Object:.ctor():this [Tier0, IL size=1, code size=10] 7: JIT compiled Dictionary`2:Initialize(int):int:this [Tier0, IL size=56, code size=231] 8: JIT compiled HashHelpers:GetPrime(int):int [Tier0, IL size=83, code size=379] 9: JIT compiled HashHelpers:.cctor() [Tier0, IL size=24, code size=102] 10: JIT compiled HashHelpers:GetFastModMultiplier(int):long [Tier0, IL size=9, code size=37] 11: JIT compiled Type:GetTypeFromHandle(RuntimeTypeHandle):Type [Tier0, IL size=8, code size=14] 12: JIT compiled Type:op_Equality(Type,Type):bool [Tier0, IL size=38, code size=143] 13: JIT compiled NonRandomizedStringEqualityComparer:GetStringComparer(Object):IEqualityComparer`1 [Tier0, IL size=39, code size=170] 14: JIT compiled NonRandomizedStringEqualityComparer:.cctor() [Tier0, IL size=46, code size=232] 15: JIT compiled EqualityComparer`1:get_Default():EqualityComparer`1 [Tier0, IL size=6, code size=36] 16: JIT compiled EqualityComparer`1:.cctor() [Tier0, IL size=26, code size=125] 17: JIT compiled ComparerHelpers:CreateDefaultEqualityComparer(Type):Object [Tier0, IL size=235, code size=949] 18: JIT compiled CastHelpers:ChkCastClass(long,Object):Object [Tier0, IL size=22, code size=72] 19: JIT compiled RuntimeHelpers:GetMethodTable(Object):long [Tier0, IL size=11, code size=33] 20: JIT compiled CastHelpers:IsInstanceOfClass(long,Object):Object [Tier0, IL size=97, code size=257] 21: JIT compiled GenericEqualityComparer`1:.ctor():this [Tier0, IL size=7, code size=31] 22: JIT compiled EqualityComparer`1:.ctor():this [Tier0, IL size=7, code size=31] 23: JIT compiled CastHelpers:ChkCastClassSpecial(long,Object):Object [Tier0, IL size=87, code size=246] 24: JIT compiled OrdinalComparer:.ctor(IEqualityComparer`1):this [Tier0, IL size=8, code size=39] 25: JIT compiled NonRandomizedStringEqualityComparer:.ctor(IEqualityComparer`1):this [Tier0, IL size=14, code size=52] 26: JIT compiled StringComparer:get_Ordinal():StringComparer [Tier0, IL size=6, code size=49] 27: JIT compiled OrdinalCaseSensitiveComparer:.cctor() [Tier0, IL size=11, code size=71] 28: JIT compiled OrdinalCaseSensitiveComparer:.ctor():this [Tier0, IL size=8, code size=33] 29: JIT compiled OrdinalComparer:.ctor(bool):this [Tier0, IL size=14, code size=43] 30: JIT compiled StringComparer:.ctor():this [Tier0, IL size=7, code size=31] 31: JIT compiled StringComparer:get_OrdinalIgnoreCase():StringComparer [Tier0, IL size=6, code size=49] 32: JIT compiled OrdinalIgnoreCaseComparer:.cctor() [Tier0, IL size=11, code size=71] 33: JIT compiled OrdinalIgnoreCaseComparer:.ctor():this [Tier0, IL size=8, code size=36] 34: JIT compiled OrdinalIgnoreCaseComparer:.ctor(IEqualityComparer`1):this [Tier0, IL size=8, code size=39] 35: JIT compiled CastHelpers:ChkCastAny(long,Object):Object [Tier0, IL size=38, code size=115] 36: JIT compiled CastHelpers:TryGet(long,long):int [Tier0, IL size=129, code size=308] 37: JIT compiled CastHelpers:TableData(ref):byref [Tier0, IL size=7, code size=31] 38: JIT compiled MemoryMarshal:GetArrayDataReference(ref):byref [Tier0, IL size=7, code size=24] 39: JIT compiled CastHelpers:KeyToBucket(byref,long,long):int [Tier0, IL size=38, code size=87] 40: JIT compiled CastHelpers:HashShift(byref):int [Tier0, IL size=3, code size=16] 41: JIT compiled BitOperations:RotateLeft(long,int):long [Tier0, IL size=17, code size=23] 42: JIT compiled CastHelpers:Element(byref,int):byref [Tier0, IL size=15, code size=33] 43: JIT compiled Volatile:Read(byref):int [Tier0, IL size=6, code size=16] 44: JIT compiled String:Ctor(long):String [Tier0, IL size=57, code size=155] 45: JIT compiled String:wcslen(long):int [Tier0, IL size=7, code size=31] 46: JIT compiled SpanHelpers:IndexOfNullCharacter(byref):int [Tier1, IL size=792, code size=388] 47: JIT compiled String:get_Length():int:this [Tier0, IL size=7, code size=17] 48: JIT compiled Buffer:Memmove(byref,byref,long) [Tier0, IL size=59, code size=102] 49: JIT compiled RuntimeHelpers:IsReferenceOrContainsReferences():bool [Tier0, IL size=2, code size=8] 50: JIT compiled Buffer:Memmove(byref,byref,long) [Tier0, IL size=480, code size=678] 51: JIT compiled Dictionary`2:Add(__Canon,__Canon):this [Tier0, IL size=11, code size=55] 52: JIT compiled Dictionary`2:TryInsert(__Canon,__Canon,ubyte):bool:this [Tier0, IL size=675, code size=2467] 53: JIT compiled OrdinalComparer:GetHashCode(String):int:this [Tier0, IL size=7, code size=37] 54: JIT compiled String:GetNonRandomizedHashCode():int:this [Tier0, IL size=110, code size=290] 55: JIT compiled BitOperations:RotateLeft(int,int):int [Tier0, IL size=17, code size=20] 56: JIT compiled Dictionary`2:GetBucket(int):byref:this [Tier0, IL size=29, code size=90] 57: JIT compiled HashHelpers:FastMod(int,int,long):int [Tier0, IL size=20, code size=70] 58: JIT compiled Type:get_IsValueType():bool:this [Tier0, IL size=7, code size=39] 59: JIT compiled RuntimeType:IsValueTypeImpl():bool:this [Tier0, IL size=54, code size=158] 60: JIT compiled RuntimeType:GetNativeTypeHandle():TypeHandle:this [Tier0, IL size=12, code size=48] 61: JIT compiled TypeHandle:.ctor(long):this [Tier0, IL size=8, code size=25] 62: JIT compiled TypeHandle:get_IsTypeDesc():bool:this [Tier0, IL size=14, code size=38] 63: JIT compiled TypeHandle:AsMethodTable():long:this [Tier0, IL size=7, code size=17] 64: JIT compiled MethodTable:get_IsValueType():bool:this [Tier0, IL size=20, code size=32] 65: JIT compiled GC:KeepAlive(Object) [Tier0, IL size=1, code size=10] 66: JIT compiled Buffer:_Memmove(byref,byref,long) [Tier0, IL size=25, code size=279] 67: JIT compiled Environment:InitializeCommandLineArgs(long,int,long):ref [Tier0, IL size=75, code size=332] 68: JIT compiled Environment:.cctor() [Tier0, IL size=11, code size=163] 69: JIT compiled StartupHookProvider:ProcessStartupHooks() [Tier-0 switched to FullOpts, IL size=365, code size=1053] 70: JIT compiled StartupHookProvider:get_IsSupported():bool [Tier0, IL size=18, code size=60] 71: JIT compiled AppContext:TryGetSwitch(String,byref):bool [Tier0, IL size=97, code size=322] 72: JIT compiled ArgumentException:ThrowIfNullOrEmpty(String,String) [Tier0, IL size=16, code size=53] 73: JIT compiled String:IsNullOrEmpty(String):bool [Tier0, IL size=15, code size=58] 74: JIT compiled AppContext:GetData(String):Object [Tier0, IL size=64, code size=205] 75: JIT compiled ArgumentNullException:ThrowIfNull(Object,String) [Tier0, IL size=10, code size=42] 76: JIT compiled Monitor:Enter(Object,byref) [Tier0, IL size=17, code size=55] 77: JIT compiled Dictionary`2:TryGetValue(__Canon,byref):bool:this [Tier0, IL size=39, code size=97] 78: JIT compiled Dictionary`2:FindValue(__Canon):byref:this [Tier0, IL size=391, code size=1466] 79: JIT compiled EventSource:.cctor() [Tier0, IL size=34, code size=80] 80: JIT compiled EventSource:InitializeIsSupported():bool [Tier0, IL size=18, code size=60] 81: JIT compiled RuntimeEventSource:.ctor():this [Tier0, IL size=55, code size=184] 82: JIT compiled Guid:.ctor(int,short,short,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):this [Tier0, IL size=86, code size=132] 83: JIT compiled EventSource:.ctor(Guid,String):this [Tier0, IL size=11, code size=90] 84: JIT compiled EventSource:.ctor(Guid,String,int,ref):this [Tier0, IL size=58, code size=187] 85: JIT compiled EventSource:get_IsSupported():bool [Tier0, IL size=6, code size=11] 86: JIT compiled TraceLoggingEventHandleTable:.ctor():this [Tier0, IL size=20, code size=67] 87: JIT compiled EventSource:ValidateSettings(int):int [Tier0, IL size=37, code size=147] 88: JIT compiled EventSource:Initialize(Guid,String,ref):this [Tier0, IL size=418, code size=1584] 89: JIT compiled Guid:op_Equality(Guid,Guid):bool [Tier0, IL size=10, code size=39] 90: JIT compiled Guid:EqualsCore(byref,byref):bool [Tier0, IL size=132, code size=171] 91: JIT compiled ActivityTracker:get_Instance():ActivityTracker [Tier0, IL size=6, code size=49] 92: JIT compiled ActivityTracker:.cctor() [Tier0, IL size=11, code size=71] 93: JIT compiled ActivityTracker:.ctor():this [Tier0, IL size=7, code size=31] 94: JIT compiled RuntimeEventSource:get_ProviderMetadata():ReadOnlySpan`1:this [Tier0, IL size=13, code size=91] 95: JIT compiled ReadOnlySpan`1:.ctor(long,int):this [Tier0, IL size=51, code size=115] 96: JIT compiled RuntimeHelpers:IsReferenceOrContainsReferences():bool [Tier0, IL size=2, code size=8] 97: JIT compiled ReadOnlySpan`1:get_Length():int:this [Tier0, IL size=7, code size=17] 98: JIT compiled OverrideEventProvider:.ctor(EventSource,int):this [Tier0, IL size=22, code size=68] 99: JIT compiled EventProvider:.ctor(int):this [Tier0, IL size=46, code size=194] 100: JIT compiled EtwEventProvider:.ctor():this [Tier0, IL size=7, code size=31] 101: JIT compiled EventProvider:Register(EventSource):this [Tier0, IL size=48, code size=186] 102: JIT compiled MulticastDelegate:CtorClosed(Object,long):this [Tier0, IL size=23, code size=70] 103: JIT compiled EventProvider:EventRegister(EventSource,EtwEnableCallback):int:this [Tier0, IL size=53, code size=154] 104: JIT compiled EventSource:get_Name():String:this [Tier0, IL size=7, code size=18] 105: JIT compiled EventSource:get_Guid():Guid:this [Tier0, IL size=7, code size=41] 106: JIT compiled EtwEventProvider:System.Diagnostics.Tracing.IEventProvider.EventRegister(EventSource,EtwEnableCallback,long,byref):int:this [Tier0, IL size=19, code size=71] 107: JIT compiled Advapi32:EventRegister(byref,EtwEnableCallback,long,byref):int [Tier0, IL size=53, code size=374] 108: JIT compiled Marshal:GetFunctionPointerForDelegate(__Canon):long [Tier0, IL size=17, code size=54] 109: JIT compiled Marshal:GetFunctionPointerForDelegate(Delegate):long [Tier0, IL size=18, code size=53] 110: JIT compiled EventPipeEventProvider:.ctor():this [Tier0, IL size=18, code size=41] 111: JIT compiled EventListener:get_EventListenersLock():Object [Tier0, IL size=41, code size=157] 112: JIT compiled List`1:.ctor(int):this [Tier0, IL size=47, code size=275] 113: JIT compiled Interlocked:CompareExchange(byref,__Canon,__Canon):__Canon [Tier0, IL size=9, code size=50] 114: JIT compiled NativeRuntimeEventSource:.cctor() [Tier0, IL size=11, code size=71] 115: JIT compiled NativeRuntimeEventSource:.ctor():this [Tier0, IL size=63, code size=184] 116: JIT compiled Guid:.ctor(int,ushort,ushort,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):this [Tier0, IL size=88, code size=132] 117: JIT compiled NativeRuntimeEventSource:get_ProviderMetadata():ReadOnlySpan`1:this [Tier0, IL size=13, code size=91] 118: JIT compiled EventPipeEventProvider:System.Diagnostics.Tracing.IEventProvider.EventRegister(EventSource,EtwEnableCallback,long,byref):int:this [Tier0, IL size=44, code size=118] 119: JIT compiled EventPipeInternal:CreateProvider(String,EtwEnableCallback):long [Tier0, IL size=43, code size=320] 120: JIT compiled Utf16StringMarshaller:GetPinnableReference(String):byref [Tier0, IL size=13, code size=50] 121: JIT compiled String:GetPinnableReference():byref:this [Tier0, IL size=7, code size=24] 122: JIT compiled EventListener:AddEventSource(EventSource) [Tier0, IL size=175, code size=560] 123: JIT compiled List`1:get_Count():int:this [Tier0, IL size=7, code size=17] 124: JIT compiled WeakReference`1:.ctor(__Canon):this [Tier0, IL size=9, code size=42] 125: JIT compiled WeakReference`1:.ctor(__Canon,bool):this [Tier0, IL size=15, code size=60] 126: JIT compiled List`1:Add(__Canon):this [Tier0, IL size=60, code size=124] 127: JIT compiled String:op_Inequality(String,String):bool [Tier0, IL size=11, code size=46] 128: JIT compiled String:Equals(String,String):bool [Tier0, IL size=36, code size=114] 129: JIT compiled ReadOnlySpan`1:GetPinnableReference():byref:this [Tier0, IL size=23, code size=57] 130: JIT compiled EventProvider:SetInformation(int,long,int):int:this [Tier0, IL size=38, code size=131] 131: JIT compiled ILStubClass:IL_STUB_PInvoke(long,int,long,int):int [FullOpts, IL size=62, code size=170] 132: JIT compiled Program:Main() [Tier0, IL size=11, code size=36] 133: JIT compiled Console:WriteLine(String) [Tier0, IL size=12, code size=59] 134: JIT compiled Console:get_Out():TextWriter [Tier0, IL size=20, code size=113] 135: JIT compiled Console:.cctor() [Tier0, IL size=11, code size=71] 136: JIT compiled Volatile:Read(byref):__Canon [Tier0, IL size=6, code size=21] 137: JIT compiled Console:<get_Out>g__EnsureInitialized|26_0():TextWriter [Tier0, IL size=63, code size=209] 138: JIT compiled ConsolePal:OpenStandardOutput():Stream [Tier0, IL size=34, code size=130] 139: JIT compiled Console:get_OutputEncoding():Encoding [Tier0, IL size=72, code size=237] 140: JIT compiled ConsolePal:get_OutputEncoding():Encoding [Tier0, IL size=11, code size=200] 141: JIT compiled NativeLibrary:LoadLibraryCallbackStub(String,Assembly,bool,int):long [Tier0, IL size=63, code size=280] 142: JIT compiled EncodingHelper:GetSupportedConsoleEncoding(int):Encoding [Tier0, IL size=53, code size=186] 143: JIT compiled Encoding:GetEncoding(int):Encoding [Tier0, IL size=340, code size=1025] 144: JIT compiled EncodingProvider:GetEncodingFromProvider(int):Encoding [Tier0, IL size=51, code size=232] 145: JIT compiled Encoding:FilterDisallowedEncodings(Encoding):Encoding [Tier0, IL size=29, code size=84] 146: JIT compiled LocalAppContextSwitches:get_EnableUnsafeUTF7Encoding():bool [Tier0, IL size=16, code size=46] 147: JIT compiled LocalAppContextSwitches:GetCachedSwitchValue(String,byref):bool [Tier0, IL size=22, code size=76] 148: JIT compiled LocalAppContextSwitches:GetCachedSwitchValueInternal(String,byref):bool [Tier0, IL size=46, code size=168] 149: JIT compiled LocalAppContextSwitches:GetSwitchDefaultValue(String):bool [Tier0, IL size=32, code size=98] 150: JIT compiled String:op_Equality(String,String):bool [Tier0, IL size=8, code size=39] 151: JIT compiled Encoding:get_Default():Encoding [Tier0, IL size=6, code size=49] 152: JIT compiled Encoding:.cctor() [Tier0, IL size=12, code size=73] 153: JIT compiled UTF8EncodingSealed:.ctor(bool):this [Tier0, IL size=8, code size=40] 154: JIT compiled UTF8Encoding:.ctor(bool):this [Tier0, IL size=14, code size=43] 155: JIT compiled UTF8Encoding:.ctor():this [Tier0, IL size=12, code size=36] 156: JIT compiled Encoding:.ctor(int):this [Tier0, IL size=42, code size=152] 157: JIT compiled UTF8Encoding:SetDefaultFallbacks():this [Tier0, IL size=64, code size=212] 158: JIT compiled EncoderReplacementFallback:.ctor(String):this [Tier0, IL size=110, code size=360] 159: JIT compiled EncoderFallback:.ctor():this [Tier0, IL size=7, code size=31] 160: JIT compiled String:get_Chars(int):ushort:this [Tier0, IL size=29, code size=61] 161: JIT compiled Char:IsSurrogate(ushort):bool [Tier0, IL size=17, code size=43] 162: JIT compiled Char:IsBetween(ushort,ushort,ushort):bool [Tier0, IL size=12, code size=52] 163: JIT compiled DecoderReplacementFallback:.ctor(String):this [Tier0, IL size=110, code size=360] 164: JIT compiled DecoderFallback:.ctor():this [Tier0, IL size=7, code size=31] 165: JIT compiled Encoding:get_CodePage():int:this [Tier0, IL size=7, code size=17] 166: JIT compiled Encoding:get_UTF8():Encoding [Tier0, IL size=6, code size=49] 167: JIT compiled UTF8Encoding:.cctor() [Tier0, IL size=12, code size=76] 168: JIT compiled Volatile:Write(byref,__Canon) [Tier0, IL size=6, code size=32] 169: JIT compiled ConsolePal:GetStandardFile(int,int,bool):Stream [Tier0, IL size=50, code size=183] 170: JIT compiled ConsolePal:get_InvalidHandleValue():long [Tier0, IL size=7, code size=41] 171: JIT compiled IntPtr:.ctor(int):this [Tier0, IL size=9, code size=25] 172: JIT compiled ConsolePal:ConsoleHandleIsWritable(long):bool [Tier0, IL size=26, code size=68] 173: JIT compiled Kernel32:WriteFile(long,long,int,byref,long):int [Tier0, IL size=46, code size=294] 174: JIT compiled Marshal:SetLastSystemError(int) [Tier0, IL size=7, code size=40] 175: JIT compiled Marshal:GetLastSystemError():int [Tier0, IL size=6, code size=34] 176: JIT compiled WindowsConsoleStream:.ctor(long,int,bool):this [Tier0, IL size=37, code size=90] 177: JIT compiled ConsoleStream:.ctor(int):this [Tier0, IL size=31, code size=71] 178: JIT compiled Stream:.ctor():this [Tier0, IL size=7, code size=31] 179: JIT compiled MarshalByRefObject:.ctor():this [Tier0, IL size=7, code size=31] 180: JIT compiled Kernel32:GetFileType(long):int [Tier0, IL size=27, code size=217] 181: JIT compiled Console:CreateOutputWriter(Stream):TextWriter [Tier0, IL size=50, code size=230] 182: JIT compiled Stream:.cctor() [Tier0, IL size=11, code size=71] 183: JIT compiled NullStream:.ctor():this [Tier0, IL size=7, code size=31] 184: JIT compiled EncodingExtensions:RemovePreamble(Encoding):Encoding [Tier0, IL size=25, code size=118] 185: JIT compiled UTF8EncodingSealed:get_Preamble():ReadOnlySpan`1:this [Tier0, IL size=24, code size=99] 186: JIT compiled UTF8Encoding:get_PreambleSpan():ReadOnlySpan`1 [Tier0, IL size=12, code size=87] 187: JIT compiled ConsoleEncoding:.ctor(Encoding):this [Tier0, IL size=14, code size=52] 188: JIT compiled Encoding:.ctor():this [Tier0, IL size=8, code size=33] 189: JIT compiled Encoding:SetDefaultFallbacks():this [Tier0, IL size=23, code size=65] 190: JIT compiled EncoderFallback:get_ReplacementFallback():EncoderFallback [Tier0, IL size=6, code size=49] 191: JIT compiled EncoderReplacementFallback:.cctor() [Tier0, IL size=11, code size=71] 192: JIT compiled EncoderReplacementFallback:.ctor():this [Tier0, IL size=12, code size=44] 193: JIT compiled DecoderFallback:get_ReplacementFallback():DecoderFallback [Tier0, IL size=6, code size=49] 194: JIT compiled DecoderReplacementFallback:.cctor() [Tier0, IL size=11, code size=71] 195: JIT compiled DecoderReplacementFallback:.ctor():this [Tier0, IL size=12, code size=44] 196: JIT compiled StreamWriter:.ctor(Stream,Encoding,int,bool):this [Tier0, IL size=201, code size=564] 197: JIT compiled Task:get_CompletedTask():Task [Tier0, IL size=6, code size=49] 198: JIT compiled Task:.cctor() [Tier0, IL size=76, code size=316] 199: JIT compiled TaskFactory:.ctor():this [Tier0, IL size=7, code size=31] 200: JIT compiled Task`1:.ctor(bool,VoidTaskResult,int,CancellationToken):this [Tier0, IL size=21, code size=75] 201: JIT compiled Task:.ctor(bool,int,CancellationToken):this [Tier0, IL size=70, code size=181] 202: JIT compiled <>c:.cctor() [Tier0, IL size=11, code size=71] 203: JIT compiled <>c:.ctor():this [Tier0, IL size=7, code size=31] 204: JIT compiled TextWriter:.ctor(IFormatProvider):this [Tier0, IL size=36, code size=124] 205: JIT compiled TextWriter:.cctor() [Tier0, IL size=26, code size=108] 206: JIT compiled NullTextWriter:.ctor():this [Tier0, IL size=7, code size=31] 207: JIT compiled TextWriter:.ctor():this [Tier0, IL size=29, code size=103] 208: JIT compiled String:ToCharArray():ref:this [Tier0, IL size=52, code size=173] 209: JIT compiled MemoryMarshal:GetArrayDataReference(ref):byref [Tier0, IL size=7, code size=24] 210: JIT compiled ConsoleStream:get_CanWrite():bool:this [Tier0, IL size=7, code size=18] 211: JIT compiled ConsoleEncoding:GetEncoder():Encoder:this [Tier0, IL size=12, code size=57] 212: JIT compiled UTF8Encoding:GetEncoder():Encoder:this [Tier0, IL size=7, code size=63] 213: JIT compiled EncoderNLS:.ctor(Encoding):this [Tier0, IL size=37, code size=102] 214: JIT compiled Encoder:.ctor():this [Tier0, IL size=7, code size=31] 215: JIT compiled Encoding:get_EncoderFallback():EncoderFallback:this [Tier0, IL size=7, code size=18] 216: JIT compiled EncoderNLS:Reset():this [Tier0, IL size=24, code size=92] 217: JIT compiled ConsoleStream:get_CanSeek():bool:this [Tier0, IL size=2, code size=12] 218: JIT compiled StreamWriter:set_AutoFlush(bool):this [Tier0, IL size=25, code size=72] 219: JIT compiled StreamWriter:CheckAsyncTaskInProgress():this [Tier0, IL size=19, code size=47] 220: JIT compiled Task:get_IsCompleted():bool:this [Tier0, IL size=16, code size=40] 221: JIT compiled Task:IsCompletedMethod(int):bool [Tier0, IL size=11, code size=25] 222: JIT compiled StreamWriter:Flush(bool,bool):this [Tier0, IL size=272, code size=1127] 223: JIT compiled StreamWriter:ThrowIfDisposed():this [Tier0, IL size=15, code size=43] 224: JIT compiled Encoding:get_Preamble():ReadOnlySpan`1:this [Tier0, IL size=12, code size=70] 225: JIT compiled ConsoleEncoding:GetPreamble():ref:this [Tier0, IL size=6, code size=27] 226: JIT compiled Array:Empty():ref [Tier0, IL size=6, code size=49] 227: JIT compiled EmptyArray`1:.cctor() [Tier0, IL size=12, code size=52] 228: JIT compiled ReadOnlySpan`1:op_Implicit(ref):ReadOnlySpan`1 [Tier0, IL size=7, code size=79] 229: JIT compiled ReadOnlySpan`1:.ctor(ref):this [Tier0, IL size=33, code size=81] 230: JIT compiled MemoryMarshal:GetArrayDataReference(ref):byref [Tier0, IL size=7, code size=24] 231: JIT compiled ConsoleEncoding:GetMaxByteCount(int):int:this [Tier0, IL size=13, code size=63] 232: JIT compiled UTF8EncodingSealed:GetMaxByteCount(int):int:this [Tier0, IL size=20, code size=50] 233: JIT compiled Span`1:.ctor(long,int):this [Tier0, IL size=51, code size=115] 234: JIT compiled ReadOnlySpan`1:.ctor(ref,int,int):this [Tier0, IL size=65, code size=147] 235: JIT compiled Encoder:GetBytes(ReadOnlySpan`1,Span`1,bool):int:this [Tier0, IL size=44, code size=234] 236: JIT compiled MemoryMarshal:GetNonNullPinnableReference(ReadOnlySpan`1):byref [Tier0, IL size=30, code size=54] 237: JIT compiled ReadOnlySpan`1:get_Length():int:this [Tier0, IL size=7, code size=17] 238: JIT compiled MemoryMarshal:GetNonNullPinnableReference(Span`1):byref [Tier0, IL size=30, code size=54] 239: JIT compiled Span`1:get_Length():int:this [Tier0, IL size=7, code size=17] 240: JIT compiled EncoderNLS:GetBytes(long,int,long,int,bool):int:this [Tier0, IL size=92, code size=279] 241: JIT compiled ArgumentNullException:ThrowIfNull(long,String) [Tier0, IL size=12, code size=45] 242: JIT compiled Encoding:GetBytes(long,int,long,int,EncoderNLS):int:this [Tier0, IL size=57, code size=187] 243: JIT compiled EncoderNLS:get_HasLeftoverData():bool:this [Tier0, IL size=35, code size=105] 244: JIT compiled UTF8Encoding:GetBytesFast(long,int,long,int,byref):int:this [Tier0, IL size=33, code size=119] 245: JIT compiled Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int [Tier0, IL size=1446, code size=3208] 246: JIT compiled Math:Min(int,int):int [Tier0, IL size=8, code size=28] 247: JIT compiled ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long [Tier0, IL size=490, code size=1187] 248: JIT compiled WindowsConsoleStream:Flush():this [Tier0, IL size=26, code size=56] 249: JIT compiled ConsoleStream:Flush():this [Tier0, IL size=1, code size=10] 250: JIT compiled TextWriter:Synchronized(TextWriter):TextWriter [Tier0, IL size=28, code size=121] 251: JIT compiled SyncTextWriter:.ctor(TextWriter):this [Tier0, IL size=14, code size=52] 252: JIT compiled SyncTextWriter:WriteLine(String):this [Tier0, IL size=13, code size=140] 253: JIT compiled StreamWriter:WriteLine(String):this [Tier0, IL size=20, code size=110] 254: JIT compiled String:op_Implicit(String):ReadOnlySpan`1 [Tier0, IL size=31, code size=171] 255: JIT compiled String:GetRawStringData():byref:this [Tier0, IL size=7, code size=24] 256: JIT compiled ReadOnlySpan`1:.ctor(byref,int):this [Tier0, IL size=15, code size=39] 257: JIT compiled StreamWriter:WriteSpan(ReadOnlySpan`1,bool):this [Tier0, IL size=368, code size=1036] 258: JIT compiled MemoryMarshal:GetReference(ReadOnlySpan`1):byref [Tier0, IL size=8, code size=17] 259: JIT compiled Buffer:MemoryCopy(long,long,long,long) [Tier0, IL size=21, code size=83] 260: JIT compiled Unsafe:ReadUnaligned(long):long [Tier0, IL size=10, code size=17] 261: JIT compiled ASCIIUtility:AllCharsInUInt64AreAscii(long):bool [Tier0, IL size=16, code size=38] 262: JIT compiled ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long) [Tier0, IL size=107, code size=171] 263: JIT compiled Unsafe:WriteUnaligned(byref,int) [Tier0, IL size=11, code size=22] 264: JIT compiled Unsafe:ReadUnaligned(long):int [Tier0, IL size=10, code size=16] 265: JIT compiled ASCIIUtility:AllCharsInUInt32AreAscii(int):bool [Tier0, IL size=11, code size=25] 266: JIT compiled ASCIIUtility:NarrowTwoUtf16CharsToAsciiAndWriteToBuffer(byref,int) [Tier0, IL size=24, code size=35] 267: JIT compiled Span`1:Slice(int,int):Span`1:this [Tier0, IL size=39, code size=135] 268: JIT compiled Span`1:.ctor(byref,int):this [Tier0, IL size=15, code size=39] 269: JIT compiled Span`1:op_Implicit(Span`1):ReadOnlySpan`1 [Tier0, IL size=19, code size=90] 270: JIT compiled ReadOnlySpan`1:.ctor(byref,int):this [Tier0, IL size=15, code size=39] 271: JIT compiled WindowsConsoleStream:Write(ReadOnlySpan`1):this [Tier0, IL size=35, code size=149] 272: JIT compiled WindowsConsoleStream:WriteFileNative(long,ReadOnlySpan`1,bool):int [Tier0, IL size=107, code size=272] 273: JIT compiled ReadOnlySpan`1:get_IsEmpty():bool:this [Tier0, IL size=10, code size=24]Hello, world! 274: JIT compiled AppContext:OnProcessExit() [Tier0, IL size=43, code size=161] 275: JIT compiled AssemblyLoadContext:OnProcessExit() [Tier0, IL size=101, code size=442] 276: JIT compiled EventListener:DisposeOnShutdown() [Tier0, IL size=150, code size=618] 277: JIT compiled List`1:.ctor():this [Tier0, IL size=18, code size=133] 278: JIT compiled List`1:.cctor() [Tier0, IL size=12, code size=129] 279: JIT compiled List`1:GetEnumerator():Enumerator:this [Tier0, IL size=7, code size=162] 280: JIT compiled Enumerator:.ctor(List`1):this [Tier0, IL size=39, code size=64] 281: JIT compiled Enumerator:MoveNext():bool:this [Tier0, IL size=81, code size=159] 282: JIT compiled Enumerator:get_Current():__Canon:this [Tier0, IL size=7, code size=22] 283: JIT compiled WeakReference`1:TryGetTarget(byref):bool:this [Tier0, IL size=24, code size=66] 284: JIT compiled List`1:AddWithResize(__Canon):this [Tier0, IL size=39, code size=85] 285: JIT compiled List`1:Grow(int):this [Tier0, IL size=53, code size=121] 286: JIT compiled List`1:set_Capacity(int):this [Tier0, IL size=86, code size=342] 287: JIT compiled CastHelpers:StelemRef_Helper(byref,long,Object) [Tier0, IL size=34, code size=104] 288: JIT compiled CastHelpers:StelemRef_Helper_NoCacheLookup(byref,long,Object) [Tier0, IL size=26, code size=111] 289: JIT compiled Enumerator:MoveNextRare():bool:this [Tier0, IL size=57, code size=80] 290: JIT compiled Enumerator:Dispose():this [Tier0, IL size=1, code size=14] 291: JIT compiled EventSource:Dispose():this [Tier0, IL size=14, code size=54] 292: JIT compiled EventSource:Dispose(bool):this [Tier0, IL size=124, code size=236] 293: JIT compiled EventProvider:Dispose():this [Tier0, IL size=14, code size=54] 294: JIT compiled EventProvider:Dispose(bool):this [Tier0, IL size=90, code size=230] 295: JIT compiled EventProvider:EventUnregister(long):this [Tier0, IL size=14, code size=50] 296: JIT compiled EtwEventProvider:System.Diagnostics.Tracing.IEventProvider.EventUnregister(long):int:this [Tier0, IL size=7, code size=181] 297: JIT compiled GC:SuppressFinalize(Object) [Tier0, IL size=18, code size=53] 298: JIT compiled EventPipeEventProvider:System.Diagnostics.Tracing.IEventProvider.EventUnregister(long):int:this [Tier0, IL size=13, code size=187]
With that out of the way, let’s move on to actual performance improvements, starting with on-stack replacement.
On-Stack Replacement
On-stack replacement (OSR) is one of the coolest features to hit the JIT in .NET 7. But to really understand OSR, we first need to understand tiered compilation, so a quick recap…
One of the issues a managed environment with a JIT compiler has to deal with is tradeoffs between startup and throughput. Historically, the job of an optimizing compiler is to, well, optimize, in order to enable the best possible throughput of the application or service once running. But such optimization takes analysis, takes time, and performing all of that work then leads to increased startup time, as all of the code on the startup path (e.g. all of the code that needs to be run before a web server can serve the first request) needs to be compiled. So a JIT compiler needs to make tradeoffs: better throughput at the expense of longer startup time, or better startup time at the expense of decreased throughput. For some kinds of apps and services, the tradeoff is an easy call, e.g. if your service starts up once and then runs for days, several extra seconds of startup time doesn’t matter, or if you’re a console application that’s going to do a quick computation and exit, startup time is all that matters. But how can the JIT know which scenario it’s in, and do we really want every developer having to know about these kinds of settings and tradeoffs and configure every one of their applications accordingly? One answer to this has been ahead-of-time compilation, which has taken various forms in .NET. For example, all of the core libraries are “crossgen”‘d, meaning they’ve been run through a tool that produces the previously mentioned R2R format, yielding binaries that contain assembly code that needs only minor tweaks to actually execute; not every method can have code generated for it, but enough that it significantly reduces startup time. Of course, such approaches have their own downsides, e.g. one of the promises of a JIT compiler is it can take advantage of knowledge of the current machine / process in order to best optimize, so for example the R2R images have to assume a certain baseline instruction set (e.g. what vectorizing instructions are available) whereas the JIT can see what’s actually available and use the best. “Tiered compilation” provides another answer, one that’s usable with or without these other ahead-of-time (AOT) compilation solutions.
Tiered compilation enables the JIT to have its proverbial cake and eat it, too. The idea is simple: allow the JIT to compile the same code multiple times. The first time, the JIT can use as a few optimizations as make sense (a handful of optimizations can actually make the JIT’s own throughput faster, so those still make sense to apply), producing fairly unoptimized assembly code but doing so really quickly. And when it does so, it can add some instrumentation into the assembly to track how often the methods are called. As it turns out, many functions used on a startup path are invoked once or maybe only a handful of times, and it would take more time to optimize them than it does to just execute them unoptimized. Then, when the method’s instrumentation triggers some threshold, for example a method having been executed 30 times, a work item gets queued to recompile that method, but this time with all the optimizations the JIT can throw at it. This is lovingly referred to as “tiering up.” Once that recompilation has completed, call sites to the method are patched with the address of the newly highly optimized assembly code, and future invocations will then take the fast path. So, we get faster startupand faster sustained throughput. At least, that’s the hope.
A problem, however, is methods that don’t fit this mold. While it’s certainly the case that many performance-sensitive methods are relatively quick and executed many, many, many times, there’s also a large number of performance-sensitive methods that are executed just a handful of times, or maybe even only once, but that take a very long time to execute, maybe even the duration of the whole process: methods with loops. As a result, by default tiered compilation hasn’t applied to loops, though it can be enabled by setting theDOTNET_TC_QuickJitForLoops
environment variable to1
. We can see the effect of this by trying this simple console app with .NET 6. With the default settings, run this app:
class Program{ static void Main() { var sw = new System.Diagnostics.Stopwatch(); while (true) { sw.Restart(); for (int trial = 0; trial < 10_000; trial++) { int count = 0; for (int i = 0; i < char.MaxValue; i++) if (IsAsciiDigit((char)i)) count++; } sw.Stop(); Console.WriteLine(sw.Elapsed); } static bool IsAsciiDigit(char c) => (uint)(c - '0') <= 9; }}
I get numbers printed out like:
00:00:00.573435200:00:00.552666700:00:00.567526700:00:00.558872400:00:00.5616028
Now, try settingDOTNET_TC_QuickJitForLoops
to1
. When I then run it again, I get numbers like this:
00:00:01.284139700:00:01.269348500:00:01.275564600:00:01.265667800:00:01.2679925
In other words, withDOTNET_TC_QuickJitForLoops
enabled, it’s taking 2.5x as long as without (the default in .NET 6). That’s because this main function never gets optimizations applied to it. By settingDOTNET_TC_QuickJitForLoops
to1
, we’re saying “JIT, please apply tiering to methods with loops as well,” but this method with a loop is only ever invoked once, so for the duration of the process it ends up remaining at “tier-0,” aka unoptimized. Now, let’s try the same thing with .NET 7. Regardless of whether that environment variable is set, I again get numbers like this:
00:00:00.552888900:00:00.556256300:00:00.562208600:00:00.566822000:00:00.5589112
but importantly, this method was still participating in tiering. In fact, we can get confirmation of that by using the aforementionedDOTNET_JitDisasmSummary=1
environment variable. When I set that and run again, I see these lines in the output:
4: JIT compiled Program:Main() [Tier0, IL size=83, code size=319]... 6: JIT compiled Program:Main() [Tier1-OSR @0x27, IL size=83, code size=380]
highlighting thatMain
was indeed compiled twice. How is that possible? On-stack replacement.
The idea behind on-stack replacement is a method can be replaced not just between invocations but even while it’s executing, while it’s “on the stack.” In addition to the tier-0 code being instrumented for call counts, loops are also instrumented for iteration counts. When the iterations surpass a certain limit, the JIT compiles a new highly optimized version of that method, transfers all the local/register state from the current invocation to the new invocation, and then jumps to the appropriate location in the new method. We can see this in action by using the previously discussedDOTNET_JitDisasm
environment variable. Set that toProgram:*
in order to see the assembly code generated for all of the methods in theProgram
class, and then run the app again. You should see output like the following:
; Assembly listing for method Program:Main(); Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-0 compilation; MinOpts code; rbp based frame; partially interruptibleG_M000_IG01: ;; offset=0000H 55 push rbp 4881EC80000000 sub rsp, 128 488DAC2480000000 lea rbp, [rsp+80H] C5D857E4 vxorps xmm4, xmm4 C5F97F65B0 vmovdqa xmmword ptr [rbp-50H], xmm4 33C0 xor eax, eax 488945C0 mov qword ptr [rbp-40H], raxG_M000_IG02: ;; offset=001FH 48B9002F0B50FC7F0000 mov rcx, 0x7FFC500B2F00 E8721FB25F call CORINFO_HELP_NEWSFAST 488945B0 mov gword ptr [rbp-50H], rax 488B4DB0 mov rcx, gword ptr [rbp-50H] FF1544C70D00 call [Stopwatch:.ctor():this] 488B4DB0 mov rcx, gword ptr [rbp-50H] 48894DC0 mov gword ptr [rbp-40H], rcx C745A8E8030000 mov dword ptr [rbp-58H], 0x3E8G_M000_IG03: ;; offset=004BH 8B4DA8 mov ecx, dword ptr [rbp-58H] FFC9 dec ecx 894DA8 mov dword ptr [rbp-58H], ecx 837DA800 cmp dword ptr [rbp-58H], 0 7F0E jg SHORT G_M000_IG05G_M000_IG04: ;; offset=0059H 488D4DA8 lea rcx, [rbp-58H] BA06000000 mov edx, 6 E8B985AB5F call CORINFO_HELP_PATCHPOINTG_M000_IG05: ;; offset=0067H 488B4DC0 mov rcx, gword ptr [rbp-40H] 3909 cmp dword ptr [rcx], ecx FF1585C70D00 call [Stopwatch:Restart():this] 33C9 xor ecx, ecx 894DBC mov dword ptr [rbp-44H], ecx 33C9 xor ecx, ecx 894DB8 mov dword ptr [rbp-48H], ecx EB20 jmp SHORT G_M000_IG08G_M000_IG06: ;; offset=007FH 8B4DB8 mov ecx, dword ptr [rbp-48H] 0FB7C9 movzx rcx, cx FF152DD40B00 call [Program:<Main>g__IsAsciiDigit|0_0(ushort):bool] 85C0 test eax, eax 7408 je SHORT G_M000_IG07 8B4DBC mov ecx, dword ptr [rbp-44H] FFC1 inc ecx 894DBC mov dword ptr [rbp-44H], ecxG_M000_IG07: ;; offset=0097H 8B4DB8 mov ecx, dword ptr [rbp-48H] FFC1 inc ecx 894DB8 mov dword ptr [rbp-48H], ecxG_M000_IG08: ;; offset=009FH 8B4DA8 mov ecx, dword ptr [rbp-58H] FFC9 dec ecx 894DA8 mov dword ptr [rbp-58H], ecx 837DA800 cmp dword ptr [rbp-58H], 0 7F0E jg SHORT G_M000_IG10G_M000_IG09: ;; offset=00ADH 488D4DA8 lea rcx, [rbp-58H] BA23000000 mov edx, 35 E86585AB5F call CORINFO_HELP_PATCHPOINTG_M000_IG10: ;; offset=00BBH 817DB800CA9A3B cmp dword ptr [rbp-48H], 0x3B9ACA00 7CBB jl SHORT G_M000_IG06 488B4DC0 mov rcx, gword ptr [rbp-40H] 3909 cmp dword ptr [rcx], ecx FF1570C70D00 call [Stopwatch:get_ElapsedMilliseconds():long:this] 488BC8 mov rcx, rax FF1507D00D00 call [Console:WriteLine(long)] E96DFFFFFF jmp G_M000_IG03; Total bytes of code 222; Assembly listing for method Program:<Main>g__IsAsciiDigit|0_0(ushort):bool; Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-0 compilation; MinOpts code; rbp based frame; partially interruptibleG_M000_IG01: ;; offset=0000H 55 push rbp 488BEC mov rbp, rsp 894D10 mov dword ptr [rbp+10H], ecxG_M000_IG02: ;; offset=0007H 8B4510 mov eax, dword ptr [rbp+10H] 0FB7C0 movzx rax, ax 83C0D0 add eax, -48 83F809 cmp eax, 9 0F96C0 setbe al 0FB6C0 movzx rax, alG_M000_IG03: ;; offset=0019H 5D pop rbp C3 ret
A few relevant things to notice here. First, the comments at the top highlight how this code was compiled:
; Tier-0 compilation; MinOpts code
So, we know this is the initial version (“Tier-0”) of the method compiled with minimal optimization (“MinOpts”). Second, note this line of the assembly:
FF152DD40B00 call [Program:<Main>g__IsAsciiDigit|0_0(ushort):bool]
OurIsAsciiDigit
helper method is trivially inlineable, but it’s not getting inlined; instead, the assembly has a call to it, and indeed we can see below the generated code (also “MinOpts”) forIsAsciiDigit
. Why? Because inlining is an optimization (a really important one) that’s disabled as part of tier-0 (because the analysis for doing inlining well is also quite costly). Third, we can see the code the JIT is outputting to instrument this method. This is a bit more involved, but I’ll point out the relevant parts. First, we see:
C745A8E8030000 mov dword ptr [rbp-58H], 0x3E8
That0x3E8
is the hex value for the decimal 1,000, which is the default number of iterations a loop needs to iterate before the JIT will generate the optimized version of the method (this is configurable via theDOTNET_TC_OnStackReplacement_InitialCounter
environment variable). So we see 1,000 being stored into this stack location. Then a bit later in the method we see this:
G_M000_IG03: ;; offset=004BH 8B4DA8 mov ecx, dword ptr [rbp-58H] FFC9 dec ecx 894DA8 mov dword ptr [rbp-58H], ecx 837DA800 cmp dword ptr [rbp-58H], 0 7F0E jg SHORT G_M000_IG05G_M000_IG04: ;; offset=0059H 488D4DA8 lea rcx, [rbp-58H] BA06000000 mov edx, 6 E8B985AB5F call CORINFO_HELP_PATCHPOINTG_M000_IG05: ;; offset=0067H
The generated code is loading that counter into theecx
register, decrementing it, storing it back, and then seeing whether the counter dropped to 0. If it didn’t, the code skips toG_M000_IG05
, which is the label for the actual code in the rest of the loop. But if the counter did drop to 0, the JIT proceeds to store relevant state into the thercx
andedx
registers and then calls theCORINFO_HELP_PATCHPOINT
helper method. That helper is responsible for triggering the creation of the optimized method if it doesn’t yet exist, fixing up all appropriate tracking state, and jumping to the new method. And indeed, if you look again at your console output from running the program, you’ll see yet another output for theMain
method:
; Assembly listing for method Program:Main(); Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-1 compilation; OSR variant for entry point 0x23; optimized code; rsp based frame; fully interruptible; No PGO data; 1 inlinees with PGO data; 8 single block inlinees; 0 inlinees without PGO dataG_M000_IG01: ;; offset=0000H 4883EC58 sub rsp, 88 4889BC24D8000000 mov qword ptr [rsp+D8H], rdi 4889B424D0000000 mov qword ptr [rsp+D0H], rsi 48899C24C8000000 mov qword ptr [rsp+C8H], rbx C5F877 vzeroupper 33C0 xor eax, eax 4889442428 mov qword ptr [rsp+28H], rax 4889442420 mov qword ptr [rsp+20H], rax 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 8BBC249C000000 mov edi, dword ptr [rsp+9CH] 8BB42498000000 mov esi, dword ptr [rsp+98H]G_M000_IG02: ;; offset=0041H EB45 jmp SHORT G_M000_IG05 align [0 bytes for IG06]G_M000_IG03: ;; offset=0043H 33C9 xor ecx, ecx 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 48894B08 mov qword ptr [rbx+08H], rcx 488D4C2428 lea rcx, [rsp+28H] 48B87066E68AFD7F0000 mov rax, 0x7FFD8AE66670G_M000_IG04: ;; offset=0060H FFD0 call rax ; Kernel32:QueryPerformanceCounter(long):int 488B442428 mov rax, qword ptr [rsp+28H] 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 48894310 mov qword ptr [rbx+10H], rax C6431801 mov byte ptr [rbx+18H], 1 33FF xor edi, edi 33F6 xor esi, esi 833D92A1E55F00 cmp dword ptr [(reloc 0x7ffcafe1ae34)], 0 0F85CA000000 jne G_M000_IG13G_M000_IG05: ;; offset=0088H 81FE00CA9A3B cmp esi, 0x3B9ACA00 7D17 jge SHORT G_M000_IG09G_M000_IG06: ;; offset=0090H 0FB7CE movzx rcx, si 83C1D0 add ecx, -48 83F909 cmp ecx, 9 7702 ja SHORT G_M000_IG08G_M000_IG07: ;; offset=009BH FFC7 inc ediG_M000_IG08: ;; offset=009DH FFC6 inc esi 81FE00CA9A3B cmp esi, 0x3B9ACA00 7CE9 jl SHORT G_M000_IG06G_M000_IG09: ;; offset=00A7H 488B6B08 mov rbp, qword ptr [rbx+08H] 48899C24A0000000 mov gword ptr [rsp+A0H], rbx 807B1800 cmp byte ptr [rbx+18H], 0 7436 je SHORT G_M000_IG12G_M000_IG10: ;; offset=00B9H 488D4C2420 lea rcx, [rsp+20H] 48B87066E68AFD7F0000 mov rax, 0x7FFD8AE66670G_M000_IG11: ;; offset=00C8H FFD0 call rax ; Kernel32:QueryPerformanceCounter(long):int 488B4C2420 mov rcx, qword ptr [rsp+20H] 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 482B4B10 sub rcx, qword ptr [rbx+10H] 4803E9 add rbp, rcx 833D2FA1E55F00 cmp dword ptr [(reloc 0x7ffcafe1ae34)], 0 48899C24A0000000 mov gword ptr [rsp+A0H], rbx 756D jne SHORT G_M000_IG14G_M000_IG12: ;; offset=00EFH C5F857C0 vxorps xmm0, xmm0 C4E1FB2AC5 vcvtsi2sd xmm0, rbp C5FB11442430 vmovsd qword ptr [rsp+30H], xmm0 48B9F04BF24FFC7F0000 mov rcx, 0x7FFC4FF24BF0 BAE7070000 mov edx, 0x7E7 E82E1FB25F call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE C5FB10442430 vmovsd xmm0, qword ptr [rsp+30H] C5FB5905E049F6FF vmulsd xmm0, xmm0, qword ptr [(reloc 0x7ffc4ff25720)] C4E1FB2CD0 vcvttsd2si rdx, xmm0 48B94B598638D6C56D34 mov rcx, 0x346DC5D63886594B 488BC1 mov rax, rcx 48F7EA imul rdx:rax, rdx 488BCA mov rcx, rdx 48C1E93F shr rcx, 63 48C1FA0B sar rdx, 11 4803CA add rcx, rdx FF1567CE0D00 call [Console:WriteLine(long)] E9F5FEFFFF jmp G_M000_IG03G_M000_IG13: ;; offset=014EH E8DDCBAC5F call CORINFO_HELP_POLL_GC E930FFFFFF jmp G_M000_IG05G_M000_IG14: ;; offset=0158H E8D3CBAC5F call CORINFO_HELP_POLL_GC EB90 jmp SHORT G_M000_IG12; Total bytes of code 351
Here, again, we notice a few interesting things. First, in the header we see this:
; Tier-1 compilation; OSR variant for entry point 0x23; optimized code
so we know this is both optimized “tier-1” code and is the “OSR variant” for this method. Second, notice there’s no longer a call to theIsAsciiDigit
helper. Instead, where that call would have been, we see this:
G_M000_IG06: ;; offset=0090H 0FB7CE movzx rcx, si 83C1D0 add ecx, -48 83F909 cmp ecx, 9 7702 ja SHORT G_M000_IG08
This is loading a value intorcx
, subtracting 48 from it (48 is the decimal ASCII value of the'0'
character) and comparing the resulting value to 9. Sounds an awful lot like ourIsAsciiDigit
implementation ((uint)(c - '0') <= 9
), doesn’t it? That’s because it is. The helper was successfully inlined in this now-optimized code.
Great, so now in .NET 7, we can largely avoid the tradeoffs between startup and throughput, as OSR enables tiered compilation to apply to all methods, even those that are long-running. A multitude of PRs went into enabling this, including many over the last few years, but all of the functionality was disabled in the shipping bits. Thanks to improvements likedotnet/runtime#62831 which implemented support for OSR on Arm64 (previously only x64 support was implemented), anddotnet/runtime#63406 anddotnet/runtime#65609 which revised how OSR imports and epilogs are handled,dotnet/runtime#65675 enables OSR (and as a resultDOTNET_TC_QuickJitForLoops
) by default.
But, tiered compilation and OSR aren’t just about startup (though they’re of course very valuable there). They’re also about further improving throughput. Even though tiered compilation was originally envisioned as a way to optimize startup while not hurting throughput, it’s become much more than that. There are various things the JIT can learn about a method during tier-0 that it can then use for tier-1. For example, the very fact that the tier-0 code executed means that anystatic
s accessed by the method will have been initialized, and that means that anyreadonly static
s will not only have been initialized by the time the tier-1 code executes but their values won’t ever change. And that in turn means that any readonly statics of primitive types (e.g.bool
,int
, etc.) can be treated likeconst
s instead ofstatic readonly
fields, and during tier-1 compilation the JIT can optimize them just as it would have optimized aconst
. For example, try running this simple program after settingDOTNET_JitDisasm
toProgram:Test
:
using System.Runtime.CompilerServices;class Program{ static readonly bool Is64Bit = Environment.Is64BitProcess; static int Main() { int count = 0; for (int i = 0; i < 1_000_000_000; i++) if (Test()) count++; return count; } [MethodImpl(MethodImplOptions.NoInlining)] static bool Test() => Is64Bit;}
When I do so, I get this output:
; Assembly listing for method Program:Test():bool; Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-0 compilation; MinOpts code; rbp based frame; partially interruptibleG_M000_IG01: ;; offset=0000H 55 push rbp 4883EC20 sub rsp, 32 488D6C2420 lea rbp, [rsp+20H]G_M000_IG02: ;; offset=000AH 48B9B8639A3FFC7F0000 mov rcx, 0x7FFC3F9A63B8 BA01000000 mov edx, 1 E8C220B25F call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE 0FB60545580C00 movzx rax, byte ptr [(reloc 0x7ffc3f9a63ea)]G_M000_IG03: ;; offset=0025H 4883C420 add rsp, 32 5D pop rbp C3 ret; Total bytes of code 43; Assembly listing for method Program:Test():bool; Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-1 compilation; optimized code; rsp based frame; partially interruptible; No PGO dataG_M000_IG01: ;; offset=0000HG_M000_IG02: ;; offset=0000H B801000000 mov eax, 1G_M000_IG03: ;; offset=0005H C3 ret; Total bytes of code 6
Note, again, we see two outputs forProgram:Test
. First, we see the “Tier-0” code, which is accessing a static (note thecall CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
instruction). But then we see the “Tier-1” code, where all of that overhead has vanished and is instead replaced simply bymov eax, 1
. Since the “Tier-0” code had to have executed in order for it to tier up, the “Tier-1” code was generated knowing that the value of thestatic readonly bool Is64Bit
field wastrue
(1
), and so the entirety of this method is storing the value1
into theeax
register used for the return value.
This is so useful that components are now written with tiering in mind. Consider the newRegex
source generator, which is discussed later in this post (Roslyn source generators were introduced a couple of years ago; just as how Roslyn analyzers are able to plug into the compiler and surface additional diagnostics based on all of the data the compiler learns from the source code, Roslyn source generators are able to analyze that same data and then further augment the compilation unit with additional source). TheRegex
source generator applies a technique based on this indotnet/runtime#67775.Regex
supports setting a process-wide timeout that gets applied toRegex
instances that don’t explicitly set a timeout. That means, even though it’s super rare for such a process-wide timeout to be set, theRegex
source generator still needs to output timeout-related code just in case it’s needed. It does so by outputting some helpers like this:
static class Utilities{ internal static readonly TimeSpan s_defaultTimeout = AppContext.GetData("REGEX_DEFAULT_MATCH_TIMEOUT") is TimeSpan timeout ? timeout : Timeout.InfiniteTimeSpan; internal static readonly bool s_hasTimeout = s_defaultTimeout != Timeout.InfiniteTimeSpan;}
which it then uses at call sites like this:
if (Utilities.s_hasTimeout){ base.CheckTimeout();}
In tier-0, these checks will still be emitted in the assembly code, but in tier-1 where throughput matters, if the relevantAppContext
switch hasn’t been set, thens_defaultTimeout
will beTimeout.InfiniteTimeSpan
, at which points_hasTimeout
will befalse
. And sinces_hasTimeout
is astatic readonly bool
, the JIT will be able to treat that as aconst
, and all conditions likeif (Utilities.s_hasTimeout)
will be treated equal toif (false)
and be eliminated from the assembly code entirely as dead code.
But, this is somewhat old news. The JIT has been able to do such an optimization since tiered compilation was introduced in .NET Core 3.0. Now in .NET 7, though, with OSR it’s also able to do so by default for methods with loops (and thus enable cases like the regex one). However, the real magic of OSR comes into play when combined with another exciting feature: dynamic PGO.
PGO
I wrote about profile-guided optimization (PGO) in myPerformance Improvements in .NET 6 post, but I’ll cover it again here as it’s seen a multitude of improvements for .NET 7.
PGO has been around for a long time, in any number of languages and compilers. The basic idea is you compile your app, asking the compiler to inject instrumentation into the application to track various pieces of interesting information. You then put your app through its paces, running through various common scenarios, causing that instrumentation to “profile” what happens when the app is executed, and the results of that are then saved out. The app is then recompiled, feeding those instrumentation results back into the compiler, and allowing it to optimize the app for exactly how it’s expected to be used. This approach to PGO is referred to as “static PGO,” as the information is all gleaned ahead of actual deployment, and it’s something .NET has been doing in various forms for years. From my perspective, though, the really interesting development in .NET is “dynamic PGO,” which was introduced in .NET 6, but off by default.
Dynamic PGO takes advantage of tiered compilation. I noted that the JIT instruments the tier-0 code to track how many times the method is called, or in the case of loops, how many times the loop executes. It can instrument it for other things as well. For example, it can track exactly which concrete types are used as the target of an interface dispatch, and then in tier-1 specialize the code to expect the most common types (this is referred to as “guarded devirtualization,” or GDV). You can see this in this little example. Set theDOTNET_TieredPGO
environment variable to1
, and then run this on .NET 7:
class Program{ static void Main() { IPrinter printer = new Printer(); for (int i = 0; ; i++) { DoWork(printer, i); } } static void DoWork(IPrinter printer, int i) { printer.PrintIfTrue(i == int.MaxValue); } interface IPrinter { void PrintIfTrue(bool condition); } class Printer : IPrinter { public void PrintIfTrue(bool condition) { if (condition) Console.WriteLine("Print!"); } }}
The tier-0 code forDoWork
ends up looking like this:
G_M000_IG01: ;; offset=0000H 55 push rbp 4883EC30 sub rsp, 48 488D6C2430 lea rbp, [rsp+30H] 33C0 xor eax, eax 488945F8 mov qword ptr [rbp-08H], rax 488945F0 mov qword ptr [rbp-10H], rax 48894D10 mov gword ptr [rbp+10H], rcx 895518 mov dword ptr [rbp+18H], edxG_M000_IG02: ;; offset=001BH FF059F220F00 inc dword ptr [(reloc 0x7ffc3f1b2ea0)] 488B4D10 mov rcx, gword ptr [rbp+10H] 48894DF8 mov gword ptr [rbp-08H], rcx 488B4DF8 mov rcx, gword ptr [rbp-08H] 48BAA82E1B3FFC7F0000 mov rdx, 0x7FFC3F1B2EA8 E8B47EC55F call CORINFO_HELP_CLASSPROFILE32 488B4DF8 mov rcx, gword ptr [rbp-08H] 48894DF0 mov gword ptr [rbp-10H], rcx 488B4DF0 mov rcx, gword ptr [rbp-10H] 33D2 xor edx, edx 817D18FFFFFF7F cmp dword ptr [rbp+18H], 0x7FFFFFFF 0F94C2 sete dl 49BB0800F13EFC7F0000 mov r11, 0x7FFC3EF10008 41FF13 call [r11]IPrinter:PrintIfTrue(bool):this 90 nopG_M000_IG03: ;; offset=0062H 4883C430 add rsp, 48 5D pop rbp C3 ret
and most notably, you can see thecall [r11]IPrinter:PrintIfTrue(bool):this
doing the interface dispatch. But, then look at the code generated for tier-1. We still see thecall [r11]IPrinter:PrintIfTrue(bool):this
,but we also see this:
G_M000_IG02: ;; offset=0020H 48B9982D1B3FFC7F0000 mov rcx, 0x7FFC3F1B2D98 48390F cmp qword ptr [rdi], rcx 7521 jne SHORT G_M000_IG05 81FEFFFFFF7F cmp esi, 0x7FFFFFFF 7404 je SHORT G_M000_IG04G_M000_IG03: ;; offset=0037H FFC6 inc esi EBE5 jmp SHORT G_M000_IG02G_M000_IG04: ;; offset=003BH 48B9D820801A24020000 mov rcx, 0x2241A8020D8 488B09 mov rcx, gword ptr [rcx] FF1572CD0D00 call [Console:WriteLine(String)] EBE7 jmp SHORT G_M000_IG03
That first block is checking the concrete type of theIPrinter
(stored inrdi
) and comparing it against the known type forPrinter
(0x7FFC3F1B2D98
). If they’re different, it just jumps to the same interface dispatch it was doing in the unoptimized version. But if they’re the same, it then jumps directly to an inlined version ofPrinter.PrintIfTrue
(you can see the call toConsole:WriteLine
right there in this method). Thus, the common case (the only case in this example) is super efficient at the expense of a single comparison and branch.
That all existed in .NET 6, so why are we talking about it now? Several things have improved. First, PGO now works with OSR, thanks to improvements likedotnet/runtime#61453. That’s a big deal, as it means hot long-running methods that do this kind of interface dispatch (which are fairly common) can get these kinds of devirtualization/inlining optimizations. Second, while PGO isn’t currently enabled by default, we’ve made it much easier to turn on. Betweendotnet/runtime#71438 anddotnet/sdk#26350, it’s now possible to simply put<TieredPGO>true</TieredPGO>
into your .csproj, and it’ll have the same effect as if you setDOTNET_TieredPGO=1
prior to every invocation of the app, enabling dynamic PGO (note that itdoesn’t disable use of R2R images, so if you want the entirety of the core libraries also employing dynamic PGO, you’ll also need to setDOTNET_ReadyToRun=0
). Third, however, is dynamic PGO has been taught how to instrument and optimize additional things.
PGO already knew how to instrument virtual dispatch. Now in .NET 7, thanks in large part todotnet/runtime#68703, it can do so for delegates as well (at least for delegates to instance methods). Consider this simple console app:
using System.Runtime.CompilerServices;class Program{ static int[] s_values = Enumerable.Range(0, 1_000).ToArray(); static void Main() { for (int i = 0; i < 1_000_000; i++) Sum(s_values, i => i * 42); } [MethodImpl(MethodImplOptions.NoInlining)] static int Sum(int[] values, Func<int, int> func) { int sum = 0; foreach (int value in values) sum += func(value); return sum; }}
Without PGO enabled, I get generated optimized assembly like this:
; Assembly listing for method Program:Sum(ref,Func`2):int; Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-1 compilation; optimized code; rsp based frame; partially interruptible; No PGO dataG_M000_IG01: ;; offset=0000H 4156 push r14 57 push rdi 56 push rsi 55 push rbp 53 push rbx 4883EC20 sub rsp, 32 488BF2 mov rsi, rdxG_M000_IG02: ;; offset=000DH 33FF xor edi, edi 488BD9 mov rbx, rcx 33ED xor ebp, ebp 448B7308 mov r14d, dword ptr [rbx+08H] 4585F6 test r14d, r14d 7E16 jle SHORT G_M000_IG04G_M000_IG03: ;; offset=001DH 8BD5 mov edx, ebp 8B549310 mov edx, dword ptr [rbx+4*rdx+10H] 488B4E08 mov rcx, gword ptr [rsi+08H] FF5618 call [rsi+18H]Func`2:Invoke(int):int:this 03F8 add edi, eax FFC5 inc ebp 443BF5 cmp r14d, ebp 7FEA jg SHORT G_M000_IG03G_M000_IG04: ;; offset=0033H 8BC7 mov eax, ediG_M000_IG05: ;; offset=0035H 4883C420 add rsp, 32 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415E pop r14 C3 ret; Total bytes of code 64
Note thecall [rsi+18H]Func`2:Invoke(int):int:this
in there that’s invoking the delegate. Now with PGO enabled:
; Assembly listing for method Program:Sum(ref,Func`2):int; Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-1 compilation; optimized code; optimized using profile data; rsp based frame; fully interruptible; with Dynamic PGO: edge weights are valid, and fgCalledCount is 5628; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO dataG_M000_IG01: ;; offset=0000H 4157 push r15 4156 push r14 57 push rdi 56 push rsi 55 push rbp 53 push rbx 4883EC28 sub rsp, 40 488BF2 mov rsi, rdxG_M000_IG02: ;; offset=000FH 33FF xor edi, edi 488BD9 mov rbx, rcx 33ED xor ebp, ebp 448B7308 mov r14d, dword ptr [rbx+08H] 4585F6 test r14d, r14d 7E27 jle SHORT G_M000_IG05G_M000_IG03: ;; offset=001FH 8BC5 mov eax, ebp 8B548310 mov edx, dword ptr [rbx+4*rax+10H] 4C8B4618 mov r8, qword ptr [rsi+18H] 48B8A0C2CF3CFC7F0000 mov rax, 0x7FFC3CCFC2A0 4C3BC0 cmp r8, rax 751D jne SHORT G_M000_IG07 446BFA2A imul r15d, edx, 42G_M000_IG04: ;; offset=003CH 4103FF add edi, r15d FFC5 inc ebp 443BF5 cmp r14d, ebp 7FD9 jg SHORT G_M000_IG03G_M000_IG05: ;; offset=0046H 8BC7 mov eax, ediG_M000_IG06: ;; offset=0048H 4883C428 add rsp, 40 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415E pop r14 415F pop r15 C3 retG_M000_IG07: ;; offset=0055H 488B4E08 mov rcx, gword ptr [rsi+08H] 41FFD0 call r8 448BF8 mov r15d, eax EBDB jmp SHORT G_M000_IG04
I chose the42
constant ini => i * 42
to make it easy to see in the assembly, and sure enough, there it is:
G_M000_IG03: ;; offset=001FH 8BC5 mov eax, ebp 8B548310 mov edx, dword ptr [rbx+4*rax+10H] 4C8B4618 mov r8, qword ptr [rsi+18H] 48B8A0C2CF3CFC7F0000 mov rax, 0x7FFC3CCFC2A0 4C3BC0 cmp r8, rax 751D jne SHORT G_M000_IG07 446BFA2A imul r15d, edx, 42
This is loading the target address from the delegate intor8
and is loading the address of the expected target intorax
. If they’re the same, it then simply performs the inlined operation (imul r15d, edx, 42
), and otherwise it jumps to G_M000_IG07 which calls to the function inr8
. The effect of this is obvious if we run this as a benchmark:
static int[] s_values = Enumerable.Range(0, 1_000).ToArray();[Benchmark]public int DelegatePGO() => Sum(s_values, i => i * 42);static int Sum(int[] values, Func<int, int>? func){ int sum = 0; foreach (int value in values) { sum += func(value); } return sum;}
With PGO disabled, we get the same performance throughput for .NET 6 and .NET 7:
Method | Runtime | Mean | Ratio |
---|---|---|---|
DelegatePGO | .NET 6.0 | 1.665 us | 1.00 |
DelegatePGO | .NET 7.0 | 1.659 us | 1.00 |
But the picture changes when we enable dynamic PGO (DOTNET_TieredPGO=1
). .NET 6 gets ~14% faster, but .NET 7 gets ~3x faster!
Method | Runtime | Mean | Ratio |
---|---|---|---|
DelegatePGO | .NET 6.0 | 1,427.7 ns | 1.00 |
DelegatePGO | .NET 7.0 | 539.0 ns | 0.38 |
dotnet/runtime#70377 is another valuable improvement with dynamic PGO, which enables PGO to play nicely with loop cloning and invariant hoisting. To understand this better, a brief digression into what those are. Loop cloning is a mechanism the JIT employs to avoid various overheads in the fast path of a loop. Consider theTest
method in this example:
using System.Runtime.CompilerServices;class Program{ static void Main() { int[] array = new int[10_000_000]; for (int i = 0; i < 1_000_000; i++) { Test(array); } } [MethodImpl(MethodImplOptions.NoInlining)] private static bool Test(int[] array) { for (int i = 0; i < 0x12345; i++) { if (array[i] == 42) { return true; } } return false; }}
The JIT doesn’t know whether the passed in array is of sufficient length that all accesses toarray[i]
inside the loop will be in bounds, and thus it would need to inject bounds checks for every access. While it’d be nice to simply do the length check up front and simply throw an exception early if it wasn’t long enough, doing so could also change behavior (imagine the method were writing into the array as it went, or otherwise mutating some shared state). Instead, the JIT employs “loop cloning.” It essentially rewrites thisTest
method to be more like this:
if (array is not null && array.Length >= 0x12345){ for (int i = 0; i < 0x12345; i++) { if (array[i] == 42) // no bounds checks emitted for this access :-) { return true; } }}else{ for (int i = 0; i < 0x12345; i++) { if (array[i] == 42) // bounds checks emitted for this access :-( { return true; } }}return false;
That way, at the expense of some code duplication, we get our fast loop without bounds checks and only pay for the bounds checks in the slow path. You can see this in the generated assembly (if you can’t already tell,DOTNET_JitDisasm
is one of my favorite features in .NET 7):
; Assembly listing for method Program:Test(ref):bool; Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-1 compilation; optimized code; rsp based frame; fully interruptible; No PGO dataG_M000_IG01: ;; offset=0000H 4883EC28 sub rsp, 40G_M000_IG02: ;; offset=0004H 33C0 xor eax, eax 4885C9 test rcx, rcx 7429 je SHORT G_M000_IG05 81790845230100 cmp dword ptr [rcx+08H], 0x12345 7C20 jl SHORT G_M000_IG05 0F1F40000F1F840000000000 align [12 bytes for IG03]G_M000_IG03: ;; offset=0020H 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7429 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CEE jl SHORT G_M000_IG03G_M000_IG04: ;; offset=0032H EB17 jmp SHORT G_M000_IG06G_M000_IG05: ;; offset=0034H 3B4108 cmp eax, dword ptr [rcx+08H] 7323 jae SHORT G_M000_IG10 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7410 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CE9 jl SHORT G_M000_IG05G_M000_IG06: ;; offset=004BH 33C0 xor eax, eaxG_M000_IG07: ;; offset=004DH 4883C428 add rsp, 40 C3 retG_M000_IG08: ;; offset=0052H B801000000 mov eax, 1G_M000_IG09: ;; offset=0057H 4883C428 add rsp, 40 C3 retG_M000_IG10: ;; offset=005CH E81FA0C15F call CORINFO_HELP_RNGCHKFAIL CC int3; Total bytes of code 98
That G_M000_IG02 section is doing the null check and the length check, jumping to the G_M000_IG05 block if either fails. If both succeed, it’s then executing the loop (block G_M000_IG03) without bounds checks:
G_M000_IG03: ;; offset=0020H 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7429 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CEE jl SHORT G_M000_IG03
with the bounds checks only showing up in the slow-path block:
G_M000_IG05: ;; offset=0034H 3B4108 cmp eax, dword ptr [rcx+08H] 7323 jae SHORT G_M000_IG10 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7410 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CE9 jl SHORT G_M000_IG05
That’s “loop cloning.” What about “invariant hoisting”? Hoisting means pulling something out of a loop to be before the loop, and invariants are things that don’t change. Thus invariant hoisting is pulling something out of a loop to before the loop in order to avoid recomputing every iteration of the loop an answer that won’t change. Effectively, the previous example already showed invariant hoisting, in that the bounds check is moved to be before the loop rather than in the loop, but a more concrete example would be something like this:
[MethodImpl(MethodImplOptions.NoInlining)]private static bool Test(int[] array){ for (int i = 0; i < 0x12345; i++) { if (array[i] == array.Length - 42) { return true; } } return false;}
Note that the value ofarray.Length - 42
doesn’t change on each iteration of the loop, so it’s “invariant” to the loop iteration and can be lifted out, which the generated code does:
G_M000_IG02: ;; offset=0004H 33D2 xor edx, edx 4885C9 test rcx, rcx 742A je SHORT G_M000_IG05 448B4108 mov r8d, dword ptr [rcx+08H] 4181F845230100 cmp r8d, 0x12345 7C1D jl SHORT G_M000_IG05 4183C0D6 add r8d, -42 0F1F4000 align [4 bytes for IG03]G_M000_IG03: ;; offset=0020H 8BC2 mov eax, edx 4439448110 cmp dword ptr [rcx+4*rax+10H], r8d 7433 je SHORT G_M000_IG08 FFC2 inc edx 81FA45230100 cmp edx, 0x12345 7CED jl SHORT G_M000_IG03
Here again we see the array being tested for null (test rcx, rcx
) and the array’s length being checked (mov r8d, dword ptr [rcx+08H]
thencmp r8d, 0x12345
), but then with the array’s length inr8d
, we then see this up-front block subtracting 42 from the length (add r8d, -42
), and that’s before we continue into the fast-path loop in the G_M000_IG03 block. This keeps that additional set of operations out of the loop, thereby avoiding the overhead of recomputing the value per iteration.
Ok, so how does this apply to dynamic PGO? Remember that with the interface/virtual dispatch avoidance PGO is able to do, it does so by doing a type check to see whether the type in use is the most common type; if it is, it uses a fast path that calls directly to that type’s method (and in doing so that call is then potentially inlined), and if it isn’t, it falls back to normal interface/virtual dispatch. That check can be invariant to a loop. So when a method is tiered up and PGO kicks in, the type check can now be hoisted out of the loop, making it even cheaper to handle the common case. Consider this variation of our original example:
using System.Runtime.CompilerServices;class Program{ static void Main() { IPrinter printer = new BlankPrinter(); while (true) { DoWork(printer); } } [MethodImpl(MethodImplOptions.NoInlining)] static void DoWork(IPrinter printer) { for (int j = 0; j < 123; j++) { printer.Print(j); } } interface IPrinter { void Print(int i); } class BlankPrinter : IPrinter { public void Print(int i) { Console.Write(""); } }}
When we look at the optimized assembly generated for this with dynamic PGO enabled, we see this:
; Assembly listing for method Program:DoWork(IPrinter); Emitting BLENDED_CODE for X64 CPU with AVX - Windows; Tier-1 compilation; optimized code; optimized using profile data; rsp based frame; partially interruptible; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 12187; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO dataG_M000_IG01: ;; offset=0000H 57 push rdi 56 push rsi 4883EC28 sub rsp, 40 488BF1 mov rsi, rcxG_M000_IG02: ;; offset=0009H 33FF xor edi, edi 4885F6 test rsi, rsi 742B je SHORT G_M000_IG05 48B9982DD43CFC7F0000 mov rcx, 0x7FFC3CD42D98 48390E cmp qword ptr [rsi], rcx 751C jne SHORT G_M000_IG05G_M000_IG03: ;; offset=001FH 48B9282040F948020000 mov rcx, 0x248F9402028 488B09 mov rcx, gword ptr [rcx] FF1526A80D00 call [Console:Write(String)] FFC7 inc edi 83FF7B cmp edi, 123 7CE6 jl SHORT G_M000_IG03G_M000_IG04: ;; offset=0039H EB29 jmp SHORT G_M000_IG07G_M000_IG05: ;; offset=003BH 48B9982DD43CFC7F0000 mov rcx, 0x7FFC3CD42D98 48390E cmp qword ptr [rsi], rcx 7521 jne SHORT G_M000_IG08 48B9282040F948020000 mov rcx, 0x248F9402028 488B09 mov rcx, gword ptr [rcx] FF15FBA70D00 call [Console:Write(String)]G_M000_IG06: ;; offset=005DH FFC7 inc edi 83FF7B cmp edi, 123 7CD7 jl SHORT G_M000_IG05G_M000_IG07: ;; offset=0064H 4883C428 add rsp, 40 5E pop rsi 5F pop rdi C3 retG_M000_IG08: ;; offset=006BH 488BCE mov rcx, rsi 8BD7 mov edx, edi 49BB1000AA3CFC7F0000 mov r11, 0x7FFC3CAA0010 41FF13 call [r11]IPrinter:Print(int):this EBDE jmp SHORT G_M000_IG06; Total bytes of code 127
We can see in the G_M000_IG02 block that it’s doing the type check on theIPrinter
instance and jumping to G_M000_IG05 if the check fails (mov rcx, 0x7FFC3CD42D98
thencmp qword ptr [rsi], rcx
thenjne SHORT G_M000_IG05
), otherwise falling through to G_M000_IG03 which is a tight fast-path loop that’s inlinedBlankPrinter.Print
with no type checks in sight!
Interestingly, improvements like this can bring with them their own challenges. PGO leads to a significant increase in the number of type checks, since call sites that specialize for a given type need to compare against that type. However, common subexpression elimination (CSE) hasn’t historically worked for such type handles (CSE is a compiler optimization where duplicate expressions are eliminated by computing the result once and then storing it for subsequent use rather than recomputing it each time).dotnet/runtime#70580 fixes this by enabling CSE for such constant handles. For example, consider this method:
[Benchmark][Arguments("", "", "", "")]public bool AllAreStrings(object o1, object o2, object o3, object o4) => o1 is string && o2 is string && o3 is string && o4 is string;
On .NET 6, the JIT produced this assembly code:
; Program.AllAreStrings(System.Object, System.Object, System.Object, System.Object) test rdx,rdx je short M00_L01 mov rax,offset MT_System.String cmp [rdx],rax jne short M00_L01 test r8,r8 je short M00_L01 mov rax,offset MT_System.String cmp [r8],rax jne short M00_L01 test r9,r9 je short M00_L01 mov rax,offset MT_System.String cmp [r9],rax jne short M00_L01 mov rax,[rsp+28] test rax,rax je short M00_L00 mov rdx,offset MT_System.String cmp [rax],rdx je short M00_L00 xor eax,eaxM00_L00: test rax,rax setne al movzx eax,al retM00_L01: xor eax,eax ret; Total bytes of code 100
Note the C# has four tests forstring
and the assembly code has four loads withmov rax,offset MT_System.String
. Now on .NET 7, the load is performed just once:
; Program.AllAreStrings(System.Object, System.Object, System.Object, System.Object) test rdx,rdx je short M00_L01 mov rax,offset MT_System.String cmp [rdx],rax jne short M00_L01 test r8,r8 je short M00_L01 cmp [r8],rax jne short M00_L01 test r9,r9 je short M00_L01 cmp [r9],rax jne short M00_L01 mov rdx,[rsp+28] test rdx,rdx je short M00_L00 cmp [rdx],rax je short M00_L00 xor edx,edxM00_L00: xor eax,eax test rdx,rdx setne al retM00_L01: xor eax,eax ret; Total bytes of code 69
Bounds Check Elimination
One of the things that makes .NET attractive is its safety. The runtime guards access to arrays, strings, and spans such that you can’t accidentally corrupt memory by walking off either end; if you do, rather than reading/writing arbitrary memory, you’ll get exceptions. Of course, that’s not magic; it’s done by the JIT inserting bounds checks every time one of these data structures is indexed. For example, this:
[MethodImpl(MethodImplOptions.NoInlining)]static int Read0thElement(int[] array) => array[0];
results in:
G_M000_IG01: ;; offset=0000H 4883EC28 sub rsp, 40G_M000_IG02: ;; offset=0004H 83790800 cmp dword ptr [rcx+08H], 0 7608 jbe SHORT G_M000_IG04 8B4110 mov eax, dword ptr [rcx+10H]G_M000_IG03: ;; offset=000DH 4883C428 add rsp, 40 C3 retG_M000_IG04: ;; offset=0012H E8E9A0C25F call CORINFO_HELP_RNGCHKFAIL CC int3
The array is passed into this method in thercx
register, pointing to the method table pointer in the object, and the length of an array is stored in the object just after that method table pointer (which is 8 bytes in a 64-bit process). Thus thecmp dword ptr [rcx+08H], 0
instruction is reading the length of the array and comparing the length to 0; that makes sense, since the length can’t be negative, and we’re trying to access the 0th element, so as long as the length isn’t 0, the array has enough elements for us to access its 0th element. In the event that the length was 0, the code jumps to the end of the function, which containscall CORINFO_HELP_RNGCHKFAIL
; that’s a JIT helper function that throws anIndexOutOfRangeException
. If the length was sufficient, however, it then reads theint
stored at the beginning of the array’s data, which on 64-bit is 16 bytes (0x10) past the pointer (mov eax, dword ptr [rcx+10H]
).
While these bounds checks in and of themselves aren’t super expensive, do a lot of them and their costs add up. So while the JIT needs to ensure that “safe” accesses don’t go out of bounds, it also tries to prove that certain accesses won’t, in which case it needn’t emit the bounds check that it knows will be superfluous. In every release of .NET, more and more cases have been added to find places these bounds checks can be eliminated, and .NET 7 is no exception.
For example,dotnet/runtime#61662 from@anthonycanino enabled the JIT to understand various forms of binary operations as part of range checks. Consider this method:
[MethodImpl(MethodImplOptions.NoInlining)]private static ushort[]? Convert(ReadOnlySpan<byte> bytes){ if (bytes.Length != 16) { return null; } var result = new ushort[8]; for (int i = 0; i < result.Length; i++) { result[i] = (ushort)(bytes[i * 2] * 256 + bytes[i * 2 + 1]); } return result;}
It’s validating that the input span is 16 bytes long and then creating anew ushort[8]
where eachushort
in the array combines two of the input bytes. To do that, it’s looping over the output array, and indexing into the bytes array usingi * 2
andi * 2 + 1
as the indices. On .NET 6, each of those indexing operations would result in a bounds check, with assembly like:
cmp r8d,10 jae short G_M000_IG04 movsxd r8,r8d
where that G_M000_IG04 is thecall CORINFO_HELP_RNGCHKFAIL
we’re now familiar with. But on .NET 7, we get this assembly for the method:
G_M000_IG01: ;; offset=0000H 56 push rsi 4883EC20 sub rsp, 32G_M000_IG02: ;; offset=0005H 488B31 mov rsi, bword ptr [rcx] 8B4908 mov ecx, dword ptr [rcx+08H] 83F910 cmp ecx, 16 754C jne SHORT G_M000_IG05 48B9302F542FFC7F0000 mov rcx, 0x7FFC2F542F30 BA08000000 mov edx, 8 E80C1EB05F call CORINFO_HELP_NEWARR_1_VC 33D2 xor edx, edx align [0 bytes for IG03]G_M000_IG03: ;; offset=0026H 8D0C12 lea ecx, [rdx+rdx] 448BC1 mov r8d, ecx FFC1 inc ecx 458BC0 mov r8d, r8d 460FB60406 movzx r8, byte ptr [rsi+r8] 41C1E008 shl r8d, 8 8BC9 mov ecx, ecx 0FB60C0E movzx rcx, byte ptr [rsi+rcx] 4103C8 add ecx, r8d 0FB7C9 movzx rcx, cx 448BC2 mov r8d, edx 6642894C4010 mov word ptr [rax+2*r8+10H], cx FFC2 inc edx 83FA08 cmp edx, 8 7CD0 jl SHORT G_M000_IG03G_M000_IG04: ;; offset=0056H 4883C420 add rsp, 32 5E pop rsi C3 retG_M000_IG05: ;; offset=005CH 33C0 xor rax, raxG_M000_IG06: ;; offset=005EH 4883C420 add rsp, 32 5E pop rsi C3 ret; Total bytes of code 100
No bounds checks, which is most easily seen by the lack of the telltalecall CORINFO_HELP_RNGCHKFAIL
at the end of the method. With this PR, the JIT is able to understand the impact of certain multiplication and shift operations and their relationships to the bounds of the data structure. Since it can see that the result array’s length is 8 and the loop is iterating from 0 to that exclusive upper bound, it knows thati
will always be in the range[0, 7]
, which means thati * 2
will always be in the range[0, 14]
andi * 2 + 1
will always be in the range[0, 15]
. As such, it’s able to prove that the bounds checks aren’t needed.
dotnet/runtime#61569 anddotnet/runtime#62864 also help to eliminate bounds checks when dealing with constant strings and spans initialized from RVA statics (“Relative Virtual Address” static fields, basically a static field that lives in a module’s data section). For example, consider this benchmark:
[Benchmark][Arguments(1)]public char GetChar(int i){ const string Text = "hello"; return (uint)i < Text.Length ? Text[i] : '\0';}
On .NET 6, we get this assembly:
; Program.GetChar(Int32) sub rsp,28 mov eax,edx cmp rax,5 jl short M00_L00 xor eax,eax add rsp,28 retM00_L00: cmp edx,5 jae short M00_L01 mov rax,2278B331450 mov rax,[rax] movsxd rdx,edx movzx eax,word ptr [rax+rdx*2+0C] add rsp,28 retM00_L01: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 56
The beginning of this makes sense: the JIT was obviously able to see that the length ofText
is 5, so it’s implementing the(uint)i < Text.Length
check by doingcmp rax,5
, and ifi
as an unsigned value is greater than or equal to 5, it’s then zero’ing out the return value (to return the'\0'
) and exiting. If the length is less than 5 (in which case it’s also at least 0 due to the unsigned comparison), it then jumps to M00_L00 to read the value from the string… but we then see anothercmp
against 5, this time as part of a range check. So even though the JIT knew the index was in bounds, it wasn’t able to remove the bounds check. Now it is; in .NET 7, we get this:
; Program.GetChar(Int32) cmp edx,5 jb short M00_L00 xor eax,eax retM00_L00: mov rax,2B0AF002530 mov rax,[rax] mov edx,edx movzx eax,word ptr [rax+rdx*2+0C] ret; Total bytes of code 29
So much nicer.
dotnet/runtime#67141 is a great example of how evolving ecosystem needs drives specific optimizations into the JIT. TheRegex
compiler and source generator handle some cases of regular expression character classes by using a bitmap lookup stored in strings. For example, to determine whether achar c
is in the character class"[A-Za-z0-9_]"
(which will match an underscore or any ASCII letter or digit), the implementation ends up generating an expression like the body of the following method:
[Benchmark][Arguments('a')]public bool IsInSet(char c) => c < 128 && ("\0\0\0\u03FF\uFFFE\u87FF\uFFFE\u07FF"[c >> 4] & (1 << (c & 0xF))) != 0;
The implementation is treating an 8-character string as a 128-bit lookup table. If the character is known to be in range (such that it’s effectively a 7-bit value), it’s then using the top 3 bits of the value to index into the 8 elements of the string, and the bottom 4 bits to select one of the 16 bits in that element, giving us an answer as to whether this input character is in the set or not. In .NET 6, even though we know the character is in range of the string, the JIT couldn’t see through either the length comparison or the bit shift.
; Program.IsInSet(Char) sub rsp,28 movzx eax,dx cmp eax,80 jge short M00_L00 mov edx,eax sar edx,4 cmp edx,8 jae short M00_L01 mov rcx,299835A1518 mov rcx,[rcx] movsxd rdx,edx movzx edx,word ptr [rcx+rdx*2+0C] and eax,0F bt edx,eax setb al movzx eax,al add rsp,28 retM00_L00: xor eax,eax add rsp,28 retM00_L01: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 75
The previously mentioned PR takes care of the length check. And this PR takes care of the bit shift. So in .NET 7, we get this loveliness:
; Program.IsInSet(Char) movzx eax,dx cmp eax,80 jge short M00_L00 mov edx,eax sar edx,4 mov rcx,197D4800608 mov rcx,[rcx] mov edx,edx movzx edx,word ptr [rcx+rdx*2+0C] and eax,0F bt edx,eax setb al movzx eax,al retM00_L00: xor eax,eax ret; Total bytes of code 51
Note the distinct lack of acall CORINFO_HELP_RNGCHKFAIL
. And as you might guess, this check can happena lot in aRegex
, making this a very useful addition.
Bounds checks are an obvious source of overhead when talking about array access, but they’re not the only ones. There’s also the need to use the cheapest instructions possible. In .NET 6, with a method like:
[MethodImpl(MethodImplOptions.NoInlining)]private static int Get(int[] values, int i) => values[i];
assembly code like the following would be generated:
; Program.Get(Int32[], Int32) sub rsp,28 cmp edx,[rcx+8] jae short M01_L00 movsxd rax,edx mov eax,[rcx+rax*4+10] add rsp,28 retM01_L00: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 27
This should look fairly familiar from our previous discussion; the JIT is loading the array’s length ([rcx+8]
) and comparing that with the value ofi
(inedx
), and then jumping to the end to throw an exception ifi
is out of bounds. Immediately after that jump we see amovsxd rax, edx
instruction, which is taking the 32-bit value ofi
fromedx
and moving it into the 64-bit registerrax
. And as part of moving it, it’s sign-extending it; that’s the “sxd” part of the instruction name (sign-extending means the upper 32 bits of the new 64-bit value will be set to the value of the upper bit of the 32-bit value, so that the number retains its signed value). The interesting thing is, though, we know that theLength
of an array and of a span is non-negative, and since we just bounds checkedi
against theLength
, we also know thati
is non-negative. That makes such sign-extension useless, since the upper bit is guaranteed to be 0. Since themov
instruction that zero-extends is a tad cheaper thanmovsxd
, we can simply use that instead. And that’s exactly whatdotnet/runtime#57970 from@pentp does for both arrays and spans (dotnet/runtime#70884 also similarly avoids some signed casts in other situations). Now on .NET 7, we get this:
; Program.Get(Int32[], Int32) sub rsp,28 cmp edx,[rcx+8] jae short M01_L00 mov eax,edx mov eax,[rcx+rax*4+10] add rsp,28 retM01_L00: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 26
That’s not the only source of overhead with array access, though. In fact, there’s a very large category of array access overhead that’s been there forever, but that’s so well known there are even old FxCop rules and newer Roslyn analyzers that warn against it: multidimensional array accesses. The overhead in the case of a multidimensional array isn’t just an extra branch on every indexing operation, or additional math required to compute the location of the element, but rather that they currently pass through the JIT’s optimization phases largely unmodified.dotnet/runtime#70271 improves the state of the world here by doing an expansion of a multidimensional array access early in the JIT’s pipeline, such that later optimization phases can improve multidimensional accesses as they would other code, including CSE and loop invariant hoisting. The impact of this is visible in a simple benchmark that sums all the elements of a multidimensional array.
private int[,] _square;[Params(1000)]public int Size { get; set; }[GlobalSetup]public void Setup(){ int count = 0; _square = new int[Size, Size]; for (int i = 0; i < Size; i++) { for (int j = 0; j < Size; j++) { _square[i, j] = count++; } }}[Benchmark]public int Sum(){ int[,] square = _square; int sum = 0; for (int i = 0; i < Size; i++) { for (int j = 0; j < Size; j++) { sum += square[i, j]; } } return sum;}
Method | Runtime | Mean | Ratio |
---|---|---|---|
Sum | .NET 6.0 | 964.1 us | 1.00 |
Sum | .NET 7.0 | 674.7 us | 0.70 |
This previous example assumes you know the size of each dimension of the multidimensional array (it’s referring to theSize
directly in the loops). That’s obviously not always (or maybe even rarely) the case. In such situations, you’d be more likely to use theArray.GetUpperBound
method, and because multidimensional arrays can have a non-zero lower bound,Array.GetLowerBound
. That would lead to code like this:
private int[,] _square;[Params(1000)]public int Size { get; set; }[GlobalSetup]public void Setup(){ int count = 0; _square = new int[Size, Size]; for (int i = 0; i < Size; i++) { for (int j = 0; j < Size; j++) { _square[i, j] = count++; } }}[Benchmark]public int Sum(){ int[,] square = _square; int sum = 0; for (int i = square.GetLowerBound(0); i < square.GetUpperBound(0); i++) { for (int j = square.GetLowerBound(1); j < square.GetUpperBound(1); j++) { sum += square[i, j]; } } return sum;}
In .NET 7, thanks todotnet/runtime#60816, thoseGetLowerBound
andGetUpperBound
calls become JIT intrinsics. An “intrinsic” to a compiler is something the compiler has intrinsic knowledge of, such that rather than relying solely on a method’s defined implementation (if it even has one), the compiler can substitute in something it considers to be better. There are literally thousands of methods in .NET known in this manner to the JIT, withGetLowerBound
andGetUpperBound
being two of the most recent. Now as intrinsics, when they’re passed a constant value (e.g.0
for the 0th rank), the JIT can substitute the necessary assembly instructions to read directly from the memory location that houses the bounds. Here’s what the assembly code for this benchmark looked like with .NET 6; the main thing to see here are all of thecall
s out toGetLowerBound
andGetUpperBound
:
; Program.Sum() push rdi push rsi push rbp push rbx sub rsp,28 mov rsi,[rcx+8] xor edi,edi mov rcx,rsi xor edx,edx cmp [rcx],ecx call System.Array.GetLowerBound(Int32) mov ebx,eax mov rcx,rsi xor edx,edx call System.Array.GetUpperBound(Int32) cmp eax,ebx jle short M00_L03M00_L00: mov rcx,[rsi] mov ecx,[rcx+4] add ecx,0FFFFFFE8 shr ecx,3 cmp ecx,1 jbe short M00_L05 lea rdx,[rsi+10] inc ecx movsxd rcx,ecx mov ebp,[rdx+rcx*4] mov rcx,rsi mov edx,1 call System.Array.GetUpperBound(Int32) cmp eax,ebp jle short M00_L02M00_L01: mov ecx,ebx sub ecx,[rsi+18] cmp ecx,[rsi+10] jae short M00_L04 mov edx,ebp sub edx,[rsi+1C] cmp edx,[rsi+14] jae short M00_L04 mov eax,[rsi+14] imul rax,rcx mov rcx,rdx add rcx,rax add edi,[rsi+rcx*4+20] inc ebp mov rcx,rsi mov edx,1 call System.Array.GetUpperBound(Int32) cmp eax,ebp jg short M00_L01M00_L02: inc ebx mov rcx,rsi xor edx,edx call System.Array.GetUpperBound(Int32) cmp eax,ebx jg short M00_L00M00_L03: mov eax,edi add rsp,28 pop rbx pop rbp pop rsi pop rdi retM00_L04: call CORINFO_HELP_RNGCHKFAILM00_L05: mov rcx,offset MT_System.IndexOutOfRangeException call CORINFO_HELP_NEWSFAST mov rsi,rax call System.SR.get_IndexOutOfRange_ArrayRankIndex() mov rdx,rax mov rcx,rsi call System.IndexOutOfRangeException..ctor(System.String) mov rcx,rsi call CORINFO_HELP_THROW int 3; Total bytes of code 219
Now here’s what it is for .NET 7:
; Program.Sum() push r14 push rdi push rsi push rbp push rbx sub rsp,20 mov rdx,[rcx+8] xor eax,eax mov ecx,[rdx+18] mov r8d,ecx mov r9d,[rdx+10] lea ecx,[rcx+r9+0FFFF] cmp ecx,r8d jle short M00_L03 mov r9d,[rdx+1C] mov r10d,[rdx+14] lea r10d,[r9+r10+0FFFF]M00_L00: mov r11d,r9d cmp r10d,r11d jle short M00_L02 mov esi,r8d sub esi,[rdx+18] mov edi,[rdx+10]M00_L01: mov ebx,esi cmp ebx,edi jae short M00_L04 mov ebp,[rdx+14] imul ebx,ebp mov r14d,r11d sub r14d,[rdx+1C] cmp r14d,ebp jae short M00_L04 add ebx,r14d add eax,[rdx+rbx*4+20] inc r11d cmp r10d,r11d jg short M00_L01M00_L02: inc r8d cmp ecx,r8d jg short M00_L00M00_L03: add rsp,20 pop rbx pop rbp pop rsi pop rdi pop r14 retM00_L04: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 130
Importantly, note there are no morecall
s (other than for the bounds check exception at the end). For example, instead of that firstGetUpperBound
call:
call System.Array.GetUpperBound(Int32)
we get:
mov r9d,[rdx+1C]mov r10d,[rdx+14]lea r10d,[r9+r10+0FFFF]
and it ends up being much faster:
Method | Runtime | Mean | Ratio |
---|---|---|---|
Sum | .NET 6.0 | 2,657.5 us | 1.00 |
Sum | .NET 7.0 | 676.3 us | 0.25 |
Loop Hoisting and Cloning
We previously saw how PGO interacts with loop hoisting and cloning, and those optimizations have seen other improvements, as well.
Historically, the JIT’s support for hoisting has been limited to lifting an invariant out one level. Consider this example:
[Benchmark]public void Compute(){ for (int thousands = 0; thousands < 10; thousands++) { for (int hundreds = 0; hundreds < 10; hundreds++) { for (int tens = 0; tens < 10; tens++) { for (int ones = 0; ones < 10; ones++) { int n = ComputeNumber(thousands, hundreds, tens, ones); Process(n); } } } }}static int ComputeNumber(int thousands, int hundreds, int tens, int ones) => (thousands * 1000) + (hundreds * 100) + (tens * 10) + ones;[MethodImpl(MethodImplOptions.NoInlining)]static void Process(int n) { }
At first glance, you might look at this and say “what could be hoisted, the computation ofn
requires all of the loop inputs, and all of that computation is inComputeNumber
.” But from a compiler’s perspective, theComputeNumber
function is inlineable and thus logically can be part of its caller, the computation ofn
is actually split into multiple pieces, and each of those pieces can be hoisted to different levels, e.g. the tens computation can be hoisted out one level, the hundreds out two levels, and the thousands out three levels. Here’s what[DisassemblyDiagnoser]
outputs for .NET 6:
; Program.Compute() push r14 push rdi push rsi push rbp push rbx sub rsp,20 xor esi,esiM00_L00: xor edi,ediM00_L01: xor ebx,ebxM00_L02: xor ebp,ebp imul ecx,esi,3E8 imul eax,edi,64 add ecx,eax lea eax,[rbx+rbx*4] lea r14d,[rcx+rax*2]M00_L03: lea ecx,[r14+rbp] call Program.Process(Int32) inc ebp cmp ebp,0A jl short M00_L03 inc ebx cmp ebx,0A jl short M00_L02 inc edi cmp edi,0A jl short M00_L01 inc esi cmp esi,0A jl short M00_L00 add rsp,20 pop rbx pop rbp pop rsi pop rdi pop r14 ret; Total bytes of code 84
We can see thatsome hoisting has happened here. After all, the inner most loop (tagged M00_L03) is only five instructions: incrementebp
(which at this point is theones
counter value), and if it’s still less than 0xA (10), jump back to M00_L03 which adds whatever is inr14
toones
. Great, so we’ve hoisted all of the unnecessary computation out of the inner loop, being left only with adding the ones position to the rest of the number. Let’s go out a level. M00_L02 is the label for the tens loop. What do we see there? Trouble. The two instructionsimul ecx,esi,3E8
andimul eax,edi,64
are performing thethousands * 1000
andhundreds * 100
operations, highlighting that these operations which could have been hoisted out further were left stuck in the next-to-innermost loop. Now, here’s what we get for .NET 7, where this was improved indotnet/runtime#68061:
; Program.Compute() push r15 push r14 push r12 push rdi push rsi push rbp push rbx sub rsp,20 xor esi,esiM00_L00: xor edi,edi imul ebx,esi,3E8M00_L01: xor ebp,ebp imul r14d,edi,64 add r14d,ebxM00_L02: xor r15d,r15d lea ecx,[rbp+rbp*4] lea r12d,[r14+rcx*2]M00_L03: lea ecx,[r12+r15] call qword ptr [Program.Process(Int32)] inc r15d cmp r15d,0A jl short M00_L03 inc ebp cmp ebp,0A jl short M00_L02 inc edi cmp edi,0A jl short M00_L01 inc esi cmp esi,0A jl short M00_L00 add rsp,20 pop rbx pop rbp pop rsi pop rdi pop r12 pop r14 pop r15 ret; Total bytes of code 99
Notice now where thoseimul
instructions live. There are four labels, each one corresponding to one of the loops, and we can see the outermost loop has theimul ebx,esi,3E8
(for the thousands computation) and the next loop has theimul r14d,edi,64
(for the hundreds computation), highlighting that these computations were hoisted out to the appropriate level (the tens and ones computation are still in the right places).
More improvements have gone in on the cloning side. Previously, loop cloning would only apply for loops iterating by 1 from a low to a high value. Withdotnet/runtime#60148, the comparison against the upper value can be<=
rather than just<
. And withdotnet/runtime#67930, loops that iterate downward can also be cloned, as can loops that have increments and decrements larger than 1. Consider this benchmark:
private int[] _values = Enumerable.Range(0, 1000).ToArray();[Benchmark][Arguments(0, 0, 1000)]public int LastIndexOf(int arg, int offset, int count){ int[] values = _values; for (int i = offset + count - 1; i >= offset; i--) if (values[i] == arg) return i; return 0;}
Without loop cloning, the JIT can’t assume thatoffset
throughoffset+count
are in range, and thus every access to the array needs to be bounds checked. With loop cloning, the JIT could generate one version of the loop without bounds checks and only use that when it knows all accesses will be valid. That’s exactly what happens now in .NET 7. Here’s what we got with .NET 6:
; Program.LastIndexOf(Int32, Int32, Int32) sub rsp,28 mov rcx,[rcx+8] lea eax,[r8+r9+0FFFF] cmp eax,r8d jl short M00_L01 mov r9d,[rcx+8] nop word ptr [rax+rax]M00_L00: cmp eax,r9d jae short M00_L03 movsxd r10,eax cmp [rcx+r10*4+10],edx je short M00_L02 dec eax cmp eax,r8d jge short M00_L00M00_L01: xor eax,eax add rsp,28 retM00_L02: add rsp,28 retM00_L03: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 72
Notice how in the core loop, at label M00_L00, there’s a bounds check (cmp eax,r9d
andjae short M00_L03
, which jumps to acall CORINFO_HELP_RNGCHKFAIL
). And here’s what we get with .NET 7:
; Program.LastIndexOf(Int32, Int32, Int32) sub rsp,28 mov rax,[rcx+8] lea ecx,[r8+r9+0FFFF] cmp ecx,r8d jl short M00_L02 test rax,rax je short M00_L01 test ecx,ecx jl short M00_L01 test r8d,r8d jl short M00_L01 cmp [rax+8],ecx jle short M00_L01M00_L00: mov r9d,ecx cmp [rax+r9*4+10],edx je short M00_L03 dec ecx cmp ecx,r8d jge short M00_L00 jmp short M00_L02M00_L01: cmp ecx,[rax+8] jae short M00_L04 mov r9d,ecx cmp [rax+r9*4+10],edx je short M00_L03 dec ecx cmp ecx,r8d jge short M00_L01M00_L02: xor eax,eax add rsp,28 retM00_L03: mov eax,ecx add rsp,28 retM00_L04: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 98
Notice how the code size is larger, and how there are now two variations of the loop: one at M00_L00 and one at M00_L01. The second one, M00_L01, has a branch to that samecall CORINFO_HELP_RNGCHKFAIL
, but the first one doesn’t, because that loop will only end up being used after proving that theoffset
,count
, and_values.Length
are such that the indexing will always be in bounds.
Other changes also improved loop cloning.dotnet/runtime#59886 enables the JIT to choose different forms for how to emit the the conditions for choosing the fast or slow loop path, e.g. whether to emit all the conditions,&
them together, and then branch (if (!(cond1 & cond2)) goto slowPath
), or whether to emit each condition on its own (if (!cond1) goto slowPath; if (!cond2) goto slowPath
).dotnet/runtime#66257 enables loop cloning to kick in when the loop variable is initialized to more kinds of expressions (e.g.for (int fromindex = lastIndex - lengthToClear; ...)
). Anddotnet/runtime#70232 increases the JIT’s willingness to clone loops with bodies that do a broader set of operations.
Folding, propagation, and substitution
Constant folding is an optimization where a compiler computes the value of an expression involving only constants at compile-time rather than generating the code to compute the value at run-time. There are multiple levels of constant folding in .NET, with some constant folding performed by the C# compiler and some constant folding performed by the JIT compiler. For example, given the C# code:
[Benchmark]public int A() => 3 + (4 * 5);[Benchmark]public int B() => A() * 2;
the C# compiler will generate IL for these methods like the following:
.method public hidebysig instance int32 A () cil managed { .maxstack 8 IL_0000: ldc.i4.s 23 IL_0002: ret}.method public hidebysig instance int32 B () cil managed { .maxstack 8 IL_0000: ldarg.0 IL_0001: call instance int32 Program::A() IL_0006: ldc.i4.2 IL_0007: mul IL_0008: ret}
You can see that the C# compiler has computed the value of3 + (4*5)
, as the IL for methodA
simply contains the equivalent ofreturn 23;
. However, methodB
contains the equivalent ofreturn A() * 2;
, highlighting that the constant folding performed by the C# compiler was intramethod only. Now here’s what the JIT generates:
; Program.A() mov eax,17 ret; Total bytes of code 6; Program.B() mov eax,2E ret; Total bytes of code 6
The assembly for methodA
isn’t particularly interesting; it’s just returning that same value 23 (hex 0x17). But methodB
is more interesting. The JIT has inlined the call fromB
toA
, exposing the contents ofA
toB
, such that the JIT effectively sees the body ofB
as the equivalent ofreturn 23 * 2;
. At that point, the JIT can do its own constant folding, and it transforms the body ofB
to simply return 46 (hex 0x2e). Constant propagation is intricately linked to constant folding and is essentially just the idea that you can substitute a constant value (typically one computed via constant folding) into further expressions, at which point they may also be able to be folded.
The JIT has long performed constant folding, but it improves further in .NET 7. One of the ways constant folding can improve is by exposing more values to be folded, which often means more inlining.dotnet/runtime#55745 helped the inliner to understand that a method call likeM(constant + constant)
(noting that those constants might be the result of some other method call) is itself passing a constant toM
, and a constant being passed to a method call is a hint to the inliner that it should consider being more aggressive about inlining, since exposing that constant to the body of the callee can potentially significantly reduce the amount of code required to implement the callee. The JIT might have previously inlined such a method anyway, but when it comes to inlining, the JIT is all about heuristics and generating enough evidence that it’s worthwhile to inline something; this contributes to that evidence. This pattern shows up, for example, in the variousFromXx
methods onTimeSpan
. For example,TimeSpan.FromSeconds
is implemented as:
public static TimeSpan FromSeconds(double value) => Interval(value, TicksPerSecond); // TicksPerSecond is a constant
and, eschewing argument validation for the purposes of this example,Interval
is:
private static TimeSpan Interval(double value, double scale) => IntervalFromDoubleTicks(value * scale);private static TimeSpan IntervalFromDoubleTicks(double ticks) => ticks == long.MaxValue ? TimeSpan.MaxValue : new TimeSpan((long)ticks);
which if everything gets inlined meansFromSeconds
is essentially:
public static TimeSpan FromSeconds(double value){ double ticks = value * 10_000_000; return ticks == long.MaxValue ? TimeSpan.MaxValue : new TimeSpan((long)ticks);}
and ifvalue
is a constant, let’s say5
, that whole thing can be constant folded (with dead code elimination on theticks == long.MaxValue
branch) to simply:
return new TimeSpan(50_000_000);
I’ll spare you the .NET 6 assembly for this, but on .NET 7 with a benchmark like:
[Benchmark]public TimeSpan FromSeconds() => TimeSpan.FromSeconds(5);
we now get the simple and clean:
; Program.FromSeconds() mov eax,2FAF080 ret; Total bytes of code 6
Another change improving constant folding includeddotnet/runtime#57726 from@SingleAccretion, which unblocked constant folding in a particular scenario that sometimes manifests when doing field-by-field assignment of structs being returned from method calls. As a small example, consider this trivial property, which access theColor.DarkOrange
property, which in turn doesnew Color(KnownColor.DarkOrange)
:
[Benchmark]public Color DarkOrange() => Color.DarkOrange;
In .NET 6, the JIT generated this:
; Program.DarkOrange() mov eax,1 mov ecx,39 xor r8d,r8d mov [rdx],r8 mov [rdx+8],r8 mov [rdx+10],cx mov [rdx+12],ax mov rax,rdx ret; Total bytes of code 32
The interesting thing here is that some constants (39, which is the value ofKnownColor.DarkOrange
, and 1, which is a privateStateKnownColorValid
constant) are being loaded into registers (mov eax, 1
thenmov ecx, 39
) and then later being stored into the relevant location for theColor
struct being returned (mov [rdx+12],ax
andmov [rdx+10],cx
). In .NET 7, it now generates:
; Program.DarkOrange() xor eax,eax mov [rdx],rax mov [rdx+8],rax mov word ptr [rdx+10],39 mov word ptr [rdx+12],1 mov rax,rdx ret; Total bytes of code 25
with direct assignment of these constant values into their destination locations (mov word ptr [rdx+12],1
andmov word ptr [rdx+10],39
). Other changes contributing to constant folding includeddotnet/runtime#58171 from@SingleAccretion anddotnet/runtime#57605 from@SingleAccretion.
However, a large category of improvement came from an optimization related to propagation, that of forward substitution. Consider this silly benchmark:
[Benchmark]public int Compute1() => Value + Value + Value + Value + Value;[Benchmark]public int Compute2() => SomethingElse() + Value + Value + Value + Value + Value;private static int Value => 16;[MethodImpl(MethodImplOptions.NoInlining)]private static int SomethingElse() => 42;
If we look at the assembly code generated forCompute1
on .NET 6, it looks like what we’d hope for. We’re addingValue
5 times,Value
is trivially inlined and returns a constant value 16, and so we’d hope that the assembly code generated forCompute1
would effectively just be returning the value 80 (hex 0x50), which is exactly what happens:
; Program.Compute1() mov eax,50 ret; Total bytes of code 6
ButCompute2
is a bit different. The structure of the code is such that the additional call toSomethingElse
ends up slightly perturbing something about the JIT’s analysis, and .NET 6 ends up with this assembly code:
; Program.Compute2() sub rsp,28 call Program.SomethingElse() add eax,10 add eax,10 add eax,10 add eax,10 add eax,10 add rsp,28 ret; Total bytes of code 29
Rather than a singlemov eax, 50
to put the value 0x50 into the return register, we have 5 separateadd eax, 10
to build up that same 0x50 (80) value. That’s… not ideal.
It turns out that many of the JIT’s optimizations operate on the tree data structures created as part of parsing the IL. In some cases, optimizations can do better when they’re exposed to more of the program, in other words when the tree they’re operating on is larger and contains more to be analyzed. However, various operations can break up these trees into smaller, individual ones, such as with temporary variables created as part of inlining, and in doing so can inhibit these operations. Something is needed in order to effectively stitch these trees back together, and that’s forward substitution. You can think of forward substitution almost like an inverse of CSE; rather than trying to find duplicate expressions and eliminate them by computing the value once and storing it into a temporary, forward substitution eliminates that temporary and effectively moves the expression tree into its use site. Obviously you don’t want to do this if it would then negate CSE and result in duplicate work, but for expressions that are defined once and used once, this kind of forward propagation is valuable.dotnet/runtime#61023 added an initial limited version of forward substitution, and thendotnet/runtime#63720 added a more robust generalized implementation. Subsequently,dotnet/runtime#70587 expanded it to also cover some SIMD vectors, and thendotnet/runtime#71161 improved it further to enable substitutions into more places (in this case into call arguments). And with those, our silly benchmark now produces the following on .NET 7:
; Program.Compute2() sub rsp,28 call qword ptr [7FFCB8DAF9A8] add eax,50 add rsp,28 ret; Total bytes of code 18
Vectorization
SIMD, or Single Instruction Multiple Data, is a kind of processing in which one instruction applies to multiple pieces of data at the same time. You’ve got a list of numbers and you want to find the index of a particular value? You could walk the list comparing one element at a time, and that would be fine functionally. But what if in the same amount of time it takes you to read and compare one element, you could instead read and compare two elements, or four elements, or 32 elements? That’s SIMD, and the art of utilizing SIMD instructions is lovingly referred to as “vectorization,” where operations are applied to all of the elements in a “vector” at the same time.
.NET has long had support for vectorization in the form ofVector<T>
, which is an easy-to-use type with first-class JIT support to enable a developer to write vectorized implementations. One ofVector<T>
‘s greatest strengths is also one of its greatest weaknesses. The type is designed to adapt to whatever width vector instructions are available in your hardware. If the machine supports 256-bit width vectors, great, that’s whatVector<T>
will target. If not, if the machine supports 128-bit width vectors, great, that’s whatVector<T>
targets. But that flexibility comes with various downsides, at least today; for example, the operations you can perform on aVector<T>
end up needing to be agnostic to the width of the vectors used, since the width is variable based on the hardware on which the code actually runs. And that means the operations that can be exposed onVector<T>
are limited, which in turn limits the kinds of operations that can be vectorized with it. Also, because it’s only ever a single size in a given process, some data set sizes that fall in between 128 bits and 256 bits might not be processed as well as you’d hope. You write yourVector<byte>
-based algorithm, and you run it on a machine with support for 256-bit vectors, which means it can process 32 bytes at a time, but then you feed it an input with 31 bytes. HadVector<T>
mapped to 128-bit vectors, it could have been used to improve the processing of that input, but as its vector size is larger than the input data size, the implementation ends up falling back to one that’s not accelerated. There are also issues related to R2R and Native AOT, since ahead-of-time compilation needs to know in advance what instructions should be used forVector<T>
operations. You already saw this earlier when discussing the output ofDOTNET_JitDisasmSummary
; we saw that theNarrowUtf16ToAscii
method was one of only a few methods that was JIT compiled in a “hello, world” console app, and that this was because it lacked R2R code due to its use ofVector<T>
.
Starting in .NET Core 3.0, .NET gained literally thousands of new “hardware intrinsics” methods, most of which are .NET APIs that map down to one of these SIMD instructions. These intrinsics enable an expert to write an implementation tuned to a specific instruction set, and if done well, get the best possible performance, but it also requires the developer to understand each instruction set and to implement their algorithm for each instruction set that might be relevant, e.g. an AVX2 implementation if it’s supported, or an SSE2 implementation if it’s supported, or an ArmBase implementation if it’s supported, and so on.
.NET 7 has introduced a middle ground. Previous releases saw the introduction of theVector128<T>
andVector256<T>
types, but purely as the vehicle by which data moved in and out of the hardware intrinsics, since they’re all tied to specific width vectors. Now in .NET 7, exposed viadotnet/runtime#53450,dotnet/runtime#63414,dotnet/runtime#60094, anddotnet/runtime#68559, a very large set of cross-platform operations is defined over these types as well, e.g.Vector128<T>.ExtractMostSignificantBits
,Vector256.ConditionalSelect
, and so on. A developer who wants or needs to go beyond what the high-levelVector<T>
offers can choose to target one or more of these two types. Typically this would amount to a developer writing one code path based onVector128<T>
, as that has the broadest reach and achieves a significant amount of the gains from vectorization, and then if is motivated to do so can add a second path forVector256<T>
in order to potentially double throughput further on platforms that have 256-bit width vectors. Think of these types and methods as a platform-abstraction layer: you code to these methods, and then the JIT translates them into the most appropriate instructions for the underlying platform. Consider this simple code as an example:
using System.Runtime.CompilerServices;using System.Runtime.Intrinsics;using System.Runtime.Intrinsics.X86;internal class Program{ private static void Main() { Vector128<byte> v = Vector128.Create((byte)123); while (true) { WithIntrinsics(v); WithVector(v); } } [MethodImpl(MethodImplOptions.NoInlining)] private static int WithIntrinsics(Vector128<byte> v) => Sse2.MoveMask(v); [MethodImpl(MethodImplOptions.NoInlining)] private static uint WithVector(Vector128<byte> v) => v.ExtractMostSignificantBits();}
I have two functions: one that directly uses theSse2.MoveMask
hardware intrinsic and one that uses the newVector128<T>.ExtractMostSignificantBits
method. UsingDOTNET_JitDisasm=Program.*
, here’s what the optimized tier-1 code for these looks like on my x64 Windows machine:
; Assembly listing for method Program:WithIntrinsics(Vector128`1):intG_M000_IG01: ;; offset=0000H C5F877 vzeroupperG_M000_IG02: ;; offset=0003H C5F91001 vmovupd xmm0, xmmword ptr [rcx] C5F9D7C0 vpmovmskb eax, xmm0G_M000_IG03: ;; offset=000BH C3 ret; Total bytes of code 12; Assembly listing for method Program:WithVector(Vector128`1):intG_M000_IG01: ;; offset=0000H C5F877 vzeroupperG_M000_IG02: ;; offset=0003H C5F91001 vmovupd xmm0, xmmword ptr [rcx] C5F9D7C0 vpmovmskb eax, xmm0G_M000_IG03: ;; offset=000BH C3 ret; Total bytes of code 12
Notice anything? The code for the two methods is identical, both resulting in avpmovmskb
(Move Byte Mask) instruction. Yet the former code will only work on a platform that supports SSE2 whereas the latter code will work on any platform with support for 128-bit vectors, including Arm64 and WASM (and any future platforms on-boarded that also support SIMD); it’ll just result in different instructions being emitted on those platforms.
To explore this a bit more, let’s take a simple example and vectorize it. We’ll implement aContains
method, where we want to search a span of bytes for a specific value and return whether it was found:
static bool Contains(ReadOnlySpan<byte> haystack, byte needle){ for (int i = 0; i < haystack.Length; i++) { if (haystack[i] == needle) { return true; } } return false;}
How would we vectorize this withVector<T>
? First things first, we need to check whether it’s even supported, and fall back to our existing implementation if it’s not (Vector.IsHardwareAccelerated
). We also need to fall back if the length of the input is less than the size of a vector (Vector<byte>.Count
).
static bool Contains(ReadOnlySpan<byte> haystack, byte needle){ if (Vector.IsHardwareAccelerated && haystack.Length >= Vector<byte>.Count) { // ... } else { for (int i = 0; i < haystack.Length; i++) { if (haystack[i] == needle) { return true; } } } return false;}
Now that we know we have enough data, we can get to coding our vectorized loop. In this loop, we’ll be searching for theneedle
, which means we need a vector that contains that value for every element; theVector<T>
‘s constructor provides that (new Vector<byte>(needle)
). And we need to be able to slice off a vector’s width of data at a time; for a bit more efficiency, I’ll use pointers. We need a current iteration pointer, and we need to iterate until the point where we couldn’t form another vector because we’re too close to the end, and a straightforward way to do that is to get a pointer that’s exactly one vector’s width from the end; that way, we can just iterate until our current pointer is equal to or greater than that threshold. And finally, in our loop body, we need to compare our current vector with the target vector to see if any elements are the same (Vector.EqualsAny
), if any is returning true, and if not bumping our current pointer to the next location. At this point we have:
static unsafe bool Contains(ReadOnlySpan<byte> haystack, byte needle){ if (Vector.IsHardwareAccelerated && haystack.Length >= Vector<byte>.Count) { fixed (byte* haystackPtr = &MemoryMarshal.GetReference(haystack)) { Vector<byte> target = new Vector<byte>(needle); byte* current = haystackPtr; byte* endMinusOneVector = haystackPtr + haystack.Length - Vector<byte>.Count; do { if (Vector.EqualsAny(target, *(Vector<byte>*)current)) { return true; } current += Vector<byte>.Count; } while (current < endMinusOneVector); // ... } } else { for (int i = 0; i < haystack.Length; i++) { if (haystack[i] == needle) { return true; } } } return false;}
And we’re almost done. The last issue to handle is we may still have a few elements at the end we haven’t searched. There are a couple of ways we could handle that. One would be to just continue with our fall back implementation and process each of the remaining elements one at a time. Another would be to employ a trick that’s common when vectorizing idempotent operations. Our operation isn’t mutating anything, which means it doesn’t matter if we compare the same element multiple times, which means we can just do one final vector compare for the last vector in the search space; that might or might not overlap with elements we’ve already looked at, but it won’t hurt anything if it does. And with that, our implementation is complete:
static unsafe bool Contains(ReadOnlySpan<byte> haystack, byte needle){ if (Vector.IsHardwareAccelerated && haystack.Length >= Vector<byte>.Count) { fixed (byte* haystackPtr = &MemoryMarshal.GetReference(haystack)) { Vector<byte> target = new Vector<byte>(needle); byte* current = haystackPtr; byte* endMinusOneVector = haystackPtr + haystack.Length - Vector<byte>.Count; do { if (Vector.EqualsAny(target, *(Vector<byte>*)current)) { return true; } current += Vector<byte>.Count; } while (current < endMinusOneVector); if (Vector.EqualsAny(target, *(Vector<byte>*)endMinusOneVector)) { return true; } } } else { for (int i = 0; i < haystack.Length; i++) { if (haystack[i] == needle) { return true; } } } return false;}
Congratulations, we’ve vectorized this operation, and fairly decently at that. We can throw this into benchmarkdotnet and see really nice speedups:
private byte[] _data = Enumerable.Repeat((byte)123, 999).Append((byte)42).ToArray();[Benchmark(Baseline = true)][Arguments((byte)42)]public bool Find(byte value) => Contains(_data, value); // just the fallback path in its own method[Benchmark][Arguments((byte)42)]public bool FindVectorized(byte value) => Contains_Vectorized(_data, value); // the implementation we just wrote
Method | Mean | Ratio |
---|---|---|
Find | 484.05 ns | 1.00 |
FindVectorized | 20.21 ns | 0.04 |
A 24x speedup! Woo hoo, victory, all your performance are belong to us!
You deploy this in your service, and you seeContains
being called on your hot path, but you don’t see the improvements you were expecting. You dig in a little more, and you discover that while you tested this with an input array with 1000 elements, typical inputs had more like 30 elements. What happens if we change our benchmark to have just 30 elements? That’s not long enough to form a vector, so we fall back to the one-at-a-time path, and we don’t get any speedups at all.
One thing we can now do is switch from usingVector<T>
toVector128<T>
. That will then lower the threshold from 32 bytes to 16 bytes, such that inputs in that range will still have some amount of vectorization applied. As theseVector128<T>
andVector256<T>
types have been designed very recently, they also utilize all the cool new toys, and thus we can useref
s instead of pointers. Other than that, we can keep the shape of our implementation almost the same, substitutingVector128
where we were usingVector
, and using some methods onUnsafe
to manipulate ourref
s instead of pointer arithmetic on the span we fixed.
static unsafe bool Contains(ReadOnlySpan<byte> haystack, byte needle){ if (Vector128.IsHardwareAccelerated && haystack.Length >= Vector128<byte>.Count) { ref byte current = ref MemoryMarshal.GetReference(haystack); Vector128<byte> target = Vector128.Create(needle); ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector128<byte>.Count); do { if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref current))) { return true; } current = ref Unsafe.Add(ref current, Vector128<byte>.Count); } while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector)); if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref endMinusOneVector))) { return true; } } else { for (int i = 0; i < haystack.Length; i++) { if (haystack[i] == needle) { return true; } } } return false;}
With that in hand, we can now try it on our smaller 30 element data set:
private byte[] _data = Enumerable.Repeat((byte)123, 29).Append((byte)42).ToArray();[Benchmark(Baseline = true)][Arguments((byte)42)]public bool Find(byte value) => Contains(_data, value);[Benchmark][Arguments((byte)42)]public bool FindVectorized(byte value) => Contains_Vectorized(_data, value);
Method | Mean | Ratio |
---|---|---|
Find | 15.388 ns | 1.00 |
FindVectorized | 1.747 ns | 0.11 |
Woo hoo, victory, all your performance are belong to us… again!
What about on the larger data set again? Previously withVector<T>
we had a 24x speedup, but now:
Method | Mean | Ratio |
---|---|---|
Find | 484.25 ns | 1.00 |
FindVectorized | 32.92 ns | 0.07 |
… closer to 15x. Nothing to sneeze at, but it’s not the 24x we previously saw. What if we want to have our cake and eat it, too? Let’s also add aVector256<T>
path. To do that, we literally copy/paste ourVector128<T>
code, search/replace all references toVector128
in the copied code withVector256
, and just put it into an additional condition that uses theVector256<T>
path if it’s supported and there are enough elements to utilize it.
static unsafe bool Contains(ReadOnlySpan<byte> haystack, byte needle){ if (Vector128.IsHardwareAccelerated && haystack.Length >= Vector128<byte>.Count) { ref byte current = ref MemoryMarshal.GetReference(haystack); if (Vector256.IsHardwareAccelerated && haystack.Length >= Vector256<byte>.Count) { Vector256<byte> target = Vector256.Create(needle); ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector256<byte>.Count); do { if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref current))) { return true; } current = ref Unsafe.Add(ref current, Vector256<byte>.Count); } while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector)); if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref endMinusOneVector))) { return true; } } else { Vector128<byte> target = Vector128.Create(needle); ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector128<byte>.Count); do { if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref current))) { return true; } current = ref Unsafe.Add(ref current, Vector128<byte>.Count); } while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector)); if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref endMinusOneVector))) { return true; } } } else { for (int i = 0; i < haystack.Length; i++) { if (haystack[i] == needle) { return true; } } } return false;}
And, boom, we’re back:
Method | Mean | Ratio |
---|---|---|
Find | 484.53 ns | 1.00 |
FindVectorized | 20.08 ns | 0.04 |
We now have an implementation that is vectorized on any platform with either 128-bit or 256-bit vector instructions (x86, x64, Arm64, WASM, etc.), that can use either based on the input length, and that can be included in an R2R image if that’s of interest.
There are many factors that impact which path you go down, and I expect we’ll have guidance forthcoming to help navigate all the factors and approaches. But the capabilities are all there, and whether you choose to useVector<T>
,Vector128<T>
and/orVector256<T>
, or the hardware intrinsics directly, there are some amazing performance opportunities ready for the taking.
I already mentioned several PRs that exposed the new cross-platform vector support, but that only scratches the surface of the work done to actually enable these operations and to enable them to produce high-quality code. As just one example of a category of such work, a set of changes went in to help ensure that zero vector constants are handled well, such asdotnet/runtime#63821 that “morphed” (changed)Vector128/256<T>.Create(default)
intoVector128/256<T>.Zero
, which then enables subsequent optimizations to focus only onZero
;dotnet/runtime#65028 that enabled constant propagation ofVector128/256<T>.Zero
;dotnet/runtime#68874 anddotnet/runtime#70171 that add first-class knowledge of vector constants to the JIT’s intermediate representation; anddotnet/runtime#62933,dotnet/runtime#65632,dotnet/runtime#55875,dotnet/runtime#67502, anddotnet/runtime#64783 that all improve the code quality of instructions generated for zero vector comparisons.
Inlining
Inlining is one of the most important optimizations the JIT can do. The concept is simple: instead of making a call to some method, take the code from that method and bake it into the call site. This has the obvious advantage of avoiding the overhead of a method call, but except for really small methods on really hot paths, that’s often on the smaller side of the wins inlining brings. The bigger wins are due to the callee’s code being exposed to the caller’s code, and vice versa. So, for example, if the caller is passing a constant as an argument to the callee, if the method isn’t inlined, the compilation of the callee has no knowledge of that constant, but if the callee is inlined, all of the code in the callee is then aware of its argument being a constant value, and can do all of the optimizations possible with such a constant, like dead code elimination, branch elimination, constant folding and propagation, and so on. Of course, if it were all rainbows and unicorns, everything possible to be inlined would be inlined, and that’s obviously not happening. Inlining brings with it the cost of potentially increased binary size. If the code being inlined would result in the same amount or less assembly code in the caller than it takes to call the callee (and if the JIT can quickly determine that), then inlining is a no-brainer. But if the code being inlined would increase the size of the callee non-trivially, now the JIT needs to weigh that increase in code size against the throughput benefits that could come from it. That code size increase can itself result in throughput regressions, due to increasing the number of distinct instructions to be executed and thereby putting more pressure on the instruction cache. As with any cache, the more times you need to read from memory to populate it, the less effective the cache will be. If you have a function that gets inlined into 100 different call sites, every one of those call sites’ copies of the callee’s instructions are unique, and calling each of those 100 functions could end up thrashing the instruction cache; in contrast, if all of those 100 functions “shared” the same instructions by simply calling the single instance of the callee, it’s likely the instruction cache would be much more effective and lead to fewer trips to memory.
All that is to say, inlining isreally important, it’s important that the “right” things be inlined and that it not overinline, and as such every release of .NET in recent memory has seen nice improvements around inlining. .NET 7 is no exception.
One really interesting improvement around inlining isdotnet/runtime#64521, and it might be surprising. Consider theBoolean.ToString
method; here’s its full implementation:
public override string ToString(){ if (!m_value) return "False"; return "True";}
Pretty simple, right? You’d expect something this trivial to be inlined. Alas, on .NET 6, this benchmark:
private bool _value = true;[Benchmark]public int BoolStringLength() => _value.ToString().Length;
produces this assembly code:
; Program.BoolStringLength() sub rsp,28 cmp [rcx],ecx add rcx,8 call System.Boolean.ToString() mov eax,[rax+8] add rsp,28 ret; Total bytes of code 23
Note thecall System.Boolean.ToString()
. The reason for this is, historically, the JIT has been unable to inline methods across assembly boundaries if those methods contain string literals (like the"False"
and"True"
in thatBoolean.ToString
implementation). This restriction had to do with string interning and the possibility that such inlining could lead to visible behavioral differences. Those concerns are no longer valid, and so this PR removes the restriction. As a result, that same benchmark on .NET 7 now produces this:
; Program.BoolStringLength() cmp byte ptr [rcx+8],0 je short M00_L01 mov rax,1DB54800D20 mov rax,[rax]M00_L00: mov eax,[rax+8] retM00_L01: mov rax,1DB54800D18 mov rax,[rax] jmp short M00_L00; Total bytes of code 38
No morecall System.Boolean.ToString()
.
dotnet/runtime#61408 made two changes related to inlining. First, it taught the inliner how to better see the what methods were being called in an inlining candidate, and in particular when tiered compilation is disabled or when a method would bypass tier-0 (such as a method with loops before OSR existed or with OSR disabled); by understanding what methods are being called, it can better understand the cost of the method, e.g. if those method calls are actually hardware intrinsics with a very low cost. Second, it enabled CSE in more cases with SIMD vectors.
dotnet/runtime#71778 also impacted inlining, and in particular in situations where atypeof()
could be propagated to the callee (e.g. via a method argument). In previous releases of .NET, various members onType
likeIsValueType
were turned into JIT intrinsics, such that the JIT could substitute a constant value for calls where it could compute the answer at compile time. For example, this:
[Benchmark]public bool IsValueType() => IsValueType<int>();private static bool IsValueType<T>() => typeof(T).IsValueType;
results in this assembly code on .NET 6:
; Program.IsValueType() mov eax,1 ret; Total bytes of code 6
However, change the benchmark slightly:
[Benchmark]public bool IsValueType() => IsValueType(typeof(int));private static bool IsValueType(Type t) => t.IsValueType;
and it’s no longer as simple:
; Program.IsValueType() sub rsp,28 mov rcx,offset MT_System.Int32 call CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE mov rcx,rax mov rax,[7FFCA47C9560] cmp [rcx],ecx add rsp,28 jmp rax; Total bytes of code 38
Effectively, as part of inlining the JIT loses the notion that the argument is a constant and fails to propagate it. This PR fixes that, such that on .NET 7, we now get what we expect:
; Program.IsValueType() mov eax,1 ret; Total bytes of code 6
Arm64
A huge amount of effort in .NET 7 went into making code gen for Arm64 as good or better than its x64 counterpart. I’ve already discussed a bunch of PRs that are relevant regardless of architecture, and others that are specific to Arm, but there are plenty more. To rattle off some of them:
- Addressing modes. “Addressing mode” is the term used to refer to how the operand of instructions are specified. It could be the actual value, it could be the address from where a value should be loaded, it could be the register containing the value, and so on. Arm supports a “scaled” addressing mode, typically used for indexing into an array, where the size of each element is supplied and the instruction “scales” the provided offset by the specified scale.dotnet/runtime#60808 enables the JIT to utilize this addressing mode. More generally,dotnet/runtime#70749 enables the JIT to use addressing modes when accessing elements of managed arrays.dotnet/runtime#66902 improves the use of addressing modes when the element type is
byte
.dotnet/runtime#65468 improves addressing modes used for floating point. Anddotnet/runtime#67490 implements addressing modes for SIMD vectors, specifically for loads with unscaled indices. - Better instruction selection. Various techniques go into ensuring that the best instructions are selected to represent input code.dotnet/runtime#61037 teaches the JIT how to recognize the pattern
(a * b) + c
with integers and fold that into a singlemadd
ormsub
instruction, whiledotnet/runtime#66621 does the same fora - (b * c)
andmsub
.dotnet/runtime#61045 enables the JIT to recognize certain constant bit shift operations (either explicit in the code or implicit to various forms of managed array access) and emitsbfiz
/ubfiz
instructions.dotnet/runtime#70599,dotnet/runtime#66407, anddotnet/runtime#65535 all handle various forms of optimizinga % b
.dotnet/runtime#61847 from@SeanWoo removes an unnecessarymovi
emitted as part of setting a dereferenced pointer to a constant value.dotnet/runtime#57926 from@SingleAccretion enables computing a 64-bit result as the multiplication of two 32-bit integers to be done withsmull
/umull
. Anddotnet/runtime#61549 folds adds with sign extension or zero extension into a singleadd
instruction withuxtw
/sxtw
/lsl
, whiledotnet/runtime#62630 drops redundant zero extensions after aldr
instruction. - Vectorization.dotnet/runtime#64864 adds new
AdvSimd.LoadPairVector64
/AdvSimd.LoadPairVector128
hardware intrinsics. - Zeroing. Lots of operations require state to be set to zero, such as initializing all reference locals in a method to zero as part of the method’s prologue (so that the GC doesn’t see and try to follow garbage references). While such functionality was previously vectorized,dotnet/runtime#63422 enables this to be implemented using 128-bit width vector instructions on Arm. Anddotnet/runtime#64481 changes the instruction sequences used for zeroing in order to avoid unnecessary zeroing, free up additional registers, and enable the CPU to recognize various instruction sequences and better optimize.
- Memory Model.dotnet/runtime#62895 enables store barriers to be used wherever possible instead of full barriers, and uses one-way barriers for
volatile
variables.dotnet/runtime#67384 enables volatile reads/writes to be implemented with theldapr
instruction, whiledotnet/runtime#64354 uses a cheaper instruction sequence to handle volatile indirections. There’sdotnet/runtime#70600, which enables LSE Atomics to be used forInterlocked
operations;dotnet/runtime#71512, which enables using theatomics
instruction on Unix machines; anddotnet/runtime#70921, which enables the same but on Windows.
JIT helpers
While logically part of the runtime, the JIT is actually isolated from the rest of the runtime, only interacting with it through an interface that enables communication between the JIT and the rest of the VM (Virtual Machine). There’s a large amount of VM functionality then that the JIT relies on for good performance.
dotnet/runtime#65738 rewrote various “stubs” to be more efficient. Stubs are tiny bits of code that serve to perform some check and then redirect execution somewhere else. For example, when an interface dispatch call site is expected to only ever be used with a single implementation of that interface, the JIT might employ a “dispatch stub” that compares the type of the object against the single one it’s cached, and if they’re equal simply jumps to the right target. You know you’re in the corest of the core areas of the runtime when a PR contains lots of assembly code for every architecture the runtime targets. And it paid off; there’s a virtual group of folks from around .NET that review performance improvements and regressions in our automated performance test suites, and attribute these back to the PRs likely to be the cause (this is mostly automated but requires some human oversight). It’s always nice then when a few days after a PR is merged and performance information has stabilized that you see a rash of comments like there were on this PR:
For anyone familiar with generics and interested in performance, you may have heard the refrain that generic virtual methods are relatively expensive. They are, comparatively. For example on .NET 6, this code:
private Example _example = new Example();[Benchmark(Baseline = true)] public void GenericNonVirtual() => _example.GenericNonVirtual<Example>();[Benchmark] public void GenericVirtual() => _example.GenericVirtual<Example>();class Example{ [MethodImpl(MethodImplOptions.NoInlining)] public void GenericNonVirtual<T>() { } [MethodImpl(MethodImplOptions.NoInlining)] public virtual void GenericVirtual<T>() { }}
results in:
Method | Mean | Ratio |
---|---|---|
GenericNonVirtual | 0.4866 ns | 1.00 |
GenericVirtual | 6.4552 ns | 13.28 |
dotnet/runtime#65926 eases the pain a tad. Some of the cost comes from looking up some cached information in a hash table in the runtime, and as is the case with many map implementations, this one involves computing a hash code and using a mod operation to map to the right bucket. Other hash table implementations arounddotnet/runtime, includingDictionary<,>
,HashSet<,>
, andConcurrentDictionary<,>
previously switched to a“fastmod” implementation; this PR does the same for thisEEHashtable
, which is used as part of theCORINFO_GENERIC_HANDLE
JIT helper function employed:
Method | Runtime | Mean | Ratio |
---|---|---|---|
GenericVirtual | .NET 6.0 | 6.475 ns | 1.00 |
GenericVirtual | .NET 7.0 | 6.119 ns | 0.95 |
Not enough of an improvement for us to start recommending people use them, but a 5% improvement takes a bit of the edge off the sting.
Grab Bag
It’s near impossible to cover every performance change that goes into the JIT, and I’m not going to try. But there were so many more PRs, I couldn’t just leave them all unsung, so here’s a few more quickies:
- dotnet/runtime#58727 from@benjamin-hodgson. Given an expression like
(byte)x | (byte)y
, that can be morphed into(byte)(x | y)
, which can optimize away somemov
s.private int _x, _y;[Benchmark]public int Test() => (byte)_x | (byte)_y;
; *** .NET 6 ***; Program.Test(Int32, Int32) movzx eax,dl movzx edx,r8b or eax,edx ret; Total bytes of code 10; *** .NET 7 ***; Program.Test(Int32, Int32) or edx,r8d movzx eax,dl ret; Total bytes of code 7
- dotnet/runtime#67182. On a machine with support for BMI2, 64-bit shifts can be performed with the
shlx
,sarx
, andshrx
instructions.[Benchmark][Arguments(123, 1)]public ulong Shift(ulong x, int y) => x << y;
; *** .NET 6 ***; Program.Shift(UInt64, Int32) mov ecx,r8d mov rax,rdx shl rax,cl ret; Total bytes of code 10; *** .NET 7 ***; Program.Shift(UInt64, Int32) shlx rax,rdx,r8 ret; Total bytes of code 6
- dotnet/runtime#69003 from@SkiFoD. The pattern
~x + 1
can be changed into a two’s-complement negation.[Benchmark][Arguments(42)]public int Neg(int i) => ~i + 1;
; *** .NET 6 ***; Program.Neg(Int32) mov eax,edx not eax inc eax ret; Total bytes of code 7; *** .NET 7 ***; Program.Neg(Int32) mov eax,edx neg eax ret; Total bytes of code 5
- dotnet/runtime#61412 from@SkiFoD. An expression
X & 1 == 1
to test whether the bottom bit of a number is set can changed to the cheaperX & 1
(which isn’t actually expressible without a following!= 0
in C#).[Benchmark][Arguments(42)]public bool BitSet(int x) => (x & 1) == 1;
; *** .NET 6 ***; Program.BitSet(Int32) test dl,1 setne al movzx eax,al ret; Total bytes of code 10; *** .NET 7 ***; Program.BitSet(Int32) mov eax,edx and eax,1 ret; Total bytes of code 6
- dotnet/runtime#63545 from@Wraith2. The expression
x & (x - 1)
can be lowered to theblsr
instruction.[Benchmark][Arguments(42)]public int ResetLowestSetBit(int x) => x & (x - 1);
; *** .NET 6 ***; Program.ResetLowestSetBit(Int32) lea eax,[rdx+0FFFF] and eax,edx ret; Total bytes of code 6; *** .NET 7 ***; Program.ResetLowestSetBit(Int32) blsr eax,edx ret; Total bytes of code 6
- dotnet/runtime#62394.
/
and%
by a vector’s.Count
wasn’t recognizing thatCount
can be unsigned, but doing so leads to better code gen.[Benchmark][Arguments(42u)]public long DivideByVectorCount(uint i) => i / Vector<byte>.Count;
; *** .NET 6 ***; Program.DivideByVectorCount(UInt32) mov eax,edx mov rdx,rax sar rdx,3F and rdx,1F add rax,rdx sar rax,5 ret; Total bytes of code 21; *** .NET 7 ***; Program.DivideByVectorCount(UInt32) mov eax,edx shr rax,5 ret; Total bytes of code 7
- dotnet/runtime#60787.Loop alignment in .NET 6 provides a very nice exploration of why and how the JIT handles loop alignment. This PR extends that further by trying to “hide” an emitted
align
instruction behind an unconditionaljmp
that might already exist, in order to minimize the impact of the processor having to fetch and decodenop
s.
GC
“Regions” is a feature of the garbage collector (GC) that’s been in the works for multiple years. It’s enabled by default in 64-bit processes in .NET 7 as ofdotnet/runtime#64688, but as with other multi-year features, a multitude of PRs went into making it a reality. At a 30,000 foot level, “regions” replaces the current “segments” approach to managing memory on the GC heap; rather than having a few gigantic segments of memory (e.g. each 1GB), often associated 1:1 with a generation, the GC instead maintains many, many smaller regions (e.g. each 4MB) as their own entity. This enables the GC to be more agile with regards to operations like repurposing regions of memory from one generation to another. For more information on regions, the blog postPut a DPAD on that GC! from the primary developer on the GC is still the best resource.
Native AOT
To many people, the word “performance” in the context of software is about throughput. How fast does something execute? How much data per second can it process? How many requests per second can it process? And so on. But there are many other facets to performance. How much memory does it consume? How fast does it start up and get to the point of doing something useful? How much space does it consume on disk? How long would it take to download? And then there are related concerns. In order to achieve these goals, what dependencies are required? What kinds of operations does it need to perform to achieve these goals, and are all of those operations permitted in the target environment? If any of this paragraph resonates with you, you are the target audience for the Native AOT support now shipping in .NET 7.
.NET has long had support for AOT code generation. For example, .NET Framework had it in the form ofngen
, and .NET Core has it in the form ofcrossgen
. Both of those solutions involve a standard .NET executable that has some of its IL already compiled to assembly code, but not all methods will have assembly code generated for them, various things can invalidate the assembly code that was generated, external .NET assemblies without any native assembly code can be loaded, and so on, and in all of those cases, the runtime continues to utilize a JIT compiler. Native AOT is different. It’s an evolution of CoreRT, which itself was an evolution of .NET Native, and it’s entirely free of a JIT. The binary that results from publishing a build is a completely standalone executable in the target platform’s platform-specific file format (e.g. COFF on Windows, ELF on Linux, Mach-O on macOS) with no external dependencies other than ones standard to that platform (e.g. libc). And it’s entirely native: no IL in sight, no JIT, no nothing. All required code is compiled and/or linked in to the executable, including the same GC that’s used with standard .NET apps and services, and a minimal runtime that provides services around threading and the like. All of that brings great benefits: super fast startup time, small and entirely-self contained deployment, and ability to run in places JIT compilers aren’t allowed (e.g. because memory pages that were writable can’t then be executable). It also brings limitations: no JIT means no dynamic loading of arbitrary assemblies (e.g.Assembly.LoadFile
) and no reflection emit (e.g.DynamicMethod
), everything compiled and linked in to the app means the more functionality that’s used (or might be used) the larger is your deployment, etc. Even with those limitations, for a certain class of application, Native AOT is an incredibly exciting and welcome addition to .NET 7.
Too many PRs to mention have gone into bringing up the Native AOT stack, in part because it’s been in the works for years (as part of the archiveddotnet/corert project and then as part ofdotnet/runtimelab/feature/NativeAOT) and in part because there have been over a hundred PRs just indotnet/runtime that have gone into bringing Native AOT up to a shippable state since the code was originally brought over fromdotnet/runtimelab indotnet/runtime#62563 anddotnet/runtime#62611. Between that and there not being a previous version to compare its performance to, instead of focusing PR by PR on improvements, let’s just look at how to use it and the benefits it brings.
Today, Native AOT is focused on console applications, so let’s create a console app:
dotnet new console -o nativeaotexample
We now have ournativeaotexample
directory containing anativeaotexample.csproj
and a “hello, world”Program.cs
. To enable publishing the application with Native AOT, edit the .csproj to include this in the existing<PropertyGroup>...</PropertyGroup>
.
<PublishAot>true</PublishAot>
And then… actually, that’s it. Our app is now fully configured to be able to target Native AOT. All that’s left is to publish. As I’m currently writing this on my Windows x64 machine, I’ll target that:
dotnet publish -r win-x64 -c Release
I now have my generated executable in the output publish directory:
Directory: C:\nativeaotexample\bin\Release\net7.0\win-x64\publishMode LastWriteTime Length Name---- ------------- ------ -----a--- 8/27/2022 6:18 PM 3648512 nativeaotexample.exe-a--- 8/27/2022 6:18 PM 14290944 nativeaotexample.pdb
That ~3.5MB .exe is the executable, and the .pdb next to it is debug information, which needn’t actually be deployed with the app. I can now copy thatnativeaotexample.exe
to any 64-bit Windows machine, regardless of what .NET may or may not be installed anywhere on the box, and my app will run. Now, if what you really care about is size, and 3.5MB is too big for you, you can start making more tradeoffs. There are a bunch of switches you can pass to the Native AOT compiler (ILC) and to the trimmer that impact what code gets included in the resulting image. Let me turn the dial up a bit:
<PublishAot>true</PublishAot> <InvariantGlobalization>true</InvariantGlobalization> <UseSystemResourceKeys>true</UseSystemResourceKeys> <IlcOptimizationPreference>Size</IlcOptimizationPreference> <IlcGenerateStackTraceData>false</IlcGenerateStackTraceData> <DebuggerSupport>false</DebuggerSupport> <EnableUnsafeBinaryFormatterSerialization>false</EnableUnsafeBinaryFormatterSerialization> <EventSourceSupport>false</EventSourceSupport> <HttpActivityPropagationSupport>false</HttpActivityPropagationSupport> <MetadataUpdaterSupport>false</MetadataUpdaterSupport>
I republish, and now I have:
Directory: C:\nativeaotexample\bin\Release\net7.0\win-x64\publishMode LastWriteTime Length Name---- ------------- ------ -----a--- 8/27/2022 6:19 PM 2061824 nativeaotexample.exe-a--- 8/27/2022 6:19 PM 14290944 nativeaotexample.pdb
so 2M instead of 3.5MB. Of course, for that significant reduction I’ve given up some things:
- Setting
InvariantGlobalization
to true means I’m now not respecting culture information and am instead using a set of invariant data for most globalization operations. - Setting
UseSystemResourceKeys
to true means nice exception messages are stripped away. - Setting
IlcGenerateStackTraceData
to false means I’m going to get fairly poor stack traces should I need to debug an exception. - Setting
DebuggerSupport
to false… good luck debugging things. - … you get the idea.
One of the potentially mind-boggling aspects of Native AOT for a developer used to .NET is that, as it says on the tin, it really is native. After publishing the app, there is no IL involved, and there’s no JIT that could even process it. This makes some of the other investments in .NET 7 all the more valuable, for example everywhere investments are happening in source generators. Code that previously relied on reflection emit for good performance will need another scheme. We can see that, for example, withRegex
. Historically for optimal throughput withRegex
, it’s been recommended to useRegexOptions.Compiled
, which uses reflection emit at run-time to generate an optimized implementation of the specified pattern. But if you look at the implementation of theRegex
constructor, you’ll find this nugget:
if (RuntimeFeature.IsDynamicCodeCompiled){ factory = Compile(pattern, tree, options, matchTimeout != InfiniteMatchTimeout);}
With the JIT,IsDynamicCodeCompiled
is true. But with Native AOT, it’s false. Thus, with Native AOT andRegex
, there’s no difference between specifyingRegexOptions.Compiled
and not, and another mechanism is required to get the throughput benefits promised byRegexOptions.Compiled
. Enter[GeneratedRegex(...)]
, which, along with the new regex source generator shipping in the .NET 7 SDK, emits C# code into the assembly using it. That C# code takes the place of the reflection emit that would have happened at run-time, and is thus able to work successfully with Native AOT.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _interpreter = new Regex(@"^.*elementary.*$", RegexOptions.Multiline);private Regex _compiled = new Regex(@"^.*elementary.*$", RegexOptions.Compiled | RegexOptions.Multiline);[GeneratedRegex(@"^.*elementary.*$", RegexOptions.Multiline)]private partial Regex SG();[Benchmark(Baseline = true)] public int Interpreter() => _interpreter.Count(s_haystack);[Benchmark] public int Compiled() => _compiled.Count(s_haystack);[Benchmark] public int SourceGenerator() => SG().Count(s_haystack);
Method | Mean | Ratio |
---|---|---|
Interpreter | 9,036.7 us | 1.00 |
Compiled | 9,064.8 us | 1.00 |
SourceGenerator | 426.1 us | 0.05 |
So, yes, there are some constraints associated with Native AOT, but there are also solutions for working with those constraints. And further, those constraints can actually bring further benefits. Considerdotnet/runtime#64497. Remember how we talked about “guarded devirtualization” in dynamic PGO, where via instrumentation the JIT can determine the most likely type to be used at a given call site and special-case it? With Native AOT, the entirety of the program is known at compile time, with no support forAssembly.LoadFrom
or the like. That means at compile time, the compiler can do whole-program analysis to determine what types implement what interfaces. If a given interface only has a single type that implements it, then every call site through that interface can be unconditionally devirtualized, without any type-check guards.
This is a really exciting space, one we expect to see flourish in coming releases.
Mono
Up until now I’ve referred to “the JIT,” “the GC,” and “the runtime,” but in reality there are actually multiple runtimes in .NET. I’ve been talking about “coreclr,” which is the runtime that’s recommended for use on Linux, macOS, and Windows. However, there’s also “mono,” which powers Blazor wasm applications, Android apps, and iOS apps. It’s also seen significant improvements in .NET 7.
Just as with coreclr (which can JIT compile, AOT compile partially with JIT fallback, and fully Native AOT compile), mono has multiple ways of actually executing code. One of those ways is an interpreter, which enables mono to execute .NET code in environments that don’t permit JIT’ing and without requiring ahead-of-time compilation or incurring any limitations it may bring. Interestingly, though, the interpreter is itself almost a full-fledged compiler, parsing the IL, generating its own intermediate representation (IR) for it, and doing one or more optimization passes over that IR; it’s just that at the end of the pipeline when a compiler would normally emit code, the interpreter instead saves off that data for it to interpret when the time comes to run. As such, the interpreter has a very similar conundrum to the one we discussed with coreclr’s JIT: the time it takes to optimize vs the desire to start up quickly. And in .NET 7, the interpreter employs a similar solution: tiered compilation.dotnet/runtime#68823 adds the ability for the interpreter to initially compile with minimal optimization of that IR, and then once a certain threshold of call counts has been hit, then take the time to do as much optimization on the IR as possible for all future invocations of that method. This yields the same benefits as it does for coreclr: improved startup time while also having efficient sustained throughput. When this merged, we saw improvements in Blazor wasm app startup time improve by 10-20%. Here’s one example from an app being tracked in our benchmarking system:
The interpreter isn’t just used for entire apps, though. Just as how coreclr can use the JIT when an R2R image doesn’t contain code for a method, mono can use the interpreter when there’s no AOT code for a method. Once such case that occurred on mono was with generic delegate invocation, where the presence of a generic delegate being invoked would trigger falling back to the interpreter; for .NET 7, that gap was addressed withdotnet/runtime#70653. A more impactful case, however, isdotnet/runtime#64867. Previously, any methods withcatch
orfilter
exception handling clauses couldn’t be AOT compiled and would fall back to being interpreted. With this PR, the method is now able to be AOT compiled, and it only falls back to using the interpreter when an exception actually occurs, switching over to the interpreter for the remainder of that method call’s execution. Since many methods contain such clauses, this can make a big difference in throughput and CPU consumption. In the same vein,dotnet/runtime#63065 enabled methods withfinally
exception handling clauses to be AOT compiled; just thefinally
block gets interpreted rather than the entire method being interpreted.
Beyond such backend improvements, another class of improvement came from further unification between coreclr and mono. Years ago, coreclr and mono had their own entire library stack built on top of them. Over time, as .NET was open sourced, portions of mono’s stack got replaced by shared components, bit by bit. Fast forward to today, all of the core .NET libraries aboveSystem.Private.CoreLib
are the same regardless of which runtime is being employed. In fact, the source forCoreLib
itself is almost entirely shared, with ~95% of the source files being compiled into theCoreLib
that’s built for each runtime, and just a few percent of the source specialized for each (these statements means that the vast majority of the performance improvements discussed in the rest of this post apply equally whether running on mono and coreclr). Even so, every release now we try to chip away at that few remaining percent, for reasons of maintainability, but also because the source used for coreclr’sCoreLib
has generally had more attention paid to it from a performance perspective.dotnet/runtime#71325, for example, moves mono’s array and span sorting generic sorting utility class over to the more efficient implementation used by coreclr.
One of the biggest categories of improvements, however, is in vectorization. This comes in two pieces. First,Vector<T>
andVector128<T>
are now fully accelerated on both x64 and Arm64, thanks to PRs likedotnet/runtime#64961,dotnet/runtime#65086,dotnet/runtime#65128,dotnet/runtime#66317,dotnet/runtime#66391,dotnet/runtime#66409,dotnet/runtime#66512,dotnet/runtime#66586,dotnet/runtime#66589,dotnet/runtime#66597,dotnet/runtime#66476, anddotnet/runtime#67125; that significant amount of work means all that code that gets vectorized using these abstractions will light-up on mono and coreclr alike. Second, thanks primarily todotnet/runtime#70086, mono now knows how to translateVector128<T>
operations to WASM’s SIMD instruction set, such that code vectorized withVector128<T>
will also be accelerated when running in Blazor wasm applications and anywhere else WASM might be executed.
Reflection
Reflection is one of those areas you either love or hate (I find it a bit humorous to be writing this section immediately after writing the Native AOT section). It’s immensely powerful, providing the ability to query all of the metadata for code in your process and for arbitrary assemblies you might encounter, to invoke arbitrary functionality dynamically, and even to emit dynamically-generated IL at run-time. It’s also difficult to handle well in the face of tooling like a linker or a solution like Native AOT that needs to be able to determine at build time exactly what code will be executed, and it’s generally quite expensive at run-time; thus it’s both something we strive to avoid when possible but also invest in reducing the costs of, as it’s so popular in so many different kinds of applications because it is incredibly useful. As with most releases, it’s seen some nice improvements in .NET 7.
One of the most impacted areas is reflection invoke. Available viaMethodBase.Invoke
, this functionality let’s you take aMethodBase
(e.g.MethodInfo
) object that represents some method for which the caller previously queried, and call it, with arbitrary arguments that the runtime needs to marshal through to the callee, and with an arbitrary return value that needs to be marshaled back. If you know the signature of the method ahead of time, the best way to optimize invocation speed is to create a delegate from theMethodBase
viaCreateDelegate<T>
and then use that delegate for all future invocations. But in some circumstances, you don’t know the signature at compile time, and thus can’t easily rely on delegates with known matching signatures. To address this, some libraries have taken to using reflection emit to generate code at run-time specific to the target method. This is extremely complicated and it’s not something we want apps to have to do. Instead, in .NET 7 viadotnet/runtime#66357,dotnet/runtime#69575, anddotnet/runtime#74614,Invoke
will itself use reflection emit (in the form ofDynamicMethod
) to generate a delegate that is customized for invoking the target, and then future invocation via thatMethodInfo
will utilize that generated method. This gives developers most of the performance benefits of a custom reflection emit-based implementation but without having the complexity or challenges of such an implementation in their own code base.
private MethodInfo _method;[GlobalSetup]public void Setup() => _method = typeof(Program).GetMethod("MyMethod", BindingFlags.NonPublic | BindingFlags.Static);[Benchmark]public void MethodInfoInvoke() => _method.Invoke(null, null);private static void MyMethod() { }
Method | Runtime | Mean | Ratio |
---|---|---|---|
MethodInfoInvoke | .NET 6.0 | 43.846 ns | 1.00 |
MethodInfoInvoke | .NET 7.0 | 8.078 ns | 0.18 |
Reflection also involves lots of manipulation of objects that represent types, methods, properties, and so on, and tweaks here and there can add up to a measurable difference when using these APIs. For example, I’ve talked in past performance posts about how, potentially counterintuitively, one of the ways we’ve achieved performance boosts is by porting native code from the runtime back into managed C#. There are a variety of ways in which doing so can help performance, but one is that there is some overhead associated with calling from managed code into the runtime, and eliminating such hops avoids that overhead. This can be seen in full effect indotnet/runtime#71873, which moves several of these “FCalls” related toType
,RuntimeType
(theType
-derived class used by the runtime to represent its types), andEnum
out of native into managed.
[Benchmark]public Type GetUnderlyingType() => Enum.GetUnderlyingType(typeof(DayOfWeek));
Method | Runtime | Mean | Ratio |
---|---|---|---|
GetUnderlyingType | .NET 6.0 | 27.413 ns | 1.00 |
GetUnderlyingType | .NET 7.0 | 5.115 ns | 0.19 |
Another example of this phenomenon comes indotnet/runtime#62866, which moved much of the underlying support forAssemblyName
out of native runtime code into managed code in CoreLib. That in turn has an impact on anything that uses it, such as when usingActivator.CreateInstance
overloads that take assembly names that need to be parsed.
private readonly string _assemblyName = typeof(MyClass).Assembly.FullName;private readonly string _typeName = typeof(MyClass).FullName;public class MyClass { }[Benchmark]public object CreateInstance() => Activator.CreateInstance(_assemblyName, _typeName);
Method | Runtime | Mean | Ratio |
---|---|---|---|
CreateInstance | .NET 6.0 | 3.827 us | 1.00 |
CreateInstance | .NET 7.0 | 2.276 us | 0.60 |
Other changes contributed toActivator.CreateInstance
improvements as well.dotnet/runtime#67148 removed several array and list allocations from inside of theRuntimeType.CreateInstanceImpl
method that’s used byCreateInstance
(usingType.EmptyTypes
instead of allocating anew Type[0]
, avoiding unnecessarily turning a builder into an array, etc.), resulting in less allocation and faster throughput.
[Benchmark]public void CreateInstance() => Activator.CreateInstance(typeof(MyClass), BindingFlags.NonPublic | BindingFlags.Instance, null, Array.Empty<object>(), null);internal class MyClass{ internal MyClass() { }}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
CreateInstance | .NET 6.0 | 167.8 ns | 1.00 | 320 B | 1.00 |
CreateInstance | .NET 7.0 | 143.4 ns | 0.85 | 200 B | 0.62 |
And since we were talking aboutAssemblyName
, other PRs improved it in other ways as well.dotnet/runtime#66750, for example, updated the computation ofAssemblyName.FullName
to use stack-allocated memory andArrayPool<char>
instead of using aStringBuilder
:
private AssemblyName[] _names = AppDomain.CurrentDomain.GetAssemblies().Select(a => new AssemblyName(a.FullName)).ToArray();[Benchmark]public int Names(){ int sum = 0; foreach (AssemblyName name in _names) { sum += name.FullName.Length; } return sum;}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
Names | .NET 6.0 | 3.423 us | 1.00 | 9.14 KB | 1.00 |
Names | .NET 7.0 | 2.010 us | 0.59 | 2.43 KB | 0.27 |
More reflection-related operations have also been turned into JIT intrinsics, as discussed earlier enabling the JIT to compute answers to various questions at JIT compile time rather than at run-time. This was done, for example, forType.IsByRefLike
indotnet/runtime#67852.
[Benchmark]public bool IsByRefLike() => typeof(ReadOnlySpan<char>).IsByRefLike;
Method | Runtime | Mean | Ratio | Code Size |
---|---|---|---|---|
IsByRefLike | .NET 6.0 | 2.1322 ns | 1.000 | 31 B |
IsByRefLike | .NET 7.0 | 0.0000 ns | 0.000 | 6 B |
That the .NET 7 version is so close to zero is called out in a warning by benchmarkdotnet:
// * Warnings *ZeroMeasurement Program.IsByRefLike: Runtime=.NET 7.0, Toolchain=net7.0 -> The method duration is indistinguishable from the empty method duration
and it’s so indistinguishable from an empty method because that’s effectively what it is, as we can see from the disassembly:
; Program.IsByRefLike() mov eax,1 ret; Total bytes of code 6
There are also improvements that are hard to see but that remove overheads as part of populating reflection’s caches, which end up reducing the work done typically on startup paths, helping apps to launch faster.dotnet/runtime#66825,dotnet/runtime#66912, anddotnet/runtime#67149 all fall into this category by removing unnecessary or duplicative array allocations as part of gathering data on parameters, properties, and events.
Interop
.NET has long had great support for interop, enabling .NET applications to consume huge amounts of functionality written in other languages and/or exposed by the underlying operating system. The bedrock of this support has been “Platform Invoke,” or “P/Invoke,” represented in code by[DllImport(...)]
applied to methods. TheDllImportAttribute
enables declaring a method that can be called like any other .NET method but that actually represents some external method that the runtime should call when this managed method is invoked. The DllImport specifies details about in what library the function lives, what its actual name is in the exports from that library, high-level details about marshalling of input arguments and return values, and so on, and the runtime ensures all the right things happen. This mechanism works on all operating systems. For example, Windows has a methodCreatePipe
for creating an anonymous pipe:
BOOL CreatePipe( [out] PHANDLE hReadPipe, [out] PHANDLE hWritePipe, [in, optional] LPSECURITY_ATTRIBUTES lpPipeAttributes, [in] DWORD nSize);
If I want to call this function from C#, I can declare a[DllImport(...)]
counterpart to it which I can then invoke as I can any other managed method:
[DllImport("kernel32", SetLastError = true)][return: MarshalAs(UnmanagedType.Bool)]private static unsafe extern bool CreatePipe( out SafeFileHandle hReadPipe, out SafeFileHandle hWritePipe, void* lpPipeAttributes, uint nSize);
There are several interesting things to note here. Several of the arguments are directly blittable with the same representation on the managed and native side of the equation, e.g.lpPipeAttributes
is a pointer andnSize
is a 32-bit integer. But what about the return value? Thebool
type in C# (System.Boolean
) is a one-byte type, but theBOOL
type in the native signature is four bytes; thus code calling this managed method can’t just directly invoke the native function somehow, as there needs to be some “marshalling” logic that converts the four-byte returnBOOL
into the one-byte returnbool
. Simiarly, the native function has two out pointers forhReadPipe
andhWritePipe
, but the managed signature declares twoSafeFileHandle
s (aSafeHandle
is a .NET type that wraps a pointer and provides a finalizer andDispose
method for ensuring that pointer is appropriately cleaned up when it’s no longer being used). Some logic needs to take the output handles generated by the native function and wrap them into theseSafeFileHandles
to be output from the managed method. And what about thatSetLastError = true
? .NET has methods likeMarshal.GetLastPInvokeError()
, and some code somewhere needs to take any error produced by this method and ensure it’s available for consumption via a subsequentGetLastPInvokeError()
.
If there’s no marshalling logic required, such that the managed signature and native signature are for all intents and purposes the same, all arguments blittable, all return values blittable, no additional logic required around the invocation of the method, etc., then a[DllImport(...)]
ends up being a simple passthrough with the runtime needing to do very little work to implement it. If, however, the[DllImport(...)]
involves any of this marshalling work, the runtime needs to generate a “stub,” creating a dedicated method that’s called when the[DllImport(...)]
is called, that handles fixing up all inputs, that delegates to the actual native function, and that fixes up all of the outputs. That stub is generated at execution time, with the runtime effectively doing reflection emit, generating IL dynamically that’s then JIT’d.
There are a variety of downsides to this. First, it takes time to generate all that marshalling code, time which can then negatively impact user experience for things like startup. Second, the nature of its implementation inhibits various optimizations, such as inlining. Third, there are platforms that don’t allow for JIT’ing due to the security exposure of allowing for dynamically generated code to then be executed (or in the case of Native AOT, where there isn’t a JIT at all). And fourth, it’s all hidden away making it more challenging for a developer to really understand what’s going on.
But what if that logic could all be generated at build time rather than at run time? The cost of generating the code would be incurred only at build time and not in every process execution. The code would effectively just end up being user code that has all of the C# compiler’s and runtime’s optimizations available to it. The code, which then would just be part of the app, would be able to be ahead-of-time compiled using whatever AOT system is desirable, whether it be crossgen or Native AOT or some other system. And the code would be inspectable, viewable by users to understand exactly what work is being done on their behalf. Sounds pretty desirable. Sounds magical. Sounds like a job for a Roslyn source generator, mentioned earlier.
.NET 6 included several source generators in the .NET SDK, and .NET 7 doubles down on this effort including several more. One of these is the brand new LibraryImport generator, which provides exactly the magical, desirable solution we were just discussing.
Let’s return to our previousCreatePipe
example. We’ll make two small tweaks. We change the attribute fromDllImport
toLibraryImport
, and we change theextern
keyword to bepartial
:
[LibraryImport("kernel32", SetLastError = true)][return: MarshalAs(UnmanagedType.Bool)]private static unsafe partial bool CreatePipe( out SafeFileHandle hReadPipe, out SafeFileHandle hWritePipe, void* lpPipeAttributes, uint nSize);
Now if you’re following along at home in Visual Studio, try right-clicking on CreatePipe and selecting Go to Definition. That might seem a little strange. “Go to Definition? Isn’t this the definition?” This is a partial method, which is a way of declaring something that another partial definition fills in, and in this case, a source generator in .NET 7 SDK has noticed this method with the[LibraryImport]
attribute and fully generated the entire marshalling stub code in C# that’s built directly into the assembly. While by default that code isn’t persisted, Visual Studio still enables you to browse it (and you can opt-in to having it persisted on disk by adding a<EmitCompilerGeneratedFiles>true</EmitCompilerGeneratedFiles>
property into your .csproj). Here’s what it currently looks like for that method:
[System.CodeDom.Compiler.GeneratedCodeAttribute("Microsoft.Interop.LibraryImportGenerator", "7.0.6.42316")][System.Runtime.CompilerServices.SkipLocalsInitAttribute]private static unsafe partial bool CreatePipe(out global::Microsoft.Win32.SafeHandles.SafeFileHandle hReadPipe, out global::Microsoft.Win32.SafeHandles.SafeFileHandle hWritePipe, void* lpPipeAttributes, uint nSize){ int __lastError; bool __invokeSucceeded = default; System.Runtime.CompilerServices.Unsafe.SkipInit(out hReadPipe); System.Runtime.CompilerServices.Unsafe.SkipInit(out hWritePipe); System.IntPtr __hReadPipe_native = default; System.IntPtr __hWritePipe_native = default; bool __retVal; int __retVal_native = default; // Setup - Perform required setup. global::Microsoft.Win32.SafeHandles.SafeFileHandle hReadPipe__newHandle = new global::Microsoft.Win32.SafeHandles.SafeFileHandle(); global::Microsoft.Win32.SafeHandles.SafeFileHandle hWritePipe__newHandle = new global::Microsoft.Win32.SafeHandles.SafeFileHandle(); try { { System.Runtime.InteropServices.Marshal.SetLastSystemError(0); __retVal_native = __PInvoke(&__hReadPipe_native, &__hWritePipe_native, lpPipeAttributes, nSize); __lastError = System.Runtime.InteropServices.Marshal.GetLastSystemError(); } __invokeSucceeded = true; // Unmarshal - Convert native data to managed data. __retVal = __retVal_native != 0; } finally { if (__invokeSucceeded) { // GuaranteedUnmarshal - Convert native data to managed data even in the case of an exception during the non-cleanup phases. System.Runtime.InteropServices.Marshal.InitHandle(hWritePipe__newHandle, __hWritePipe_native); hWritePipe = hWritePipe__newHandle; System.Runtime.InteropServices.Marshal.InitHandle(hReadPipe__newHandle, __hReadPipe_native); hReadPipe = hReadPipe__newHandle; } } System.Runtime.InteropServices.Marshal.SetLastPInvokeError(__lastError); return __retVal; // Local P/Invoke [System.Runtime.InteropServices.DllImportAttribute("kernel32", EntryPoint = "CreatePipe", ExactSpelling = true)] static extern unsafe int __PInvoke(System.IntPtr* hReadPipe, System.IntPtr* hWritePipe, void* lpPipeAttributes, uint nSize);}
With this, you can read exactly the marshalling work that’s being performed. TwoSafeHandle
instances are being allocated and then later after the native function completes, theMarshal.InitHandle
method is used to store the resulting handles into these instances (the allocations happen before the native function call, as performing them after the native handles have already been produced increases the chances of a leak if theSafeHandle
allocation fails due to an out-of-memory situation). TheBOOL
tobool
conversion happens via a!= 0
comparison. And the error information is captured by callingMarshal.GetLastSystemError()
just after the native function call and thenMarshal.SetLastPInvokeError(int)
just prior to returning. The actual native function call is still implemented with a[DllImport(...)]
, but now that P/Invoke is blittable and doesn’t require any stub to be generated by the runtime, as all that work has been handled in this C# code.
A sheer ton of work went in to enabling this. I touched on some of it last year inPerformance Improvements in .NET 6, but a significant amount of additional effort has gone into .NET 7 to polish the design and make the implementation robust, roll it out across all ofdotnet/runtime and beyond, and expose the functionality for all C# developers to use:
- The LibraryImport generator started its life as an experiment indotnet/runtimelab. When it was ready,dotnet/runtime#59579 brought 180 commits spanning years of effort into thedotnet/runtime main branch.
- In .NET 6, there were almost 3000
[DllImport]
uses throughout the core .NET libraries. As of my writing this, in .NET 7 there are… let me search… wait for it… 7 (I was hoping I could say 0, but there are just a few stragglers, mostly related to COM interop, still remaining). That’s not a transformation that happens over night. A multitude of PRs went library by library converting from the old to the new, such asdotnet/runtime#62295 anddotnet/runtime#61640 for System.Private.CoreLib,dotnet/runtime#61742 anddotnet/runtime#62309 for the cryptography libraries,dotnet/runtime#61765 for networking,dotnet/runtime#61996 anddotnet/runtime#61638 for most of the other I/O-related libraries, and a long-tail of additional porting indotnet/runtime#61975,dotnet/runtime#61389,dotnet/runtime#62353,dotnet/runtime#61990,dotnet/runtime#61949,dotnet/runtime#61805,dotnet/runtime#61741,dotnet/runtime#61184,dotnet/runtime#54290,dotnet/runtime#62365,dotnet/runtime#61609,dotnet/runtime#61532, anddotnet/runtime#54236. - Such porting is significantly easier when there’s a tool to help automate it.dotnet/runtime#72819 enables the analyzer and fixer for performing these transformations.
There were plenty of other PRs that went into making the LibraryImport generator a reality for .NET 7. To highlight just a few more,dotnet/runtime#63320 introduces a new[DisabledRuntimeMarshalling]
attribute that can be specified at the assembly level to disable all of the runtime’s built-in marshalling; at that point, the only marshalling performed as part of interop is the marshaling done in the user’s code, e.g. that which is generated by[LibraryImport]
. Other PRs likedotnet/runtime#67635 anddotnet/runtime#68173 added new marshaling types that encompass common marshaling logic and can be referenced from[LibraryImport(...)]
use to customize how marshaling is performed (the generator is pattern-based and allows for customization of marshalling by providing types that implement the right shape, which these types do in support of the most common marshalling needs). Really usefully,dotnet/runtime#71989 added support for marshaling{ReadOnly}Span<T>
, such that spans can be used directly in[LibraryImport(...)]
method signatures, just as arrays can be (examples indotnet/runtime are available indotnet/runtime#73256). Anddotnet/runtime#69043 consolidated logic to be shared between the runtime’s marshalling support in[DllImport]
and the generators support with[LibraryImport]
.
One more category of interop-related changes that I think are worth talking about are to do withSafeHandle
cleanup. As a reminder,SafeHandle
exists to mitigate various issues around managing native handles and file descriptors. A native handle or file descriptor is just a memory address or number that refers to some owned resource and which must be cleaned up / closed when done with it. ASafeHandle
at its core is just a managed object that wraps such a value and provides aDispose
method and a finalizer for closing it. That way, if you neglect toDispose
of theSafeHandle
in order to close the resource, the resource will still be cleaned up when theSafeHandle
is garbage collected and its finalizer eventually run.SafeHandle
then also provides some synchronization around that closure, trying to minimize the possibility that the resource is closed while it’s still in use. It providesDangerousAddRef
andDangerousRelease
methods that increment and decrement a ref count, respectively, and ifDispose
is called while the ref count is above zero, the actual releasing of the handle triggered byDispose
is delayed until the ref count goes back to 0. When you pass aSafeHandle
into a P/Invoke, the generated code for that P/Invoke handles callingDangerousAddRef
andDangerousRelease
(and due to the wonders of LibraryImport I’ve already extolled, you can easily see that being done, such as in the previous generated code example). Our code tries hard to clean up afterSafeHandle
s deterministically, but it’s quite easy to accidentally leave some for finalization.
dotnet/runtime#71854 added some debug-only tracking code toSafeHandle
to make it easier for developers working indotnet/runtime (or more specifically, developers using a checked build of the runtime) to find such issues. When theSafeHandle
is constructed, it captures the current stack trace, and if theSafeHandle
is finalized, it dumps that stack trace to the console, making it easy to see whereSafeHandles
that do end up getting finalized were created, in order to track them down and ensure they’re being disposed of. As is probably evident from that PR touching over 150 files and almost 1000 lines of code, there were quite a few places that benefited from clean up. Now to be fair, many of these are on exceptional code paths. For example, consider a hypothetical P/Invoke like:
[LibraryImport("SomeLibrary", SetLastError = true)]internal static partial SafeFileHandle CreateFile();
and code that uses it like:
SafeFileHandle handle = Interop.CreateFile();if (handle.IsInvalid){ throw new UhOhException(Marshal.GetLastPInvokeError());}return handle;
Seems straightforward enough. Except this code will actually leave aSafeHandle
for finalization on the failure path. It doesn’t matter thatSafeHandle
has an invalid handle in it, it’s still a finalizable object. To deal with that, this code would have been more robustly written as:
SafeFileHandle handle = Interop.CreateFile();if (handle.IsInvalid){ int lastError = Marshal.GetLastPInvokeError(); handle.Dispose(); // or handle.SetHandleAsInvalid() throw new UhOhException(lastError);}return handle;
That way, thisSafeHandle
won’t create finalization pressure even in the case of failure. Note, as well, that as part of adding in theDispose
call, I also moved theMarshal.GetLastPInvokeError()
up. That’s because callingDispose
on aSafeHandle
may end up invoking theSafeHandle
‘sReleaseHandle
method, which the developer of theSafeHandle
-derived type will have overridden to close the resource, which typically involves making another P/Invoke. And if that P/Invoke hasSetLastError=true
on it, it can overwrite the very error code for which we’re about to throw. Hence, we access and store the last error immediately after the interop call once we know it failed, then clean up, and only then throw. All that said, there were a myriad of places in that PR whereSafeHandle
s were being left for finalization even on the success path. And that PR wasn’t alone.dotnet/runtime#71991,dotnet/runtime#71854,dotnet/runtime#72116,dotnet/runtime#72189,dotnet/runtime#72222,dotnet/runtime#72203, anddotnet/runtime#72279 all found and fixed many occurrences ofSafeHandle
s being left for finalization (many thanks to the diagnostics put in place in the earlier mentioned PR).
Other PRs also accrued to improved interop performance.dotnet/runtime#70000 from@huoyaoyuan rewrote several delegate-related “FCalls” from being implemented in native code to instead being managed, resulting in less overhead when invoking these operations that are commonly involved in scenarios involvingMarshal.GetDelegateForFunctionPointer
.dotnet/runtime#68694 also moved some trivial functionality from native to managed, as part of relaxing argument validation on the use of pinning handles. This in turn measurably reduced the overhead involved with usingGCHandle.Alloc
for such pinning handles:
private byte[] _buffer = new byte[1024];[Benchmark]public void PinUnpin(){ GCHandle.Alloc(_buffer, GCHandleType.Pinned).Free();}
Method | Runtime | Mean | Ratio | Code Size |
---|---|---|---|---|
PinUnpin | .NET 6.0 | 37.11 ns | 1.00 | 353 B |
PinUnpin | .NET 7.0 | 32.17 ns | 0.87 | 232 B |
Threading
Threading is one of those cross-cutting concerns that impacts every application, such that changes in the threading space can have a wide-spread impact. This release sees two very substantial changes to theThreadPool
itself;dotnet/runtime#64834 switches the “IO pool” over to using an entirely managed implementation (whereas previously the IO pool was still in native code even though the worker pool had been moved entirely to managed in previous releases), anddotnet/runtime#71864 similarly switches the timer implementation from one based in native to one entirely in managed code. Those two changes can impact performance, and the former was demonstrated to on larger hardware, but for the most part that wasn’t their primary goal. Instead, other PRs have been focused on improving throughput.
One in particular isdotnet/runtime#69386. TheThreadPool
has a “global queue” that any thread can queue work into, and then each thread in the pool has its own “local queue” (which any thread can dequeue from but only the owning thread can enqueue into). When a worker needs another piece of work to process, it first checks its own local queue, then it checks the global queue, and then only if it couldn’t find work in either of those two places, it goes and checks all of the other threads’ local queues to see if it can help lighten their load. As machines scale up to have more and more cores, and more and more threads, there’s more and more contention on these shared queues, and in particular on the global queue. This PR addresses this for such larger machines by introducing additional global queues once the machine reaches a certain threshold (32 processors today). This helps to partition accesses across multiple queues, thereby decreasing contention.
Another isdotnet/runtime#57885. In order to coordinate threads, when work items were enqueued and dequeued, the pool was issuing requests to its threads to let them know that there was work available to do. This, however, often resulted in oversubscription, where more threads than necessary would race to try to get work items, especially when the system wasn’t at full load. That in turn would manifest as a throughput regression. This change overhauls how threads are requested, such that only one additional thread is requested at a time, and after that thread has dequeued its first work item, it can issue a request for an additional thread if there’s work remaining, and then that one can issue an additional request, and so on. Here’s one of our performance tests in our performance test suite (I’ve simplified it down to remove a bunch of configuration options from the test, but it’s still accurately one of those configurations). At first glance you might think, “hey, this is a performance test aboutArrayPool
, why is it showing up in a threading discussion?” And, you’d be right, this is a performance test that was written focused onArrayPool
. However, as mentioned earlier, threading impacts everything, and in this case, thatawait Task.Yield()
in the middle there causes the remainder of this method to be queued to theThreadPool
for execution. And because of how the test is structured, doing “real work” that competes for CPU cycles with thread pool threads all racing to get their next task, it shows a measurable improvement when moving to .NET 7.
private readonly byte[][] _nestedArrays = new byte[8][];private const int Iterations = 100_000;private static byte IterateAll(byte[] arr){ byte ret = default; foreach (byte item in arr) ret = item; return ret;}[Benchmark(OperationsPerInvoke = Iterations)]public async Task MultipleSerial(){ for (int i = 0; i < Iterations; i++) { for (int j = 0; j < _nestedArrays.Length; j++) { _nestedArrays[j] = ArrayPool<byte>.Shared.Rent(4096); _nestedArrays[j].AsSpan().Clear(); } await Task.Yield(); for (int j = _nestedArrays.Length - 1; j >= 0; j--) { IterateAll(_nestedArrays[j]); ArrayPool<byte>.Shared.Return(_nestedArrays[j]); } }}
Method | Runtime | Mean | Ratio |
---|---|---|---|
MultipleSerial | .NET 6.0 | 14.340 us | 1.00 |
MultipleSerial | .NET 7.0 | 9.262 us | 0.65 |
There have been improvements outside ofThreadPool
, as well. One notable change is in the handling ofAsyncLocal<T>
s, indotnet/runtime#68790.AsyncLocal<T>
is integrated tightly withExecutionContext
; in fact, in .NET Core,ExecutionContext
isentirely about flowingAsyncLocal<T>
instances. AnExecutionContext
instance maintains a single field, a map data structure, that stores the data for allAsyncLocal<T>
with data present in that context. EachAsyncLocal<T>
has an object it uses as a key, and any gets or sets on thatAsyncLocal<T>
manifest as getting the currentExecutionContext
, looking up thatAsyncLocal<T>
‘s key in the context’s dictionary, and then either returning whatever data it finds, or in the case of a setter, creating a newExecutionContext
with an updated dictionary and publishing that back. This dictionary thus needs to be very efficient for reads and writes, as developers expectAsyncLocal<T>
access to be as fast as possible, often treating it as if it were any other local. So, to optimize these lookups, the representation of that dictionary changes based on how manyAsyncLocal<T>
s are represented in this context. For up to three items, dedicated implementations with fields for each of the three keys and values were used. Above that up to around 16 elements, an array of key/value pairs was used. And above that, aDictionary<,>
was used. For the most part, this has worked well, with the majority ofExecutionContext
s being able to represent many flows with one of the first three types. However, it turns out that four activeAsyncLocal<T>
instances is really common, especially in ASP.NET where ASP.NET infrastructure itself uses a couple. So, this PR took the complexity hit to add a dedicated type for four key/value pairs, in order to optimize from one to four of them rather than one to three. While this improves throughput a bit, its main intent was to improve allocation, which is does over .NET 6 by ~20%.
private AsyncLocal<int> asyncLocal1 = new AsyncLocal<int>();private AsyncLocal<int> asyncLocal2 = new AsyncLocal<int>();private AsyncLocal<int> asyncLocal3 = new AsyncLocal<int>();private AsyncLocal<int> asyncLocal4 = new AsyncLocal<int>();[Benchmark(OperationsPerInvoke = 4000)]public void Update(){ for (int i = 0; i < 1000; i++) { asyncLocal1.Value++; asyncLocal2.Value++; asyncLocal3.Value++; asyncLocal4.Value++; }}
Method | Runtime | Mean | Ratio | Code Size | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|
Update | .NET 6.0 | 61.96 ns | 1.00 | 1,272 B | 176 B | 1.00 |
Update | .NET 7.0 | 61.92 ns | 1.00 | 1,832 B | 144 B | 0.82 |
Another valuable fix comes for locking indotnet/runtime#70165. This particular improvement is a bit harder to demonstrate with benchmarkdotnet, so just try running this program, first on .NET 6 and then on .NET 7:
using System.Diagnostics;var rwl = new ReaderWriterLockSlim();var tasks = new Task[100];int count = 0;DateTime end = DateTime.UtcNow + TimeSpan.FromSeconds(10);while (DateTime.UtcNow < end){ for (int i = 0; i < 100; ++i) { tasks[i] = Task.Run(() => { var sw = Stopwatch.StartNew(); rwl.EnterReadLock(); rwl.ExitReadLock(); sw.Stop(); if (sw.ElapsedMilliseconds >= 10) { Console.WriteLine(Interlocked.Increment(ref count)); } }); } Task.WaitAll(tasks);}
This is simply spinning up 100 tasks, each of which enters and exits a read-write lock, waits for them all, and then does the process over again, for 10 seconds. It also times how long it takes to enter and exit the lock, and writes a warning if it had to wait for at least 15ms. When I run this on .NET 6, I get ~100 occurrences of it taking >= 10 ms to enter/exit the lock. On .NET 7, I get 0 occurrences. Why the difference? The implementation ofReaderWriterLockSlim
has its own spin loop implementation, and that spin loop tries to mix together various things to do as it spins, ranging from callingThread.SpinWait
toThread.Sleep(0)
toThread.Sleep(1)
. The issue lies in theThread.Sleep(1)
. That’s saying “put this thread to sleep for 1 millisecond”; however, the operating system has the ultimate say on such timings, and on Windows, by default that sleep is going to be closer to 15 milliseconds (on Linux it’s a bit lower but still quite high). Thus, every time there was enough contention on the lock to force it to callThread.Sleep(1)
, we’d incur a delay of at least 15 milliseconds, if not more. The aforementioned PR fixed this by eliminating use ofThread.Sleep(1)
.
One final threading-related change to call out:dotnet/runtime#68639. This one is Windows specific. Windows has the concept of processor groups, each of which can have up to 64 cores in it, and by default when a process runs, it’s assigned a specific processor group and can only use the cores in that group. With .NET 7, the runtime flips its default so that by default it will try to use all processor groups if possible.
Primitive Types and Numerics
We’ve looked at code generation and GC, at threading and vectorization, at interop… let’s turn our attention to some of the fundamental types in the system. Primitives likeint
andbool
anddouble
, core types likeGuid
andDateTime
, they form the backbone on which everything is built, and every release it’s exciting to see the improvements that find their way into these types.
float
anddouble
got a very nice boost in their implementation of parsing (e.g.double.Parse
,float.TryParse
, etc.).dotnet/runtime#62301 from@CarlVerret significantly improvesdouble.Parse
andfloat.Parse
for parsing UTF16 text into floating-point values. This is particularly neat because it’s based on somerelatively recent research from@lemire and@CarlVerret, who used C# with .NET 5 to implement a very fast implementation for parsing floating-point numbers, and that implementation how now found its way into .NET 7!
private string[] _valuesToParse;[GlobalSetup]public void Setup(){ using HttpClient hc = new HttpClient(); string text = hc.GetStringAsync("https://raw.githubusercontent.com/CarlVerret/csFastFloat/1d800237275f759b743b86fcce6680d072c1e834/Benchmark/data/canada.txt").Result; var lines = new List<string>(); foreach (ReadOnlySpan<char> line in text.AsSpan().EnumerateLines()) { ReadOnlySpan<char> trimmed = line.Trim(); if (!trimmed.IsEmpty) { lines.Add(trimmed.ToString()); } } _valuesToParse = lines.ToArray();}[Benchmark]public double ParseAll(){ double total = 0; foreach (string s in _valuesToParse) { total += double.Parse(s); } return total;}
Method | Runtime | Mean | Ratio |
---|---|---|---|
ParseAll | .NET 6.0 | 26.84 ms | 1.00 |
ParseAll | .NET 7.0 | 12.63 ms | 0.47 |
bool.TryParse
andbool.TryFormat
were also improved.dotnet/runtime#64782 streamlined these implementations by usingBinaryPrimitives
to perform fewer writes and reads. For example, instead ofTryFormat
writing out “True” by doing:
destination[0] = 'T';destination[1] = 'r';destination[2] = 'u';destination[3] = 'e';
which requires four writes, it can instead implement the same operation in a single write by doing:
BinaryPrimitives.WriteUInt64LittleEndian(MemoryMarshal.AsBytes(destination), 0x65007500720054); // "True"
That0x65007500720054
is the numerical value of the four characters in memory as a singleulong
. You can see the impact of these changes with a microbenchmark:
private bool _value = true;private char[] _chars = new char[] { 'T', 'r', 'u', 'e' };[Benchmark] public bool ParseTrue() => bool.TryParse(_chars, out _);[Benchmark] public bool FormatTrue() => _value.TryFormat(_chars, out _);
Method | Runtime | Mean | Ratio |
---|---|---|---|
ParseTrue | .NET 6.0 | 7.347 ns | 1.00 |
ParseTrue | .NET 7.0 | 2.327 ns | 0.32 |
FormatTrue | .NET 6.0 | 3.030 ns | 1.00 |
FormatTrue | .NET 7.0 | 1.997 ns | 0.66 |
Enum
gets several performance boosts, as well. For example, when performing an operation likeEnum.IsDefined
,Enum.GetName
, orEnum.ToString
, the implementation consults a cache of all of the values defined on the enum. This cache includes the string name and the value for every defined enumeration in theEnum
. It’s also sorted by value in an array, so when one of these operations is performed, the code usesArray.BinarySearch
to find the index of the relevant entry. The issue with that is one of overheads. When it comes to algorithmic complexity, a binary search is faster than a linear search; after all, a binary search isO(log N)
whereas a linear search isO(N)
. However, there’s also less overhead for every step of the algorithm in a linear search, and so for smaller values ofN
, it can be much faster to simply do the simple thing. That’s whatdotnet/runtime#57973 does for enums. For enums with less than or equal to 32 defined values, the implementation now just does a linear search via the internalSpanHelpers.IndexOf
(the worker routine behindIndexOf
on spans, strings, and arrays), and for enums with more than that, it does aSpanHelpers.BinarySearch
(which is the implementation forArray.BinarySearch
).
private DayOfWeek[] _days = Enum.GetValues<DayOfWeek>();[Benchmark]public bool AllDefined(){ foreach (DayOfWeek day in _days) { if (!Enum.IsDefined(day)) { return false; } } return true;}
Method | Runtime | Mean | Ratio |
---|---|---|---|
AllDefined | .NET 6.0 | 159.28 ns | 1.00 |
AllDefined | .NET 7.0 | 94.86 ns | 0.60 |
Enum
s also get a boost in conjunction withNullable<T>
andEqualityComparer<T>.Default
.EqualityComparer<T>.Default
caches a singleton instance of anEqualityComparer<T>
instance returned from all accesses toDefault
. That singleton is initialized based on theT
in question, with the implementation choosing from a multitude of different internal implementations, for example aByteArrayComparer
specialized for bytes, aGenericEqualityComparer<T>
forT
s that implementIComparable<T>
, and so on. The catch-all, for arbitrary types, is anObjectEqualityComparer<T>
. As it happens, nullable enums would end up hitting this catch-all path, which means that everyEquals
call would box the arguments.dotnet/runtime#68077 fixes this by ensuring nullable enums get mapped to (an existing) specialized comparer forNullable<T>
and simple tweaks its definition to ensure it can play nicely with enums. The results highlight just how much unnecessary overhead there was previously.
private DayOfWeek?[] _enums = Enum.GetValues<DayOfWeek>().Select(e => (DayOfWeek?)e).ToArray();[Benchmark][Arguments(DayOfWeek.Saturday)]public int FindEnum(DayOfWeek value) => IndexOf(_enums, value);private static int IndexOf<T>(T[] values, T value){ for (int i = 0; i < values.Length; i++) { if (EqualityComparer<T>.Default.Equals(values[i], value)) { return i; } } return -1;}
Method | Runtime | Mean | Ratio |
---|---|---|---|
FindEnum | .NET 6.0 | 421.608 ns | 1.00 |
FindEnum | .NET 7.0 | 5.466 ns | 0.01 |
Not to be left out,Guid
‘s equality operations also get faster, thanks todotnet/runtime#66889 from@madelson. The previous implementation ofGuid
split the data into four 32-bit values and performed 4int
comparisons. With this change, if the current hardware has 128-bit SIMD support, the implementation loads the data from the two guids as two vectors and simply does a single comparison.
private Guid _guid1 = Guid.Parse("0aa2511d-251a-4764-b374-4b5e259b6d9a");private Guid _guid2 = Guid.Parse("0aa2511d-251a-4764-b374-4b5e259b6d9a");[Benchmark]public bool GuidEquals() => _guid1 == _guid2;
Method | Runtime | Mean | Ratio | Code Size |
---|---|---|---|---|
GuidEquals | .NET 6.0 | 2.119 ns | 1.00 | 90 B |
GuidEquals | .NET 7.0 | 1.354 ns | 0.64 | 78 B |
DateTime
equality is also improved.dotnet/runtime#59857 shaves some overhead off ofDateTime.Equals
.DateTime
is implemented with a singleulong _dateData
field, where the majority of the bits store a ticks offset from 1/1/0001 12:00am and where each tick is 100 nanoseconds, and where the top two bits describe theDateTimeKind
. Thus the publicTicks
property returns the value of_dateData
but with the top two bits masked out, e.g._dateData & 0x3FFFFFFFFFFFFFFF
. The equality operators were all then just comparing oneDateTime
‘sTicks
against the others, such that we effectively get(dt1._dateData & 0x3FFFFFFFFFFFFFFF) == (dt2._dateData & 0x3FFFFFFFFFFFFFFF)
. However, as a micro-optimization that can instead be expressed more efficiently as((dt1._dateData ^ dt2._dateData) << 2) == 0
. It’s difficult to measure the difference in such tiny operations, but you can see it simply from the number of instructions involved, where on .NET 6 this produces:
; Program.DateTimeEquals() mov rax,[rcx+8] mov rdx,[rcx+10] mov rcx,0FFFFFFFFFFFF and rax,rcx and rdx,rcx cmp rax,rdx sete al movzx eax,al ret; Total bytes of code 34
and on .NET 7 this produces:
; Program.DateTimeEquals() mov rax,[rcx+8] mov rdx,[rcx+10] xor rax,rdx shl rax,2 sete al movzx eax,al ret; Total bytes of code 22
so instead of amov
,and
,and
, andcmp
, we get just anxor
and ashl
.
Other operations onDateTime
also become more efficient, thanks todotnet/runtime#72712 from@SergeiPavlov anddotnet/runtime#73277 from@SergeiPavlov. In another case of .NET benefiting from recent advancements in research, these PRs implemented the algorithm from Neri and Schneider’s“Euclidean Affine Functions and Applications to CalendarAlgorithms” in order to improveDateTime.Day
,DateTime.DayOfYear
,DateTime.Month
, andDateTime.Year
, as well as the internal helperDateTime.GetDate()
that’s used by a bunch of other methods likeDateTime.AddMonths
,Utf8Formatter.TryFormat(DateTime, ...)
,DateTime.TryFormat
, andDateTime.ToString
.
private DateTime _dt = DateTime.UtcNow;private char[] _dest = new char[100];[Benchmark] public int Day() => _dt.Day;[Benchmark] public int Month() => _dt.Month;[Benchmark] public int Year() => _dt.Year;[Benchmark] public bool TryFormat() => _dt.TryFormat(_dest, out _, "r");
Method | Runtime | Mean | Ratio |
---|---|---|---|
Day | .NET 6.0 | 5.2080 ns | 1.00 |
Day | .NET 7.0 | 2.0549 ns | 0.39 |
Month | .NET 6.0 | 4.1186 ns | 1.00 |
Month | .NET 7.0 | 2.0945 ns | 0.51 |
Year | .NET 6.0 | 3.1422 ns | 1.00 |
Year | .NET 7.0 | 0.8200 ns | 0.26 |
TryFormat | .NET 6.0 | 27.6259 ns | 1.00 |
TryFormat | .NET 7.0 | 25.9848 ns | 0.94 |
So, we’ve touched on improvements to a few types, but the pièce de résistance around primitive types in this release is “generic math,” which impacts almost every primitive type in .NET. There are significant improvements here, some which have been in the making for literally over a decade.
There’s an excellent blog post from June dedicated just togeneric math, so I won’t go into much depth here. At a high level, however, there are now over 30 new interfaces that utilize the new C# 11 static abstract interface methods functionality, exposing wide-ranging operations from exponentiation functions to trigonometric functions to standard numerical operators, all available via generics, such that you can write one implementation that operates over these interfaces generically and have your code applied to any types that implement the interfaces… which all of the numerical types in .NET 7 do (including not just the primitives but also, for example,BigInteger
andComplex
). A preview version of this feature, including necessary runtime support, language syntax, C# compiler support, generic interfaces, and interface implementations all shipped in .NET 6 and C# 10, but it wasn’t supported for production use, and you had to download an experimental reference assembly in order to get access. Withdotnet/runtime#65731, all of this support moved into .NET 7 as supported functionality.dotnet/runtime#66748,dotnet/runtime#67453,dotnet/runtime#69391,dotnet/runtime#69582,dotnet/runtime#69756, anddotnet/runtime#71800 all updated the design and implementation based on feedback from usage in .NET 6 and .NET 7 previews as well as a proper API review with our API review team (a process every new API in .NET goes through before it’s shipped publicly).dotnet/runtime#67714 added support for user-definedchecked
operators, a new C# 11 feature that enables bothunchecked
andchecked
variations of operators to be exposed, with the compiler picking the right one based on thechecked
context.dotnet/runtime#68096 also added support for the new C# 11 unsigned right shift operator (>>>
). Anddotnet/runtime#69651,dotnet/runtime#67939,dotnet/runtime#73274,dotnet/runtime#71033,dotnet/runtime#71010,dotnet/runtime#68251,dotnet/runtime#68217, anddotnet/runtime#68094 all added large swaths of new public surface area for various operations, all with highly-efficient managed implementations, in many cases based on the open sourceAMD Math Library.
While this support is all primarily intended for external consumers, the core libraries do consume some of it internally. You can see how these APIs clean up consuming code even while maintaining performance in PRs likedotnet/runtime#68226 anddotnet/runtime#68183, which use the interfaces to deduplicate a bunch of LINQ code inEnumerable
.Sum
/Average
/Min
/Max
. There are multiple overloads of these methods forint
,long
,float
,double
, anddecimal
. The GitHub summary of the diffs tells the story on how much code was able to be deleted:
Another simple example comes from the newSystem.Formats.Tar
library in .NET 7, which as the name suggests is used for reading and writing archives in any of multipletar file formats. The tar file formats include integer values in octal representation, so theTarReader
class needs to parse octal values. Some of these values are 32-bit integers, and some are 64-bit integers. Rather than have two separateParseOctalAsUInt32
andParseOctalAsUInt64
methods,dotnet/runtime#74281] consolidated the methods into a singleParseOctal<T>
with the constraintwhere T : struct, INumber<T>
. The implementation is then entirely in terms ofT
and can be used for either of these types (plus any other types meeting the constraints, should that ever be needed). What’s particularly interesting about this example is theParseOctal<T>
method includes use ofchecked
, e.g.value = checked((value * octalFactor) + T.CreateTruncating(digit));
. This is only possible because C# 11 includes the aforementioned support foruser-definedchecked
operators, enabling the generic math interfaces to support both the normal and checked varieties, e.g. theIMultiplyOperators<,,>
interface contains these methods:
static abstract TResult operator *(TSelf left, TOther right);static virtual TResult operator checked *(TSelf left, TOther right) => left * right;
and the compiler will pick the appropriate one based on the context.
In addition to all the existing types that get these interfaces, there are also new types.dotnet/runtime#69204 adds the newInt128
andUInt128
types. As these types implement all of the relevant generic math interfaces, they come complete with a huge number of methods, over 100 each, all of which are implemented efficiently in managed code. In the future, the aim is that some set of these will be optimized further by the JIT and to take advantage of hardware acceleration.
Several PRs moved native implementations of these kinds of math operations to managed code.dotnet/runtime#63881 from@am11 did so forMath.Abs
andMath.AbsF
(absolute value), anddotnet/runtime#56236 from@alexcovington did so forMath.ILogB
andMathF.ILogB
(base 2 integer logarithm). The latter’s implementation is based on the MUSL libc implementation of the same algorithm, and in addition to improving performance (in part by avoiding the transition between managed and native code, in part by the actual algorithm employed), it also enabled deleting two distinct implementations from native code, one from the coreclr side and one from the mono side, which is always a nice win from a maintainability perspective.
[Benchmark][Arguments(12345.6789)]public int ILogB(double arg) => Math.ILogB(arg);
Method | Runtime | arg | Mean | Ratio |
---|---|---|---|---|
ILogB | .NET 6.0 | 12345.6789 | 4.056 ns | 1.00 |
ILogB | .NET 7.0 | 12345.6789 | 1.059 ns | 0.26 |
Other math operations were also improved in various ways.Math{F}.Truncate
was improved indotnet/runtime#65014 from@MichalPetryka by making it into a JIT intrinsic, such that on Arm64 the JIT could directly emit afrintz
instruction.dotnet/runtime#65584 did the same forMax
andMin
so that the Arm-specificfmax
andfmin
instructions could be used. And severalBitConverter
APIs were also turned into intrinsics indotnet/runtime#71567 in order to enable better code generation in some generic math scenarios.
dotnet/runtime#55121 from@key-moon also improves parsing, but forBigInteger
, and more specifically for really, really bigBigIntegers
. The algorithm previously employed for parsing a string into aBigInteger
wasO(N^2)
whereN
is the number of digits, but while a larger algorithmic complexity than we’d normally like, it has a low constant overhead and so is still reasonable for reasonably-sized values. In contrast, an alternative algorithm is available that runs inO(N * (log N)^2)
time, but with a much higher constant factor involved. That makes is so that it’s really only worth switching for really big numbers. Which is what this PR does. It implements the alternative algorithm and switches over to it when the input is at least 20,000 digits (so, yes, big). But for such large numbers, it makes a significant difference.
private string _input = string.Concat(Enumerable.Repeat("1234567890", 100_000)); // "One miiilliiiion digits"[Benchmark]public BigInteger Parse() => BigInteger.Parse(_input);
Method | Runtime | Mean | Ratio |
---|---|---|---|
Parse | .NET 6.0 | 3.474 s | 1.00 |
Parse | .NET 7.0 | 1.672 s | 0.48 |
Also related toBigInteger
(and not just for really big ones),dotnet/runtime#35565 from@sakno overhauled much of the internals ofBigInteger
to be based on spans rather than arrays. That in turn enabled a fair amount of use of stack allocation and slicing to avoid allocation overheads, while also improving reliability and safety by moving some code away from unsafe pointers to safe spans. The primary performance impact is visible in allocation numbers, and in particular for operations related to division.
private BigInteger _bi1 = BigInteger.Parse(string.Concat(Enumerable.Repeat("9876543210", 100)));private BigInteger _bi2 = BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 100)));private BigInteger _bi3 = BigInteger.Parse(string.Concat(Enumerable.Repeat("12345", 10)));[Benchmark]public BigInteger ModPow() => BigInteger.ModPow(_bi1, _bi2, _bi3);
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
ModPow | .NET 6.0 | 1.527 ms | 1.00 | 706 B | 1.00 |
ModPow | .NET 7.0 | 1.589 ms | 1.04 | 50 B | 0.07 |
Arrays, Strings, and Spans
While there are many forms of computation that can consume resources in applications, some of the most common include processing of data stored in arrays, strings, and now spans. Thus you see a focus in every .NET release on removing as much overhead as possible from such scenarios, while also finding ways to further optimize the concrete operations developers are commonly performing.
Let’s start with some new APIs that can help make writing more efficient code easier. When examining string parsing/processing code, it’s very common to see characters examined for their inclusion in various sets. For example, you might see a loop looking for characters that are ASCII digits:
while (i < str.Length){ if (str[i] >= '0' && str[i] <= '9') { break; } i++;}
or that are ASCII letters:
while (i < str.Length){ if ((str[i] >= 'a' && str[i] <= 'z') || (str[i] >= 'A' && str[i] <= 'Z')) { break; } i++;}
or other such groups. Interestingly, there’s wide-spread variation in how such checks are coded, often depending on how much effort a developer put in to optimizing them, or in some cases likely not even recognizing that some amount of performance was being left on the table. For example, that same ASCII letter check could instead be written as:
while (i < str.Length){ if ((uint)((c | 0x20) - 'a') <= 'z' - 'a') { break; } i++;}
which while more “intense” is also much more concise and more efficient. It’s taking advantage of a few tricks. First, rather than having two comparisons to determine whether the character is greater than or equal to the lower bound and less than or equal to the upper bound, it’s doing a single comparison based on the distance between the character and the lower bound ((uint)(c - 'a')
). If'c'
is beyond'z'
, then'c' - 'a'
will be larger than 25, and the comparison will fail. If'c'
is earlier than'a'
, then'c' - 'a'
will be negative, and casting it touint
will then cause it to wrap around to a massive number, also larger than 25, again causing the comparison to fail. Thus, we’re able to pay a single additional subtraction to avoid an entire additional comparison and branch, which isalmost always a good deal. The second trick is that| 0x20
. The ASCII table has some well-thought-out relationships, including that upper-case'A'
and lower-case'a'
differ by only a single bit ('A'
is0b1000001
and'a'
is0b1100001
). To go from any lowercase ASCII letter to its uppercase ASCII equivalent, we thus need only to& ~0x20
(to turn off that bit), and to go in the opposite direction from any uppercase ASCII letter to its lowercase ASCII equivalent, we need only to| 0x20
(to turn on that bit). We can take advantage of this in our range check, then, by normalizing our charc
to be lowercase, such that for the low cost of a bit twiddle, we can achieve both the lowercase and uppercase range checks. Of course, those tricks aren’t something we want every developer to have to know and write on each use. Instead, .NET 7 exposes a bunch of new helpers onSystem.Char
to encapsulate these common checks, done in an efficient manner.char
already had methods likeIsDigit
andIsLetter
, which provided the more comprehensive Unicode meaning of those monikers (e.g. there are ~320 Unicode characters categorized as “digits”). Now in .NET 7, there are also these helpers:
- IsAsciiDigit
- IsAsciiHexDigit
- IsAsciiHexDigitLower
- IsAsciiHexDigitUpper
- IsAsciiLetter
- IsAsciiLetterLower
- IsAsciiLetterUpper
- IsAsciiLetterOrDigit
These methods were added bydotnet/runtime#69318, which also employed them in dozens of locations where such checks were being performed acrossdotnet/runtime (many of them using less-efficient approaches).
Another new API focused on encapsulating a common pattern is the newMemoryExtensions.CommonPrefixLength
method, introduced bydotnet/runtime#67929. This accepts either twoReadOnlySpan<T>
instances or aSpan<T>
and aReadOnlySpan<T>
, and an optionalIEqualityComparer<T>
, and returns the number of elements that are the same at the beginning of each input span. This is useful when you want to know the first place that two inputs differ.dotnet/runtime#68210 from@gfoidl then utilized the newVector128
functionality to provide a basic vectorization of the implementation. As it’s comparing two sequences and looking for the first place they differ, this implementation uses a neat trick, which is to have a single method implemented to compare the sequences as bytes. If theT
being compared is bitwise-equatable and no custom equality comparer is supplied, then it reinterpret-casts the refs from the spans asbyte
refs, and uses the single shared implementation.
Yet another new set of APIs are theIndexOfAnyExcept
andLastIndexOfAnyExcept
methods, introduced bydotnet/runtime#67941 and used in a variety of additional call sites bydotnet/runtime#71146 anddotnet/runtime#71278. While somewhat of a mouthful, these methods are quite handy. They do what their name suggests: whereasIndexOf(T value)
searches for the first occurrence ofvalue
in the input, and whereasIndexOfAny(T value0, T value1, ...)
searches for the first occurrence of any ofvalue0
,value1
, etc. in the input,IndexOfAnyExcept(T value)
searches for the first occurrence of something that’snot equal tovalue
, and similarlyIndexOfAnyExcept(T value0, T value1, ...)
searches for the first occurrence of something that’snot equal tovalue0
,value1
, etc. For example, let’s say you wanted to know whether an array of integers was entirely 0. You can now write that as:
bool allZero = array.AsSpan().IndexOfAnyExcept(0) < 0;
dotnet/runtime#73488 vectorizes this overload, as well.
private byte[] _zeros = new byte[1024];[Benchmark(Baseline = true)]public bool OpenCoded(){ foreach (byte b in _zeros) { if (b != 0) { return false; } } return true;}[Benchmark]public bool IndexOfAnyExcept() => _zeros.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method | Mean | Ratio |
---|---|---|
OpenCoded | 370.47 ns | 1.00 |
IndexOfAnyExcept | 23.84 ns | 0.06 |
Of course, while new “index of” variations are helpful, we already have a bunch of such methods, and it’s important that they are as efficient as possible. These coreIndexOf{Any}
methods are used in huge numbers of places, many of which are performance-sensitive, and so every release they get additional tender-loving care. While PRs likedotnet/runtime#67811 got gains by paying very close attention to the assembly code being generated (in this case, tweaking some of the checks used on Arm64 inIndexOf
andIndexOfAny
to achieve better utilization), the biggest improvements here come in places where either vectorization was added and none was previously employed, or where the vectorization scheme was overhauled for significant gain. Let’s start withdotnet/runtime#63285, which yields huge improvements for many uses ofIndexOf
andLastIndexOf
for “substrings” ofbyte
s andchar
s. Previously, given a call likestr.IndexOf("hello")
, the implementation would essentially do the equivalent of repeatedly searching for the ‘h’, and when an ‘h’ was found, then performing aSequenceEqual
to match the remainder. As you can imagine, however, it’s very easy to run into cases where the first character being searched for is very common, such that you frequently have to break out of the vectorized loop in order to do the full string comparison. Instead, the PR implements an algorithm based onSIMD-friendly algorithms for substring searching. Rather than just searching for the first character, it can instead vectorize a search for both the first and last character at appropriate distances from each other. In our “hello” example, in any given input, it’s much more likely to find an ‘h’ than it is to find an ‘h’ followed four characters later by an ‘o’, and thus this implementation is able to stay within the vectorized loop a lot longer, garnering many fewer false positives that force it down theSequenceEqual
route. The implementation also handles cases where the two characters selected are equal, in which case it’ll quickly look for another character that’s not equal in order to maximize the efficiency of the search. We can see the impact of all of this with a couple of examples:
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;[Benchmark][Arguments("Sherlock")][Arguments("elementary")]public int Count(string needle){ ReadOnlySpan<char> haystack = s_haystack; int count = 0, pos; while ((pos = haystack.IndexOf(needle)) >= 0) { haystack = haystack.Slice(pos + needle.Length); count++; } return count;}
This is pulling down the text to “The Adventures of Sherlock Holmes” from Project Gutenberg and then benchmarking usingIndexOf
to count the occurrences of “Sherlock” and “elementary” in the text. On my machine, I get results like this:
Method | Runtime | needle | Mean | Ratio |
---|---|---|---|---|
Count | .NET 6.0 | Sherlock | 43.68 us | 1.00 |
Count | .NET 7.0 | Sherlock | 48.33 us | 1.11 |
Count | .NET 6.0 | elementary | 1,063.67 us | 1.00 |
Count | .NET 7.0 | elementary | 56.04 us | 0.05 |
For “Sherlock”, the performance is actually a bit worse in .NET 7 than in .NET 6; not much, but a measurable 10%. That’s because there are very few capital'S'
characters in the source text, 841 to be exact, out of 593,836 characters in the document. At only 0.1% density of the starting character, the new algorithm doesn’t bring much benefit, as the existing algorithm that searched for the first character alone captures pretty much all of the possible vectorization gains to be had, and we do pay a bit of overhead in doing a search for both the'S'
and the'k'
, whereas previously we’d have only searched for the'S'
. In contrast, though, there are 54,614'e'
characters in the document, so almost 10% of the source. In that case, .NET 7 is 20x faster than .NET 6, taking 53us on .NET 7 to count all the'e'
‘s vs 1084us on .NET 6. In this case, the new scheme yields immense gains, by vectorizing a search for both the'e'
and a'y'
at the specific distance away, a combination that is much, much less frequent. This is one of those situations where overall there are on average huge observed gains even though we can see small regressions for some specific inputs.
Another example of significantly changing the algorithm employed isdotnet/runtime#67758, which enables some amount of vectorization to be applied toIndexOf("...", StringComparison.OrdinalIgnoreCase)
. Previously, this operation was implemented with a fairly typical substring search, walking the input string and at every location doing an inner loop to compare the target string, except performing aToUpper
on every character in order to do it in a case-insensitive manner. Now with this PR, which is based on approaches previously used byRegex
, if the target string begins with an ASCII character, the implementation can useIndexOf
(if the character isn’t an ASCII letter) orIndexOfAny
(if the character is an ASCII letter) to quickly jump ahead to the first possible location of a match. Let’s take the exact same benchmark as we just looked at, but tweaked to useOrdinalIgnoreCase
:
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;[Benchmark][Arguments("Sherlock")][Arguments("elementary")]public int Count(string needle){ ReadOnlySpan<char> haystack = s_haystack; int count = 0, pos; while ((pos = haystack.IndexOf(needle, StringComparison.OrdinalIgnoreCase)) >= 0) { haystack = haystack.Slice(pos + needle.Length); count++; } return count;}
Here, both words are about 4x faster on .NET 7 than they were on .NET 6:
Method | Runtime | needle | Mean | Ratio |
---|---|---|---|---|
Count | .NET 6.0 | Sherlock | 2,113.1 us | 1.00 |
Count | .NET 7.0 | Sherlock | 467.3 us | 0.22 |
Count | .NET 6.0 | elementary | 2,325.6 us | 1.00 |
Count | .NET 7.0 | elementary | 638.8 us | 0.27 |
as we’re now doing a vectorizedIndexOfAny('S', 's')
orIndexOfAny('E', 'e')
rather than manually walking each character and comparing it. (dotnet/runtime#73533 uses the same approach now for handlingIndexOf(char, StringComparison.OrdinalIgnoreCase).
)
Another example comes fromdotnet/runtime#67492 from@gfoidl. It updatesMemoryExtensions.Contains
with the approach we discussed earlier for handling the leftover elements at the end of vectorized operation: process one last vector’s worth of data, even if it means duplicating some work already done. This particularly helps for smaller inputs where the processing time might otherwise be dominated by the serial handling of those leftovers.
private byte[] _data = new byte[95];[Benchmark]public bool Contains() => _data.AsSpan().Contains((byte)1);
Method | Runtime | Mean | Ratio |
---|---|---|---|
Contains | .NET 6.0 | 15.115 ns | 1.00 |
Contains | .NET 7.0 | 2.557 ns | 0.17 |
dotnet/runtime#60974 from@alexcovington broadens the impact ofIndexOf
. Prior to this PR,IndexOf
was vectorized for one and two-byte sized primitive types, but this PR extends it as well to four and eight-byte sized primitives. As with most of the other vectorized implementations, it checks whether theT
is bitwise-equatable, which is important for the vectorization as it’s only looking at the bits in memory and not paying attention to anyEquals
implementation that might be defined on the type. In practice today, that means this is limited to just a handful of types of which the runtime has intimate knowledge (Boolean
,Byte
,SByte
,UInt16
,Int16
,Char
,UInt32
,Int32
,UInt64
,Int64
,UIntPtr
,IntPtr
,Rune
, and enums), but in theory it could be extended in the future.
private int[] _data = new int[1000];[Benchmark]public int IndexOf() => _data.AsSpan().IndexOf(42);
Method | Runtime | Mean | Ratio |
---|---|---|---|
IndexOf | .NET 6.0 | 252.17 ns | 1.00 |
IndexOf | .NET 7.0 | 78.82 ns | 0.31 |
One final interestingIndexOf
-related optimization.string
has long hadIndexOf
/IndexOfAny
/LastIndexOf
/LastIndexOfAny
, and obviously forstring
it’s all about processingchar
s. WhenReadOnlySpan<T>
andSpan<T>
came on the scene,MemoryExtensions
was added to provide extension methods for spans and friends, including suchIndexOf
/IndexOfAny
/LastIndexOf
/LastIndexOfAny
methods. But for spans, this is about more than justchar
, and soMemoryExtensions
grew its own set of implementations largely separate fromstring
‘s. Over the years,MemoryExtensions
implementations have specialized more and more types, but in particularbyte
andchar
, such that over timestring
‘s implementations have mostly been replaced by delegation into the same implementation asMemoryExtensions
uses. However,IndexOfAny
andLastIndexOfAny
had been unification holdouts, each in its own direction.string.IndexOfAny
did delegate to the same implementation asMemoryExtensions.IndexOfAny
for 1-5 values being searched for, but for more than 5 values,string.IndexOfAny
used a “probabilistic map,” essentially aBloom filter. It creates a 256-bit table, and quickly sets bits in that table based on the values being searched for (essentially hashing them, but with a trivial hash function). Then it iterates through the input, and rather than checking every input character against every one of the target values, it instead first looks up the input character in the table. If the corresponding bit isn’t set, it knows the input character doesn’t match any of the target values. If the corresponding bit is set, then it proceeds to compare the input character against each of the target values, with a high probability of it being one of them.MemoryExtensions.IndexOfAny
lacked such a filter for more than 5 values. Conversely,string.LastIndexOfAny
didn’t provide any vectorization for multiple target values, whereasMemoryExtensions.LastIndexOfAny
vectorized two and three target values. As ofdotnet/runtime#63817, all of these are now unified, such that bothstring
andMemoryExtensions
get the best of what the other had.
private readonly char[] s_target = new[] { 'z', 'q' };const string Sonnet = """ Shall I compare thee to a summer's day? Thou art more lovely and more temperate: Rough winds do shake the darling buds of May, And summer's lease hath all too short a date; Sometime too hot the eye of heaven shines, And often is his gold complexion dimm'd; And every fair from fair sometime declines, By chance or nature's changing course untrimm'd; But thy eternal summer shall not fade, Nor lose possession of that fair thou ow'st; Nor shall death brag thou wander'st in his shade, When in eternal lines to time thou grow'st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee. """;[Benchmark]public int LastIndexOfAny() => Sonnet.LastIndexOfAny(s_target);[Benchmark]public int CountLines(){ int count = 0; foreach (ReadOnlySpan<char> _ in Sonnet.AsSpan().EnumerateLines()) { count++; } return count;}
Method | Runtime | Mean | Ratio |
---|---|---|---|
LastIndexOfAny | .NET 6.0 | 443.29 ns | 1.00 |
LastIndexOfAny | .NET 7.0 | 31.79 ns | 0.07 |
CountLines | .NET 6.0 | 1,689.66 ns | 1.00 |
CountLines | .NET 7.0 | 1,461.64 ns | 0.86 |
That same PR also cleans up uses of theIndexOf
family, and in particular around uses that are checking for containment rather than the actual index of a result. TheIndexOf
family of methods return a non-negative value when an element is found, and otherwise return -1. That means when checking whether an element was found, code can use either>= 0
or!= -1
, and when checking whether an element wasn’t found, code can use either< 0
or== -1
. It turns out that the code generated for comparisons against 0 is ever so slightly more efficient than comparisons generated against -1, and this isn’t something the JIT can itself substitute without theIndexOf
methods being intrinsics such that the JIT can understand the semantics of the return value. Thus, for consistency and a small perf gain, all relevant call sites were switched to compare against 0 instead of against -1.
Speaking of call sites, one of the great things about having highly optimizedIndexOf
methods is using them in all the places that can benefit, removing the maintenance impact of open-coded replacements while also reaping the perf wins.dotnet/runtime#63913 usedIndexOf
inside ofStringBuilder.Replace
to speed up the search for the next character to be replaced:
private StringBuilder _builder = new StringBuilder(Sonnet);[Benchmark]public void Replace(){ _builder.Replace('?', '!'); _builder.Replace('!', '?');}
Method | Runtime | Mean | Ratio |
---|---|---|---|
Replace | .NET 6.0 | 1,563.69 ns | 1.00 |
Replace | .NET 7.0 | 70.84 ns | 0.04 |
dotnet/runtime#60463 from@nietras usedIndexOfAny
inStringReader.ReadLine
to search for'\r'
and'\n'
line ending characters, which results in some substantial throughput gains even with the allocation and copy that is inherent to the method’s design:
[Benchmark]public void ReadAllLines(){ var reader = new StringReader(Sonnet); while (reader.ReadLine() != null) ;}
Method | Runtime | Mean | Ratio |
---|---|---|---|
ReadAllLines | .NET 6.0 | 947.8 ns | 1.00 |
ReadAllLines | .NET 7.0 | 385.7 ns | 0.41 |
Anddotnet/runtime#70176 cleaned up a plethora of additional uses.
Finally on theIndexOf
front, as noted, a lot of time and energy over the years has gone into optimizing these methods. In previous releases, some of that energy has been in the form of using hardware intrinsics directly, e.g. having an SSE2 code path and an AVX2 code path and an AdvSimd code path. Now that we haveVector128<T>
andVector256<T>
, many such uses can be simplified (e.g. avoiding the duplication between an SSE2 implementation and an AdvSimd implementation) while still maintaining as good or even better performance and while automatically supporting vectorization on other platforms with their own intrinsics, like WebAssembly.dotnet/runtime#73481,dotnet/runtime#73556,dotnet/runtime#73368,dotnet/runtime#73364,dotnet/runtime#73064, anddotnet/runtime#73469 all contributed here, in some cases incurring meaningful throughput gains:
[Benchmark]public int IndexOfAny() => Sonnet.AsSpan().IndexOfAny("!.<>");
Method | Runtime | Mean | Ratio |
---|---|---|---|
IndexOfAny | .NET 6.0 | 52.29 ns | 1.00 |
IndexOfAny | .NET 7.0 | 40.17 ns | 0.77 |
TheIndexOf
family is just one of many onstring
/MemoryExtensions
that has seen dramatic improvements. Another are theSequenceEquals
family, includingEquals
,StartsWith
, andEndsWith
. One of my favorite changes in the whole release isdotnet/runtime#65288 and is squarely in this area. It’s very common to see calls to methods likeStartsWith
with a constant string argument, e.g.value.StartsWith("https://")
,value.SequenceEquals("Key")
, etc. These methods are now recognized by the JIT, which can now automatically unroll the comparison and compare more than one char at a time, e.g. doing a single read of four chars as along
and a single comparison of thatlong
against the expected combination of those four chars. The result is beautiful. Making it even better isdotnet/runtime#66095, which adds to this support forOrdinalIgnoreCase
. Remember those ASCII bit twiddling tricks discussed a bit earlier withchar.IsAsciiLetter
and friends? The JIT now employs the same trick as part of this unrolling, so if you do that samevalue.StartsWith("https://")
but instead asvalue.StartsWith("https://", StringComparison.OrdinalIgnoreCase)
, it will recognize that the whole comparison string is ASCII and will OR in the appropriate mask on both the comparison constant and on the read data from the input in order to perform the comparison in a case-insensitive manner.
private string _value = "https://dot.net";[Benchmark]public bool IsHttps_Ordinal() => _value.StartsWith("https://", StringComparison.Ordinal);[Benchmark]public bool IsHttps_OrdinalIgnoreCase() => _value.StartsWith("https://", StringComparison.OrdinalIgnoreCase);
Method | Runtime | Mean | Ratio |
---|---|---|---|
IsHttps_Ordinal | .NET 6.0 | 4.5634 ns | 1.00 |
IsHttps_Ordinal | .NET 7.0 | 0.4873 ns | 0.11 |
IsHttps_OrdinalIgnoreCase | .NET 6.0 | 6.5654 ns | 1.00 |
IsHttps_OrdinalIgnoreCase | .NET 7.0 | 0.5577 ns | 0.08 |
Interestingly, since .NET 5 the code generated byRegexOptions.Compiled
would perform similar unrolling when comparing sequences of multiple characters, and when the source generator was added in .NET 7, it also learned how to do this. However, the source generator has problems with such an optimization, due to endianness. The constants being compared against are subject to byte ordering issues, such that the source generator would need to emit code that could handle running on either little-endian or big-endian machines. The JIT has no such problem, as it’s generating the code on the same machine on which the code will execute (and in scenarios where it’s being used to generate code ahead of time, the entirety of that code is already tied to a particular architecture). By moving this optimization into the JIT, the corresponding code could be deleted fromRegexOptions.Compiled
and the regex source generator, which then also benefits from producing much easier to read code utilizingStartsWith
that’s just as fast (dotnet/runtime#65222 anddotnet/runtime#66339). Wins all around. (This could only be removed fromRegexOptions.Compiled
afterdotnet/runtime#68055, which fixed the ability for the JIT to recognize these string literals inDynamicMethod
s, whichRegexOptions.Compiled
uses with reflection emit to spit out the IL for the regex being compiled.)
StartsWith
andEndsWith
have improved in other ways.dotnet/runtime#63734 (improved further bydotnet/runtime#64530) added another really interesting JIT-based optimization, but to understand it, we need to understandstring
‘s internal layout.string
is essentially represented in memory as anint
length followed by that manychar
s plus a null terminatorchar
. The actualSystem.String
class represents this in C# as anint _stringLength
field followed by achar _firstChar
field, such that_firstChar
indeed lines up with the first character of the string, or the null terminator if the string is empty. Internally in System.Private.CoreLib, and in particular in methods onstring
itself, code will often refer to_firstChar
directly when the first character needs to be consulted, as it’s typically faster to do that than to usestr[0]
, in particular because there are no bounds checks involved and the string’s length generally needn’t be consulted. Now, consider a method likepublic bool StartsWith(char value)
onstring
. In .NET 6, the implementation was:
return Length != 0 && _firstChar == value;
which given what I just described makes sense: if theLength
is 0, then the string doesn’t begin with the specified character, and ifLength
is not 0, then we can just compare the value against_firstChar
. But, why is thatLength
check even needed at all? Couldn’t we just doreturn _firstChar == value;
? That will avoid the additional comparison and branch, and it will work just fine… unless the target character is itself'\0'
, in which case we could get false positives on the result. Now to this PR. The PR introduces an internal JIT intrinsincRuntimeHelpers.IsKnownConstant
, which the JIT will substitute withtrue
if the containing method is inlined and the argument passed toIsKnownConstant
is then seen to be a constant. In such cases, the implementation can rely on other JIT optimizations kicking in and optimizing various code in the method, effectively enabling a developer to write two different implementations, one when the argument is known to be a constant and one when not. With that in hand, the PR is able to optimizeStartsWith
as follows:
public bool StartsWith(char value){ if (RuntimeHelpers.IsKnownConstant(value) && value != '\0') return _firstChar == value; return Length != 0 && _firstChar == value;}
If thevalue
parameter isn’t a constant, thenIsKnownConstant
will be substituted withfalse
, the entire startingif
block will be eliminated, and the method will be left exactly was it was before. But, if this method gets inlined and thevalue
was actually a constant, then thevalue != '\0'
condition will also be evaluatable at JIT-compile-time. If thevalue
is in fact'\0'
, well, again that wholeif
block will be eliminated and we’re no worse off. But in the common case where thevalue
isn’t null, the entire method will end up being compiled as if it were:
return _firstChar == ConstantValue;
and we’ve saved ourselves a read of the string’s length, a comparison, and a branch.dotnet/runtime#69038 then employs a similar technique forEndsWith
.
private string _value = "https://dot.net";[Benchmark]public bool StartsWith() => _value.StartsWith('a') || _value.StartsWith('b') || _value.StartsWith('c') || _value.StartsWith('d') || _value.StartsWith('e') || _value.StartsWith('f') || _value.StartsWith('g') || _value.StartsWith('i') || _value.StartsWith('j') || _value.StartsWith('k') || _value.StartsWith('l') || _value.StartsWith('m') || _value.StartsWith('n') || _value.StartsWith('o') || _value.StartsWith('p');
Method | Runtime | Mean | Ratio |
---|---|---|---|
StartsWith | .NET 6.0 | 8.130 ns | 1.00 |
StartsWith | .NET 7.0 | 1.653 ns | 0.20 |
(Another example ofIsKnownConstant
being used comes fromdotnet/runtime#64016, which uses it to improveMath.Round
when aMidpointRounding
mode is specified. Call sites to this almost always explicitly specify the enum value as a constant, which then allows the JIT to specialize the code generation for the method to the specific mode being used; that in turn, for example, enables aMath.Round(..., MidpointRounding.AwayFromZero)
call on Arm64 to be lowered to a singlefrinta
instruction.)
EndsWith
was also improved indotnet/runtime#72750, and specifically for whenStringComparison.OrdinalIgnoreCase
is specified. This simple PR just switched which internal helper method was used to implement this method, taking advantage of one that is sufficient for the needs of this method and that has lower overheads.
[Benchmark][Arguments("System.Private.CoreLib.dll", ".DLL")]public bool EndsWith(string haystack, string needle) => haystack.EndsWith(needle, StringComparison.OrdinalIgnoreCase);
Method | Runtime | Mean | Ratio |
---|---|---|---|
EndsWith | .NET 6.0 | 10.861 ns | 1.00 |
EndsWith | .NET 7.0 | 5.385 ns | 0.50 |
Finally,dotnet/runtime#67202 anddotnet/runtime#73475 employVector128<T>
andVector256<T>
to replace direct hardware intrinsics usage, just as was previously shown for variousIndexOf
methods, but here forSequenceEqual
andSequenceCompareTo
, respectively.
Another method that’s seem some attention in .NET 7 isMemoryExtensions.Reverse
(andArray.Reverse
as it shares the same implementation), which performs an in-place reversal of the target span.dotnet/runtime#64412 from@alexcovington provides a vectorized implementation via direct use of AVX2 and SSSE3 hardware intrinsics, withdotnet/runtime#72780 from@SwapnilGaikwad following up to add an AdvSimd intrinsics implementation for Arm64. (There was an unintended regression introduced by the original vectorization change, but that was fixed bydotnet/runtime#70650.)
private char[] text = "Free. Cross-platform. Open source.\r\nA developer platform for building all your apps.".ToCharArray();[Benchmark]public void Reverse() => Array.Reverse(text);
Method | Runtime | Mean | Ratio |
---|---|---|---|
Reverse | .NET 6.0 | 21.352 ns | 1.00 |
Reverse | .NET 7.0 | 9.536 ns | 0.45 |
String.Split
also saw vectorization improvements indotnet/runtime#64899 from@yesmey. As with some of the previously discussed PRs, it switched the existing usage of SSE2 and SSSE3 hardware intrinsics over to the newVector128<T>
helpers, which improved upon the existing implementation while also implicitly adding vectorization support for Arm64.
Converting various formats of strings is something many applications and services do, whether that’s converting from UTF8 bytes to and fromstring
or formatting and parsing hex values. Such operations have also improved in a variety of ways in .NET 7.Base64-encoding, for example, is a way of representing arbitrary binary data (thinkbyte[]
) across mediums that only support text, encoding bytes into one of 64 different ASCII characters. Multiple APIs in .NET implement this encoding. For converting between binary data represented asReadOnlySpan<byte>
and UTF8 (actually ASCII) encoded data also represented asReadOnlySpan<byte>
, theSystem.Buffers.Text.Base64
type providesEncodeToUtf8
andDecodeFromUtf8
methods. These were vectorized several releases ago, but they were further improved in .NET 7 viadotnet/runtime#70654 from@a74nh, which converted the SSSE3-based implementation to useVector128<T>
(which in turn implicitly enabled vectorization on Arm64). However, for converting between arbitrary binary data represented asReadOnlySpan<byte>
/byte[]
andReadOnlySpan<char>
/char[]
/string
, theSystem.Convert
type exposes multiple methods, e.g.Convert.ToBase64String
, and these methods historically were not vectorized. That changes in .NET 7, wheredotnet/runtime#71795 anddotnet/runtime#73320 vectorize theToBase64String
,ToBase64CharArray
, andTryToBase64Chars
methods. The way they do this is interesting. Rather than effectively duplicating the vectorization implementation fromBase64.EncodeToUtf8
, they instead layer on top ofEncodeToUtf8
, calling it to encode the inputbyte
data into an outputSpan<byte>
. Then, then they “widen” those bytes into chars (remember, Base64-encoded data is a set of ASCII chars, so going from these bytes to chars entails adding just a0
byte onto each element). That widening can itself easily be done in a vectorized manner. The other interesting thing about this layering is it doesn’t actually require separate intermediate storage for the encoded bytes. The implementation can perfectly compute the number of resulting characters for encoding X bytes into Y Base64 characters (there’s a formula), and the implementation can either allocate that final space (e.g. in the case of ToBase64CharArray) or ensure the provided space is sufficient (e.g. in the case of TryToBase64Chars). And since we know the initial encoding will require exactly half as many bytes, we can encode into that same space (with the destination span reinterpreted as abyte
span rather thanchar
span), and then widen “in place”: walk from the end of the bytes and the end of the char space, copying the bytes into the destination.
private byte[] _data = Encoding.UTF8.GetBytes(""" Shall I compare thee to a summer's day? Thou art more lovely and more temperate: Rough winds do shake the darling buds of May, And summer's lease hath all too short a date; Sometime too hot the eye of heaven shines, And often is his gold complexion dimm'd; And every fair from fair sometime declines, By chance or nature's changing course untrimm'd; But thy eternal summer shall not fade, Nor lose possession of that fair thou ow'st; Nor shall death brag thou wander'st in his shade, When in eternal lines to time thou grow'st: So long as men can breathe or eyes can see, So long lives this, and this gives life to thee. """);private char[] _encoded = new char[1000];[Benchmark]public bool TryToBase64Chars() => Convert.TryToBase64Chars(_data, _encoded, out _);
Method | Runtime | Mean | Ratio |
---|---|---|---|
TryToBase64Chars | .NET 6.0 | 623.25 ns | 1.00 |
TryToBase64Chars | .NET 7.0 | 81.82 ns | 0.13 |
Just as widening can be used to go from bytes to chars, narrowing can be used to go from chars to bytes, in particular if the chars are actually ASCII and thus have a 0 upper byte. Such narrowing can be vectorized, and the internalNarrowUtf16ToAscii
utility helper does exactly that, used as part of methods likeEncoding.ASCII.GetBytes
. While this method was previously vectorized, its primary fast-path utilized SSE2 and thus didn’t apply to Arm64; thanks todotnet/runtime#70080 from@SwapnilGaikwad, that path was changed over to be based on the cross-platformVector128<T>
, enabling the same level of optimization across supported platforms. Similarly,dotnet/runtime#71637 from@SwapnilGaikwad adds Arm64 vectorization to theGetIndexOfFirstNonAsciiChar
internal helper that’s used by methods likeEncoding.UTF8.GetByteCount
. (And in the same vein,dotnet/runtime#67192 changed the internalHexConverter.EncodeToUtf16
method from using SSSE3 intrinsics to instead useVector128<T>
, automatically providing an Arm64 implementation.)
Encoding.UTF8
was also improved a bit. In particular,dotnet/runtime#69910 streamlined the implementations ofGetMaxByteCount
andGetMaxCharCount
, making them small enough to be commonly inlined when used directly off ofEncoding.UTF8
such that the JIT is able to devirtualize the calls.
[Benchmark]public int GetMaxByteCount() => Encoding.UTF8.GetMaxByteCount(Sonnet.Length);
Method | Runtime | Mean | Ratio |
---|---|---|---|
GetMaxByteCount | .NET 6.0 | 1.7442 ns | 1.00 |
GetMaxByteCount | .NET 7.0 | 0.4746 ns | 0.27 |
Arguably the biggest improvement around UTF8 in .NET 7 is the new C# 11 support for UTF8 literals. Initially implemented in the C# compiler indotnet/roslyn#58991, with follow-on work indotnet/roslyn#59390,dotnet/roslyn#61532, anddotnet/roslyn#62044, UTF8 literals enables the compiler to perform the UTF8 encoding into bytes at compile-time. Rather than writing a normal string, e.g."hello"
, a developer simply appends the newu8
suffix onto the string literal, e.g."hello"u8
. At that point, this is no longer a string. Rather, the natural type of this expression is aReadOnlySpan<byte>
. If you write:
public static ReadOnlySpan<byte> Text => "hello"u8;
the C# compiler will compile that equivalent to if you wrote:
public static ReadOnlySpan<byte> Text => new ReadOnlySpan<byte>(new byte[] { (byte)'h', (byte)'e', (byte)'l', (byte)'l', (byte)'o', (byte)'\0' }, 0, 5);
In other words, the compiler is doing the equivalent ofEncoding.UTF8.GetBytes
at compile-time and hardcoding the resulting bytes, saving the cost of performing that encoding at run-time. Of course, at first glance, that array allocation might look terribly inefficient. However, looks can be deceiving, and are in this case. For several releases now, when the C# compiler sees abyte[]
(orsbyte[]
orbool[]
) being initialized with a constant length and constant values and immediately cast to or used to construct aReadOnlySpan<byte>
, it optimizes away thebyte[]
allocation. Instead, it blits the data for that span into the assembly’s data section, and then constructs a span that points directly to that data in the loaded assembly. This is the actual generated IL for the above property:
IL_0000: ldsflda valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=6' '<PrivateImplementationDetails>'::F3AEFE62965A91903610F0E23CC8A69D5B87CEA6D28E75489B0D2CA02ED7993CIL_0005: ldc.i4.5IL_0006: newobj instance void valuetype [System.Runtime]System.ReadOnlySpan`1<uint8>::.ctor(void*, int32)IL_000b: ret
This means we not only save on the encoding costs at run-time, and we not only avoid whatever managed allocations might be required to store the resulting data, we also benefit from the JIT being able to see information about the encoded data, like it’s length, enabling knock-on optimizations. You can see this clearly by examining the assembly generated for a method like:
public static int M() => Text.Length;
for which the JIT produces:
; Program.M() mov eax,5 ret; Total bytes of code 6
The JIT inlines the property access, sees that the span is being constructed with a length of5
, and so rather than emitting any array allocations or span constructions or anything even resembling that, it simply outputsmov eax, 5
to return the known length of the span.
Thanks primarily todotnet/runtime#70568,dotnet/runtime#69995,dotnet/runtime#70894,dotnet/runtime#71417 from@am11,dotnet/runtime#71292,dotnet/runtime#70513, anddotnet/runtime#71992,u8
is now used more than 2100 times throughoutdotnet/runtime. Hardly a fair comparison, but the following benchmark demonstrates how little work is actually being performed foru8
at execution time:
[Benchmark(Baseline = true)]public ReadOnlySpan<byte> WithEncoding() => Encoding.UTF8.GetBytes("test");[Benchmark] public ReadOnlySpan<byte> Withu8() => "test"u8;
Method | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|
WithEncoding | 17.3347 ns | 1.000 | 32 B | 1.00 |
Withu8 | 0.0060 ns | 0.000 | – | 0.00 |
Like I said, not fair, but it proves the point 🙂
Encoding
is of course just one mechanism for creatingstring
instances. Others have also improved in .NET 7. Take the super commonlong.ToString
, for example. Previous releases improvedint.ToString
, but there were enough differences between the 32-bit and 64-bit algorithms thatlong
didn’t see all of the same gains. Now thanks todotnet/runtime#68795, the 64-bit formatting code paths are made much more similar to the 32-bit, resulting in faster performance.
You can also see improvements instring.Format
andStringBuilder.AppendFormat
, as well as other helpers that layer on top of these (likeTextWriter.AppendFormat
).dotnet/runtime#69757 overhauls the core routines insideFormat
to avoid unnecessary bounds checking, favor expected cases, and generally clean up the implementation. It also, however, utilitiesIndexOfAny
to search for the next interpolation hole that needs to be filled in, and if the non-hole-character to hole ratio is high (e.g. long format string with few holes), it can be way faster than before.
private StringBuilder _sb = new StringBuilder();[Benchmark]public void AppendFormat(){ _sb.Clear(); _sb.AppendFormat("There is already one outstanding '{0}' call for this WebSocket instance." + "ReceiveAsync and SendAsync can be called simultaneously, but at most one " + "outstanding operation for each of them is allowed at the same time.", "ReceiveAsync");}
Method | Runtime | Mean | Ratio |
---|---|---|---|
AppendFormat | .NET 6.0 | 338.23 ns | 1.00 |
AppendFormat | .NET 7.0 | 49.15 ns | 0.15 |
Speaking ofStringBuilder
, it’s seen additional improvements beyond the aforementioned changes toAppendFormat
. One interesting change isdotnet/runtime#64405, which achieved two related things. The first was to remove pinning as part of formatting operations. As an example,StringBuilder
has anAppend(char* value, int valueCount)
overload which copies the specified number of characters from the specified pointer into theStringBuilder
, and other APIs were implemented in terms of this method; for example, theAppend(string? value, int startIndex, int count)
method was essentially implemented as:
fixed (char* ptr = value){ Append(ptr + startIndex, count);}
Thatfixed
statement translates into a “pinning pointer.” Normally the GC is free to move managed objects around on the heap, which it might do in order to compact the heap (to, for example, avoid small, unusuable fragments of memory between objects). But if the GC can move objects around, a normal native pointer into that memory would be terribly unsafe and unreliable, as without notice the data being pointed to could move and your pointer could now be pointing to garbage or to some other object that was shifted to this location. There are two ways for dealing with this. The first is a “managed pointer,” otherwise known as a “reference” or “ref,” as that’s exactly what you get when you have the “ref” keyword in C#; it’s a pointer that the runtime will update with the correct value when it moves the object being pointed into. The second is to prevent the pointed-to object from being moved, “pinning” it in place. And that’s what the “fixed” keyword does, pinning the referenced object for the duration of thefixed
block, during which time it’s safe to use the supplied pointer. Thankfully, pinning is cheap when no GC occurs; when a GC does occur, however, pinned objects aren’t able to be moved around, and thus pinning can have a global impact on the performance of the application (and on GCs themselves). There are also various optimizations inhibited by pinning. With all of the advents in C# around being able to useref
in many more places (e.g. ref locals, ref returns, and now in C# 11, ref fields), and with all of the new APIs in .NET for manipulating refs (e.g.Unsafe.Add
,Unsafe.AreSame
), it’s now possible to rewrite code that was using pinning pointers to instead use managed pointers, thereby avoiding the problems that come from pinning. Which is what this PR did. Rather than implementing all of theAppend
methods in terms of anAppend(char*, int)
helper, they’re now all implemented in terms of anAppend(ref char, int)
helper. So for example instead of the previously shownAppend(string? value, int startIndex, int count)
implementation, it’s now akin to
Append(ref Unsafe.Add(ref value.GetRawStringData(), startIndex), count);
where thatstring.GetRawStringData
method is just an internal version of the publicstring.GetPinnableReference
method, returning aref
instead of aref readonly
. This means that all of the high-performance code inside ofStringBuilder
that had been using pointers to avoid bounds checking and the like can continue to do so, but now also does so without pinning all of the inputs.
The second thing thisStringBuilder
change did was unify an optimization that was present forstring
inputs to also apply tochar[]
inputs andReadOnlySpan<char>
inputs. Specifically, because it’s so common to appendstring
instances to aStringBuilder
, a special code path was long ago put in place to optimize for this input and specifically for the case where there’s already enough room in theStringBuilder
to hold the whole input, at which point an efficient copy can be used. With a sharedAppend(ref char, int)
helper, though, this optimization can be moved down into that helper, such that it not only helps outstring
but any other type that also calls into the same helper. The effects of this are visible in a simple microbenchmark:
private StringBuilder _sb = new StringBuilder();[Benchmark]public void AppendSpan(){ _sb.Clear(); _sb.Append("this".AsSpan()); _sb.Append("is".AsSpan()); _sb.Append("a".AsSpan()); _sb.Append("test".AsSpan()); _sb.Append(".".AsSpan());}
Method | Runtime | Mean | Ratio |
---|---|---|---|
AppendSpan | .NET 6.0 | 35.98 ns | 1.00 |
AppendSpan | .NET 7.0 | 17.59 ns | 0.49 |
One of the great things about improving things low in the stack is they have a multiplicative effect; they not only help improve the performance of user code that directly relies on the improved functionality, they can also help improve the performance of other code in the core libraries, which then further helps dependent apps and services. You can see this, for example, withDateTimeOffset.ToString
, which depends onStringBuilder
:
private DateTimeOffset _dto = DateTimeOffset.UtcNow;[Benchmark]public string DateTimeOffsetToString() => _dto.ToString();
Method | Runtime | Mean | Ratio |
---|---|---|---|
DateTimeOffsetToString | .NET 6.0 | 340.4 ns | 1.00 |
DateTimeOffsetToString | .NET 7.0 | 289.4 ns | 0.85 |
StringBuilder
itself was then further updated bydotnet/runtime#64922 from@teo-tsirpanis, which improves theInsert
methods. It used to be that theAppend(primitive)
methods onStringBuilder
(e.g.Append(int)
) would callToString
on the value and then append the resulting string. With the advent ofISpanFormattable
, as a fast-path those methods now try to format the value directly into theStringBuilder
‘s internal buffer, and only if there’s not enough room remaining do they then take the old path as a fallback.Insert
wasn’t improved in this way at the time, because it can’t just format into the space at the end of the builder; the insert location could be anywhere in the builder. This PR addresses that by formatting into some temporary stack space, and then delegating to the existing internal ref-based helper from the previously discussed PR to insert the resulting characters at the right location (it also falls back toToString
when there’s not enough stack space for theISpanFormattable.TryFormat
, but that only happens in incredibly corner cases, like a floating-point value that formats to hundreds of digits).
private StringBuilder _sb = new StringBuilder();[Benchmark]public void Insert(){ _sb.Clear(); _sb.Insert(0, 12345);}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
Insert | .NET 6.0 | 30.02 ns | 1.00 | 32 B | 1.00 |
Insert | .NET 7.0 | 25.53 ns | 0.85 | – | 0.00 |
Other minor improvements toStringBuilder
have also been made, likedotnet/runtime#60406 which removed a smallint[]
allocation from theReplace
method. Even with all these improvements, though, the fastest use ofStringBuilder
is no use;dotnet/runtime#68768 removed a bunch of uses ofStringBuilder
that would have been better served with otherstring
-creation mechanisms. For example, the legacyDataView
type had some code that created a sorting specification as a string:
private static string CreateSortString(PropertyDescriptor property, ListSortDirection direction){ var resultString = new StringBuilder(); resultString.Append('['); resultString.Append(property.Name); resultString.Append(']'); if (ListSortDirection.Descending == direction) { resultString.Append(" DESC"); } return resultString.ToString();}
We don’t actually need theStringBuilder
here, as in the worst-case we’re just concatenating three strings, andstring.Concat
has a dedicated overload for that exact operation that has the best possible implementation for that operation (and if we ever found a better way, that method would be improved according). So we can just use that:
private static string CreateSortString(PropertyDescriptor property, ListSortDirection direction) => direction == ListSortDirection.Descending ? $"[{property.Name}] DESC" : $"[{property.Name}]";
Note that I’ve expressed that concatenation via an interpolated string, but the C# compiler will “lower” this interpolated string to a call tostring.Concat
, so the IL for this is indistinguishable from if I’d instead written:
private static string CreateSortString(PropertyDescriptor property, ListSortDirection direction) => direction == ListSortDirection.Descending ? string.Concat("[", property.Name, "] DESC") : string.Concat("[", property.Name, "]");
As an aside, the expandedstring.Concat
version highlights that this method could have been written to result in a bit less IL if it were instead written as:
private static string CreateSortString(PropertyDescriptor property, ListSortDirection direction) => string.Concat("[", property.Name, direction == ListSortDirection.Descending ? "] DESC" : "]");
but this doesn’t meaningfully affect performance and here clarity and maintainability was more important than shaving off a few bytes.
[Benchmark(Baseline = true)][Arguments("SomeProperty", ListSortDirection.Descending)]public string WithStringBuilder(string name, ListSortDirection direction){ var resultString = new StringBuilder(); resultString.Append('['); resultString.Append(name); resultString.Append(']'); if (ListSortDirection.Descending == direction) { resultString.Append(" DESC"); } return resultString.ToString();}[Benchmark][Arguments("SomeProperty", ListSortDirection.Descending)]public string WithConcat(string name, ListSortDirection direction) => direction == ListSortDirection.Descending? $"[{name}] DESC" : $"[{name}]";
Method | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|
WithStringBuilder | 68.34 ns | 1.00 | 272 B | 1.00 |
WithConcat | 20.78 ns | 0.31 | 64 B | 0.24 |
There are also places whereStringBuilder
was still applicable, but it was being used on hot-enough paths that previous releases of .NET saw theStringBuilder
instance being cached. Several of the core libraries, including System.Private.CoreLib, have an internalStringBuilderCache
type which caches aStringBuilder
instance in a[ThreadStatic]
, meaning every thread could end up having such an instance. There are several issues with this, including that the buffers employed byStringBuilder
aren’t usable for anything else while theStringBuilder
isn’t in use, and because of that,StringBuilderCache
places a limit on the capacity of theStringBuilder
instances that can be cached; attempts to cache ones longer than that result in them being thrown away. It’d be better instead to use cached arrays that aren’t length-limited and that everyone has access to for sharing. Many of the core .NET libraries have an internalValueStringBuilder
type for this purpose, aref struct
-based type that can usestackalloc
‘d memory to start and then if necessary grow intoArrayPool<char>
arrays. And withdotnet/runtime#64522 anddotnet/runtime#69683, many of the remaining uses ofStringBuilderCache
have been replaced. I’m hopeful we can entirely removeStringBuilderCache
in the future.
In the same vein of not doing unnecessary work, there’s a fairly common pattern that shows up with methods likestring.Substring
andspan.Slice
:
span = span.Slice(offset, str.Length - offset);
The relevant thing to recognize here is these methods have overloads that take just the starting offset. Since the length being specified is the remainder after the specified offset, the call could instead be simplified to:
span = span.Slice(offset);
which is not only more readable and maintainable, it has some small efficiency benefits, e.g. on 64-bit theSlice(int, int)
constructor has an extra addition overSlice(int)
, and for 32-bit theSlice(int, int)
constructor incurs an additional comparison and branch. It’s thus beneficial for both code maintenance and for performance to simplify these calls, whichdotnet/runtime#68937 does for all found occurrences of that pattern. This is then made more impactful bydotnet/runtime#73882, which streamlinesstring.Substring
to remove unnecessary overheads, e.g. it condenses four argument validation checks down to a single fast-path comparison (in 64-bit processes).
Ok, enough aboutstring
. What about spans? One of the coolest features in C# 11 is the new support forref
fields. What is aref
field? You’re familiar withref
s in C# in general, and we’ve already discussed how they’re essentially managed pointers, i.e. pointers that the runtime can update at any time due to the object it references getting moved on the heap. These references can point to the beginning of an object, or they can point somewhere inside the object, in which case they’re referred to as “interior pointers.”ref
has existed in C# since 1.0, but at that time it was primarily about passing by reference to method calls, e.g.
class Data{ public int Value;}...void Add(ref int i){ i++;}...var d = new Data { Value = 42 };Add(ref d.Value);Debug.Assert(d.Value == 43);
Later versions of C# added the ability to have localref
s, e.g.
void Add(ref int i){ ref j = ref i; j++;}
and even to haveref
returns, e.g.
ref int Add(ref int i){ ref j = ref i; j++; return ref j;}
These facilities are more advanced, but they’re used liberally throughout higher-performance code bases, and many of the optimizations in .NET in recent years are possible in large part due to theseref
-related capabilities.
Span<T>
andReadOnlySpan<T>
themselves are heavily-based onref
s. For example, the indexer on many older collection types is implemented as a get/set property, e.g.
private T[] _items;...public T this[int i]{ get => _items[i]; set => _items[i] = value;}
But not span.Span<T>
‘s indexer looks more like this:
public ref T this[int index]{ get { if ((uint)index >= (uint)_length) ThrowHelper.ThrowIndexOutOfRangeException(); return ref Unsafe.Add(ref _reference, index); }}
Note there’s only a getter and no setter; that’s because it returns aref T
to the actual storage location. It’s a writable ref, so you can assign to it, e.g. you can write:
span[i] = value;
but rather than that being equivalent to calling some setter:
span.set_Item(i, value);
it’s actually equivalent to using the getter to retrieve theref
and then writing a value through thatref
, e.g.
ref T item = ref span.get_Item(i);item = value;
That’s all well and good, but what’s that_reference
in the getter definition? Well,Span<T>
is really just a tuple of two fields: a reference (to the start of the memory being referred to) and a length (how many elements from that reference are included in the span). In the past, the runtime had to hack this with an internal type (ByReference<T>
) specially recognized by the runtime to be a reference. But as of C# 11 and .NET 7,ref structs
can now containref
fields, which meansSpan<T>
today is literally defined as follows:
public readonly ref struct Span<T>{ internal readonly ref T _reference; private readonly int _length; ...}
The rollout ofref
fields throughoutdotnet/runtime was done indotnet/runtime#71498, following the C# language gaining this support primarily indotnet/roslyn#62155, which itself was the culmination of many PRs first into a feature branch.ref
fields alone doesn’t itself automatically improve performance, but it does simplify code significantly, and it allows for both new custom code that usesref
fields as well as new APIs that take advantage of them, both of which can help with performance (and specifically performance without sacrificing potential safety). One such example of a new API is new constructors onReadOnlySpan<T>
andSpan<T>
:
public Span(ref T reference);public ReadOnlySpan(in T reference);
added indotnet/runtime#67447 (and then made public and used more broadly indotnet/runtime#71589). This may beg the question, why doesref
field support enable two new constructors that takeref
s, considering spans already were able to store aref
? After all, theMemoryMarshal.CreateSpan(ref T reference, int length)
and correspondingCreateReadOnlySpan
methods have existed for as long as spans have, and these new constructors are equivalent to calling those methods with a length of 1. The answer is: safety.
Imagine if you could willy-nilly call this constructor. You’d be able to write code like this:
public Span<int> RuhRoh(){ int i = 42; return new Span<int>(ref i);}
At this point the caller of this method is handed a span that refers to garbage; that’s bad in code that’s intended to be safe. You can already accomplish the same thing by using pointers:
public Span<int> RuhRoh(){ unsafe { int i = 42; return new Span<int>(&i, 1); }}
but at that point you’ve taken on the risk of using unsafe code and pointers and any resulting problems are on you. With C# 11, if you now try to write the above code using theref
-based constructor, you’ll be greeted with an error like this:
error CS8347: Cannot use a result of 'Span<int>.Span(ref int)' in this context because it may expose variables referenced by parameter 'reference' outside of their declaration scope
In other words, the compiler now understands thatSpan<int>
as aref struct
could be storing the passed inref
, and if it does store it (whichSpan<T>
does), this is akin to passing aref
to a local out of the method, which is bad. Hence how this relates toref
fields: becauseref
fields are now a thing, the compiler’s rules for safe-handling ofref
s have been updated, which in turn enables us to expose the aforementioned constructors on{ReadOnly}Span<T>
.
As is often the case, addressing one issue kicks the can down the road a bit and exposes another. The compiler now believes that aref
passed to a method on aref struct
could enable thatref struct
instance to store theref
(note that this was already the case withref struct
s passed to methods onref structs
), but what if we don’t want that? What if we want to be able to say “thisref
is not storable and should not escape the calling scope”? From a caller’s perspective, we want the compiler to allow passing in suchref
s without it complaining about potential extension of lifetime, and from a callee’s perspective, we want the compiler to prevent the method from doing what it’s not supposed to do. Enterscoped
. The new C# keyword does exactly what we just wished for: put it on aref
orref struct
parameter, and the compiler both will guarantee (short of using unsafe code) that the method can’t stash away the argument and will then enable the caller to write code that relies on that guarantee. For example, consider this program:
var writer = new SpanWriter(stackalloc char[128]);Append(ref writer, 123);writer.Write(".");Append(ref writer, 45);Console.WriteLine(writer.AsSpan().ToString());static void Append(ref SpanWriter builder, byte value){ Span<char> tmp = stackalloc char[3]; value.TryFormat(tmp, out int charsWritten); builder.Write(tmp.Slice(0, charsWritten));}ref struct SpanWriter{ private readonly Span<char> _chars; private int _length; public SpanWriter(Span<char> destination) => _chars = destination; public Span<char> AsSpan() => _chars.Slice(0, _length); public void Write(ReadOnlySpan<char> value) { if (_length > _chars.Length - value.Length) { throw new InvalidOperationException("Not enough remaining space"); } value.CopyTo(_chars.Slice(_length)); _length += value.Length; }}
We have aref struct SpanWriter
that takes aSpan<char>
to its constructor and allows for writing to it by copying in additional content and then updating the stored length. TheWrite
method accepts aReadOnlySpan<char>
. And we then have a helperAppend
method which is formatting abyte
into somestackalloc
‘d temporary space and passing the resulting formattedchar
s in toWrite
. Straightforward. Except, this doesn’t compile:
error CS8350: This combination of arguments to 'SpanWriter.Write(ReadOnlySpan<char>)' is disallowed because it may expose variables referenced by parameter 'value' outside of their declaration scope
What do we do? TheWrite
method doesn’t actually store thevalue
parameter and won’t ever need to, so we can change the signature of the method to annotate it asscoped
:
public void Write(scoped ReadOnlySpan<char> value)
IfWrite
were then to try to storevalue
, the compiler would balk:
error CS8352: Cannot use variable 'ReadOnlySpan<char>' in this context because it may expose referenced variables outside of their declaration scope
But as it’s not trying to do so, everything now compiles successfully. You can see examples of how this is utilized in the aforementioneddotnet/runtime#71589.
There’s also the other direction: there are some things that are implicitlyscoped
, like thethis
reference on a struct. Consider this code:
public struct SingleItemList{ private int _value; public ref int this[int i] { get { if (i != 0) throw new IndexOutOfRangeException(); return ref _value; } }}
This produces a compiler error:
error CS8170: Struct members cannot return 'this' or other instance members by reference
Effectively, that’s becausethis
is implicitlyscoped
(even though that keyword wasn’t previously available). What if we want to enable such an item to be returned? Enter[UnscopedRef]
. This is rare enough in need that it doesn’t get its own C# language keyword, but the C# compiler does recognize the new[UnscopedRef]
attribute. It can be put onto relevant parameters but also onto methods and properties, in which case it applies to thethis
reference for that member. As such, we can modify our previous code example to be:
[UnscopedRef]public ref int this[int i]
and now the code will compile successfully. Of course, this also places demands on callers of this method. For a call site, the compiler sees the[UnscopedRef]
on the member being invoked, and then knows that the returnedref
might reference something from that struct, and thus assigns to the returnedref
the same lifetime as that struct. So, if that struct were a local living on the stack, theref
would also be limited to that same method.
Another impactful span-related change comes indotnet/runtime#70095 from@teo-tsirpanis.System.HashCode
‘s goal is to provide a fast, easy-to-use implementation for producing high-quality hash codes. In its current incarnation, it incorporates a random process-wide seed and is an implementation of the xxHash32 non-cryptographic hash algorithm. In a previous release,HashCode
saw the addition of anAddBytes
methods, which accepts aReadOnlySpan<byte>
and is useful for incorporating sequences of data that should be part of a type’s hash code, e.g.BigInteger.GetHashCode
includes all the data that makes up theBigInteger
. The xxHash32 algorithm works by accumulating 4 32-bit unsigned integers and then combining them together into the hash code; thus if you callHashCode.Add(int)
, the first three times you call it you’re just storing the values separately into the instance, and then the fourth time you call it all of those values are combined into the hash code (and there’s a separate process that incorporates any remaining values if the number of 32-bit values added wasn’t an exact multiple of 4). Thus, previouslyAddBytes
was simply implemented to repeatedly read the next 4 bytes from the input span and callAdd(int)
with those bytes as an integer. But thoseAdd
calls have overhead. Instead, this PR skips theAdd
calls and directly handles the accumulation and combining of the 16 bytes. Interestingly, it still has to deal with the possibility that previous calls toAdd
may have left some state queued, which means (with the current implementation at least), if there are multiple pieces of state to include in the hash code, say aReadOnlySpan<byte>
and an additionalint
, it’s more efficient to add the span first and then theint
rather than the other way around. So for example whendotnet/runtime#71274 from@huoyaoyuan changedBigInteger.GetHashCode
to useHashCode.AddBytes
, it coded the method to first callAddBytes
with theBigInteger
‘s_bits
andthen callAdd
with the_sign
.
private byte[] _data = Enumerable.Range(0, 256).Select(i => (byte)i).ToArray();[Benchmark]public int AddBytes(){ HashCode hc = default; hc.AddBytes(_data); return hc.ToHashCode();}
Method | Runtime | Mean | Ratio |
---|---|---|---|
AddBytes | .NET 6.0 | 159.11 ns | 1.00 |
AddBytes | .NET 7.0 | 42.11 ns | 0.26 |
Another span-related change,dotnet/runtime#72727 refactored a bunch of code paths to eliminate some cached arrays. Why avoid cached arrays? After all, isn’t it desirable to cache an array once and reuse it over and over again? It is, if that’s the best option, but sometimes there are better options. For example, one of the changes took code like:
private static readonly char[] s_pathDelims = { ':', '\\', '/', '?', '#' };...int index = value.IndexOfAny(s_pathDelims);
and replaced it with code like:
int index = value.AsSpan().IndexOfAny(@":\/?#");
This has a variety of benefits. There’s the usability benefit of keeping the tokens being searched close to the use site, and the usability benefit of the list being immutable such that some code somewhere won’t accidentally replace a value in the array. But there are also performance benefits. We don’t need an extra field to store the array. We don’t need to allocate the array as part of this type’s static constructor. And loading/using the string is slightly faster.
private static readonly char[] s_pathDelims = { ':', '\\', '/', '?', '#' };private static readonly string s_value = "abcdefghijklmnopqrstuvwxyz";[Benchmark]public int WithArray() => s_value.IndexOfAny(s_pathDelims);[Benchmark]public int WithString() => s_value.AsSpan().IndexOfAny(@":\/?#");
Method | Mean | Ratio |
---|---|---|
WithArray | 8.601 ns | 1.00 |
WithString | 6.949 ns | 0.81 |
Another example from that PR took code along the lines of:
private static readonly char[] s_whitespaces = new char[] { ' ', '\t', '\n', '\r' };...switch (attr.Value.Trim(s_whitespaces)){ case "preserve": return Preserve; case "default": return Default;}
and replaced it with code like:
switch (attr.Value.AsSpan().Trim(" \t\n\r")){ case "preserve": return Preserve; case "default": return Default;}
In this case, not only have we avoided thechar[]
, but if the text did require any trimming of whitespaces, the new version (which trims a span instead of the original string) will save an allocation for the trimmed string. This is taking advantage of the new C# 11 feature that supports switching onReadOnlySpan<char>
s just as you can switch onstring
s, added indotnet/roslyn#44388 from@YairHalberstadt.dotnet/runtime#68831 also took advantage of this in several additional places.
Of course, in some cases the arrays are entirely unnecessary. In that same PR, there were several cases like this:
private static readonly char[] WhiteSpaceChecks = new char[] { ' ', '\u00A0' };...int wsIndex = target.IndexOfAny(WhiteSpaceChecks, targetPosition);if (wsIndex < 0){ return false;}
By switching to use spans, again, we can instead write it like this:
int wsIndex = target.AsSpan(targetPosition).IndexOfAny(' ', '\u00A0');if (wsIndex < 0){ return false;}wsIndex += targetPosition;
MemoryExtensions.IndexOfAny
has a dedicated overload for two and three arguments, at which point we don’t need the array at all (these overloads also happen to be faster; when passing an array of two chars, the implementation would extract the two chars from the array and pass them off to the same two-argument implementation). Multiple other PRs similarly removed array allocations.dotnet/runtime#60409 removed a single-char array that was cached to be able to pass it tostring.Split
and replaced it with usage of theSplit
overload that directly accepts a singlechar
.
Finally,dotnet/runtime#59670 from@NewellClark got rid of even more arrays. We saw earlier how the C# compiler special-casesbyte[]
s constructed with a constant length and constant elements and that’s immediately cast to aReadOnlySpan<byte>
. Thus, it can be beneficial any time there’s such abyte[]
being cached to instead expose it as aReadOnlySpan<byte>
. As I discussed in the.NET 6 post, this avoids even the one-time array allocation you’d get for a cached array, results in much more efficient access, and supplies to the JIT compiler more information that enables it to more heavily optimize… goodness all around. This PR removed even more arrays in this manner, as diddotnet/runtime#60411,dotnet/runtime#72743,dotnet/runtime#73115 from@vcsjones, anddotnet/runtime#70665.
Regex
Back in May, I shared a fairly detailed post about the improvements coming toRegular Expressions in .NET 7. As a recap, prior to .NET 5,Regex
‘s implementation had largely been untouched for quite some time. In .NET 5, we brought it back up to be on par with or better than multiple other industry implementations from a performance perspective. .NET 7 takes some significant leaps forward from that. If you haven’t read the post yet, please go ahead and do so now; I’ll wait…
Welcome back. With that context, I’ll avoid duplicating content here, and instead focus on how exactly these improvements came about and the PRs that did so.
RegexOptions.NonBacktracking
Let’s start with one of the larger new features inRegex
, the newRegexOptions.NonBacktracking
implementation. As discussed in the previous post,RegexOptions.NonBacktracking
switches the processing ofRegex
over to using a new engine based in finite automata. It has two primary modes of execution, one that relies on DFAs (deterministic finite automata) and one that relies on NFAs (non-deterministic finite automata). Both implementations provide a very valuable guarantee: processing time is linear in the length of the input. Whereas a backtracking engine (which is whatRegex
uses ifNonBacktracking
isn’t specified) can hit a situation known as “catastrophic backtracking,” where problematic expressions combined with problematic input can result in exponential processing in the length of the input,NonBacktracking
guarantees it’ll only ever do an ammortized-constant amount of work per character in the input. In the case of a DFA, that constant is very small. With an NFA, that constant can be much larger, based on the complexity of the pattern, but for any given pattern the work is still linear in the length of the input.
A significant number of years of development went into theNonBacktracking
implementation, which was initially added intodotnet/runtime indotnet/runtime#60607. However, the original research and implementation for it actually came from Microsoft Research (MSR), and was available as an experimental package in the form of the Symbolic Regex Matcher (SRM) library published by MSR. You can still see vestiges of this in the current code now in .NET 7, but it’s evolved significantly, in tight collaboration between developers on the .NET team and the researchers at MSR (prior to being integrated indotnet/runtime, it was incubated for over a year indotnet/runtimelab, where the original SRM code was brought in viadotnet/runtimelab#588 from@veanes).
This implementation is based on the notion of regular expression derivatives, a concept that’s been around for decades (the term was originally coined in a paper by Janusz Brzozowski in the 1960s) and which has been significantly advanced for this implementation. Regex derivatives form the basis for how the automata (think “graph”) used to process input are constructed. The idea at its core is fairly simple: take a regex and process a single character… what is the new regex you get to describe what remains after processing that one character? That’s the derivative. For example, given the regex\w{3}
to match three word characters, if you apply this to the next input character ‘a’, well, that will strip off the first\w
, leaving us with the derivative\w{2}
. Simple, right? How about something more complicated, like the expression.*(the|he)
. What happens if the next character is at
? Well, it’s possible thatt
could be consumed by the.*
at the beginning of the pattern, in which case the remaining regex would be exactly the same as the starting one (.*(the|he)
), since after matchingt
we could still match exactly the same input as without thet
. But, thet
could have also been part of matchingthe
, and applied tothe
, we’d strip off thet
and be left withhe
, so now our derivative is.*(the|he)|he
. Then what about thehe
in the original alternation?t
doesn’t matchh
, so the derivative would be nothing, which we’ll express here as an empty character class, giving us.*(the|he)|he|[]
. Of course, as part of an alternation, that “nothing” at the end is a nop, and so we can simplify the whole derivative to just.*(the|he)|he
… done. That was all when applying the original pattern against a nextt
. What if it was against anh
instead? Following the same logic as for thet
, this time we end up with.*(the|he)|e
. And so on. What if we instead start with theh
derivative and the next character is ane
? Then we’re taking the pattern.*(the|he)|e
and applying it toe
. Against the left side of the alternation, it can be consumed by the.*
(but doesn’t match eithert
orh
), and so we just end up with that same subexpression. But against the right side of the alternation,e
matchese
, leaving us with the empty string()
:.*(the|he)|()
. At the point where a pattern is “nullable” (it can match the empty string), that can be considered a match. We can visualize this whole thing as a graph, with transitions for every input character to the derivative that comes from applying it.
Looks an awful lot like a DFA, doesn’t it? It should. And that’s exactly howNonBacktracking
constructs the DFAs it uses to process input. For every regex construct (concatenations, alternations, loops, etc.) the engine knows how to derive the next regex based on the character being evaluated. This application is done lazily, so we have an initial starting state (the original pattern), and then when we evaluate the next character in the input, it looks to see whether there’s already a derivative available for that transition: if there is, it follows it, and if there isn’t, it dynamically/lazily derives the next node in the graph. At its core, that’s how it works.
Of course, the devil is in the details and there’s a ton of complication and engineering smarts that go into making the engine efficient. One such example is a tradeoff between memory consumption and throughput. Given the ability to have anychar
as input, you could have effectively ~65K transitions out of every node (e.g. every node could need a ~65K element table); that would significantly increase memory consumption. However, if you actually had that many transitions, it’s very likely a significant majority of them would point to the same target node. Thus,NonBacktracking
maintains its own groupings of characters into what it calls “minterms.” If two characters will have exactly the same transition, they’re part of the same minterm. The transitions are then constructed in terms of minterms, with at most one transition per minterm out of a given node. When the next input character is read, it maps that to a minterm ID, and then finds the appropriate transition for that ID; one additional level of indirection in order to save a potentially huge amount of memory. That mapping is handled via an array bitmap for ASCII and an efficient data structure known as aBinary Decision Diagram (BDD) for everything above 0x7F.
As noted, the non-backtracking engine is linear in the length of the input. But that doesn’t mean it always looks at each input character exactly once. If you callRegex.IsMatch
, it does; after all,IsMatch
only needs to determine whether there is a match and doesn’t need to compute any additional information, such as where the match actual starts or ends, any information on captures, etc. Thus, the engine can simply employ its automata to walk along the input, transitioning from node to node in the graph until it comes to a final state or runs out of input. Other operations, however, do require it to gather more information.Regex.Match
needs to compute everything, and that can actually entail multiple walks over the input. In the initial implementation, the equivalent ofMatch
would always take three passes: match forwards to find the end ofa match, then match a reversed-copy of the pattern in reverse from that ending location in order to find where the match actually starts, and then once more walk forwards from that known starting position to find the actual ending position. However, withdotnet/runtime#68199 from@olsaarik, unless captures are required, it can now be done in only two passes: once forward to find the guaranteed ending location of the match, and then once in reverse to find its starting location. Anddotnet/runtime#65129 from@olsaarik added captures support, which the original implementation also didn’t have. This captures support adds back a third pass, such that once the bounds of the match are known, the engine runs the forward pass one more time, but this time with an NFA-based “simulation” that is able to record “capture effects” on transitions. All of this enables the non-backtracking implementation to have the exact same semantics as the backtracking engines, always producing the same matches in the same order with the same capture information. The only difference in this regard is, whereas with the backtracking engines capture groups inside of loops will store all values captured in every iteration of the loop, only the last iteration is stored with the non-backtracking implementation. On top of that, there are a few constructs the non-backtracking implementation simply doesn’t support, such that attempting to use any of those will fail when trying to construct theRegex
, e.g. backreferences and lookarounds.
Even after its progress as a standalone library from MSR, more than 100 PRs went into makingRegexOptions.NonBacktracking
what it is now in .NET 7, including optimizations likedotnet/runtime#70217 from@olsaarik that tries to streamline the tight inner matching loop at the heart of the DFA (e.g. read the next input character, find the appropriate transition to take, move to the next node, and check information about the node like whether it’s a final state) and optimizations likedotnet/runtime#65637 from@veanes that optimized the NFA mode to avoid superfluous allocations, caching and reusing list and set objects to make the handling of the lists of states ammortized allocation-free.
There’s one more set of PRs of performance interest forNonBacktracking
. TheRegex
implementation for taking patterns and turning them into something processable, regardless of which of the multiple engines is being used, is essentially a compiler, and as with many compilers, it naturally lends itself to recursive algorithms. In the case ofRegex
, those algorithms involve walking around trees of regular expression constructs. Recursion ends up being a very handy way of expressing these algorithms, but recursion also suffers from the possibility of stack overflow; essentially it’s using stack space as scratch space, and if it ends up using too much, things go badly. One common approach to dealing with this is turning the recursive algorithm into an iterative one, which typically involves using an explicit stack of state rather than the implicit one. The nice thing about this is the amount of state you can store is limited only by how much memory you have, as opposed to being limited by your thread’s stack space. The downsides, however, are that it’s typically much less natural to write the algorithms in this manner, and it typically requires allocating heap space for the stack, which then leads to additional complications if you want to avoid that allocation, such as various kinds of pooling.dotnet/runtime#60385 introduces a different approach forRegex
, which is then used bydotnet/runtime#60786 from@olsaarik specifically in theNonBacktracking
implementation. It still uses recursion, and thus benefits from the expressiveness of the recursive algorithm as well as being able to use stack space and thus avoid additional allocation in the most common cases, but then to avoid stack overflows, it issues explicit checks to ensure we’re not too deep on the stack (.NET has long provided the helpersRuntimeHelpers.EnsureSufficientExecutionStack
andRuntimeHelpers.TryEnsureSufficientExecutionStack
for this purpose). If it detects it’s too deep on the stack, it forks off continued execution into another thread. Hitting this condition is expensive, but it’s very rarely if ever actually hit in practice (e.g. the only time it’s hit in our vast functional tests are in the tests explicitly written to stress it), it keeps the code simple, and it keeps the typical cases fast. A similar approach is used in other areas ofdotnet/runtime, such as in System.Linq.Expressions.
As was mentioned in my previous blog post about regular expressions, both the backtracking implementations and the non-backtracking implementation have their place. The main benefit of the non-backtracking implementation is predictability: because of the linear processing guarantee, once you’ve constructed the regex, you don’t need to worry about malicious inputs causing worst-case behavior in the processing of your potentially susceptible expressions. This doesn’t meanRegexOptions.NonBacktracking
is always the fastest; in fact, it’s frequently not. In exchange for reduced best-case performance, it provides the best worst-case performance, and for some kinds of applications, that’s a really worthwhile and valuable tradeoff.
New APIs
Regex
gets several new methods in .NET 7, all of which enable improved performance. The simplicity of the new APIs likely also misrepresents how much work was necessary to enable them, in particular because the new APIs all supportReadOnlySpan<char>
inputs into the regex engines.
dotnet/runtime#65473 bringsRegex
into the span-based era of .NET, overcoming a significant limitation inRegex
since spans were introduced back in .NET Core 2.1.Regex
has historically been based on processingSystem.String
inputs, and that fact pervades theRegex
design and implementation, including the APIs exposed for the extensibility modelRegex.CompileToAssembly
relied on in .NET Framework (CompileToAssembly
is now obsoleted and has never been functional in .NET Core). One subtly that relies on the nature ofstring
as the input is how match information is returned to callers.Regex.Match
returns aMatch
object that represents the first match in the input, and thatMatch
object exposes aNextMatch
method that enables moving to the next match. That means theMatch
object needs to store a reference to the input, so that it can be fed back into the matching engine as part of such aNextMatch
call. If that input is astring
, great, no problem. But if that input is aReadOnlySpan<char>
, that span as aref struct
can’t be stored on theclass Match
object, sinceref structs
can only live on the stack and not the heap. That alone would make it a challenge to support spans, but the problem is even more deeply rooted. All of the regex engines rely on aRegexRunner
, a base class that stores on it all of the state necessary to feed into theFindFirstChar
andGo
methods that compose the actual matching logic for the regular expressions (these methods contain all of the core code for performing the match, withFindFirstChar
being an optimization to skip past input positions that couldn’t possibly start a match and thenGo
performing the actual matching logic). If you look at the internalRegexInterpreter
type, which is the engine you get when you construct anew Regex(...)
without theRegexOptions.Compiled
orRegexOptions.NonBacktracking
flags, it derives fromRegexRunner
. Similarly, when you useRegexOptions.Compiled
, it hands off the dynamic methods it reflection emits to a type derived fromRegexRunner
,RegexOptions.NonBacktracking
has aSymbolicRegexRunnerFactory
that produces types derived fromRegexRunner
, and so on. Most relevant here,RegexRunner
is public, because the types generated by theRegex.CompileToAssembly
type (and now the regex source generator) include ones derived from thisRegexRunner
. ThoseFindFirstChar
andGo
methods are thusabstract
andprotected
, and parameterless, because they pick up all the state they need from protected members on the base class. That includes thestring
input to process. So what about spans? We could of course have just calledToString()
on an inputReadOnlySpan<char>
. That would have been functionally correct, but would have completely defeated the purpose of accepting spans, and worse, would have been so unexpected as to likely cause consuming apps to be worse performing than they would have without the APIs. Instead, we needed a new approach and new APIs.
First, we madeFindFirstChar
andGo
virtual instead of abstract. The design that splits these methods is largely antiquated, and in particular the forced separation between a stage of processing where you find the next possible location of a match and then a stage where you actually perform the match at that location doesn’t align well with all engines, like the one used byNonBacktracking
(which initially implementedFindFirstChar
as a nop and had all its logic inGo
). Then we added a new virtualScan
method which, importantly, takes aReadOnlySpan<char>
as a parameter; the span can’t be exposed from the baseRegexRunner
and must be passed in. We then implementedFindFirstChar
andGo
in terms ofScan
, and made them “just work.” Then, all of the engines are implemented in terms of that span; they no longer need to access the protectedRegexRunner.runtext
,RegexRunner.runtextbeg
, andRegexRunner.runtextend
members that surface the input; they’re just handed the span, already sliced to the input region, and process that. One of the neat things about this from a performance perspective is it enables the JIT to do a better job at shaving off various overheads, in particular around bounds checking. When the logic is implemented in terms ofstring
, in addition to the input string itself the engine is also handed the beginning and end of the region of the input to process (since the developer could have called a method likeRegex.Match(string input, int beginning, int length)
in order to only process a substring). Obviously the engine matching logic is way more complicated than this, but simplifying, imagine the entirety of the engine was just a loop over the input. With the input, beginning, and length, that would look like:
[Benchmark][Arguments("abc", 0, 3)]public void Scan(string input, int beginning, int length){ for (int i = beginning; i < length; i++) { Check(input[i]); }}[MethodImpl(MethodImplOptions.AggressiveInlining)]private void Check(char c) { }
That will result in the JIT generating assembly code along the lines of this:
; Program.Scan(System.String, Int32, Int32) sub rsp,28 cmp r8d,r9d jge short M00_L01 mov eax,[rdx+8]M00_L00: cmp r8d,eax jae short M00_L02 inc r8d cmp r8d,r9d jl short M00_L00M00_L01: add rsp,28 retM00_L02: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 36
In contrast, if we’re dealing with a span, which already factors in the bounds, then we can write a more canonical loop like this:
[Benchmark][Arguments("abc")]public void Scan(ReadOnlySpan<char> input){ for (int i = 0; i < input.Length; i++) { Check(input[i]); }}[MethodImpl(MethodImplOptions.AggressiveInlining)]private void Check(char c) { }
And when it comes to compilers, something in a canonical form is really good, because the more common the shape of the code, the more likely it is to be heavily optimized:
; Program.Scan(System.ReadOnlySpan`1<Char>) mov rax,[rdx] mov edx,[rdx+8] xor ecx,ecx test edx,edx jle short M00_L01M00_L00: mov r8d,ecx movsx r8,word ptr [rax+r8*2] inc ecx cmp ecx,edx jl short M00_L00M00_L01: ret; Total bytes of code 27
So even without all the other benefits that come from operating in terms of span, we immediately get low-level code generation benefits from performing all the logic in terms of spans. While the above example was made up (obviously the matching logic does more than a simple for loop), here’s a real example. When a regex contains a\b
, as part of evaluating the input against that\b
the backtracking engines call aRegexRunner.IsBoundary
helper method which checks whether the character at the current position is a word character and whether the character before it is a word character (factoring in the bounds of the input as well). Here’s what theIsBoundary
method based onstring
looked like (theruntext
it’s using is the name of thestring
field onRegexRunner
that stores the input):
[Benchmark][Arguments(0, 0, 26)]public bool IsBoundary(int index, int startpos, int endpos){ return (index > startpos && IsBoundaryWordChar(runtext[index - 1])) != (index < endpos && IsBoundaryWordChar(runtext[index]));}[MethodImpl(MethodImplOptions.NoInlining)]private bool IsBoundaryWordChar(char c) => false;
and here’s what the span version looks like:
[Benchmark][Arguments("abcdefghijklmnopqrstuvwxyz", 0)]public bool IsBoundary(ReadOnlySpan<char> inputSpan, int index){ int indexM1 = index - 1; return ((uint)indexM1 < (uint)inputSpan.Length && IsBoundaryWordChar(inputSpan[indexM1])) != ((uint)index < (uint)inputSpan.Length && IsBoundaryWordChar(inputSpan[index]));}[MethodImpl(MethodImplOptions.NoInlining)]private bool IsBoundaryWordChar(char c) => false;
And here’s the resulting assembly:
; Program.IsBoundary(Int32, Int32, Int32) push rdi push rsi push rbp push rbx sub rsp,28 mov rdi,rcx mov esi,edx mov ebx,r9d cmp esi,r8d jle short M00_L00 mov rcx,rdi mov rcx,[rcx+8] lea edx,[rsi-1] cmp edx,[rcx+8] jae short M00_L04 mov edx,edx movzx edx,word ptr [rcx+rdx*2+0C] mov rcx,rdi call qword ptr [Program.IsBoundaryWordChar(Char)] jmp short M00_L01M00_L00: xor eax,eaxM00_L01: mov ebp,eax cmp esi,ebx jge short M00_L02 mov rcx,rdi mov rcx,[rcx+8] cmp esi,[rcx+8] jae short M00_L04 mov edx,esi movzx edx,word ptr [rcx+rdx*2+0C] mov rcx,rdi call qword ptr [Program.IsBoundaryWordChar(Char)] jmp short M00_L03M00_L02: xor eax,eaxM00_L03: cmp ebp,eax setne al movzx eax,al add rsp,28 pop rbx pop rbp pop rsi pop rdi retM00_L04: call CORINFO_HELP_RNGCHKFAIL int 3; Total bytes of code 117; Program.IsBoundary(System.ReadOnlySpan`1<Char>, Int32) push r14 push rdi push rsi push rbp push rbx sub rsp,20 mov rdi,rcx mov esi,r8d mov rbx,[rdx] mov ebp,[rdx+8] lea edx,[rsi-1] cmp edx,ebp jae short M00_L00 mov edx,edx movzx edx,word ptr [rbx+rdx*2] mov rcx,rdi call qword ptr [Program.IsBoundaryWordChar(Char)] jmp short M00_L01M00_L00: xor eax,eaxM00_L01: mov r14d,eax cmp esi,ebp jae short M00_L02 mov edx,esi movzx edx,word ptr [rbx+rdx*2] mov rcx,rdi call qword ptr [Program.IsBoundaryWordChar(Char)] jmp short M00_L03M00_L02: xor eax,eaxM00_L03: cmp r14d,eax setne al movzx eax,al add rsp,20 pop rbx pop rbp pop rsi pop rdi pop r14 ret; Total bytes of code 94
The most interesting thing to notice here is the:
call CORINFO_HELP_RNGCHKFAILint 3
at the end of the first version that doesn’t exist at the end of the second. As we saw earlier, this is what the generated assembly looks like when the JIT is emitting the code to throw an index out of range exception for an array, string, or span. It’s at the end because it’s considered to be “cold,” rarely executed. It exists in the first because the JIT can’t prove based on local analysis of that function that theruntext[index-1]
andruntext[index]
accesses will be in range of the string (it can’t know or trust any implied relationship betweenstartpos
,endpos
, and the bounds ofruntext
). But in the second, the JIT can know and trust that theReadOnlySpan<char>
‘s lower bound is 0 and upper bound (exclusive) is the span’sLength
, and with how the method is constructed, it can then prove that the span accesses are always in bound. As such, it doesn’t need to emit any bounds checks in the method, and the method then lacks the tell-tale signature of the index out of range throw. You can see more examples of taking advantage of spans now being at the heart of the all of the engines indotnet/runtime#66129,dotnet/runtime#66178, anddotnet/runtime#72728, all of which clean up unnecessary checks against the bounds that are then always0
andspan.Length
.
Ok, so the engines are now able to be handed span inputs and process them, great, what can we do with that? Well,Regex.IsMatch
is easy: it’s not encumbered by needing to perform multiple matches, and thus doesn’t need to worry about how to store that inputReadOnlySpan<char>
for the next match. Similarly, the newRegex.Count
, which provides an optimized implementation for counting how many matches there are in the input, can bypass usingMatch
orMatchCollection
, and thus can easily operate over spans as well;dotnet/runtime#64289 addedstring
-based overloads, anddotnet/runtime#66026 added span-based overloads. We can optimizeCount
further by passing additional information into the engines to let them know how much information they actually need to compute. For example, I noted previously thatNonBacktracking
is fairly pay-for-play in how much work it needs to do relative to what information it needs to gather. It’s cheapest to just determine whether there is a match, as it can do that in a single forward pass through the input. If it also needs to compute the actual starting and ending bounds, that requires another reverse pass through some of the input. And if it then also needs to compute capture information, that requires yet another forward pass based on an NFA (even if the other two were DFA-based).Count
needs the bounds information, as it needs to know where to start looking for the next match, but it doesn’t need the capture information, since none of that capture information is handed back to the caller.dotnet/runtime#68242 updates the engines to receive this additional information, such that methods likeCount
can be made more efficient.
So,IsMatch
andCount
can work with spans. But we still don’t have a method that lets you actually get back that match information. Enter the newEnumerateMatches
method, added bydotnet/runtime#67794.EnumerateMatches
is very similar toMatch
, except instead of handing back aMatch
class instance, it hands back aref struct
enumerator:
public ref struct ValueMatchEnumerator{ private readonly Regex _regex; private readonly ReadOnlySpan<char> _input; private ValueMatch _current; private int _startAt; private int _prevLen; ...}
Being aref struct
, the enumerator is able to store a reference to the input span, and is thus able to iterate through matches, which are represented by theValueMatch
ref struct. Notably, todayValueMatch
doesn’t provide capture information, which also enables it to partake in the optimizations previously mentioned forCount
. Even if you have an inputstring
,EnumerateMatches
is thus a way to have ammortized allocation-free enumeration of all matches in the input. In .NET 7, though, there isn’t a way to have such allocation-free enumeration if you also need all the capture data. That’s something we’ll investigate designing in the future if/as needed.
TryFindNextPossibleStartingPosition
As noted earlier, the core of all of the engines is aScan(ReadOnlySpan<char>)
method that accepts the input text to match, combines that with positional information from the base instance, and exits when it either finds the location of the next match or exhausts the input without finding another. For the backtracking engines, the implementation of that method is logically as follows:
protected override void Scan(ReadOnlySpan<char> inputSpan){ while (!TryMatchAtCurrentPosition(inputSpan) && base.runtextpos != inputSpan.Length) { base.runtextpos++; }}
We try to match the input at the current position, and if we’re successful in doing so, that’s it, we exit. If the current position doesn’t match, however, then if there’s any input remaining we “bump” the position and start the process over. In regex engine terminology, this is often referred to as a “bumpalong loop.” However, if we actually ran the full matching process at every input character, that could be unnecessarily slow. For many patterns, there’s something about the pattern that would enable us to be more thoughtful about where we perform full matches, quickly skipping past locations that couldn’t possibly match, and only spending our time and resources on locations that have a real chance of matching. To elevate that concept to a first-class one, the backtracking engines’ “bumpalong loop” is typically more like the following (I say “typically” because in some cases the compiled and source generated regexes are able to generate something even better).
protected override void Scan(ReadOnlySpan<char> inputSpan){ while (TryFindNextPossibleStartingPosition(inputSpan) && !TryMatchAtCurrentPosition(inputSpan) && base.runtextpos != inputSpan.Length) { base.runtextpos++; }}
As withFindFirstChar
previously, thatTryFindNextPossibleStartingPosition
has the responsibility of searching as quickly as possible for the next place to match (or determining that nothing else could possibly match, in which case it would returnfalse
and the loop would exit). AsFindFirstChar
, and it was embued with multiple ways of doing its job. In .NET 7,TryFindNextPossibleStartingPosition
learns many more and improved ways of helping the engine be fast.
In .NET 6, the interpreter engine had effectively two ways of implementingTryFindNextPossibleStartingPosition
: a Boyer-Moore substring search if the pattern began with a string (potentially case-insensitive) of at least two characters, and a linear scan for a character class known to be the set of all possible chars that could begin a match. For the latter case, the interpreter had eight different implementations for matching, based on a combination of whetherRegexOptions.RightToLeft
was set or not, whether the character class required case-insensitive comparison or not, and whether the character class contained only a single character or more than one character. Some of these were more optimized than others, e.g. a left-to-right, case-sensitive, single-char search would use anIndexOf(char)
to search for the next location, an optimization added in .NET 5. However, every time this operation was performed, the engine would need to recompute which case it would be.dotnet/runtime#60822 improved this, introducing an internal enum of the strategies used byTryFindNextPossibleStartingPosition
to find the next opportunity, adding aswitch
toTryFindNextPossibleStartingPosition
to quickly jump to the right strategy, and precomputing which strategy to use when the interpreter was constructed. This not only made the interpreter’s implementation at match time faster, it made it effectively free (in terms of runtime overhead at match time) to add additional strategies.
dotnet/runtime#60888 then added the first additional strategy. The implementation was already capable of usingIndexOf(char)
, but as mentioned previously in this post, the implementation ofIndexOf(ReadOnlySpan<char>)
got way better in .NET 7 in many cases, to the point where it ends up being significantly better than Boyer-Moore in all but the most corner of corner cases. So this PR enables a newIndexOf(ReadOnlySpan<char>)
strategy to be used to search for a prefix string in the case where the string is case-sensitive.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _regex = new Regex(@"\belementary\b", RegexOptions.Compiled);[Benchmark]public int Count() => _regex.Matches(s_haystack).Count;
Method | Runtime | Mean | Ratio |
---|---|---|---|
Count | .NET 6.0 | 377.32 us | 1.00 |
Count | .NET 7.0 | 55.44 us | 0.15 |
dotnet/runtime#61490 then removed Boyer-Moore entirely. This wasn’t done in the previously mentioned PR because of lack of a good way to handle case-insensitive matches. However, this PR also special-cased ASCII letters to teach the optimizer how to turn an ASCII case-insensitive match into a set of both casings of that letter (excluding the few known to be a problem, likei
andk
, which can both be impacted by the employed culture and which might map case-insensitively to more than two values). With enough of the common cases covered, rather than use Boyer-Moore to perform a case-insensitive search, the implementation just usesIndexOfAny(char, char, ...)
to search for the starting set, and the vectorization employed byIndexOfAny
ends up outpacing the old implementation handily in real-world cases. This PR goes further than that, such that it doesn’t just discover the “starting set,” but is able to find all of the character classes that could match a pattern a fixed-offset from the beginning; that then gives the analyzer the ability to choose the set that’s expected to be least common and issue a search for it instead of whatever happens to be at the beginning. The PR goes even further, too, motivated in large part by the non-backtracking engine. The non-backtracking engine’s prototype implementation also usedIndexOfAny(char, char, ...)
when it arrived at a starting state and was thus able to quickly skip through input text that wouldn’t have a chance of pushing it to the next state. We wanted all of the engines to share as much logic as possible, in particular around this speed ahead, and so this PR unified the interpreter with the non-backtracking engine to have them share the exact sameTryFindNextPossibleStartingPosition
routine (which the non-backtracking engine just calls at an appropriate place in its graph traversal loop). Since the non-backtracking engine was already usingIndexOfAny
in this manner, initially not doing so popped as a significant regression on a variety of patterns we measure, and this caused us to invest in using it everywhere. This PR also introduced the first special-casing for case-insensitive comparisons into the compiled engine, e.g. if we found a set that was[Ee]
, rather than emitting a check akin toc == 'E' || c == 'e'
, we’d instead emit a check akin to(c | 0x20) == 'e'
(those fun ASCII tricks discussed earlier coming into play again).
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _regex = new Regex(@"\belementary\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);[Benchmark]public int Count() => _regex.Matches(s_haystack).Count;
Method | Runtime | Mean | Ratio |
---|---|---|---|
Count | .NET 6.0 | 499.3 us | 1.00 |
Count | .NET 7.0 | 177.7 us | 0.35 |
The previous PR started turningIgnoreCase
pattern text into sets, in particular for ASCII, e.g.(?i)a
would become[Aa]
. That PR hacked in the support for ASCII knowing that something more complete would be coming along, as it did indotnet/runtime#67184. Rather than hardcoding the case-insensitive sets that just the ASCII characters map to, this PR essentially hardcodes the sets for every possible char. Once that’s done, we no longer need to know about case-insensitivity at match time and can instead just double-down on efficiently matching sets, which we already need to be able to do well. Now, I said it encodes the sets for every possible char; that’s not entirely true. If it were true, that would take up a large amount of memory, and in fact, most of that memory would be wasted because the vast majority of characters don’t participate in case conversion… there are only ~2,000 characters that we need to handle. As such, the implementation employs a three-tier table scheme. The first table has 64 elements, dividing the full range ofchars
into 64 groupings; of those 64 groups, 54 of them have no characters that participate in case conversion, so if we hit one of those entries, we can immediately stop the search. For the remaining 10 that do have at least one character in their range participating, the character and the value from the first table are used to compute an index into the second table; there, too, the majority of entries say that nothing participates in case conversion. It’s only if we get a legitimate hit in the second table does that give us an index into the third table, at which location we can find all of the characters considered case-equivalent with the first.
dotnet/runtime#63477 (and then later improved indotnet/runtime#66572) proceeded to add another searching strategy, this one inspired bynim-regex’s literal optimizations. There are a multitude of regexes we track from a performance perspective to ensure we’re not regressing in common cases and to help guide investments. One is the set of patterns inmariomka/regex-benchmark languages regex benchmark. One of those is forURIs:(@"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?"
. This pattern defies the thus-far enabled strategies for finding a next good location, as it’s guaranteed to begin with a “word character” (\w
), which includes ~50,000 of the ~65,000 possible characters; we don’t have a good way of vectorizing a search for such a character class. However, this pattern is interesting in that it begins with a loop, and not only that, it’s an upper-unbounded loop which our analysis will determine is atomic, because the character guaranteed to immediately follow the loop is a':'
, which is itself not a word character, and thus there’s nothing the loop could match and give up as part of backtracking that would match':'
. That all lends itself to a different approach to vectorization: rather than trying to search for the\w
character class, we can instead search for the substring"://"
, and then once we find it, we can match backwards through as many[\w]
s as we can find; in this case, the only constraint is we need to match at least one. This PR added that strategy, for a literal after an atomic loop, to all of the engines.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _regex = new Regex(@"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?", RegexOptions.Compiled);[Benchmark]public bool IsMatch() => _regex.IsMatch(s_haystack); // Uri's in Sherlock Holmes? "Most unlikely."
Method | Runtime | Mean | Ratio |
---|---|---|---|
IsMatch | .NET 6.0 | 4,291.77 us | 1.000 |
IsMatch | .NET 7.0 | 42.40 us | 0.010 |
Of course, as has been talked about elsewhere, the best optimizations aren’t ones that make something faster but rather ones that make something entirely unnecessary. That’s whatdotnet/runtime#64177 does, in particular in relation to anchors. The .NET regex implementation has long had optimizations for patterns with a starting anchor: if the pattern begins with^
, for example (andRegexOptions.Multiline
wasn’t specified), the pattern is rooted to the beginning, meaning it can’t possibly match at any position other than0
; as such, with such an anchor,TryFindNextPossibleStartingPosition
won’t do any searching at all. The key here, though, is being able to detect whether the pattern begins with such an anchor. In some cases, like^abc$
, that’s trivial. In other cases, like^abc|^def
, the existing analysis had trouble seeing through that alternation to find the guaranteed starting^
anchor. This PR fixes that. It also adds a new strategy based on discovering that a pattern has an ending anchor like$
. If the analysis engine can determine a maximum number of characters for any possible match, and it has such an anchor, then it can simply jump to that distance from the end of the string, and bypass even looking at anything before then.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _regex = new Regex(@"^abc|^def", RegexOptions.Compiled);[Benchmark]public bool IsMatch() => _regex.IsMatch(s_haystack); // Why search _all_ the text?!
Method | Runtime | Mean | Ratio |
---|---|---|---|
IsMatch | .NET 6.0 | 867,890.56 ns | 1.000 |
IsMatch | .NET 7.0 | 33.55 ns | 0.000 |
dotnet/runtime#67732 is another PR related to improving anchor handling. It’s always fun when a bug fix or code simplification refactoring turns into a performance improvement. The PR’s primary purpose was to simplify some complicated code that was computing the set of characters that could possibly start a match. It turns out that complication was hiding a logic bug which manifested in it missing some opportunities to report valid starting character classes, the impact of which is that some searches which could have been vectorized weren’t. By simplifying the implementation, the bug was fixed, exposing more performance opportunities.
By this point, the engines are able to useIndexOf(ReadOnlySpan<char>)
to find a substring at the beginning of a pattern. But sometimes the most valuable substring isn’t at the beginning, but somewhere in the middle or even at the end. As long as it’s at a fixed-offset from the beginning of the pattern, we can search for it, and then just back-off by the offset to the position we should actually try running the match.dotnet/runtime#67907 does exactly that.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _regex = new Regex(@"looking|feeling", RegexOptions.Compiled);[Benchmark]public int Count() => _regex.Matches(s_haystack).Count; // will search for "ing"
Method | Runtime | Mean | Ratio |
---|---|---|---|
Count | .NET 6.0 | 444.2 us | 1.00 |
Count | .NET 7.0 | 122.6 us | 0.28 |
Loops and Backtracking
Loop handling in the compiled and source generated engines has been significantly improved, both with respect to processing them faster and with respect to backtracking less.
With regular greedy loops (e.g.c*
), there are two directions to be concerned about: how quickly can we consume all the elements that match the loop, and how quickly can we give back elements that might be necessary as part of backtracking for the remainder of the expression to match. And with lazy loops, we’re primarily concerned with backtracking, which is the forward direction (since lazy loops consume as part of backtracking rather than giving back as part of backtracking). With PRsdotnet/runtime#63428,dotnet/runtime#68400,dotnet/runtime#64254, anddotnet/runtime#73910, in both the compiler and source generator we now make full use of effectively all of the variants ofIndexOf
,IndexOfAny
,LastIndexOf
,LastIndexOfAny
,IndexOfAnyExcept
, andLastIndexOfAnyExcept
in order to speed along these searches. For example, in a pattern like.*abc
, the forward direction of that loop entails consuming every character until the next newline, which we can optimize with anIndexOf('\n')
. Then as part of backtracking, rather than giving up one character at a time, we canLastIndexOf("abc")
in order to find the next viable location that could possibly match the remainder of the pattern. Or for example, in a pattern like[^a-c]*def
, the loop will initially greedily consume everything other than'a'
,'b'
, or'c'
, so we can useIndexOfAnyExcept('a', 'b', 'c')
to find the initial end of the loop. And so on. This can yield huge performance gains, and with the source generator, also makes the generated code more idiomatic and easier to understand.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _regex = new Regex(@"^.*elementary.*$", RegexOptions.Compiled | RegexOptions.Multiline);[Benchmark]public int Count() => _regex.Matches(s_haystack).Count;
Method | Runtime | Mean | Ratio |
---|---|---|---|
Count | .NET 6.0 | 3,369.5 us | 1.00 |
Count | .NET 7.0 | 430.2 us | 0.13 |
Sometimes optimizations are well-intended but slightly miss the mark.dotnet/runtime#63398 fixes such an issue with an optimization introduced in .NET 5; the optimization was valuable but only for a subset of the scenarios it was intended to cover. WhileTryFindNextPossibleStartingPosition
‘s primary raison d’être is to update the bumpalong position, it’s also possible forTryMatchAtCurrentPosition
to do so. One of the occasions in which it’ll do so is when the pattern begins with an upper-unbounded single-character greedy loop. Since processing starts with the loop having fully consumed everything it could possibly match, subsequent trips through the scan loop don’t need to reconsider any starting position within that loop; doing so would just be duplicating work done in a previous iteration of the scan loop. And as such,TryMatchAtCurrentPosition
can update the bumpalong position to the end of the loop. The optimization added in .NET 5 was dutifully doing this, and it did so in a way that fully handled atomic loops. But with greedy loops, the updated position was getting updated every time we backtracked, meaning it started going backwards, when it should have remained at the end of the loop. This PR fixes that, yielding significant savings in the additional covered cases.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;private Regex _regex = new Regex(@".*stephen", RegexOptions.Compiled);[Benchmark]public int Count() => _regex.Matches(s_haystack).Count;
Method | Runtime | Mean | Ratio |
---|---|---|---|
Count | .NET 6.0 | 103,962.8 us | 1.000 |
Count | .NET 7.0 | 336.9 us | 0.003 |
As mentioned elsewhere, the best optimizations are those that make work entirely vanish rather than just making work faster.dotnet/runtime#68989,dotnet/runtime#63299, anddotnet/runtime#63518 do exactly that by improving the pattern analyzers ability to find and eliminate more unnecessary backtracking, a process the analyzer refers to as “auto-atomicity” (automatically making loops atomic). For example, in the patterna*?b
, we have a lazy loop of'a'
s followed by ab
. That loop can only match'a'
s, and'a'
doesn’t overlap with'b'
. So let’s say the input is"aaaaaaaab"
. The loop is lazy, so we’ll start out by trying to match just'b'
. It won’t match, so we’ll backtrack into the lazy loop and try to match"ab"
. It won’t match so we’ll backtrack into the lazy loop and try to match"aab"
. And so on, until we’ve consumed all the'a'
s such that the rest of the pattern has a chance of matching the rest of the input. That’s exactly what an atomic greedy loop does, so we can transform the patterna*?b
into(?>a*)b
, which is much more efficiently processed. In fact, we can see exactly how it’s processed just by looking at the source-generated implementation of this pattern:
private bool TryMatchAtCurrentPosition(ReadOnlySpan<char> inputSpan){ int pos = base.runtextpos; int matchStart = pos; ReadOnlySpan<char> slice = inputSpan.Slice(pos); // Match 'a' atomically any number of times. { int iteration = slice.IndexOfAnyExcept('a'); if (iteration < 0) { iteration = slice.Length; } slice = slice.Slice(iteration); pos += iteration; } // Advance the next matching position. if (base.runtextpos < pos) { base.runtextpos = pos; } // Match 'b'. if (slice.IsEmpty || slice[0] != 'b') { return false; // The input didn't match. } // The input matched. pos++; base.runtextpos = pos; base.Capture(0, matchStart, pos); return true;}
(Note that those comments aren’t ones I added for this blog post; the source generator itself is emitting commented code.)
When a regular expression is input, it’s parsed into a tree-based form. The “auto-atomicity” analysis discussed in the previous PR is one form of analysis that walks around this tree looking for opportunities to transform portions of the tree into a behaviorally equivalent alternative that will be more efficient to execute. Several PRs introduced additional such transformations.dotnet/runtime#63695, for example, looks for “empty” and “nothing” nodes in the tree that can be removed. An “empty” node is something that matches the empty string, so for example in the alternationabc|def||ghi
, the third branch of that alternation is empty. A “nothing” node is something that can’t match anything, so for example in the concatenationabc(?!)def
, that(?!)
in middle is a negative lookahead around an empty, which can’t possibly match anything, as it’s saying the expression won’t match if it’s followed by an empty string, which everything is. These constructs often arise as a result of other transformations rather than being something a developer typically writes by hand, just as there are optimizations in the JIT where you might look at them and say “why on earth is that something a developer would write” but it ends up being a valuable optimization anyways because inlining might transform perfectly reasonable code into something that matches the target pattern. Thus, for example, if you did haveabc(?!)def
, since that concatenation requires the(?!)
to match in order to be successful, the concatenation itself can simply be replaced by a “nothing.” You can see this easily if you try this with the source generator:
[GeneratedRegex(@"abc(?!)def")]
as it will produce aScan
method like this (comment and all):
protected override void Scan(ReadOnlySpan<char> inputSpan){ // The pattern never matches anything.}
Another set of transformations was introduced indotnet/runtime#59903, specifically around alternations (which beyond loops are the other source of backtracking). This introduced two main optimizations. First, it enables rewriting alternations into alternations of alternations, e.g. transformingaxy|axz|bxy|bxz
intoax(?:y|z)|bx(?:y|z)
, which is then further reduced intoax[yz]|bx[yz]
. This can enable the backtracking engines to more efficiently process alternations due to fewer branches and thus less potential backtracking. The PR also enabled limited reordering of branches in an alternation. Generally branches can’t be reordered, as the order can impact exactly what’s matched and what’s captured, but if the engine can prove there’s no effect on ordering, then it’s free to reorder. One key place that ordering isn’t a factor is if the alternation is atomic due to it being wrapped in an atomic group (and the auto-atomicity analysis will add such groups implicitly in some situations). Reordering the branches then enables other optimizations, like the one previously mentioned from this PR. And then once those optimizations have kicked in, if we’re left with an atomic alternation where every branch begins with a different letter, than can enable further optimizations in terms of how the alternation is lowered; this PR teaches the source generator how to emit aswitch
statement, which leads to both more efficient and more readable code. (The detection of whether nodes in the tree are atomic, and other such properties such as performing captures or introducing backtracking, turned out to be valuable enough thatdotnet/runtime#65734 added dedicated support for this.)
Code generation
The .NET 7 regex implementation has no fewer than four engines: the interpreter (what you get if you don’t explicitly choose another engine), the compiler (what you get withRegexOptions.Compiled
), the non-backtracking engine (what you get withRegexOptions.NonBacktracking
), and the source generator (what you get with[GeneratedRegex(...)]
). The interpreter and the non-backtracking engine don’t require any kind of code generation; they’re both based on creating in-memory data structures that represent how to match input against the pattern. The other two, though, both generate code specific to the pattern; the generated code is code attempting to mimick what you might write if you weren’t usingRegex
at all and were instead writing code to perform a similar match directly. The source generator spits out C# that’s compiled directly into your assembly, and the compiler spits out IL at run-time via reflection emit. The fact that these are generating code specific to the pattern means there’s a ton of opportunity to optimize.
dotnet/runtime#59186 provided the initial implementation of the source generator. This was a direct port of the compiler, effectively a line-by-line translation of IL into C#; the result is C# akin to what you’d get if you were to run the generated IL through a decompiler likeILSpy. A bunch of PRs then proceeded to iterate on and tweak the source generator, but the biggest improvements came from changes that changed the compiler and the source generator together. Prior to .NET 5, the compiler spit out IL that was very similar to what the interpreter would do. The interpreter is handed a series of instructions that it walks through one by one and interprets, and the compiler, handed that same series of instructions, would just emit the IL for processing each. It had some opportunity for being more efficient, e.g. loop unrolling, but a lot of value was left on the table. In .NET 5, an alternate path was added in support of patterns without backtracking; this code path was based on being handed the parsed node tree rather than being based on the series of instructions, and that higher-level form enabled the compiler to derive more insights about the pattern that it could then use to generate more efficient code. In .NET 7, support for all regex features were incrementally added in, over the course of multiple PRs, in particulardotnet/runtime#60385 for backtracking single char loops,dotnet/runtime#61698 for backtracking single char lazy loops,dotnet/runtime#61784 for other backtracking lazy loops, anddotnet/runtime#61906 for other backtracking loops as well as back references and conditionals. At that point, the only features missing were support forRegexOptions.RightToLeft
and lookbehinds (which are implemented in terms of right-to-left), and we decided based on relatively little use of these features that we needn’t keep around the old compiler code just to enable them. So,dotnet/runtime#62318 deleted the old implementation. But, even though these features are relatively rare, it’s a lot easier to tell a story that “all patterns are supported” than one that requires special callouts and exceptions, sodotnet/runtime#66127 anddotnet/runtime#66280 added full lookbehind andRightToLeft
support such that there were no takebacks. At this point, both the compiler and source generator now supported everything the compiler previously did, but now with the more modernized code generation. This code generation is in turn what enables many of the optimizations previously discussed, e.g. it provides the opportunity to use APIs likeLastIndexOf
as part of backtracking, which would have been near impossible with the previous approach.
One of the great things about the source generator emitting idiomatic C# is it makes it easy to iterate. Every time you put in a pattern and see what the generator emits, it’s like being asked to do a code review of someone else’s code, and you very frequently see something “new” worthy of comment, or in this case, improving the generator to address the issue. And so a bunch of PRs were originated based on reviewing what the generator emitted and then tweaking the generator to do better (and since the compiler was effectively entirely rewritten along with the source generator, they maintain the same structure, and it’s easy to port improvements from one to the other). For example,dotnet/runtime#68846 anddotnet/runtime#69198 tweaked how some comparisons were being performed in order for them to convey enough information to the JIT that it can eliminate some subsequent bounds checking, anddotnet/runtime#68490 recognized a variety of conditions being emitted that could never happen in some situations observable statically and was able to elide all that code gen. It also became obvious that some patterns didn’t need the full expressivity of the scan loop, and a more compact and customizedScan
implementation could be used.dotnet/runtime#68560 does that, such that, for example, a simple pattern likehello
won’t emit a loop at all and will instead have a simplerScan
implementation like:
protected override void Scan(ReadOnlySpan<char> inputSpan){ if (TryFindNextPossibleStartingPosition(inputSpan)) { // The search in TryFindNextPossibleStartingPosition performed the entire match. int start = base.runtextpos; int end = base.runtextpos = start + 5; base.Capture(0, start, end); }}
The compiler and source generator were also updated to take advantage of newer features.dotnet/runtime#63277, for example, teaches the source generator how to determine ifunsafe
code is allowed, and if it is, it emits a[SkipLocalsInit]
for the core logic; the matching routine can result in many locals being emitted, andSkipLocalsInit
can make it cheaper to call the function due to less zero’ing being necessary. Then there’s the issue of where the code is generated; we want helper functions (like the\w
IsWordChar
helper introduced indotnet/runtime#62620) that can be shared amongst multiple generated regexes, and we want to be able to share the exact same regex implementation if the same pattern/options/timeout combination are used in multiple places in the same assembly (dotnet/runtime#66747), but doing so then exposes this implementation detail to user code in the same assembly. To still be able to get the perf benefits of such code sharing while avoiding the resulting complications,dotnet/runtime#66432 and thendotnet/runtime#71765 teaches the source generator to use the newfile-local types
features in C# 11 (dotnet/roslyn#62375).
One last and interesting code generation aspect is in optimizations around character class matching. Matching character classes, whether ones explicitly written by the developer or ones implicitly created by the engine (e.g. as part of finding the set of all characters that can begin the expression), can be one of the more time-consuming aspects of matching; if you imagine having to evaluate this logic for every character in the input, then how many instructions needs to be executed as part of matching a character class directly correlates to how long it takes to perform the overall match. We thus spend some time trying to ensure we generate optimal matching code for as many categories of character classes as possible.dotnet/runtime#67365, for example, improved a bunch of cases found to be common in real-world use, like specially-recognizing sets like[\d\D]
,[\s\S]
, and[\w\W]
as meaning “match anything” (just as is the case for.
inRegexOptions.Singleline
mode), in which case existing optimizations around the handling of “match anything” can kick in.
private static readonly string s_haystack = new string('a', 1_000_000);private Regex _regex = new Regex(@"([\s\S]*)", RegexOptions.Compiled);[Benchmark]public Match Match() => _regex.Match(s_haystack);
Method | Runtime | Mean | Ratio |
---|---|---|---|
Match | .NET 6.0 | 1,934,393.69 ns | 1.000 |
Match | .NET 7.0 | 91.80 ns | 0.000 |
Ordotnet/runtime#68924, which taught the source generator how to use all of the newchar
ASCII helper methods, likechar.IsAsciiLetterOrDigit
, as well as some existing helpers it didn’t yet know about, in the generated output; for example this:
[GeneratedRegex(@"[A-Za-z][A-Z][a-z][0-9][A-Za-z0-9][0-9A-F][0-9a-f][0-9A-Fa-f]\p{Cc}\p{L}[\p{L}\d]\p{Ll}\p{Lu}\p{N}\p{P}\p{Z}\p{S}")]
now produces this in the core matching logic emitted by the source generator:
if ((uint)slice.Length < 17 || !char.IsAsciiLetter(slice[0]) || // Match a character in the set [A-Za-z]. !char.IsAsciiLetterUpper(slice[1]) || // Match a character in the set [A-Z]. !char.IsAsciiLetterLower(slice[2]) || // Match a character in the set [a-z]. !char.IsAsciiDigit(slice[3]) || // Match '0' through '9'. !char.IsAsciiLetterOrDigit(slice[4]) || // Match a character in the set [0-9A-Za-z]. !char.IsAsciiHexDigitUpper(slice[5]) || // Match a character in the set [0-9A-F]. !char.IsAsciiHexDigitLower(slice[6]) || // Match a character in the set [0-9a-f]. !char.IsAsciiHexDigit(slice[7]) || // Match a character in the set [0-9A-Fa-f]. !char.IsControl(slice[8]) || // Match a character in the set [\p{Cc}]. !char.IsLetter(slice[9]) || // Match a character in the set [\p{L}]. !char.IsLetterOrDigit(slice[10]) || // Match a character in the set [\p{L}\d]. !char.IsLower(slice[11]) || // Match a character in the set [\p{Ll}]. !char.IsUpper(slice[12]) || // Match a character in the set [\p{Lu}]. !char.IsNumber(slice[13]) || // Match a character in the set [\p{N}]. !char.IsPunctuation(slice[14]) || // Match a character in the set [\p{P}]. !char.IsSeparator(slice[15]) || // Match a character in the set [\p{Z}]. !char.IsSymbol(slice[16])) // Match a character in the set [\p{S}].{ return false; // The input didn't match.}
Other changes impacting character class code generation includeddotnet/runtime#72328, which improved the handling of character classes that involve character class subtraction;dotnet/runtime#72317 from@teo-tsirpanis, which enabled additional cases where the generator could avoid emitting a bitmap lookup;dotnet/runtime#67133, which added a tighter bounds check when it does emit such a lookup table; anddotnet/runtime#61562, which enables better normalization of character classes in the engine’s internal representation, thus leading to downstream optimizations better recognizing more character classes.
Finally, with all of these improvements toRegex
, a multitude of PRs fixed up regexes being used acrossdotnet/runtime, in various ways.dotnet/runtime#66142,dotnet/runtime#66179 from@Clockwork-Muse, anddotnet/runtime#62325 from@Clockwork-Muse all convertedRegex
usage over to using[GeneratedRegex(...)]
.dotnet/runtime#68961 optimized other usage in various ways. The PR replaced severalregex.Matches(...).Success
calls withIsMatch(...)
, as usingIsMatch
has less overhead due to not needing to construct aMatch
instance and due to being able to avoid more expensive phases in the non-backtracking engine to compute exact bounds and capture information. The PR also replaced someMatch
/Match.MoveNext
usage withEnumerateMatches
, in order to avoid needingMatch
object allocations. The PR also entirely removed at least one regex usage that was just as doable as a cheaperIndexOf
.dotnet/runtime#68766 also removed a use ofRegexOptions.CultureInvariant
. SpecifyingCultureInvariant
changes the behavior ofIgnoreCase
by alternating which casing tables are employed; ifIgnoreCase
isn’t specified and there’s no inline case-insensitivity options ((?i)
), then specifyingCultureInvariant
is a nop. But a potentially expensive one. For any code that’s size conscious, theRegex
implementation is structured in a way as to try to make it as trimmmer friendly as possible. If you only ever donew Regex(pattern)
, we’d really like to be able to statically determine that the compiler and non-backtracking implementations aren’t needed such that the trimmer can remove it without having a visible and meaningful negative impact. However, the trimmer analysis isn’t yet sophisticated enough to see exactly which options are used and only keep the additional engines linked in ifRegexOptions.Compiled
orRegexOptions.NonBacktracking
is used; instead,any use of an overload that takes aRegexOptions
will result in that code continuing to be referenced. By getting rid of the options, we increase the chances that no code in the app is using this constructor, which would in turn enable this constructor, the compiler, and the non-backtracking implementation to be trimmed away.
Collections
System.Collections
hasn’t seen as much investment in .NET 7 as it has in previous releases, though many of the lower-level improvements have a trickle-up effect into collections as well. For example,Dictionary<,>
‘s code hasn’t changed between .NET 6 and .NET 7, but even so, this benchmark focused on dictionary lookups:
private Dictionary<int, int> _dictionary = Enumerable.Range(0, 10_000).ToDictionary(i => i);[Benchmark]public int Sum(){ Dictionary<int, int> dictionary = _dictionary; int sum = 0; for (int i = 0; i < 10_000; i++) { if (dictionary.TryGetValue(i, out int value)) { sum += value; } } return sum;}
shows a measurable improvement in throughput between .NET 6 and .NET 7:
Method | Runtime | Mean | Ratio | Code Size |
---|---|---|---|---|
Sum | .NET 6.0 | 51.18 us | 1.00 | 431 B |
Sum | .NET 7.0 | 43.44 us | 0.85 | 413 B |
Beyond that, there have been explicit improvements elsewhere in collections.ImmutableArray<T>
, for example. As a reminder,ImmutableArray<T>
is a very thin struct-based wrapper around aT[]
that hides the mutability ofT[]
; unless you’re using unsafe code, neither the length nor the shallow contents of anImmutableArray<T>
will ever change (by shallow, I mean the data stored directly in that array can’t be mutated, but if there are mutable reference types stored in the array, those instances themselves may still have their data mutated). As a result,ImmutableArray<T>
also has an associated “builder” type, which does support mutation: you create the builder, populate it, and then transfer that contents to anImmutableArray<T>
which is frozen forevermore. Indotnet/runtime#70850 from@grbell-ms, the builder’sSort
method is changed to use a span, which in turn avoids anIComparer<T>
allocation and aComparison<T>
allocation, while also speeding up the sort itself by removing several layers of indirection from every comparison.
private ImmutableArray<int>.Builder _builder = ImmutableArray.CreateBuilder<int>();[GlobalSetup]public void Setup(){ _builder.AddRange(Enumerable.Range(0, 1_000));}[Benchmark]public void Sort(){ _builder.Sort((left, right) => right.CompareTo(left)); _builder.Sort((left, right) => left.CompareTo(right));}
Method | Runtime | Mean | Ratio |
---|---|---|---|
Sort | .NET 6.0 | 86.28 us | 1.00 |
Sort | .NET 7.0 | 67.17 us | 0.78 |
dotnet/runtime#61196 from@lateapexearlyspeed bringsImmutableArray<T>
into the span-based era, adding around 10 new methods toImmutableArray<T>
that interoperate withSpan<T>
andReadOnlySpan<T>
. These are valuable from a performance perspective because it means if you have your data in a span, you can get it into anImmutableArray<T>
without incurring additional allocations beyond the one theImmutableArray<T>
itself will create.dotnet/runtime#66550 from@RaymondHuy also adds a bunch of new methods to the immutable collection builders, which provide efficient implementations for operations like replacing elements and adding, inserting, and removing ranges.
SortedSet<T>
also saw some improvements in .NET 7. For example,SortedSet<T>
internally uses ared/black tree as its internal data structure, and it uses aLog2
operation to determine the maximum depth the tree could be for a given node count. Previously, that operation was implemented as a loop. But thanks todotnet/runtime#58793 from@teo-tsirpanis that implementation is now simply a call toBitOperations.Log2
, which is in turn implemented trivially in terms of one of multiple hardware intrinsics if they’re supported (e.g.Lzcnt.LeadingZeroCount
,ArmBase.LeadingZeroCount
,X86Base.BitScanReverse
). Anddotnet/runtime#56561 from@johnthcall improvesSortedSet<T>
copy performance by streamlining how the iteration through the nodes in the tree is handled.
[Params(100)]public int Count { get; set; }private static SortedSet<string> _set;[GlobalSetup]public void GlobalSetup(){ _set = new SortedSet<string>(StringComparer.OrdinalIgnoreCase); for (int i = 0; i < Count; i++) { _set.Add(Guid.NewGuid().ToString()); }}[Benchmark]public SortedSet<string> SortedSetCopy(){ return new SortedSet<string>(_set, StringComparer.OrdinalIgnoreCase);}
Method | Runtime | Mean | Ratio |
---|---|---|---|
SortedSetCopy | .NET 6.0 | 2.397 us | 1.00 |
SortedSetCopy | .NET 7.0 | 2.090 us | 0.87 |
One last PR to look at in collections:dotnet/runtime#67923.ConditionalWeakTable<TKey, TValue>
is a collection most developers haven’t used, but when you need it, you need it. It’s used primarily for two purposes: to associate additional state with some object, and to maintain a weak collection of objects. Essentially, it’s a thread-safe dictionary that doesn’t maintain strong references to anything it stores but ensures that the value associated with a key will remain rooted as long as the associated key is rooted. It exposes many of the same APIs asConcurrentDictionary<,>
, but for adding items to the collection, it’s historically only had anAdd
method. That means if the design of the consuming code entailed trying to use the collection as a set, where duplicates were common, it would also be common to experience exceptions when trying toAdd
an item that already existed in the collection. Now in .NET 7, it has aTryAdd
method, which enables such usage without potentially incurring the costs of such exceptions (and without needing to addtry/catch
blocks to defend against them).
LINQ
Let’s move on to Language-Integrated Query (LINQ). LINQ is a productivity feature that practically every .NET developer uses. It enables otherwise complicated operations to be trivially expressed, whether via language-integrated query comprehension syntax or via direct use of methods onSystem.Linq.Enumerable
. That productivity and expressivity, however, comes at a bit of an overhead cost. In the vast majority of situations, those costs (such as delegate and closure allocations, delegate invocations, use of interface methods on arbitrary enumerables vs direct access to indexers andLength
/Count
properties, etc.) don’t have a significant impact, but for really hot paths, they can and do show up in a meaningful way. This leads some folks to declare LINQ as being broadly off-limits in their codebases. From my perspective, that’s misguided; LINQ is extremely useful and has its place. In .NET itself, we use LINQ, we’re just practical and thoughtful about where, avoiding it in code paths we’ve optimized to be lightweight and fast due to expectations that such code paths could matter to consumers. And as such, while LINQ itself may not perform as fast as a hand-rolled solution, we still care a lot about the performance of LINQ’s implementation, so that it can be used in more and more places, and so that where it’s used there’s as little overhead as possible. There are also differences between operations in LINQ; with over 200 overloads providing various kinds of functionality, some of these overloads benefit from more performance tuning than do others, based on their expected usage.
dotnet/runtime#64470 is the result of analyzing various real-world code bases for use ofEnumerable.Min
andEnumerable.Max
, and seeing that it’s very common to use these with arrays, often ones that are quite large. This PR updates theMin<T>(IEnumerable<T>)
andMax<T>(IEnumerable<T>)
overloads when the input is anint[]
orlong[]
to vectorize the processing, usingVector<T>
. The net effect of this is significantly faster execution time for larger arrays, but still improved performance even for short arrays (because the implementation is now able to access the array directly rather than going through the enumerable, leading to less allocation and interface dispatch and more applicable optimizations like inlining).
[Params(4, 1024)]public int Length { get; set; }private IEnumerable<int> _source;[GlobalSetup]public void Setup() => _source = Enumerable.Range(1, Length).ToArray();[Benchmark]public int Min() => _source.Min();[Benchmark]public int Max() => _source.Max();
Method | Runtime | Length | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|
Min | .NET 6.0 | 4 | 26.167 ns | 1.00 | 32 B | 1.00 |
Min | .NET 7.0 | 4 | 4.788 ns | 0.18 | – | 0.00 |
Max | .NET 6.0 | 4 | 25.236 ns | 1.00 | 32 B | 1.00 |
Max | .NET 7.0 | 4 | 4.234 ns | 0.17 | – | 0.00 |
Min | .NET 6.0 | 1024 | 3,987.102 ns | 1.00 | 32 B | 1.00 |
Min | .NET 7.0 | 1024 | 101.830 ns | 0.03 | – | 0.00 |
Max | .NET 6.0 | 1024 | 3,798.069 ns | 1.00 | 32 B | 1.00 |
Max | .NET 7.0 | 1024 | 100.279 ns | 0.03 | – | 0.00 |
One of the more interesting aspects of the PR, however, is one line that’s meant to help with the non-array cases. In performance optimization, and in particular when adding “fast paths” to better handle certain cases, there’s almost always a winner and a loser: the winner is the case the optimization is intended to help, and the loser is every other case that’s penalized by whatever checks are necessary to determine whether to take the improved path. An optimization that special-cases arrays might normally look like:
if (source is int[] array){ ProcessArray(array);}else{ ProcessEnumerable(source);}
However, if you look at the PR, you’ll see theif
condition is actually:
if (source.GetType() == typeof(int[]))
How come? Well at this point in the code flow, we know that source isn’t null, so we don’t need the extra null check thatis
will bring. However, that’s minor compared to the real impact here, that of support for array covariance. It might surprise you to learn that there are types beyondint[]
that will satisfy asource is int
check… try runningConsole.WriteLine((object)new uint[42] is int[]);
, and you’ll find it prints outTrue
. (This is also a rare case where the .NET runtime and C# the language disagree on aspects of the type system. If you change thatConsole.WriteLine((object)new uint[42] is int[]);
to instead beConsole.WriteLine(new uint[42] is int[]);
, i.e. remove the(object)
cast, you’ll find it starts printing outFalse
instead ofTrue
. That’s because the C# compiler believes it’s impossible for auint[]
to ever be anint[]
, and thus optimizes the check away entirely to be a constantfalse
.) Thus the runtime is having to do more work as part of the type check than just a simple comparison against the known type identity ofint[]
. We can see this by looking at the assembly generated for these two methods (the latter assumes we’ve already null-checked the input, which is the case in these LINQ methods):
public IEnumerable<object> Inputs { get; } = new[] { new object() };[Benchmark][ArgumentsSource(nameof(Inputs))]public bool M1(object o) => o is int[];[Benchmark][ArgumentsSource(nameof(Inputs))]public bool M2(object o) => o.GetType() == typeof(int[]);
This results in:
; Program.M1(System.Object) sub rsp,28 mov rcx,offset MT_System.Int32[] call qword ptr [System.Runtime.CompilerServices.CastHelpers.IsInstanceOfAny(Void*, System.Object)] test rax,rax setne al movzx eax,al add rsp,28 ret; Total bytes of code 34; Program.M2(System.Object) mov rax,offset MT_System.Int32[] cmp [rdx],rax sete al movzx eax,al ret; Total bytes of code 20
Note the former involves a method call to the JIT’sCastHelpers.IsInstanceOfAny
helper method, and that it’s not inlined. That in turn impacts performance:
private IEnumerable<int> _source = (int[])(object)new uint[42];[Benchmark(Baseline = true)]public bool WithIs() => _source is int[];[Benchmark]public bool WithTypeCheck() => _source.GetType() == typeof(int[]);
Method | Mean | Ratio | Code Size |
---|---|---|---|
WithIs | 1.9246 ns | 1.000 | 215 B |
WithTypeCheck | 0.0013 ns | 0.001 | 24 B |
Of course, these two operations aren’t semantically equivalent, so if this was for something that required the semantics of the former, we couldn’t use the latter. But in the case of this LINQ performance optimization, we can choose to only optimize theint[]
case, forego the super rare case of theint[]
actually being auint[]
(or e.g.DayOfWeek[]
), and minimize the performance penalty of the optimization forIEnumerable<int>
inputs other thanint[]
to just a few quick instructions.
This improvement was built upon further indotnet/runtime#64624, which expands the input types supported and the operations that take advantage. First, it introduced a private helper for extracting aReadOnlySpan<T>
from certain types ofIEnumerable<T>
inputs, namely today those inputs that are actually either aT[]
or aList<T>
; as with the previous PR, it uses theGetType() == typeof(T[])
form to avoid significantly penalizing other inputs. Both of these types enable extracting aReadOnlySpan<T>
for the actual storage, in the case ofT[]
via a cast and in the case ofList<T>
via theCollectionsMarshal.AsSpan
method that was introduced in .NET 5. Once we have that span, we can do a few interesting things. This PR:
- Expands the previous
Min<T>(IEnumerable<T>)
andMax<T>(IEnumerable<T>)
optimizations to not only apply toint[]
andlong[]
but also toList<int>
andList<long>
. - Uses direct span access for
Average<T>(IEnumerable<T>)
andSum<T>(IEnumerable<T>)
forT
beingint
,long
,float
,double
, ordecimal
, all for arrays and lists. - Similarly uses direct span access for
Min<T>(IEnumerable<T>)
andMax<T>(IEnumerable<T>)
forT
beingfloat
,double
, anddecimal
. - Vectorizes
Average<int>(IEnumerable<int>)
for arrays and lists
The effect of that is evident in microbenchmarks, e.g.
private static float[] CreateRandom(){ var r = new Random(42); var results = new float[10_000]; for (int i = 0; i < results.Length; i++) { results[i] = (float)r.NextDouble(); } return results;}private IEnumerable<float> _floats = CreateRandom();[Benchmark]public float Sum() => _floats.Sum();[Benchmark]public float Average() => _floats.Average();[Benchmark]public float Min() => _floats.Min();[Benchmark]public float Max() => _floats.Max();
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
Sum | .NET 6.0 | 39.067 us | 1.00 | 32 B | 1.00 |
Sum | .NET 7.0 | 14.349 us | 0.37 | – | 0.00 |
Average | .NET 6.0 | 41.232 us | 1.00 | 32 B | 1.00 |
Average | .NET 7.0 | 14.378 us | 0.35 | – | 0.00 |
Min | .NET 6.0 | 45.522 us | 1.00 | 32 B | 1.00 |
Min | .NET 7.0 | 9.668 us | 0.21 | – | 0.00 |
Max | .NET 6.0 | 41.178 us | 1.00 | 32 B | 1.00 |
Max | .NET 7.0 | 9.210 us | 0.22 | – | 0.00 |
The previous LINQ PRs were examples from making existing operations faster. But sometimes performance improvements come about from new APIs that can be used in place of previous ones in certain situations to further improve performance. One such example of that comes from new APIs introduced indotnet/runtime#70525 from@deeprobin which were then improved indotnet/runtime#71564. One of the most popular methods in LINQ isEnumerable.OrderBy
(and its inverseOrderByDescending
), which enables creating a sorted copy of the input enumerable. To do so, the caller passes aFunc<TSource,TKey>
predicate toOrderBy
whichOrderBy
uses to extract the comparison key for each item. However, it’s relatively common to want to sort items with themselves as the keys; this is, after all, the default for methods likeArray.Sort
, and in such cases callers ofOrderBy
end up passing in an identity function, e.g.OrderBy(x => x)
. To eliminate that cruft, .NET 7 introduces the newOrder
andOrderDescending
methods, which, in the spirit of pairs likeDistinct
andDistinctBy
, perform that same sorting operation, just with an implicitx => x
done on behalf of the caller. But beyond performance, a nice benefit of this is the implementation then knows that the keys will all be the same as the inputs, and it no longer needs to invoke the callback for each item to retrieve its key nor allocate a new array to store those keys. Thus if you find yourself using LINQ and reaching forOrderBy(x => x)
, consider instead usingOrder()
and reaping the (primarily allocation) benefits:
[Params(1024)]public int Length { get; set; }private int[] _arr;[GlobalSetup]public void Setup() => _arr = Enumerable.Range(1, Length).Reverse().ToArray();[Benchmark(Baseline = true)]public void OrderBy(){ foreach (int _ in _arr.OrderBy(x => x)) { }}[Benchmark]public void Order(){ foreach (int _ in _arr.Order()) { }}
Method | Length | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
OrderBy | 1024 | 68.74 us | 1.00 | 12.3 KB | 1.00 |
Order | 1024 | 66.24 us | 0.96 | 8.28 KB | 0.67 |
File I/O
.NET 6 saw some huge file I/O improvements, in particular a complete rewrite ofFileStream
. While .NET 7 doesn’t have any single changes on that scale, it does have a significant number of improvements that measurably “move the needle,” and in variety of ways.
One form of performance improvement that also masquerades as a reliability improvement is increasing responsiveness to cancellation requests. The faster something can be canceled, the sooner the system is able to give back valuable resources in use, and the sooner things waiting for that operation to complete are able to be unblocked. There have been several improvements of this ilk in .NET 7.
In some cases, it comes from adding cancelable overloads where things weren’t previously cancelable at all. That’s the case fordotnet/runtime#61898 from@bgrainger, which added new cancelable overloads ofTextReader.ReadLineAsync
andTextReader.ReadToEndAsync
, and that includes overrides of these methods onStreamReader
andStringReader
;dotnet/runtime#64301 from@bgrainger then overrode these methods (and others missing overrides) on theNullStreamReader
type returned fromTextReader.Null
andStreamReader.Null
(interestingly, these were defined as two different types, unnecessarily, and so this PR also unified on just having both use theStreamReader
variant, as it satisfies the required types of both). You can see this put to good use indotnet/runtime#66492 from@lateapexearlyspeed, which adds a newFile.ReadLinesAsync
method. This produces anIAsyncEnumerable<string>
of the lines in the file, is based on a simple loop around the newStreamReader.ReadLineAsync
overload, and is thus itself fully cancelable.
From my perspective, though, a more interesting form of this is when an existing overload is purportedly cancelable but isn’t actually. For example, the baseStream.ReadAsync
method just wraps theStream.BeginRead
/EndRead
methods, which aren’t cancelable, so if aStream
-derived type doesn’t overrideReadAsync
, attempts to cancel a call to itsReadAsync
will be minimally effective. It does an up-front check for cancellation, such that if cancellation was requested prior to the call being made, it will be immediately canceled, but after that check the suppliedCancellationToken
is effectively ignored. Over time we’ve tried to stamp out all remaining such cases, but a few stragglers have remained. One pernicious case has been with pipes. For this discussion, there are two relevant kinds of pipes, anonymous and named, which are represented in .NET as pairs of streams:AnonymousPipeClientStream
/AnonymousPipeServerStream
andNamedPipeClientStream
/NamedPipeServerStream
. Also, on Windows, the OS makes a distinction between handles opened for synchronous I/O from handles opened for overlapped I/O (aka asynchronous I/O), and this is reflected in the .NET API: you can open a named pipe for synchronous or overlapped I/O based on thePipeOptions.Asynchronous
option specified at construction. And, on Unix, named pipes, contrary to their naming, are actually implemented on top of Unix domain sockets. Now some history:
- .NET Framework 4.8: No cancellation support. The pipe
Stream
-derived types didn’t even overrideReadAsync
orWriteAsync
, so all they got was the default up-front check for cancellation and then the token was ignored. - .NET Core 1.0: On Windows, with a named pipe opened for asynchronous I/O, cancellation was fully supported. The implementation would register with the
CancellationToken
, and upon a cancellation request, would useCancelIoEx
for theNativeOverlapped*
associated with the asynchronous operation. On Unix, with named pipes implemented in terms of sockets, if the pipe was opened withPipeOptions.Asynchronous
, the implementation would simulate cancellation via polling: rather than simply issuing theSocket.ReceiveAsync
/Socket.SendAsync
(which wasn’t cancelable at the time), it would queue a work item to theThreadPool
, and that work item would run a polling loop, makingSocket.Poll
calls with a small timeout, checking the token, and then looping around to do it again until either thePoll
indicated the operation would succeed or cancellation was requested. On both Windows and Unix, other than a named pipe opened withAsynchronous
, after the operation was initated, cancellation was a nop. - .NET Core 2.1: On Unix, the implementation was improved to avoid the polling loop, but it still lacked a truly cancelable
Socket.ReceiveAsync
/Socket.SendAsync
. Instead, by this pointSocket.ReceiveAsync
supported zero-byte reads, where a caller could pass a zero-length buffer toReceiveAsync
and use that as notification for data being available to consume without actually consuming it. The Unix implementation for asynchronous named pipe streams then changed to issue zero-byte reads, and wouldawait
aTask.WhenAny
of both that operation’s task and a task that would be completed when cancellation was requested. Better, but still far from ideal. - .NET Core 3.0: On Unix,
Socket
got truly cancelableReceiveAsync
andSendAsync
methods, which asynchronous named pipes were updated to utilize. At this point, the Windows and Unix implementations were effectively on par with regards to cancellation; both good for asynchronous named pipes, and just posing for everything else. - .NET 5: On Unix,
SafeSocketHandle
was exposed and it became possible to create aSocket
for an arbitrary suppliedSafeSocketHandle
, which enabled creating aSocket
that actually referred to an anonymous pipe. This in tern enabled everyPipeStream
on Unix to be implemented in terms ofSocket
, which enabledReceiveAsync
/SendAsync
to be fully cancelable for both anonymous and named pipes, regardless of how they were opened.
So by .NET 5, the problem was addressed on Unix, but still an issue on Windows. Until now. In .NET 7, we’ve made the rest of the operations fully cancelable on Windows as well, thanks todotnet/runtime#72503 (and a subsequent tweak indotnet/runtime#72612). Windows doesn’t support overlapped I/O for anonymous pipes today, so for anonymous pipes and for named pipes opened for synchronous I/O, the Windows implementation would just delegate to the baseStream
implementation, which would queue a work item to theThreadPool
to invoke the synchronous counterpart, just on another thread. Instead, the implementations now queue that work item, but instead of just calling the synchronous method, it does some pre- and post- work that registers for cancellation, passing in the thread ID of the thread that’s about to perform the I/O. If cancellation is requested, the implementation then usesCancelSynchronousIo
to interrupt it. There’s a race condition here, in that the moment the thread registers for cancellation, cancellation could be requested, such thatCancelSynchronousIo
could be called before the operation is actually initiated. So, there’s a small spin loop employed, where if cancellation is requested between the time registration occurs and the time the synchronous I/O is actually performed, the cancellation thread will spin until the I/O is initiated, but this condition is expected to be exceedingly rare. There’s also a race condition on the other side, that ofCancelSynchronousIo
being requested after the I/O has already completed; to address that race, the implementation relies on the guarantees made byCancellationTokenRegistration.Dispose
, which promises that the associated callback will either never be invoked or will already have fully completed executing by the timeDispose
returns. Not only does this implementation complete the puzzle such that all asynchronous read/write operations on both anonymous and named pipes on both Windows and Unix are cancelable, it also actually improves normal throughput.
private Stream _server;private Stream _client;private byte[] _buffer = new byte[1];private CancellationTokenSource _cts = new CancellationTokenSource();[Params(false, true)]public bool Cancelable { get; set; }[Params(false, true)]public bool Named { get; set; }[GlobalSetup]public void Setup(){ if (Named) { string name = Guid.NewGuid().ToString("N"); var server = new NamedPipeServerStream(name, PipeDirection.Out); var client = new NamedPipeClientStream(".", name, PipeDirection.In); Task.WaitAll(server.WaitForConnectionAsync(), client.ConnectAsync()); _server = server; _client = client; } else { var server = new AnonymousPipeServerStream(PipeDirection.Out); var client = new AnonymousPipeClientStream(PipeDirection.In, server.ClientSafePipeHandle); _server = server; _client = client; }}[GlobalCleanup]public void Cleanup(){ _server.Dispose(); _client.Dispose();}[Benchmark(OperationsPerInvoke = 1000)]public async Task ReadWriteAsync(){ CancellationToken ct = Cancelable ? _cts.Token : default; for (int i = 0; i < 1000; i++) { ValueTask<int> read = _client.ReadAsync(_buffer, ct); await _server.WriteAsync(_buffer, ct); await read; }}
Method | Runtime | Cancelable | Named | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|
ReadWriteAsync | .NET 6.0 | False | False | 22.08 us | 1.00 | 400 B | 1.00 |
ReadWriteAsync | .NET 7.0 | False | False | 12.61 us | 0.76 | 192 B | 0.48 |
ReadWriteAsync | .NET 6.0 | False | True | 38.45 us | 1.00 | 400 B | 1.00 |
ReadWriteAsync | .NET 7.0 | False | True | 32.16 us | 0.84 | 220 B | 0.55 |
ReadWriteAsync | .NET 6.0 | True | False | 27.11 us | 1.00 | 400 B | 1.00 |
ReadWriteAsync | .NET 7.0 | True | False | 13.29 us | 0.52 | 193 B | 0.48 |
ReadWriteAsync | .NET 6.0 | True | True | 38.57 us | 1.00 | 400 B | 1.00 |
ReadWriteAsync | .NET 7.0 | True | True | 33.07 us | 0.86 | 214 B | 0.54 |
The rest of the performance-focused changes around I/O in .NET 7 were primarily focused on one of two things: reducing syscalls, and reducing allocation.
Several PRs went into reducing syscalls on Unix as part of copying files, e.g.File.Copy
andFileInfo.CopyTo
.dotnet/runtime#59695 from@tmds reduced overheads in several ways. The code had been performing astat
call in order to determine up front whether the source was actually a directory, in which case the operation would error out. Instead, the PR simply tries to open the source file, which it would need to do anyway for the copy operation, and then it only performs thatstat
if opening the file fails. If opening the file succeeds, the code was already performing anfstat
to gather data on the file, such as whether it was seekable; with this change, it now also extracts from the results of that singlefstat
the source file size, which it then threads through to the core copy routine, which itself is then able to avoid anfstat
syscall it had been performing in order to get the size. Saving those syscalls is great, in particular for very small files where the overhead of setting up the copy can actually be more expensive than the actual copy of the bytes. But the biggest benefit of this PR is that it takes advantage ofIOCTL-FICLONERANGE
on Linux. Some Linux file systems, like XFS and Btrfs, support “copy-on-write,” which means that rather than copying all of the data to a new file, the file system simply notes that there are two different files pointing to the same data, sharing the underlying storage. This makes the “copy” super fast, since nothing actually needs to be copied and instead the file system just needs to update some bookkeeping; plus, less space is consumed on disk, since there’s just a single store of the data. The file system then only needs to actually copy data that’s overwritten in one of the files. This PR usesioctl
andFICLONE
to perform the copy as copy-on-write if the source and destination file system are the same and the file system supports the operation. In a similar vein,dotnet/runtime#64264 from@tmds further improvesFile.Copy
/FileInfo.CopyTo
by utilizingcopy_file_range
on Linux if it’s supported (and only if it’s a new enough kernel that it addresses some issues the function had in previous releases). Unlike a typical read/write loop that reads the data from the source and then writes it to the destination,copy_file_range
is implemented to stay entirely in kernel mode, without having to transition to user space for each read and write.
Another example of avoiding syscalls comes for theFile.WriteXx
andFile.AppendXx
methods when on Unix. The implementation of these methods opens aFileStream
or aSafeFileHandle
directly, and it was specifyingFileOptions.SequentialScan
.SequentialScan
is primarily relevant for reading data from a file, and hints to OS caching to expect data to be read from the file sequentially rather than randomly. However, these write/append methods don’t read, they only write, and the implementation ofFileOptions.SequentialScan
on Unix requires an additional syscall viaposix_fadvise
(passing inPOSIX_FADV_SEQUENTIAL
); thus, we’re paying for a syscall and not benefiting from it. This situation is akin to the famous Henny Youngman joke: “The patient says, ‘Doctor, it hurts when I do this’; the doctor says, ‘Then don’t do that!’.” Here, too, the answer is “don’t do that,” and sodotnet/runtime#59247 from@tmds simply stops passingSequentialScan
in places where it won’t help but may hurt.
Directory handling has seen reduced syscalls across the directory lifecycle, especially on Unix.dotnet/runtime#58799 from@tmds speeds up directory creation on Unix. Previously, the implementation of directory creation would first check to see if the directory already existed, which involves a syscall. In the expected minority case where it already existed the code could early exit out. But in the expected more common case where the directory didn’t exist, it would then parse the file path to find all of the directories in it, walk up the directory list until it found one that did exist, and then try to create all of the subdirectories back down through the target one. However, the expected most common case is the parent directories already exist and the child directory doesn’t, in which case we’re still paying for all that parsing when we could have just created the target directory. This PR addresses that by changing the up-front existence check to instead simply try tomkdir
the target directory; if it succeeds, great, we’re done, and if it fails, the error code from the failure can be used instead of the existence check to know whethermkdir
failed because it had no work to do.dotnet/runtime#61777 then takes this a step further and avoids string allocations while creating directories by using stack memory for the paths temporarily needed to pass tomkdir
.
dotnet/runtime#63675 then improves the performance of moving directories, on both Unix and Windows, removing several syscalls. The shared code forDirectory.Move
andDirectorInfo.MoveTo
was doing explicit directory existence checks for the source and destination locations, but on Windows the Win32 API called to perform the move does such checks itself, so they’re not needed preemptively. On Unix, we can similarly avoid the existence check for the source directory, as therename
function called will similarly simply fail if the source doesn’t exist (with an appropriate error that let’s us deduce what went wrong so the right exception can be thrown), and for the destination, the code had been issuing separate existence checks for whether the destination existed as a directory or as a file, but a singlestat
call suffices for both.
private string _path1;private string _path2;[GlobalSetup]public void Setup(){ _path1 = Path.GetTempFileName(); _path2 = Path.GetTempFileName(); File.Delete(_path1); File.Delete(_path2); Directory.CreateDirectory(_path1);}[Benchmark]public void Move(){ Directory.Move(_path1, _path2); Directory.Move(_path2, _path1);}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
Move | .NET 6.0 | 31.70 us | 1.00 | 256 B | 1.00 |
Move | .NET 7.0 | 26.31 us | 0.83 | – | 0.00 |
And then also on Unix,dotnet/runtime#59520 from@tmds improves directory deletion, and in particular recursive deletion (deleting a directory and everything it contains and everything they contain and so on), by utilizing the information already provided by the file system enumeration to avoid a secondary existence check.
Syscalls were also reduced as part of support for memory-mapped files.dotnet/runtime#63754 takes advantage of special-casing to do so while opening aMemoryMappedFile
. WhenMemoryMappedFile.CreateFromFile
was called, one of the first things it would do is callFile.Exists
to determine whether the specified file already exists; that’s because later in the method as part of dealing with errors and exceptions, the implementation needs to know whether to delete the file that might then exist; the implementation constructs aFileStream
, and doing might will the specified file into existence. However, that only happens for someFileMode
values, which is configurable via an argument passed by callers ofCreateFromFile
. The common and default value ofFileMode
isFileMode.Open
, which requires that the file exist such that constructing theFileStream
will throw if it doesn’t. That means we only actually need to callFile.Exists
if theFileMode
is something other thanOpen
orCreateNew
, which means we can trivially avoid the extra system call in the majority case.dotnet/runtime#63790 also helps here, in two ways. First, throughout theCreateFromFile
operation, the implementation might access theFileStream
‘sLength
multiple times, but each call results in a syscall to read the underlying length of the file. We can instead read it once and use that one value for all of the various checks performed. Second, .NET 6 introduced theFile.OpenHandle
method which enables opening a file handle / file descriptor directly into aSafeFileHandle
, rather than having to go throughFileStream
to do so. The use of theFileStream
inMemoryMappedFile
is actually quite minimal, and so it makes sense to just use theSafeFileHandle
directly rather than also constructing the superfluousFileStream
and its supporting state. This helps to reduce allocations.
Finally, there’sdotnet/runtime#63794, which recognizes that aMemoryMappedViewAccessor
orMemoryMappedViewStream
opened for read-only access can’t have been written to. Sounds obvious, but the practical implication of this is that closing either needn’t bother flushing, since that view couldn’t have changed any data in the implementation, and flushing a view can be relatively expensive, especially for larger views. Thus, a simple change to avoid flushing if the view isn’t writable can yield a measurable improvement toMemoryMappedViewAccessor
/MemoryMappedviewStream
‘sDispose
.
private string _path;[GlobalSetup]public void Setup(){ _path = Path.GetTempFileName(); File.WriteAllBytes(_path, Enumerable.Range(0, 10_000_000).Select(i => (byte)i).ToArray());}[GlobalCleanup]public void Cleanup(){ File.Delete(_path);}[Benchmark]public void MMF(){ using var mmf = MemoryMappedFile.CreateFromFile(_path, FileMode.Open, null); using var s = mmf.CreateViewStream(0, 10_000_000, MemoryMappedFileAccess.Read);}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
MMF | .NET 6.0 | 315.7 us | 1.00 | 488 B | 1.00 |
MMF | .NET 7.0 | 227.1 us | 0.68 | 336 B | 0.69 |
Beyond system calls, there have also been a plethora of improvements around reducing allocation. One such change isdotnet/runtime#58167, which improved the performance of the commonly-usedFile.WriteAllText{Async}
andFile.AppendAllText{Async}
methods. The PR recognizes two things: one, that these operations are common enough that it’s worth avoiding the small-but-measurable overhead of going through aFileStream
and instead just going directly to the underlyingSafeFileHandle
, and, two, that since the methods are passed the entirety of the payload to output, the implementation can use that knowledge (in particular for length) to do better than theStreamWriter
that was previously employed. In doing so, the implementation avoids the overheads (primarily in allocation) of the streams and writers and temporary buffers.
private string _path;[GlobalSetup]public void Setup() => _path = Path.GetRandomFileName();[GlobalCleanup]public void Cleanup() => File.Delete(_path);[Benchmark]public void WriteAllText() => File.WriteAllText(_path, Sonnet);
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
WriteAllText | .NET 6.0 | 488.5 us | 1.00 | 9944 B | 1.00 |
WriteAllText | .NET 7.0 | 482.9 us | 0.99 | 392 B | 0.04 |
dotnet/runtime#61519 similarly updatesFile.ReadAllBytes{Async}
to useSafeFileHandle
(andRandomAccess
) directly rather than going throughFileStream
, shaving off some allocation from each use. It also makes the sameSequentialScan
change as mentioned earlier. While this caseis about reading (whereas the previous change sawSequentialScan
being complete overhead with no benefit),ReadAllBytes{Async}
is very frequently used to read smaller files where the overhead of the additional syscall can measure up to 10% of the total cost (and for larger files, modern kernels are pretty good about caching even without a sequentiality hint, so there’s little downside measured there).
Another such change isdotnet/runtime#68662, which improvedPath.Join
‘s handling of null or empty path segments.Path.Join
has overloads that acceptstring
s and overloads that acceptReadOnlySpan<char>
s, but all of the overloads producestring
s. Thestring
-based overloads just wrapped each string in a span and delegated to the span-based overloads. However, in the event that the join operation is a nop (e.g. there are two path segments and the second is empty so the join should just return the first), the span-based implementation still needs to create a new string (there’s no way for theReadOnlySpan<char>
-based overloads to extract a string from the span). As such, thestring
-based overloads can do a little bit better in the case of one of them being null or empty; they can do the same thing thePath.Combine
overloads do, which is to have the M argument overload delegate to the M-1 argument overload, filtering out a null or empty, and in the base case of the overload with two arguments, if a segment is null or empty, the other (or empty) can just be returned directly.
Beyond that, there are a multitude of allocation-focused PRs, such asdotnet/runtime#69335 from@pedrobsaila which adds a fast-path based on stack allocation to the internalReadLink
helper that’s used on Unix anywhere we need to follow symlinks, ordotnet/runtime#68752 that updatesNamedPipeClientStream.ConnectAsync
to remove a delegate allocation (by passing state into aTask.Factory.StartNew
call explicitly), ordotnet/runtime#69412 which adds an optimizedRead(Span<byte>)
override to theStream
returned fromAssembly.GetManifestResourceStream
.
But my personal favorite improvement in this area come fromdotnet/runtime#69272, which adds a few new helpers toStream
:
public void ReadExactly(byte[] buffer, int offset, int count);public void ReadExactly(Span<byte> buffer);public ValueTask ReadExactlyAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken = default);public ValueTask ReadExactlyAsync(Memory<byte> buffer, CancellationToken cancellationToken = default);public int ReadAtLeast(Span<byte> buffer, int minimumBytes, bool throwOnEndOfStream = true);public ValueTask<int> ReadAtLeastAsync(Memory<byte> buffer, int minimumBytes, bool throwOnEndOfStream = true, CancellationToken cancellationToken = default);
In fairness, these are more about usability than they are about performance, but in this case there’s a tight correlation between the two. It’s very common to write these helpers one’s self (the aforementioned PR deleted many open-coded loops for this functionality from across the core libraries) as the functionality is greatly needed, and it’s unfortunately easy to get them wrong in ways that negatively impact performance, such as by using aStream.ReadAsync
overload that needs to allocate a returnedTask<int>
or reading fewer bytes than is allowed as part of a read call. These implementations are correct and efficient.
Compression
.NET Core 2.1 added support for theBrotli compression algorithm, surfacing it in two ways:BrotliStream
and the pair ofBrotliEncoder
/BrotliDecoder
structs thatBrotliStream
is itself built on top of. For the most part, these types just provide wrappers around a native C implementation fromgoogle/brotli, and so while the .NET layer has the opportunity to improve how data is moved around, managed allocation, and so on, the speed and quality of the compression itself are largely at the mercy of the C implementation and the intricacies of the Brotli algorithm.
As with many compression algorithms, Brotli provides a knob that allows for a quintessential tradeoff to be made between compression speed (how fast data can be compressed) and compression quality/ratio (how small can the compressed output be made). The hand-wavy idea is the more time the algorithm spends looking for opportunity, the more space can be saved. Many algorithms expose this as a numerical dial, in Brotli’s case going from 0 (fastest speed, least compression) to 11 (spend as much time as is needed to minimize the output size). But whileBrotliEncoder
surfaces that same range,BrotliStream
‘s surface area is simpler: most use just specifies that compression should be performed (e.g.new BrotliStream(destination, CompressionMode.Compress)
) and the only knob available is via theCompressionLevel
enum (e.g.new BrotliStream(destination, CompressionLevel.Fastest)
), which provides just a few options:CompressionLevel.NoCompression
,CompressionLevel.Fastest
,CompressionLevel.Optimal
, andCompressionLevel.SmallestSize
. This means theBrotliStream
implementation needs to select a default value when noCompressionLevel
is specified and needs to mapCompressionLevel
to an underlying numerical value when one is.
For better or worse (and I’m about to argue “much worse”), the native C implementation itself defines the default to be 11 (google/brotli#encode.h), and so that’s whatBrotliStream
has ended up using when noCompressionLevel
is explicitly specified. Further, theCompressionLevel.Optimal
enum value is poorly named. It’s intended to represent a good default that’s a balanced tradeoff between speed and quality; that’s exactly what it means forDeflateStream
,GZipStream
, andZLibStream
. But forBrotliStream
, as the default it similarly got translated to mean the underlying native library’s default, which is 11. This means that when constructing aBrotliStream
with eitherCompressionMode.Compress
orCompressionLevel.Optimal
, rather than getting a nice balanced default, you’re getting the dial turned all the way up to 11.
Is that so bad? Maybe compression quality is the most important thing? For example, reducing the size of data can make it faster to then transmit it over a wire, and with a slow connection, size then meaningfully translates into end-to-end throughput.
The problem is just how much this extra effort costs. Compression speed and ratio are highly dependent on the data being compressed, so take this example with a small grain of salt as it’s not entirely representative of all use, but it’s good enough for our purposes. Consider this code, which usesBrotliEncoder
to compress theThe Complete Works of William Shakespeare from Project Gutenberg at varying levels of compression:
using System.Buffers;using System.Diagnostics;using System.IO.Compression;using System.Text;using var hc = new HttpClient();byte[] data = await hc.GetByteArrayAsync("https://www.gutenberg.org/ebooks/100.txt.utf-8");Console.WriteLine(data.Length);var compressed = new MemoryStream();var sw = new Stopwatch();for (int level = 0; level <= 11; level++){ const int Trials = 10; compressed.Position = 0; Compress(level, data, compressed); sw.Restart(); for (int i = 0; i < Trials; i++) { compressed.Position = 0; Compress(level, data, compressed); } sw.Stop(); Console.WriteLine($"{level},{sw.Elapsed.TotalMilliseconds / Trials},{compressed.Position}"); static void Compress(int level, byte[] data, Stream destination) { var encoder = new BrotliEncoder(quality: level, window: 22); Write(ref encoder, data, destination, false); Write(ref encoder, Array.Empty<byte>(), destination, true); encoder.Dispose(); static void Write(ref BrotliEncoder encoder, byte[] data, Stream destination, bool isFinalBlock) { byte[] output = ArrayPool<byte>.Shared.Rent(4096); OperationStatus lastResult = OperationStatus.DestinationTooSmall; ReadOnlySpan<byte> buffer = data; while (lastResult == OperationStatus.DestinationTooSmall) { lastResult = encoder.Compress(buffer, output, out int bytesConsumed, out int bytesWritten, isFinalBlock); if (lastResult == OperationStatus.InvalidData) throw new InvalidOperationException(); if (bytesWritten > 0) destination.Write(output.AsSpan(0, bytesWritten)); if (bytesConsumed > 0) buffer = buffer.Slice(bytesConsumed); } ArrayPool<byte>.Shared.Return(output); } }}
The code is measuring how long it takes to compress the input data at each of the levels (doing a warmup and then averaging several iterations), timing how long it takes and capturing the resulting compressed data size. For the size, I get values like this:
Level | Size (bytes) |
---|---|
0 | 2,512,855.00 |
1 | 2,315,466.00 |
2 | 2,224,638.00 |
3 | 2,218,328.00 |
4 | 2,027,153.00 |
5 | 1,964,810.00 |
6 | 1,923,456.00 |
7 | 1,889,927.00 |
8 | 1,863,988.00 |
9 | 1,846,685.00 |
10 | 1,741,561.00 |
11 | 1,702,214.00 |
That’s a fairly liner progression from least to most compression. That’s not the problem. This is the problem:
Level | Time (ms) |
---|---|
0 | 24.11 |
1 | 36.67 |
2 | 64.13 |
3 | 73.72 |
4 | 146.41 |
5 | 257.12 |
6 | 328.54 |
7 | 492.81 |
8 | 702.38 |
9 | 892.08 |
10 | 4,830.32 |
11 | 10,634.88 |
This chart shows an almost exponential increase in processing time as we near the upper end of the dial, with quality level 11 compressing ~33% better than quality level 0 but taking ~440x as long to achieve that. If that’s what a developer wants, they can specifyCompressionLevel.SmallestSize
, but that cost by default and for the balancedCompressionLevel.Optimal
is far out of whack.
dotnet/runtime#72266 fixes that. A very small change, it simply makesCompressMode.Compress
andCompressionLevel.Optimal
for Brotli map to quality level 4, which across many kinds of inputs does represent a fairly balanced trade-off between size and speed.
private byte[] _data = new HttpClient().GetByteArrayAsync("https://www.gutenberg.org/ebooks/100.txt.utf-8").Result;private Stream _output = new MemoryStream();[Benchmark]public void Compress(){ _output.Position = 0; using var brotli = new BrotliStream(_output, CompressionMode.Compress, leaveOpen: true); brotli.Write(_data);}
Method | Runtime | Mean | Ratio |
---|---|---|---|
Compress | .NET 6.0 | 9,807.0 ms | 1.00 |
Compress | .NET 7.0 | 133.1 ms | 0.01 |
Other improvements have gone into compression, such asdotnet/runtime#69439 which updates the internalZipHelper.AdvanceToPosition
function used byZipArchive
to reuse a buffer on every iteration of a loop rather than allocating a new buffer for each iteration,dotnet/runtime#66764 which uses spans judiciously to avoid a bunch of superfluousstring
andstring[]
allocations fromSystem.IO.Packaging
, anddotnet/runtime#73082 updating the zlib implementations shipped as part of .NET from v1.2.11 (which was released in January 2017) to v1.2.12 (which was released in March 2022).
Networking
Networking is the life-blood of almost every service, with performance being critical to success. In previous releases, a lot of effort was focused on the lower layers of the networking stack, e.g. .NET 5 saw a significant investment in improving the performance of sockets on Linux. In .NET 7, much of the effort is above sockets.
That said, there were some interesting performance improvements in sockets itself for .NET 7. One of the more interesting isdotnet/runtime#64770, which revamped how some synchronization is handled inside ofSocketsAsyncEventArgs
. As background, in the early days of networking in .NET Framework, asynchrony was enabled viaBegin
/End
methods (the “APM” pattern). This pattern is not only complicated to use well, it’s relatively inefficient, resulting in allocation for every single operation performed (at a minimum for theIAsyncResult
object that’s returned from theBeginXx
method). To help make networking operations more efficient,SocketsAsyncEventArgs
was introduced.SocketsAsyncEventArgs
is a reusable class you allocate to hold all of the state associated with asynchronous operations: allocate one, pass it to various async methods (e.g.ReceiveAsync
), and then completion events are raised on theSocketAsyncEventArgs
instance when the operation completes. It can be quite efficient when used correctly, but it’s also complicated to use correctly. In subsequent releases,Task
-based andValueTask
-based APIs were released; these have the efficiency ofSocketAsyncEventArgs
and the ease-of-use ofasync/await
, and are the recommended starting point for allSocket
-based asynchronous programming today. They have the efficiency ofSocketAsyncEventArgs
because they’re actually implemented as a thin veneer on top of it under the covers, and so while most code these days isn’t written to useSocketAsyncEventArgs
directly, it’s still very relevant from a performance perspective.
SocketAsyncEventArgs
on Windows is implemented to use winsock and overlapped I/O. When you call an async method likeValueTask<Socket> Socket.AcceptAsync(CancellationToken)
, that grabs an internalSocketAsyncEventArgs
and issues anAcceptAsync
on it, which in turn gets aNativeOverlapped*
from theThreadPoolBoundHandle
associated with the socket, and uses it to issue the nativeAcceptEx
call. When that handle is initially created, we set theFILE_SKIP_COMPLETION_PORT_ON_SUCCESS
completion notification mode on the socket; use of this was introduced in earlier releases of .NET Core, and it enables a significant number of socket operations, in particular sends and receives, to complete synchronously, which in turn saves unnecessary trips through the thread pool, unnecessary unwinding of async state machines, and so on. But it also causes a condundrum. There are some operations we want to perform associated with asynchronous operation but that have additional overhead, such as registering for the cancellation of those operations, and we don’t want to pay the cost of doing them if the operation is going to complete synchronously. That means we really want to delay performing such registration until after we’ve made the native call and discovered the operation didn’t complete synchronously… but at that point we’ve already initiated the operation, so if itdoesn’t complete synchronously, then we’re now in a potential race condition, where our code that’s still setting up the asynchronous operation is racing with it potentially completing in a callback on another thread. Fun.SocketAsyncEventArgs
handled this race condition with a spin lock; the theory was that contention would be incredibly rare, as the vast majority cases would either be the operation completing synchronously (in which case there’s no other thread involved) or asynchronously with enough of a delay that the small amount of additional work performed by the initiating thread would have long ago completed by the time the asynchronous operation completed. And for the most part, that was true. However, it turns out that it’s actually much more common than expected for certain kinds of operations, like Accepts. Accepts end up almost always completing asynchronously, but if there’s already a pending connection, completing asynchronously almost immediately, which then induces this race condition to happen more frequently and results in more contention on the spin locks. Contention on a spin lock is something you really want to avoid. And in fact, for a particular benchmark, this spin lock showed up as the cause for an almost 300% slowdown in requests-per-second (RPS) for a benchmark that used a dedicated connection per request (e.g. with every response setting “Connection: close”).dotnet/runtime#64770 changed the synchronization mechanism to no longer involve a spin lock; instead, it maintains a simple gate implemented as anInterlocked.CompareExchange
. If the initiating thread gets to the gate first, from that point on the operation is considered asynchronous and any additional work is handled by the completing callback. Conversely, if the callback gets to the gate first, the initiating thread treats the operation as if it completed synchronously. This not only avoids one of the threads spinning while waiting for the other to make forward progress, it also increases the number of operations that end up being handled as synchronous, which in turn reduces other costs (e.g. the codeawait
ing the task returned from this operation doesn’t need to hook up a callback and exit, and can instead itself continue executing synchronously). The impact of this is difficult to come up with a microbenchmark for, but it can have meaningful impact for loaded Windows servers that end up accepting significant numbers of connections in steady state.
A more-easily quantifiable change around sockets isdotnet/runtime#71090, which improves the performance ofSocketAddress.Equals
. ASocketAddress
is the serialized form of anEndPoint
, with abyte[]
containing the sequence of bytes that represent the address. ItsEquals
method, used to determine whether toSocketAddress
instances are the same, looped over thatbyte[]
byte-by-byte. Not only is such code gratuitous when there are now helpers available likeSequenceEqual
for comparing spans, doing it byte-by-byte is also much less efficient than the vectorized implementation inSequenceEqual
. Thus, this PR simply replaced the open-coded comparison loop with a call toSequenceEqual
.
private SocketAddress _addr = new IPEndPoint(IPAddress.Parse("123.123.123.123"), 80).Serialize();private SocketAddress _addr_same = new IPEndPoint(IPAddress.Parse("123.123.123.123"), 80).Serialize();[Benchmark]public bool Equals_Same() => _addr.Equals(_addr_same);
Method | Runtime | Mean | Ratio |
---|---|---|---|
Equals_Same | .NET 6.0 | 57.659 ns | 1.00 |
Equals_Same | .NET 7.0 | 4.435 ns | 0.08 |
Let’s move up to some more interesting changes in the layers aboveSockets
, starting withSslStream
.
One of the more impactful changes toSslStream
on .NET 7 is in support for TLS resumption on Linux. When a TLS connection is established, the client and server engage in a handshake protocol where they collaborate to decide on a TLS version and cipher suites to use, authenticate and validate each other’s identity, and create symmetric encryption keys for use after the handshake. This represents a significant portion of the time required to establish a new connection. For a client that might disconnect from a server and then reconnect later, as is fairly common in distributed applications, TLS resumption allows a client and server to essentially pick up where they left off, with the client and/or server storing some amount of information about recent connections and using that information to resume. Windows SChannel provides default support for TLS resumption, and thus the Windows implementation ofSslStream
(which is built on SChannel) has long had support for TLS resumption. But OpenSSL’s model requires additional code to enable TLS resumption, and such code wasn’t present in the Linux implementation ofSslStream
. Withdotnet/runtime#57079 anddotnet/runtime#63030, .NET 7 adds server-side support for TLS resumption (using the variant thatdoesn’t require storing recent connection state on the server), and withdotnet/runtime#64369, .NET 7 adds client-side support (whichdoes require storing additional state). The effect of this is significant, in particular for a benchmark that opens and closes lots of connections between clients.
private NetworkStream _client, _server;private readonly byte[] _buffer = new byte[1];private readonly SslServerAuthenticationOptions _options = new SslServerAuthenticationOptions{ ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),};[GlobalSetup]public void Setup(){ using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); listener.Bind(new IPEndPoint(IPAddress.Loopback, 0)); listener.Listen(1); var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); client.Connect(listener.LocalEndPoint); _server = new NetworkStream(listener.Accept(), ownsSocket: true); _client = new NetworkStream(client, ownsSocket: true);}[GlobalCleanup]public void Cleanup(){ _client.Dispose(); _server.Dispose();}[Benchmark]public async Task Handshake(){ using var client = new SslStream(_client, leaveInnerStreamOpen: true, delegate { return true; }); using var server = new SslStream(_server, leaveInnerStreamOpen: true, delegate { return true; }); await Task.WhenAll( client.AuthenticateAsClientAsync("localhost", null, SslProtocols.None, checkCertificateRevocation: false), server.AuthenticateAsServerAsync(_options)); await client.WriteAsync(_buffer); await server.ReadAsync(_buffer); await server.WriteAsync(_buffer); await client.ReadAsync(_buffer);}private static X509Certificate2 GetCertificate() => new X509Certificate2( Convert.FromBase64String("MIIUmgIBAzCCFFYGCSqGSIb3DQEHAaCCFEcEghRDMIIUPzCCCiAGCSqGSIb3DQEHAaCCChEEggoNMIIKCTCCCgUGCyqGSIb3DQEMCgECoIIJfjCCCXowHAYKKoZIhvcNAQwBAzAOBAhCAauyUWggWwICB9AEgglYefzzX/jx0b+BLU/TkAVj1KBpojf0o6qdTXV42drqIGhX/k1WwF1ypVYdHeeuDfhH2eXHImwPTw+0bACY0dSiIHKptm0sb/MskoGI8nlOtHWLi+QBirJ9LSUZcBNOLwoMeYLSFEWWBT69k/sWrc6/SpDoVumkfG4pZ02D9bQgs1+k8fpZjZGoZp1jput8CQXPE3JpCsrkdSdiAbWdbNNnYAy4C9Ej/vdyXJVdBTEsKzPYajAzo6Phj/oS/J3hMxxbReMtj2Z0QkoBBVMc70d+DpAK5OY3et872D5bZjvxhjAYh5JoVTCLTLjbtPRn1g7qh2dQsIpfQ5KrdgqdImshHvxgL92ooC1eQVqQffMnZ0/LchWNb2rMDa89K9CtAefEIF4ve2bOUZUNFqQ6dvd90SgKq6jNfwQf/1u70WKE86+vChXMMcHFeKso6hTE9+/zuUPNVmbRefYAtDd7ng996S15FNVdxqyVLlmfcihX1jGhTLi//WuMEaOfXJ9KiwYUyxdUnMp5QJqO8X/tiwnsuhlFe3NKMXY77jUe8F7I+dv5cjb9iKXAT+q8oYx1LcWu2mj1ER9/b2omnotp2FIaJDwI40Tts6t4QVH3bUNE9gFIfTMK+WMgKBz/JAGvC1vbPSdFsWIqwhl7mEYWx83HJp/+Uqp5f+d8m4phSan2rkHEeDjkUaoifLWHWDmL94SZBrgU6yGVK9dU82kr7jCSUTrnga8qDYsHwpQ22QZtu0aOJGepSwZU7NZNMiyX6QR2hI0CNMjvTK2VusHFB+qnvw+19DzaDT6P0KNPxwBwp07KMQm3HWTRNt9u6gKUmo5FHngoGte+TZdY66dAwCl0Pt+p1v18XlOB2KOQZKLXnhgikjOwYQxFr3oTb2MjsP6YqnSF9EpYpmiNySXiYmrYxVinHmK+5JBqoQCN2C3N24slZkYq+AYUTnNST7Ib2We3bBICOFdVUgtFITRW40T+0XZnIv8G1Kbaq/1avfWI/ieKKxyiYp/ZNXaxc+ycgpsSsAJEuhb83bUkSBpGg9PvFEF0DXm4ah67Ja1SSTmvrCnrOsWZXIpciexMWRGoKrdvd7Yzj9E8hiu+CGTC4T6+7FxVXJrjCg9zU9G2U6g7uxzoyjGj1wqkhxgvl9pPbz6/KqDRLOHCEwRF4qlWXhsJy4levxGtifFt6n7DWaNSsOUf8Nwpi+d4fd7LQ7B5tW/y+/vVZziORueruCWO4LnfPhpJ70g18uyN7KyzrWy29rpE46rfjZGGt0WDZYahObPbw6HjcqSOuzwRoJMxamQb2qsuQnaBS6Bhb5PAnY4SEA045odf/u9uC7mLom2KGNHHz6HrgEPas2UHoJLuxYvY1pza/29akuVQZQUvMA5yMFHHGYZLtTKtCGdVGwX0+QS6ovpV93xux4I/5TrD5U8z9RmTdAx03R3MUhkHF7Zbv5egDNsVar+41YWG4VkV1ZXtsZRKJf0hvKNvrpH0e7fVKBdXljm5PXOSg2VdtkhhOpnKKSMcv6MbGWVi/svWLnc7Qim4A4MDaz+bFVZmh3oGJ7WHvRQhWIcHUL+YJx+064+4IKXZJ/2a/+b2o7C8mJ3GGSBx831ADogg6MRWZx3UY19OZ8YMvpzmZEBRZZnm4KgNpj+SQnf6pGzD2cmnRhzG60LSNPb17iKbdoUAEMkgt2tlMKXpnt1r7qwsIoTt407cAdCEsUH7OU/AjfFmSkKJZ7vC5HweqZPnhgJgZ6LYHlfiRzUR1xeDg8JG0nb0vb7LUE4nGPy39/TxIGos7WNwGpG1QVL/8pKjFdjwREaR8e5CSTlQ7gxHV+G3FFvFGpA1p8cRFzlgE6khDLrSJIUkhkHMA3oFwwAzBNIKVXjToyxCogDqxWya0E1Hw5rVCS/zOCS1De2XQbXs//g46TW0wTJwvgNbs0xLShf3XB+23meeEsMTCR0+igtMMMsh5K/vBUGcJA27ru/KM9qEBcseb/tqCkhhsdj1dnH0HDmpgFf5DfVrjm+P6ickcF2b+Ojr9t7XHgFszap3COpEPGmeJqNOUTuU53tu/O774IBgqINMWvvG65yQwsEO06jRrFPRUGb0eH6UM4vC7wbKajnfDuI/EXSgvuOSZ9wE8DeoeK/5We4pN7MSWoDl39gI/LBoNDKFYEYuAw/bhGp8nOwDKki4a16aYcBGRClpN3ymrdurWsi7TjyFHXfgW8fZe4jXLuKRIk19lmL1gWyD+3bT3mkI2cU2OaY2C0fVHhtiBVaYbxBV8+kjK8q0Q70zf0r+xMHnewk9APFqUjguPguTdpCoH0VAQST9Mmriv/J12+Y+fL6H+jrtDY2zHPxTF85pA4bBBnLA7Qt9TKCe6uuWu5yBqxOV3w2Oa4Pockv1gJzFbVnwlEUWnIjbWVIyo9vo4LBd03uJHPPIQbUp9kCP/Zw+Zblo42/ifyY+a+scwl1q1dZ7Y0L92yJCKm9Qf6Q+1PBK+uU9pcuVTg/Imqcg5T7jFO5QCi88uwcorgQp+qoeFi0F9tnUecfDl6d0PSgAPnX9XA0ny3bPwSiWOA8+uW73gesxnGTsNrtc1j85tail8N6m6S2tHXwOmM65J4XRZlzzeM4D/Rzzh13xpRA9kzm9T2cSHsXEYmSW1X7WovrmYhdOh9K3DPwSyG4tD58cvC7X79UbOB+d17ieo7ZCj+NSLVQO1BqTK0QfErdoVHGKfQG8Lc/ERQRqj132Mhi2/r5Ca7AWdqD7/3wgRdQTJSFXt/akpM44xu5DMTCISEFOLWiseSOBtzT6ssaq2Q35dCkXp5wVbWxkXAD7Gm34FFXXyZrJWAx45Y40wj/0KDJoEzXCuS4Cyiskx1EtYNNOtfDC5wngywmINFUnnW0NkdKSxmDJvrT6HkRKN8ftik7tP4ZvTaTS28Z0fDmWJ+RjvZW+vtF6mrIzYgGOgdpZwG0ZOSKrXKrY3xpMO16fXyawFfBosLzCty7uA57niPS76UXdbplgPanIGFyceTg1MsNDsd8vszXd4KezN2VMaxvw+93s0Uk/3Mc+5MAj+UhXPi5UguXMhNo/CU7erzyxYreOlAI7ZzGhPk+oT9g/MqWa5RpA2IBUaK/wgaNaHChfCcDj/J1qEl6YQQboixxp1IjQxiV9bRQzgwf31Cu2m/FuHTTkPCdxDK156pyFdhcgTpTNy7RPLDF0MBMGCSqGSIb3DQEJFTEGBAQBAAAAMF0GCSsGAQQBgjcRATFQHk4ATQBpAGMAcgBvAHMAbwBmAHQAIABTAHQAcgBvAG4AZwAgAEMAcgB5AHAAdABvAGcAcgBhAHAAaABpAGMAIABQAHIAbwB2AGkAZABlAHIwggoXBgkqhkiG9w0BBwagggoIMIIKBAIBADCCCf0GCSqGSIb3DQEHATAcBgoqhkiG9w0BDAEGMA4ECH63Q8xWHKhqAgIH0ICCCdDAo9x82rwRM6s16wMo01glVedahn1COCP1FKmP6lQ3kjcHruIWlcKW+eCUpt41qs0LM3iFcPQj5x7675DeLL0AC2Ebu7Jhg0FGMJZwHLbmJLyG0VSb1WhX2UfxNSdLrdZv8pmejB7DYdV3xAj8DBCRGfwwnbTQjFH9wUPga5U79Dvpqq+YVvUEEci1N6tTPu32LOOEvjoEtpskrHoKyqLGV7sSgM6xMIDcfVWbLb8fDcVS1JQRHbeOdGClFMDjwzr+eGWd+OyOZ6BydUGjIKAZpRp0YTk5jjYUMNRbvBP1VPq9ASIh8pJnt/Kq1nqfj7EPatXJJUZAH35E6bSbLBnP0+5+xim1l4HsB8066c4B3aTUXnLepPRyMIn6Xh5ev0pF3aUc4ZlWgar57TzKUFBTkcH5OCbqZloQ7ZCDNc4C3WKVLSUOKLj3QOxJPrb6/nyXZHjki1tGKisb9RLv4dkeMdRjsSwNRn6Cfdlk2qHWUCiWLlsLXFyMSM12qrSSfIIBRo0wbn1SEJagHqUmlF9UR5A6b5OODIbDq3cXH/q6U09zVX/BxqxyZqEfeSAcvXjqImLWnZzbIgm0QH7jOtti/vEfvdzypdWH9V64PzQj/5B8P4ZpbQyWUgzKEIdx24WhTOcdwNivkaEkGFTra3qw2dKO0RTVtx3bSgesHCumQDuDf8yafLfchWuqihYV7zvqW9BWrsa0W7yKNXLNqdlSz8KvuTnFffOOHrJQwBs+JKdMcKX5IR222RH3fp8Dp17y8hFEaPp4AqpuhHGALXOCwmuPtlUjuHRCUluh3BjaPPLNwLmSGfe0piOVh4rTyJCfN4rlz0lWBAAfIHi47J9sTnSgEJgkTuemPJXssQ3Z/trcYdfhlYjelOBtS/5DW3wFmjNDilwVBQT66li5xUvcWvZPx/scXgbgpsMThqguJWtiPLR1SzusKCN4q7bVQ8D8ErHh5uMb5NmNRIZ/xNeqslqTU9A4bi0TE0FjEu28F0Wg4CxiwqNM58xik9eni85t+S0Uo9wPV1V2Vdhe9LkO3PeoSTCau4D189DoViL44WPDQ+TCSvlPP7SFEwaBvUlGBWjxJWVb81lkgRsol1bllUvIzN13V0LSiA0Nks9w9H8cQ17ZRe2r7SpDDR6Rn5oLb9G98AyvlcgJfyUe1iZCUAUZGEU247KwePtXYAlO47HbAJe0bOtM9zp7KyWxbImKCfxsPWv6CR6PH+ooHDBO9kXVpKaJCYWeYybSMuPufy/u/rMcIVO4oXVsdnjh4jAxpQOXowCAcN2+Q+XnqtiCr9Mzd0q5ee7jsYuJF6LQRdNP04wIpwjpdggKyB7zURPeTXlV8vIjUs25+CoCxp+fCXfXKqe2xxdbQ2zFbpKSbJdpbWad3F6MsFBGOKTdyK8EZODGApBtlo71kY6uOxKiBwJKd76zTMsPEQWOZphi2khpTxIVYONrmPKjSO8zc4dTC8SW+d1kmCt4UYblwoeDCAYp2RiDHpgC+5yuBDCooT/6fG6GQpa1X0PiH2oUCpltZz2M4+1bH2HdTeBfc1Mtj/hniLL8VdH0qcpS0KYPUxJFEg6IxxrWw1OBreY//6pJLm76nKiflzhz+Mt0RbQZqkPP/K9BxzQw//bW9Kh4iRQ37D9HNQG/GtrCEcbH4V4uUjbj34sEo0FC7gVvDob0Bik8l/c901zQZEydqe0DgHtGbY2xIZ2qqsQy4LDVfHNHqSLiNssL8BJtxUyvnhiwHD7jmyCB6cWyFGtibRBehQzleioS16xvLph88CMGV3IH9By5QtXpDIB4vjhibE6coPkTmpDCB9xlTE3TV4GBt5JLttkjfOkXAAx0xD523Adcy6FVe5QYuY1O8170O6l88YptozyWi5jVfDh+aDg9pjsw/aZ1hCURe9KDaB4gIlW4ZEGKsf5e/xU+vuVxw374te/Y2aCChSj93XyC+Fjxe06s4yifVAYA0+HtLMGNHe/X0kPXvRnoa5kIu0yHrzViQrBb/4Sbms617Gg1BFONks1JO2G0zIt8CouTqVmdtuH7tV0JZV/Nmg7NQ1X59XDC/JH2i4jOu8OhnmIZFlTysS6e1qnqsGt/0XcUyzPia8+UIAynXmyi8sWlUjy37w6YqapAfcs7B3TezqIwn7RgRasJpNBi7eQQqg5YLe6EYTxctKNkGpzeTBUiXNXM4Gv3tIaMbzwlhUNbYWuNBsi/7XJPM5jMycINRbdPwYy19gRBs3pm0FoP2Lhl5mVAJ2R8a40Lo5g73wvt9Th+uB9/yc196RryQe280yfgKiwUoFFcDnL6SoQTRCTl95mF8zw1f3Hc7QImhubgcLntXEndzSNN7ZIDSAB8HiDSR6CGYPNiCNAC4hj+jUswoWIE257h+deWFTUvjTZmXH+XMoN6trqjdeCH0hePdmrIWVdr1uTIoO16TR6mFNm6Utzc0t5vVrcpnEh3w6amVHw5xmweW4S75ncN6vSPxGjtfuQ6c2RTG5NXZuWpnhXwOxgoBN4q/h99zVRvwwsF32Eyzx6GOYLmORgCkzke9eXjjXWY83oysXx/aE9WCqt3en8zzRzzA1aO9Yi88uv1OOqTvWEoGrf4e7SgjXO6hNjYE6EEvK+mMz6a9F3xSWsUlMsZPIIBe8CEgNEhXKsa6xw7ljSx8Nz7zYG+u5rgXKFmSNvWvwasZyIfRXkccqODl17BaevbWp/ir3rJ/b9mOiV0UW8qIJ3zC6b1lXU5pNuOODjqhKkjIHPGXiql+uBPVlfUy8Zbi4AntZAeNIB7HtUavVKX6CF7k9AFtRHIWK70+cFEw4yMZiQjaWeB3dt16Fz6LZ8+c17kuB2wFuZQqYQkf3quWQVPwKj41gFYoFSwFfJ8L6TBcNHI2u3avtVp9ZbP9zArT8An9Ryri/PwTSbPLTcaz549b60/0k4c/qV4XRMuFsi29CXcMnLSCPpPKs71LTvsRXK6QUJd4fX/KnTiWargbS6tT6lR/bBqY/gFU1xWyKQ8xij97vlQjffSKdcbj5JsnjSr8xAh9idfJ2FWZZUJReR9EU1twK7slyUivNLVY7bqroE6CzYaEDecRqfwIrFrzmH+gJoM88waGRC0JTvm8GpBX0eTb5bnMxJKPtH1GIffgyQLERO1jwjApr6SJEB4yV7x48CZPod9wE51OxUY2hEdAA5l7DBTJysg5gn/nhY6ZzL0llb39yVyDEcZdmrji0ncEMdBDioGBV3mNz1DL398ZLdjG+xkneI3sgyzgm3cZZ1+/A2kloIEmOKJSe0k/B1cyMB5QRnXpObF1vWXjauMVIKm0wlLY3YQ9I1vfr6y1o2DN+Vy0sumbIQrjDKqMDswHzAHBgUrDgMCGgQUHEWyD7i5PbatVl3k0+S9WV3ZJRAEFFd7xcvfj1HpkOawyGnJdtcQ0KWPAgIH0A=="), "testcertificate", X509KeyStorageFlags.DefaultKeySet);
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
Handshake | .NET 6.0 | 4.647 ms | 1.00 | 19.27 KB | 1.00 |
Handshake | .NET 7.0 | 2.314 ms | 0.50 | 9.56 KB | 0.50 |
Another significant improvement forSslStream
in .NET 7 is support for OCSP stapling. When a client handshakes with the server and the server shares its certificate, a client that cares about validating it’s talking to exactly who it intended to talk to needs to validate that certificate. In the days of yore, such validation was done with certificate revocation lists (CRL), where periodically the client would download a giant list of certificates known to be revoked. Online Certificate Status Protocol (OCSP) is a newer protocol and mechanism that enables a client to get real-time information about a certificate; while the client handshakes with the server and the server sends the client its certificate, the client then connects to an “OCSP responder” and sends it a request to determine whether the certificate is considered good. OCSP has multiple issues of its own, however. In particular, it places a significant load on these OCSP responder servers, with every client making a real-time request to it about every certificate encountered, and also potentially significantly increasing the time it takes the client to establish a connection. OCSP stapling offers a solution to this. Rather than a client issuing a request to the OCSP responder, the server itself contacts the OCSP responder and gets a signed ticket from the OCSP responder stating that the server’s certificate is good and will be for some period of time. When a client handshakes with the server, the server can then “staple” (include) this signed ticket as part of its response to the client, giving the validation to the client directly rather than the client needing to make a separate roundtrip to the OCSP responder. This reduces overheads for everyone involved.dotnet/runtime#67011 adds support for OCSP stapling toSslStream
client usage on Linux, withdotnet/runtime#69833 adding the Linux server-side counterpart, anddotnet/runtime#71570 adds client-side support for Windows.
The aforementioned changes are primarily about the performance of opening a connection. Additional work has been done to improve that further in other ways.dotnet/runtime#69527 gets rid of allocations associated with severalSafeHandle
instances that were being created unnecessarily on Linux as part of establishing a TLS connection. This highlights the benefits of doing profiling on multiple platforms, as while theseSafeHandle
s were necessary in the Windows implementation, they were fairly meaningless in the Linux implementation (due to differences between SChannel and OpenSSL), and were only brought along for the ride because of how the platform-abstraction layer (PAL) was defined to reuse most of theSslStream
code across platforms. Anddotnet/runtime#68188 avoids several collections allocated as part of the TLS handshake. This one is particularly interesting as it’s come up multiple times in the past in various libraries. Imagine you have a lazily initialized property like this:
private List<T>? _items;public List<T> Items => _items ??= new List<T>();
And then some code in the same implementation comes along and wants to read the contents of these items. That code might look like:
if (Items.Count > 0) { ... }
but the very act of accessingItems
just to check its count forces the collection into existence (with a 0Count
). If the code instead checks:
if (_items is List<T> items && items.Count > 0) { ... }
It can save that unnecessary collection allocation. The approach is made even simpler with C# pattern matching:
if (_items is { Count: > 0 } items) { ... }
This is one of those things that’s incredibly obvious once you “see” it and realize what’s happening, but you often miss until it jumps out at you in a profiler.
dotnet/runtime#69098 is another good example of how profiling can lead to insights about allocations that can be removed. Application-Layer Protocol Negotation (ALPN) allows code establishing a TLS connection to piggy-back on the roundtrips that are being used for the TLS handshake anyway to negotiate some higher-level protocol that will end up being used as well. A very common use-case, for example, is for an HTTPS client/server to negotiate which version of HTTP should be used. This information is exposed fromSslStream
as anSslApplicationProtocol
struct returned from itsNegotiatedApplicationProtocol
property, but as the actual negotiated protocol can be arbitrary data,SslApplicationProtocol
just wraps abyte[]
. The implementation had been dutifully allocating abyte[]
to hold the bytes passed around as part of ALPN, since we need such abyte[]
to store in theSslApplicationProtocol
. But while the byte datacan be arbitrary, in practice by far the most common byte sequences are equivalent to “http/1.1” for HTTP/1.1, “h2” for HTTP/2, and “h3” for HTTP/3. Thus, it makes sense to special-case those values and use a reusable cachedbyte[]
singleton when one of those values is needed. IfSslApplicationProtocol
exposed the underlyingbyte[]
directly to consumers, we’d be hesitant to use such singletons, as doing so would mean that if code wrote into thebyte[]
it would potentially be changing the value for other consumers in the same process. However,SslApplicationProtocol
exposes it as aReadOnlyMemory<byte>
, which is only mutable via unsafe code (using theMemoryMarshal.TryGetArray
method), and once you’re employing unsafe code to do “bad” things,” all bets are off anyway.dotnet/runtime#63674 also removes allocations related to ALPN, in this case avoiding the need for abyte[]
allocation on Linux when setting the negotiated protocol on a clientSslStream
. It uses stack memory instead of an array allocation for protocols up to 256 bytes in length, which is way larger than any in known use, and thus doesn’t bother to do anything fancy for the fallback path, which will never be used in practice. Anddotnet/runtime#69103 further avoids ALPN-related allocations and work on Windows by entirely skipping some unnecessary code paths: various methods can be invoked multiple times during a TLS handshake, but even though the ALPN-related work only needed to happen once the first time, the code wasn’t special-casing it and was instead repeating the work over and over.
Everything discussed thus far was about establishing connections. What about the performance of reading and writing on that connection? Improvements have been made there, too, in particular around memory management and asynchrony. But first we need some context.
Whenasync/await
were first introduced,Task
andTask<TResult>
were the only game in town; while the pattern-based mechanism the compiler supports for arbitrary “task-like” types enabledasync
methods to return other types, in practice it was only tasks (which also followed our guidance). We soon realized, however, that a significant number of calls to a significant number of commonly-used async APIs would actually complete synchronously. Consider, for example, a method likeMemoryStream.ReadAsync
:MemoryStream
is backed entirely by an in-memory buffer, so even though the operation is “async,” every call to it completes synchronously, as the operation can be performed without doing any potentially long-running I/O. Or considerFileStream.ReadAsync
. By defaultFileStream
employs its own internal buffer. If you issue a call toFileStream.ReadAsync
with your own buffer and ask for only, say, 16 bytes, under the coversFileStream.ReadAsync
will issue the actual native call with its own much larger buffer, which by default is 4K. The first time you issue your 16-byte read, actual I/O will be required and the operation is likely to complete asynchronously. But the next 255 calls you make could simply end up draining the remainder of the data read into that 4K buffer, in which case 255 of the 256 “async” operations actually complete synchronously. If the method returns aTask<int>
, every one of those 255 synchronously-completing calls could still end up allocating aTask<int>
, just to hand back theint
that’s already known. Various techniques were devised to minimize this, e.g. if theint
is one of a few well-known values (e.g. -1 through 8), then the async method infrastructure will hand back a pre-allocated and cachedTask<int>
instance for that value, and various stream implementations (includingFileStream
) would cache the previously-returnedTask<int>
and hand it back for the next call as well if the next call yielded exactly the same number of bytes. But those optimizations don’t fully mitigate the issue. Instead, we introduced theValueTask<TResult>
struct and provided the necessary “builder” to allowasync
methods to return them.ValueTask<TResult>
was simply a discrimated union between aTResult
andTask<TResult>
. If an async method completed asynchronously (or if it failed synchronously), well, it would simply allocate theTask<TResult>
as it otherwise would have and return that task wrapped in aValueTask<TResult>
. But if the method actually completed synchronously and successfully, it would create aValueTask<TResult>
that just wrapped the resultingTResult
, which then eliminates all allocation overhead for the synchronously-completing case. Yay, everyone’s happy. Well, almost everyone. For really hot paths, especially those lower down in the stack that many other code paths build on top of, it can also be beneficial to avoid the allocations even for the asynchronously completing case. To address that, .NET Core 2.1 saw the introduction of theIValueTaskSource<TResult>
interface along with enablingValueTask<TResult>
to wrap an instance of that interface in addition to aTResult
or aTask<TResult>
(at which point it also became meaningful to introduce a non-genericValueTask
and the associatedIValueTaskSource
). Someone can implement this interface with whatever behaviors they want, although we codified the typical implementation of the core async logic into theManualResetValueTaskSourceCore
helper struct, which is typically embedded into some object, with the interface methods delegating to corresponding helpers on the struct. Why would someone want to do this? Most commonly, it’s to be able to reuse the same instance implementing this interface over and over and over. So, for example,Socket
exposes aValueTask<int> ReceiveAsync
method, and it caches a single instance of anIValueTaskSource<int>
implementation for use with such receives. As long as you only ever have one receive pending on a given socket at a time (which is the 99.999% case), everyReceiveAsync
call will either return aValueTask<int>
wrapped around anint
value or aValueTask<int>
wrapped around that reusableIValueTaskSource<int>
, making all use ofReceiveAsync
ammortized allocation-free (there is another instance used forSendAsync
, such that you can have a concurrent read and write on the socket and still avoid allocations). However, implementing this support is still non-trivial, and can be super hard when dealing with an operation that’s composed of multiple suboperations, which is exactly whereasync/await
shine. Thus, C# 10 added support for overriding the default builder that’s used on an individual async method (e.g. such that someone could provide their own builder for aValueTask<int>
-returning method instead of the one that allocatesTask<int>
instances for asynchronous completion) and .NET 6 included the newPoolingAsyncValueTaskMethodBuilder
andPoolingAsyncValueTaskMethodBuilder<>
types. With those, anasync
method like:
public async ValueTask<int> ReadAsync(Memory<byte> buffer) { ... }
can be changed to be:
[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder<>))]public async ValueTask<int> ReadAsync(Memory<byte> buffer) { ... }
which will cause the C# compiler to emit the implementation of this method usingPoolingAsyncValueTaskMethodBuilder<int>
instead of the defaultAsyncValueTaskMethodBuilder<int>
. The implementation ofPoolingAsyncValueTaskMethodBuilder<TResult>
is true to its name; it employs pooling to avoidmost of the allocation asynchronous completion would otherwise experience (I say “most” because the pooling by design tries to balance all the various costs involved and may still sometimes allocate), and makes it easy for methods implemented withasync
/await
to reap those benefits. So, if this was all introduced in the last release, why am I talking about it now? Pooling isn’t free. There are various tradeoffs involved in its usage, and while it can make microbenchmarks look really good, it can also negatively impact real-world usage, e.g. by increasing the cost of garbage collections that do occur by increasing the number of Gen2 to Gen0 references that exist. As such, while the functionality is valuable, we’ve been methodical in where and how we use it, choosing to do so more slowly and only employing it after sufficient analysis deems it’s worthwhile.
Such is the case withSslStream
. Withdotnet/runtime#69418, two core and hotasync
methods onSslStream
‘s read path were annotated to use pooling. A microbenchmark shows what I mean when I wrote this can make microbenchmarks look really good (focus on the allocation columns). This benchmark is repeatedly issuing a read (that will be forced to complete asynchronously because there’s no available data to satisfy it), then issuing a write to enable that read to complete, and thenawait
ing the read’s completion; every read thus completes asynchronously.
private SslStream _sslClient, _sslServer;private readonly byte[] _buffer = new byte[1];private readonly SslServerAuthenticationOptions _options = new SslServerAuthenticationOptions{ ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),};[GlobalSetup]public void Setup(){ using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); listener.Bind(new IPEndPoint(IPAddress.Loopback, 0)); listener.Listen(1); var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); client.Connect(listener.LocalEndPoint); _sslClient = new SslStream(new NetworkStream(client, ownsSocket: true), leaveInnerStreamOpen: true, delegate { return true; }); _sslServer = new SslStream(new NetworkStream(listener.Accept(), ownsSocket: true), leaveInnerStreamOpen: true, delegate { return true; }); Task.WaitAll( _sslClient.AuthenticateAsClientAsync("localhost", null, SslProtocols.None, checkCertificateRevocation: false), _sslServer.AuthenticateAsServerAsync(_options));}[GlobalCleanup]public void Cleanup(){ _sslClient.Dispose(); _sslServer.Dispose();}[Benchmark]public async Task ReadWriteAsync(){ for (int i = 0; i < 1000; i++) { ValueTask<int> read = _sslClient.ReadAsync(_buffer); await _sslServer.WriteAsync(_buffer); await read; }}
Method | Runtime | Mean | Ratio | Code Size | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|
ReadWriteAsync | .NET 6.0 | 68.34 ms | 1.00 | 510 B | 336404 B | 1.000 |
ReadWriteAsync | .NET 7.0 | 69.60 ms | 1.02 | 514 B | 995 B | 0.003 |
One final change related to reading and writing performance on anSslStream
. I find this one particularly interesting, as it highlights a new and powerful C# 11 and .NET 7 feature: static abstract members in interfaces.SslStream
, as with everyStream
, exposes both synchronous and asynchronous methods for reading and writing. And as you may be aware, the code withinSslStream
for implementing reads and writes is not particularly small. Thus, we really want to avoid having to duplicate all of the code paths, once for synchronous work and once for asynchronous work, when in reality the only place that bifurcation is needed is at the leaves where calls into the underlyingStream
are made to perform the actual I/O. Historically, we’ve had two different mechanisms we’ve employed indotnet/runtime for handling such unification. One is to make all methodsasync
, but with an additionalbool useAsync
parameter that gets fed through the call chain, then branching based on it at the leaves, e.g.
public static void Work(Stream s) => A(s, useAsyunc: false).GetAwaiter().GetResult(); // GetResult() to propagate any exceptionspublic static Task WorkAsync(Stream S) => A(s, useAsync: true);internal static async Task A(Stream s, bool useAsync){ ... await B(s, useAsync); ...}private static async Task B(Stream s, bool useAsync){ ... int bytesRead = useAsync ? await s.ReadAsync(buffer) : s.Read(buffer.Span); ...}
This way most of the logic and code is shared, and when useAsync is false, everything completes synchronously and so we don’t pay for allocation that might otherwise be associated with theasync
-ness. The other approach is similar in spirit, but instead of abool
parameter, taking advantage of generic specialization and interface-implementing structs. Consider an interface like:
interface IReader{ ValueTask<int> ReadAsync(Stream s, Memory<byte> buffer);}
We can then declare two implementations of this interface:
struct SyncReader : IReader{ public ValueTask<int> ReadAsync(Stream s, Memory<byte> buffer) => new ValueTask<int>(s.Read(buffer.Span));}struct AsyncReader : IReader{ public ValueTask<int> ReadAsync(Stream s, Memory<byte> buffer) => s.ReadAsync(buffer);}
Then we can redeclare our earlier example as:
public static void Work(Stream s) => A(stream, default(SyncReader)).GetAwaiter().GetResult(); // to propagate any exceptionspublic static Task WorkAsync(Stream S) => A(s, default(AsyncReader));internal static async Task A<TReader>(Stream s, TReader reader) where TReader : IReader{ ... await B(s, reader); ...}private static async Task B<TReader>(Stream s, TReader reader) where TReader : IReader{ ... int bytesRead = await reader.ReadAsync(s, buffer); ...}
Note that the generic constraint on theTReader
parameter here allows the implementation to invoke the interface methods, and passing the structs as a generic avoids boxing. One code path supporting both sync and async implementations.
This latter generic approach is howSslStream
has historically handled the unification of its sync and async implementations. It gets better in .NET 7 with C# 11 now that we have static abstract methods in interfaces. We can instead declare our interface as (note thestatic abstract
addition):
interface IReader{ static abstract ValueTask<int> ReadAsync(Stream s, Memory<byte> buffer);}
our types as (note thestatic
addition):
struct SyncReader : IReader{ public static ValueTask<int> ReadAsync(Stream s, Memory<byte> buffer) => new ValueTask<int>(s.Read(buffer.Span));}struct AsyncReader : IReader{ public static ValueTask<int> ReadAsync(Stream s, Memory<byte> buffer) => s.ReadAsync(buffer);}
and our consuming methods as (note the removal of the parameter and the switch to calling static methods on the type parameter):
public static void Work(Stream s) => A<SyncReader>(stream).GetAwaiter().GetResult(); // to propagate any exceptionspublic static Task WorkAsync(Stream S) => A<AsyncReader>(s);internal static async Task A<TReader>(Stream s) where TReader : IReader{ ... await B<TReader>(s); ...}private static async Task B<TReader>(Stream s) where TReader : IReader{ ... int bytesRead = await TReader.ReadAsync(s, buffer); ...}
Not only is this cleaner, but from a performance perspective we no longer need to pass around the dummy generic parameter, which is general goodness, but for an async method it’s particularly beneficial because the state machine type ends up storing all parameters as fields, which means every parameter can increase the amount of allocation incurred by an async method if the method ends up completing asynchronously.dotnet/runtime#65239 flippedSslStream
(andNegotiateStream
) to follow this approach. It’s also used in multiple other places now throughout dotnet/runtime.dotnet/runtime#69278 from@teo-tsirpanis changed theRandomAccess
class’ implementation for Windows and theThreadPool
‘s mechanism for invoking work items to use the same approach. Further,dotnet/runtime#63546 did the same in theRegex
implementation, and in particular in the newRegexOptions.NonBacktracking
implementation, as a way to abstract over DFA and NFA-based operations using the same code (this technique was since further utilized inNonBacktracking
, such as bydotnet/runtime#71234 from@olsaarik). And potentially most impactfully,dotnet/runtime#73768 did so withIndexOfAny
to abstract away the differences betweenIndexOfAny
andIndexOfAnyExcept
(also for theLast
variants). With the introduction of the{Last}IndexOfAnyExcept
variations previously mentioned, we now have four different variants ofIndexOfAny
with essentially the same functionality: searching forward or backwards, and with equality or inequality. While more challenging to try to unify the directional aspect, this PR utilized this same kind of generic specialization to hide behind an interface the ability to negate the comparison; the core implementations of these methods can then be implemented once and passed either aNegate
orDontNegate
implementation of the interface. The net result is not only that the newExcept
varieties immediately gained all of the optimizations of the non-Except
varieties, but also the goal of trying to make everything consistent resulted in finding places where we were missing optimization opportunities in existing methods (gaps that the PR also rectified).
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;[Benchmark]public int LastIndexOfAny() => s_haystack.AsSpan().LastIndexOfAny(';', '_');
Method | Runtime | Mean | Ratio |
---|---|---|---|
LastIndexOfAny | .NET 6.0 | 9.977 us | 1.00 |
LastIndexOfAny | .NET 7.0 | 1.172 us | 0.12 |
Let’s move up the stack to HTTP. Most of the folks focusing on networking in .NET 7 were focused on taking the preview support for HTTP/3 that shipped in .NET 6 and making it a first-class supported feature in .NET 7. That included functional improvements, reliability and correctness fixes, and performance improvements, such that HTTP/3 can now be used viaHttpClient
on both Windows and Linux (it depends on an underlying QUIC implementation in the msquic component, which isn’t currently available for macOS). However, there were significant improvements throughout the HTTP stack, beyond HTTP/3.
One aspect ofHttpClient
that cuts across all versions of HTTP is support for handling and representing headers. While significant improvements went into previous releases to trim down the size of the data structures used to store header information, further work on this front was done for .NET 7.dotnet/runtime#62981, for example, improves the data structure used to store headers. One of the thingsHttpHeaders
needs to deal with is that there’s no defined limit to the number of headers that can be sent with an HTTP request or response (though in order to mitigate possible denial of service attacks, the implementation has a configurable limit for how many bytes of headers are accepted from the server), and thus it needs to be able to handle an arbitrary number of them and to do so with efficient access. As such, for the longest timeHttpHeaders
has used aDictionary<,>
to provideO(1)
lookup into these headers. However, while it’s valid to have large numbers of headers, it’s most common to only have a handful, and for only a few items, the overheads involved in a hash table likeDictionary<>
can be more than just storing the elements in an array and doing anO(N)
lookup by doing a linear search through all the elements (algorithmic complexity ignores the “constants” involved, so for a smallN
, anO(N)
algorithm might be much faster and lighterweight than anO(1)
). This PR takes advantage of that and teachesHttpHeaders
how to use either an array or a dictionary; for common numbers of headers (the current threshold is 64), it just uses an array, and in the rare case where that threshold is exceeded, it graduates into a dictionary. This reduces the allocation inHttpHeader
in all but the most niche cases while also making it faster for lookups.
Another header-related size reduction comes indotnet/runtime#64105. The internal representation of headers involves aHeaderDescriptor
that enables “known headers” (headers defined in the HTTP specifications or that we’re otherwise aware of and want to optimize) to share common data, e.g. if a response header matches one of these known headers, we can just use the header name string singleton rather than allocating a new string for that header each time we receive it. ThisHeaderDescriptor
accomodated both known headers and custom headers by having two fields, one for known header data (which would be null for custom headers) and one for the header name. Instead, this PR employs a relatively-common technique of having a singleobject
field that then stores either the known header information or the name, since the known header information itself includes the name, and thus we don’t need the duplication. At the expense of a type check when we need to look up information from that field, we cut the number of fields in half. And while thisHeaderDescriptor
is itself a struct, it’s stored in header collections, and thus by cutting the size of theHeaderDescriptor
in half, we can significantly reduce the size of those collections, especially when many custom headers are involved.
private readonly string[] _strings = new[] { "Access-Control-Allow-Credentials", "Access-Control-Allow-Origin", "Cache-Control", "Connection", "Date", "Server" };[Benchmark]public HttpResponseHeaders GetHeaders(){ var headers = new HttpResponseMessage().Headers; foreach (string s in _strings) { headers.TryAddWithoutValidation(s, s); } return headers;}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
GetHeaders | .NET 6.0 | 334.4 ns | 1.00 | 664 B | 1.00 |
GetHeaders | .NET 7.0 | 213.9 ns | 0.64 | 360 B | 0.54 |
Similarly focused on allocation,dotnet/runtime#63057 removes two fields from theHttpHeaderValueCollection<T>
collection type, which provides the concrete implementation forICollection<T>
properties likeHttpContentHeaders.ContentEncoding
,HttpRequestHeaders.UserAgent
, andHttpResponseHeaders.Server
. The initial design and implementation of this type were overly flexible, with a mechanism for custom validation of values, which entailed multiple fields for storing things like anAction<>
callback to use for validation. But as it turns out in practice, that validation was only used for one specific consumer, and so rather than making everyone pay for the extra space that wasn’t typically used, the validation was instead extracted out to just the call sites it was required.
A more focused allocation reduction comes indotnet/runtime#63641. The shared internal utility methodHttpRuleParser.GetHostLength
was usingstring.Substring
in order to hand back the parsed host information, but only some of the callers needed this. Rather than making everyone pay for something that not everyone needed, this logic was moved into only the call sites that needed it.
Other small allocation improvements were also made outside of headers. For example, when new HTTP/1 and HTTP/2 connections are created, the implementation queues a work item to the thread pool to handle the actual creation, primarily to escape locks that might be held higher in the call stack. To do so, it usedTask.Run
. And while normallyTask.Run
is a fine thing to use, in this case there were two issues: the resultingTask
was being ignored, such that any unexpected exceptions would just be eaten, and the lambda being passed toTask.Run
was closing overthis
and a local, which means the C# compiler will have generated code to allocate both a “display class” (an object to store the state being passed in) for the closure and then also a delegate to a method on that display class. Instead,dotnet/runtime#68750 switches it to useThreadPool.QueueUserWorkItem
, using the overload that takes a genericTState
, and passing in a tuple of all required state in order to avoid both superfluous allocations.
Folks using HTTP often need to go through a proxy server, and in .NET the ability to go through an HTTP proxy is represented via theIWebProxy
interface; it has three members,GetProxy
for getting theUri
of the proxy to use for a given destinationUri
, theIsBypassed
method which says whether a givenUri
should go through a proxy or not, and then aCredentials
property to be used when accessing the target proxy. The canonical implementation ofIWebProxy
provided in the core libraries is the aptly namedWebProxy
.WebProxy
is fairly simple: you give it a proxyUri
, and then calls toGetProxy
return that proxyUri
if the destination isn’t to be bypassed. Whether aUri
should be bypassed is determined by two things (assuming a non-null proxyUri
was provided): did the constructor of theWebProxy
specify that “local” destinations should be bypassed (and if so, is this destination local), or does this destination address match any of any number of regular expressions provided. As it turns out, this latter aspect has been relatively slow and allocation-heavy in all previous releases of .NET, for two reasons: every call to check whether an address was bypassed was recreating aRegex
instance for every supplied regular expression, and every call to check whether an address was bypassed was deriving a newstring
from theUri
to use to match against theRegex
. In .NET 7, both of those issues have been fixed, yielding significant improvements if you rely on this regular expression functionality.dotnet/runtime#73803 from@onehourlate changed the handling of the collection of theseRegex
instances. The problem was thatWebProxy
exposes anArrayList
(this type goes back to the beginning of .NET and was created pre-generics), which the consumer could modify, and soWebProxy
had to assume the collection was modified between uses and addressed that by simply creating newRegex
instances on every use; not good. Instead, this PR creates a customArrayList
-derived type that can track all relevant mutations, and then only if the collection is changed (which is incredibly rare, bordering on never) do theRegex
instances need to be recreated. Anddotnet/runtime#73807 takes advantage of stack allocation and theMemoryExtensions.TryWrite
method with string interpolation to format the text into stack memory, avoiding the string allocation. This, combined with the newRegex.IsMatch(ReadOnlySpan<char>)
overload that enables us to match against thatstackalloc
‘d span, makes that aspect of the operation allocation-free as well. Altogether, drastic improvements:
private WebProxy _proxy = new WebProxy("http://doesntexist", BypassOnLocal: false, new[] { @"\.microsoft.com", @"\.dot.net", @"\.bing.com" });private Uri _destination = new Uri("https://docs.microsoft.com/dotnet/api/system.net.webproxy");[Benchmark]public bool IsBypassed() => _proxy.IsBypassed(_destination);
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
IsBypassed | .NET 6.0 | 5,343.2 ns | 1.00 | 7528 B | 1.00 |
IsBypassed | .NET 7.0 | 205.5 ns | 0.04 | – | 0.00 |
Also related to HTTP,WebUtility
‘sHtmlDecode
method has improved for .NET 7. The implementation had been manually iterating through each character in the input looking for a'&'
to be unescaped. Any time you see such an open-coded loop looking for one or more specific characters, it’s a red flag thatIndexOf
should be strongly considered.dotnet/runtime#70700 deletes the entire searching function and replaces it withIndexOf
, yielding simpler and much faster code (you can see other improvements to useIndexOf
variants in networking, such asdotnet/runtime#71137, which usedIndexOfAny
inHttpListener
‘sHandleAuthentication
to search a header for certain kinds of whitespace):
private string _encoded = WebUtility.HtmlEncode(""" Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Condimentum vitae sapien pellentesque habitant. Vitae auctor eu augue ut lectus. Augue lacus viverra vitae congue eu. Tempus quam pellentesque nec nam aliquam sem. Urna nec tincidunt praesent semper feugiat nibh sed. Amet tellus cras adipiscing enim eu. Duis ultricies lacus sed turpis tincidunt. Et sollicitudin ac orci phasellus egestas tellus rutrum tellus pellentesque. """);[Benchmark]public string HtmlDecode() => WebUtility.HtmlDecode(_encoded);
Method | Runtime | Mean | Ratio |
---|---|---|---|
HtmlDecode | .NET 6.0 | 245.54 ns | 1.00 |
HtmlDecode | .NET 7.0 | 19.66 ns | 0.08 |
There have been a myriad of other performance-related improvements in networking as well, such asdotnet/runtime#67881 which removed the use ofTcpClient
fromFtpWebRequest
;dotnet/runtime#68745 inWebSocket
which removed a parameter from one of the core async methods (and since parameters end up on the state machine, if the async method yields this results in fewer allocated bytes); anddotnet/runtime#70866 anddotnet/runtime#70900, which replaced all remaining use ofMarshal.PtrToStructure
in the core networking code with more efficient marshaling (e.g. just performing casts). WhileMarshal.PtrToStructure
is valuable when custom marshaling directives are used and the runtime needs to be involved in the conversion, it’s also much more heavyweight than just casting, which can be done when the native and managed layouts are bit-for-bit compatible. As with theu8
example earlier, this comparison is hardly fair, but that’s exactly the point:
private IntPtr _mem;[GlobalSetup]public void Setup(){ _mem = Marshal.AllocHGlobal(8); Marshal.StructureToPtr(new SimpleType { Value1 = 42, Value2 = 84 }, _mem, false);}[GlobalCleanup]public void Cleanup() => Marshal.FreeHGlobal(_mem);public struct SimpleType{ public int Value1; public int Value2;}[Benchmark(Baseline = true)]public SimpleType PtrToStructure() => Marshal.PtrToStructure<SimpleType>(_mem);[Benchmark]public unsafe SimpleType Cast() => *(SimpleType*)_mem;
Method | Mean | Ratio |
---|---|---|
PtrToStructure | 26.6593 ns | 1.000 |
Cast | 0.0736 ns | 0.003 |
For folks usingNegotiateStream
,dotnet/runtime#71280 from@filipnavara will also be very welcome (this comes as part of a larger effort, primarily indotnet/runtime#71777 from@filipnavara anddotnet/runtime#70720 from@filipnavara, to expose the newNegotiateAuthentication
class). It removes a significant amount of allocation from a typical NTLM handshake by reusing a buffer rather than reallocating a new buffer for each of multiple phases of the handshake:
private NetworkStream _client, _server;[GlobalSetup]public void Setup(){ using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); listener.Bind(new IPEndPoint(IPAddress.Loopback, 0)); listener.Listen(1); var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); client.Connect(listener.LocalEndPoint); Socket server = listener.Accept(); _client = new NetworkStream(client, ownsSocket: true); _server = new NetworkStream(server, ownsSocket: true);}[Benchmark]public async Task Handshake(){ using NegotiateStream client = new NegotiateStream(_client, leaveInnerStreamOpen: true); using NegotiateStream server = new NegotiateStream(_server, leaveInnerStreamOpen: true); await Task.WhenAll(client.AuthenticateAsClientAsync(), server.AuthenticateAsServerAsync());}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
Handshake | .NET 6.0 | 1.905 ms | 1.00 | 240.5 KB | 1.00 |
Handshake | .NET 7.0 | 1.913 ms | 1.00 | 99.28 KB | 0.41 |
JSON
System.Text.Json
was introduced in .NET Core 3.0, and has seen a significant amount of investment in each release since. .NET 7 is no exception. New features in .NET 7 include support forcustomizing contracts,polymorphic serialization,support for required members,support forDateOnly
/TimeOnly
,support forIAsyncEnumerable<T>
andJsonDocument
in source generation, andsupport for configuringMaxDepth
inJsonWriterOptions
. However, there have also been new features specifically focused on performance, and other changes about improving performance of JSON handling in a variety of scenarios.
One of the biggest performance pitfalls we’ve seen developers face withSystem.Text.Json
has to do with how the library caches data. In order to achieve good serialization and deserialization performance when the source generator isn’t used,System.Text.Json
uses reflection emit to generate custom code for reading/writing members of the types being processed. Instead of then having to pay reflection invoke costs on every access, the library incurs a much larger one-time cost per type to perform this code generation, but then all subsequent handling of these types is very fast… assuming the generated code is available for use. These generated handlers need to be stored somewhere, and the location that’s used for storing them is them isJsonSerializerOptions
. The idea was intended to be that developers would instantiate an options instance once and pass it around to all of their serialization/deserialization calls; thus, state like these generated handlers could be cached on them. And that works well when developers follow the recommended model. But when they don’t, performance falls off a cliff, and hard. Instead of “just” paying for the reflection invoke costs, each use of a newJsonSerializerOptions
ends up re-generating via reflection emit those handlers, skyrocketing the cost of serialization and deserialization. A super simple benchmark makes this obvious:
private JsonSerializerOptions _options = new JsonSerializerOptions();private MyAmazingClass _instance = new MyAmazingClass();[Benchmark(Baseline = true)]public string ImplicitOptions() => JsonSerializer.Serialize(_instance);[Benchmark]public string WithCached() => JsonSerializer.Serialize(_instance, _options);[Benchmark]public string WithoutCached() => JsonSerializer.Serialize(_instance, new JsonSerializerOptions());public class MyAmazingClass{ public int Value { get; set; }}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
ImplicitOptions | .NET 6.0 | 170.3 ns | 1.00 | 200 B | 1.00 |
WithCached | .NET 6.0 | 163.8 ns | 0.96 | 200 B | 1.00 |
WithoutCached | .NET 6.0 | 100,440.6 ns | 592.48 | 7393 B | 36.97 |
In .NET 7, this was fixed indotnet/runtime#64646 (and subsequently tweaked indotnet/runtime#66248) by adding a global cache of the type information separate from the options instances. AJsonSerializerOptions
still has a cache, but when new handlers are generated via reflection emit, those are also cached at the global level (with appropriate removal when no longer used in order to avoid unbounded leaks).
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
ImplicitOptions | .NET 6.0 | 170.3 ns | 1.00 | 200 B | 1.00 |
ImplicitOptions | .NET 7.0 | 166.8 ns | 0.98 | 48 B | 0.24 |
WithCached | .NET 6.0 | 163.8 ns | 0.96 | 200 B | 1.00 |
WithCached | .NET 7.0 | 168.3 ns | 0.99 | 48 B | 0.24 |
WithoutCached | .NET 6.0 | 100,440.6 ns | 592.48 | 7393 B | 36.97 |
WithoutCached | .NET 7.0 | 590.1 ns | 3.47 | 337 B | 1.69 |
As can be seen here, it’s still more expensive to create a newJsonSerializerOptions
instance on each call, and the recommended approach is “don’t do that.” But if someone does do it, in this example they’re only paying 3.6x the cost rather than 621x the cost, a huge improvement.dotnet/runtime#61434 also now exposes theJsonSerializerOptions.Default
instance that’s used by default if no options are explicitly provided.
Another change toJsonSerializer
came indotnet/runtime#72510, which slightly improved the performance of serialization when using the source generator. The source generator emits helpers for performing the serialization/deserialization work, and these are then invoked byJsonSerializer
via delegates (as part of abstracting away all the different implementation strategies for how to get and set members on the types being serialized and deserialized). Previously, these helpers were being emitted as static methods, which in turn meant that the delegates were being created to static methods. Delegates to instance methods are a bit faster to invoke than delegates to static methods, so this PR made a simple few-line change for the source generator to emit these as instance methods instead.
Yet another forJsonSerializer
comes indotnet/runtime#73338, which improves allocation with how it utilizesUtf8JsonWriter
.Utf8JsonWriter
is a class, and every timeJsonSerializer
would write out JSON, it would allocate a newUtf8JsonWriter
instance. In turn,Utf8JsonWriter
needs something to write to, and although the serializer was using anIBufferWriter
implementation that pooled the underlyingbyte[]
instances employed, the implementation ofIBufferWriter
itself is a class thatJsonSerializer
would allocate. A typicalSerialize
call would then end up allocating a few extra objects and an extra couple of hundred bytes just for these helper data structures. To address that, this PR takes advantage of[ThreadStatic]
, which can be put onto static fields to make them per-thread rather than per-process. From whatever thread is performing the (synchronous)Serialize
operation, it then ensures the current thread has aUtf8JsonWriter
andIBufferWriter
instance it can use, and uses them; for the most part this is straightforward, but it needs to ensure that the serialization operation itself doesn’t try to recursively serialize, in which case these objects could end up being used erroneously while already in use. It also needs to make sure that the pooledIBufferWriter
doesn’t hold on to any of itsbyte[]
s while it’s not being used. That instance gets its arrays fromArrayPool<T>
, and we want those arrays to be usable in the meantime by anyone else making use of the pool, not sequestered off in this cachedIBufferWriter
implementation. This optimization is also only really meaningful for small object graphs being serialized, and only applies to the synchronous operations (asynchronous operations would require a more complicated pooling mechanism, since the operation isn’t tied to a specific thread, and the overhead of such complication would likely outweigh the modest gain this optimization provides).
private byte[] _data = new byte[] { 1, 2, 3, 4, 5 };[Benchmark]public string SerializeToString() => JsonSerializer.Serialize(_data);
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
SerializeToString | .NET 6.0 | 146.4 ns | 1.00 | 200 B | 1.00 |
SerializeToString | .NET 7.0 | 137.5 ns | 0.94 | 48 B | 0.24 |
Utf8JsonWriter
andUtf8JsonReader
also saw several improvements directly.dotnet/runtime#69580 adds a few new performance-focused members, theValueIsEscaped
property (which exposes already tracked information and enables consumers to avoid the expense of re-checking) and theCopyString
method (which provides a non-allocating mechanism to get access to a string value from the reader). It then also uses the added support internally to speed up certain operations onUtf8JsonReader
. Anddotnet/runtime#63863,dotnet/runtime#71534, anddotnet/runtime#61746 fix how some exception checks and throws were being handled so as to not slow down the non-exceptional fast paths.
XML
System.Xml is used by a huge number of applications and services, but ever since JSON hit the scene and has been all the rage, XML has taken a back seat and thus hasn’t seen a lot of investment from either a functionality or performance perspective. Thankfully, System.Xml gets a bit of performance love in .NET 7, in particular around reducing allocation on some commonly used code paths.
Sometimes a performance fix is as easy as changing a single number. That’s the case withdotnet/runtime#63459 from@chrisdcmoore, which addresses a long-standing issue with the asynchronous methods on the popularXmlReader
. WhenXmlReader
was originally written, whoever developed it chose a fairly common buffer size to be used for read operations, namely 4K or 8Kchar
s depending on various conditions. WhenXmlReader
later gained asynchronous reading functionality, for whatever reason a much, much larger buffer size of 64Kchar
s was selected (presumably in hopes of minimizing the number of asynchronous operations that would need to be employed, but the actual rationale is lost to history). A key problem with such a buffer size, beyond it leading to a lot of allocation, is the allocation it produces typically ends up on the Large Object Heap (LOH). By default, under the expectation that really large objects are long-lived, objects greater than 85K bytes are allocated into the LOH, which is treated as part of Gen 2, and that makes such allocation ifnot long-lived even more expensive in terms of overall impact on the system. Well, 64Kchar
s is 128K bytes, which puts it squarely above that threshold. This PR lowers the size from 64Kchar
s to 32Kchar
s, putting it below the threshold (and generally reducing allocation pressure, how much memory needs to be zero’d, etc). While it’s still a very large allocation, and in the future we could look at pooling the buffer or employing a smaller one (e.g. no different from what’s done for the synchronous APIs), this simple one-number change alone makes a substantial difference for shorter input documents (while not perceivably negatively impacting larger ones).
private readonly XmlReaderSettings _settings = new XmlReaderSettings { Async = true };private MemoryStream _stream;[Params(10, 1_000_000)]public int ItemCount;[GlobalSetup]public void Setup(){ _stream = new MemoryStream(); using XmlWriter writer = XmlWriter.Create(_stream); writer.WriteStartElement("Items"); for (var i = 0; i < ItemCount; i++) { writer.WriteStartElement($"Item{i}"); writer.WriteEndElement(); } writer.WriteEndElement();}[Benchmark]public async Task XmlReader_ReadAsync(){ _stream.Position = 0; using XmlReader reader = XmlReader.Create(_stream, _settings); while (await reader.ReadAsync());}
Method | Runtime | ItemCount | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|
XmlReader_ReadAsync | .NET 6.0 | 10 | 42.344 us | 1.00 | 195.94 KB | 1.00 |
XmlReader_ReadAsync | .NET 7.0 | 10 | 9.992 us | 0.23 | 99.94 KB | 0.51 |
XmlReader_ReadAsync | .NET 6.0 | 1000000 | 340,382.953 us | 1.00 | 101790.34 KB | 1.00 |
XmlReader_ReadAsync | .NET 7.0 | 1000000 | 333,417.347 us | 0.98 | 101804.45 KB | 1.00 |
XmlReader
andXmlWriter
saw other allocation-related improvements as well.dotnet/runtime#60076 from@kronic improved theReadOnlyTernaryTree
internal type that’s used whenXmlOutputMethod.Html
is specified in theXmlWriterSettings
. This included using aReadOnlySpan<byte>
initialized from an RVA static instead of a largebyte[]
array that would need to be allocated. Anddotnet/runtime#60057 from@kronic, which converted ~400string
creations in theSystem.Private.Xml
assembly to use interpolated strings. Many of these cases were stylistic, converting something likestring1 + ":" + string2
into$"{string1}:{string2}"
; I say stylistic here because the C# compiler will generate the exact same code for both of those, a call tostring.Concat(string1, ":", string2)
, given that there’s aConcat
overload that accepts three strings. However, some of the changes do impact allocation. For example, the privateXmlTextWriter.GeneratePrefix
method had the code:
return "d" + _top.ToString("d", CultureInfo.InvariantCulture) + "p" + temp.ToString("d", CultureInfo.InvariantCulture);
where_top
andtemp
are bothint
s. This will result in allocating two temporary strings and then concatenating those with the two constant strings. Instead, the PR changed it to:
return string.Create(CultureInfo.InvariantCulture, $"d{_top:d}p{temp:d}");
which while shorter is also more efficient, avoiding the intermediate string allocations, as the custom interpolated string handler used bystring.Create
will format those into a pooled buffer rather than allocating intermediate temporaries.
XmlSerializer
is also quite popular and also gets a (small) allocation reduction, in particular for deserialization.XmlSerializer
has two modes for generating serialization/deserialization routines: using reflection emit to dynamically generate IL at run-time that are tuned to the specific shape of the types being serialized/deserialized, and theXML Serializer Generator Tool (sgen), which generates a .dll containing the same support, just ahead-of-time (a sort-of precursor to the Roslyn source generators we love today). In both cases, when deserializing, the generated code wants to track which properties of the object being deserialized have already been set, and to do that, it uses abool[]
as a bit array. Every time an object is deserialized, it allocates abool[]
with enough elements to track every member of the type. But in common usage, the vast majority of types being deserialized only have a relatively small number of properties, which means we can easily use stack memory to track this information rather than heap memory. That’s whatdotnet/runtime#66914 does. It updates both of the code generators tostackalloc
into aSpan<bool>
for less than or equal to 32 values, and otherwise fall back to the old approach of heap-allocating thebool[]
(which can also then be stored into aSpan<bool>
so that the subsequent code paths simply use a span instead of an array). You can see this quite easily in the .NET Object Allocation Tracking tool in Visual Studio. For this console app (which, as an aside, shows how lovely the new raw string literals feature in C# is for working with XML):
using System.Text;using System.Xml.Serialization;var serializer = new XmlSerializer(typeof(Release[]));var stream = new MemoryStream(Encoding.UTF8.GetBytes( """ <?xml version="1.0" encoding="utf-8"?> <ArrayOfRelease xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <Release><Major>1</Major><Minor>0</Minor></Release> <Release><Major>1</Major><Minor>1</Minor></Release> <Release><Major>2</Major><Minor>0</Minor></Release> <Release><Major>2</Major><Minor>1</Minor></Release> <Release><Major>2</Major><Minor>2</Minor></Release> <Release><Major>3</Major><Minor>0</Minor></Release> <Release><Major>3</Major><Minor>1</Minor></Release> <Release><Major>5</Major><Minor>0</Minor></Release> <Release><Major>6</Major><Minor>0</Minor></Release> <Release><Major>7</Major><Minor>0</Minor></Release> </ArrayOfRelease> """));for (int i = 0; i < 1000; i++){ stream.Position = 0; serializer.Deserialize(stream);}public class Release{ public int Major; public int Minor; public int Build; public int Revision;}
Here’s what I see when I run this under .NET 6:
We’re running a thousand deserializations, each of which will deserialize 10Release
instances, and so we expect to see 10,000Release
objects being allocated, which we do… but we also see 10,000bool[]
being allocated. Now with .NET 7 (note the distinct lack of the per-objectbool[]
):
Other allocation reduction went into the creation of the serializer/deserializer itself, such as withdotnet/runtime#68738 avoiding allocating strings to escape text that didn’t actually need escaping,dotnet/runtime#66915 using stack allocation for building up small text instead of using aStringBuilder
,dotnet/runtime#66797 avoiding delegate and closure allocations in accessing the cache of serializers previously created,dotnet/runtime#67001 from@TrayanZapryanov caching an array used withstring.Split
, anddotnet/runtime#67002 from@TrayanZapryanov that changed some parsing code to avoid astring.ToCharArray
invocation.
For folks using XML schema,dotnet/runtime#66908 replaces someHashtable
s in the implementation where those collections were storingint
s as the value. Given thatHashtable
is a non-generic collection, every one of thoseint
s was getting boxed, resulting in unnecessary allocation overhead; these were fixed by replacing theseHashtable
s withDictionary<..., int>
instances. (As an aside, this is a fairly common performance-focused replacement to do, but you need to be careful asHashtable
has a few behavioral differences fromDictionary<,>
; beyond the obvious difference ofHashtable
returningnull
from its indexer when a key isn’t found andDictionary<,>
throwing in that same condition,Hashtable
is thread-safe for use with not only multiple readers but multiple readers concurrent with a single writer, andDictionary<,>
is not.)dotnet/runtime#67045 reduces allocation ofXmlQualifiedName
instances in the implementation ofXsdBuilder.ProcessElement
andXsdBuilder.ProcessAttribute
. Anddotnet/runtime#64868 from@TrayanZapryanov uses stack-based memory and pooling to avoid temporary string allocation in the implementation of the internalXsdDateTime
andXsdDuration
types, which are used by the publicXmlConvert
.
private TimeSpan _ts = TimeSpan.FromMilliseconds(12345);[Benchmark]public string XmlConvertToString() => XmlConvert.ToString(_ts);
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
XmlConvertToString | .NET 6.0 | 90.70 ns | 1.00 | 184 B | 1.00 |
XmlConvertToString | .NET 7.0 | 59.21 ns | 0.65 | 40 B | 0.22 |
XML pops up in other areas as well, as in theXmlWriterTraceListener
type. While theSystem.Diagnostics.Trace
type isn’t the recommended tracing mechanism for new code, it’s widely used in existing applications, andXmlWriterTraceListener
let’s you plug in to that mechanism to write out XML logs for traced information.dotnet/runtime#66762 avoids a bunch of string allocation occurring as part of this tracing, by formatting much of the header information into a span and then writing that out rather thanToString()
‘ing each individual piece of data.
[GlobalSetup]public void Setup(){ Trace.Listeners.Clear(); Trace.Listeners.Add(new XmlWriterTraceListener(Stream.Null));}[Benchmark]public void TraceWrite(){ Trace.WriteLine("Something important");}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
TraceWrite | .NET 6.0 | 961.9 ns | 1.00 | 288 B | 1.00 |
TraceWrite | .NET 7.0 | 772.2 ns | 0.80 | 64 B | 0.22 |
Cryptography
Some fairly significant new features came toSystem.Security.Cryptography
in .NET 7, including the support necessary to enable the previously discussed OCSP stapling and support forbuilding certificate revocation lists, but there was also a fair amount of effort put into making existing support faster and more lightweight.
One fairly substantial change in .NET 7 is split acrossdotnet/runtime#61025,dotnet/runtime#61137, anddotnet/runtime#64307. These PRs don’t change any code materially, but instead consolidate all of the various cryptography-related assemblies in the core libraries into a singleSystem.Security.Cryptography
assembly. When .NET Core was first envisioned, a goal was to make it extremely modular, and large swaths of code were teased apart to create many smaller assemblies. For example, cryptographic functionality was split betweenSystem.Security.Cryptography.Algorithms.dll
,System.Security.Cryptography.Cng.dll
,System.Security.Cryptography.Csp.dll
,System.Security.Cryptography.Encoding.dll
,System.Security.Cryptography.OpenSsl.dll
,System.Security.Cryptography.Primitives.dll
, andSystem.Security.Cryptography.X509Certificates.dll
. You can see this if you look in your shared framework folder for a previous release, e.g. here’s mine for .NET 6:
These PRs move all of that code into a singleSystem.Security.Cryptography.dll
assembly. This has several benefits. First, crypto is used in a huge number of applications, and most apps would end up requiring multiple (or even most) of these assemblies. Every assembly that’s loaded adds overhead. Second, a variety of helper files had to be compiled into each assembly, leading to overall larger amount of compiled code to be distributed. And third, we weren’t able to implement everything as optimal as we’d have otherwise liked due to functionality in one assembly not exposed to another (and we avoid usingInternalsVisibleTo
as it hampers maintainability and impedes other analysis and optimizations). Now in .NET 7, the shared framework looks more like this:
Interesting, you still see a bunch of assemblies there, but all except forSystem.Security.Cryptography.dll
are tiny; that’s because these are simple facades. Because we need to support binaries built for .NET 6 and earlier running on .NET 7, we need to be able to handle binaries that refer to types in these assemblies, but in .NET 7, those types actually live inSystem.Security.Cryptography.dll
. .NET provides a solution for this in the form of the[TypeForwardedTo(...)]
attribute, which enables one assembly to say “hey, if you’re looking for type X, it now lives over there.” And if you crack open one of these assemblies in a tool likeILSpy, you can see they’re essentially empty except for a bunch of these attributes:
In addition to the startup and maintenance wins this provides, this has also enabled further subsequent optimization. For example, there’s a lot of object cloning that goes on in the innards of this library. Various objects are used to wrap native handles to OS cryptographic resources, and to handle lifetime semantics and ownership appropriately, there are many cases where a native handle is duplicated and then wrapped in one or more new managed objects. In some cases, however, the original resource is then destroyed because it’s no longer needed, and the whole operation could have been made more efficient if the original resource just had its ownership transferred to the new objects rather than being duplicated and destroyed. This kind of ownership transfer typically is hard to do between assemblies as it generally requires public API that’s not focused on such usage patterns, but with internals access, this can be overcome.dotnet/runtime#72120 does this, for example, to reduce allocation of various resources inside theRSACng
,DSACng
,ECDsaCng
, andECDiffieHellmanCng
public types.
In terms of actual code improvements, there are many. One category of improvements is around “one-shot” operations. With many forms of data processing, all of the data needn’t be processed in one operation. A block of data can be processed, then another, then another, until finally there’s no more data to be processed. In such usage, there’s often some kind of state carried over from the processing of one block to the processing of the next, and then the processing of the last block is special as it needn’t carry over anything and instead needs to perform whatever work is required to end the whole operation, e.g. outputting any final footer or checksum that might be required as part of the format. Thus, APIs that are able to handle arbitrary number of blocks of data are often a bit more expensive in one way, shape, or form than APIs that only support a single input; this latter category is known as “one shot” operations, because they do everything in “one shot.” In some cases, one-shot operations can be significantly cheaper, and in other cases they merely avoid some allocations that would have been necessary to transfer state from the processing of one block of data to the next.dotnet/runtime#58270 from@vcsjones anddotnet/runtime#65725 from@vcsjones both improved the performance of various one-shot operations on “symmetric” cryptograhic algorithms (algorithms that use the same key information to both encrypt and decrypt), like AES. The former does so by refactoring the implementations to avoid some reset work that’s not necessary in the case of one-shots because the relevant state is about to go away, anyway, and that in turns also allows the implementation to store less of certain kinds of state. The latter does so for decryption one-shots by decrypting directly into the destination buffer whenever possible, using stack space if possible when going directly into the user’s buffer isn’t feasible, etc.
private byte[] _plaintext = Encoding.UTF8.GetBytes("This is a test. This is only a test. Nothing to see here.");private byte[] _iv = Enumerable.Range(0, 16).Select(i => (byte)i).ToArray();private Aes _aes = Aes.Create();private byte[] _output = new byte[1000];[Benchmark]public bool OneShot() => _aes.TryEncryptCfb(_plaintext, _iv, _output, out _);
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
OneShot | .NET 6.0 | 1.828 us | 1.00 | 336 B | 1.00 |
OneShot | .NET 7.0 | 1.770 us | 0.97 | 184 B | 0.55 |
In addition to making one-shots lighterweight, other PRs have then used these one-shot operations in more places in order to simplify their code and benefit from the increased performance, e.g.dotnet/runtime#70639 from@vcsjones,dotnet/runtime#70857 from@vcsjones,dotnet/runtime#64005 from@vcsjones, anddotnet/runtime#64174 from@vcsjones.
There’s also a large number of PRs that have focused on removing allocations from around the crypto stack:
- Stack allocation. As has been seen in many other PRs referenced throughout this post, using
stackalloc
is a very effective way to get rid of array allocations in many situations. It’s used effectively in multiple crypto PRs to avoid either temporary or pooled array allocations, such as indotnet/runtime#64584 from@vcsjones,dotnet/runtime#69831 from@vcsjones,dotnet/runtime#70173 from@vcsjones,dotnet/runtime#69812 from@vcsjones, anddotnet/runtime#69448 from@vcsjones. Sometimes this is used when calling an API that has multiple overloads, including one taking an array and one taking a span. Othertimes it’s used with P/Invokes that often just pass out a small amount of data. Sometimes it’s used to avoid temporary array allocations, and sometimes it’s used in places where pooling was used previously, but the data is often small enough to avoid even the overheads of pooling. - Avoiding double copies. Most of the crypto APIs that accept
byte[]
s and store them end up making defensive copies of those arrays rather than storing the original. This is fairly common throughout .NET, but it’s especially common in the crypto stack, where the ability to trust the data is as you expect it (and validate it) is paramount. In some cases, though, code ends up allocating a temporarybyte[]
just to pass data into one of these APIs that copies and re-allocates, anyway.dotnet/runtime#71102 from@vcsjones,dotnet/runtime#69024 from@vcsjones,dotnet/runtime#71015 from@vcsjones, anddotnet/runtime#69534 from@vcsjones deal with that duplication in some cases by extracting a span to the original data instead of creating a temporarybyte[]
; when that span is passed into the target API, the target API still makes a copy, but we’ve avoided the first one and thus cut the array allocation for these operations effectively in half.dotnet/runtime#71888 from@vcsjones is a variation on this theme, improving the internals ofRfc2898DeriveBytes
to supports spans such that its constructors that accept spans can then do the more efficient thing. - Replacing
O(1)
data structures.O(1)
lookup data structures likeDictionary<,>
andHashSet<>
are the lifeblood of most applications and services, but sometimes algorithmic complexity is misleading. Yes, these provide very efficient searching, but there’s still overhead associated with computing a hash code, mapping that hash code to a location in the data structure, and so on. If there’s only ever a handful of items (i.e. theN
in the complexity is really, really small), it can be much faster to just do a linear search, and ifN
is sufficiently small, a data structure may not even be needed at all: the search can just be open-coded as a waterfall of if/elseif/else constructs. That’s the case in a PR likedotnet/runtime#71341 from@vcsjones, where the 99.999% case involves just five strings (names of hash algorithms); it’s cheaper to just compare against each than it is do aHashSet<>.Contains
, especially since the JIT now unrolls and vectorizes the comparison against the constant string names. - Simply avoiding unnecessary work. The best optimizations are ones where you simply stop doing work you don’t have to do.dotnet/runtime#68553 from@vcsjones is a good example of this. This code was performing a hash of some data in order to determine the length of resulting hashes for that particular configuration, but we actually know ahead of time exactly how long a hash for a given algorithm is going to be, and we already have in this code a cascading if/elseif/else that’s checking for each known algorithm, so we can instead just hardcode the length for each.dotnet/runtime#70589 from@vcsjones is another good example, in the same spirit of the ownership transfer example mentioned earlier (but this one didn’t previously span assembly boundaries). Rather than in several places taking an
X509Extension
, serializing it to abyte[]
, and passing that temporarybyte[]
to something else that in turn makes a defensive copy, we can instead provide an internal pathway for ownership transfer, bypassing all of the middle stages. Another good one isdotnet/runtime#70618 from@vcsjones, as it’s an example of how it pays to really understand your dependencies. The implementation of symmetric encryption on macOS uses the CommonCrypto library. One of the functions it exposes isCCCryptorFinal
, which is used at the end of the encryption/decryption process. However, there are several cases called out in the docs where it’s unnecessary (“superfluous,” according to the docs), and so our dutifully calling it even in those situations is wasteful. The fix? Stop doing unnecessary work. - New APIs. A bunch of new APIs were introduced for cryptography in .NET 7. Most are focused on easing scenarios that were difficult to do correctly before, likedotnet/runtime#66509 from@vcsjones that provides an
X500DistinguishedNameBuilder
. But some are focused squarely on performance.dotnet/runtime#57835 from@vcsjones, for example, exposes a newRawDataMemory
property onX509Certificate2
. Whereas the existingRawData
property returns a newbyte[]
on every call (again a defensive copy to avoid having to deal with the possiblity that the consumer mucked with the raw data), this newRawDataMemory
returns aReadOnlyMemory<byte>
around the internalbyte[]
. Since the only way to access and mutate that underlyingbyte[]
via aReadOnlyMemory<byte>
is via unsafe interop code (namely via theSystem.Runtime.InteropServices.MemoryMarshal
type), it doesn’t create a defensive copy and enables accessing this data freely without additional allocation.
Diagnostics
Let’s turn our attention to System.Diagnostics, which encompasses types ranging from process management to tracing.
TheProcess
class is used for a variety of purposes, including querying information about running processes, interacting with other processes (e.g. being notified of their exiting), and launching processes. The performance of querying for information in particular had some notable improvements in .NET 7.Process
provides several APIs for querying for process information, one of the most common beingProcess.GetProcessesByName
: apps that know the name of the process they’re interested in can pass that toGetProcessesByName
and get back aProcess[]
containing aProcess
for each. It turns out that previous releases of .NET were loading the full information (e.g. all of its threads) about everyProcess
on the machine in order to filter down to just those with the target name.dotnet/runtime#68705 fixes that by only loading the name for a process rather than all of the information for it. While this helps a bit with throughput, it helps a ton with allocation:
[Benchmark]public void GetProcessesByName(){ foreach (Process p in Process.GetProcessesByName("dotnet.exe")) p.Dispose();}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
GetProcessesByName | .NET 6.0 | 2.287 ms | 1.00 | 447.86 KB | 1.000 |
GetProcessesByName | .NET 7.0 | 2.086 ms | 0.90 | 2.14 KB | 0.005 |
Accessing various pieces of information from aProcess
has also improved. If you load aProcess
object via theProcess.GetProcesses
orProcess.GetProcessesByName
methods, by design they load all information about theProcess
being retrieved; internally their state will be populated such that subsequent accesses to members of theProcess
instance will be very fast. But, if you access aProcess
viaProcess.GetProcessById
orProcess.GetCurrentProcess
(which is effectivelyGetProcessById
for the current process’ id), no information other than the process’ ID is prepopulated, and the state for theProcess
instance is queried on-demand. In most cases, accessing a single member of one of those lazy-loadedProcess
instances triggers loading all of the data for it, as the information is all available as part of the same native operation, e.g. on Windows usingNtQuerySystemInformation
and on Linux reading from/proc/pid/stat
and/proc/pid/status
. But in some cases we can be more fine-grained about it, using APIs that serve up a subset of the data much more quickly.dotnet/runtime#59672 from@SteveDunn provides one such optimization, using theQueryFullProcessImageName
on Windows to read the process name in response toProcess.ProcessName
being used. If all you care about reading is the process’ name, it’s a huge boost in throughput, and even if you subsequently go on to read additional state from theProcess
and force it to load everything else, accessing the process name is so fast that it doesn’t add meaningful overhead to the all-up operation. This is visible in this benchmark:
[Benchmark]public string GetCurrentProcessName(){ using Process current = Process.GetCurrentProcess(); return current.ProcessName;}[Benchmark]public string GetCurrentProcessNameAndWorkingSet(){ using Process current = Process.GetCurrentProcess(); return $"{current.ProcessName} {current.WorkingSet64}";}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
GetCurrentProcessName | .NET 6.0 | 3,070.54 us | 1.00 | 3954 B | 1.00 |
GetCurrentProcessName | .NET 7.0 | 32.30 us | 0.01 | 456 B | 0.12 |
GetCurrentProcessNameAndWorkingSet | .NET 6.0 | 3,055.70 us | 1.00 | 4010 B | 1.00 |
GetCurrentProcessNameAndWorkingSet | .NET 7.0 | 3,149.92 us | 1.03 | 4186 B | 1.04 |
Interestingly, this PR had a small deficiency we didn’t initially catch, which is that theQueryFullProcessImageName
API we switched to didn’t work in the case of elevated/privileged processes. To accomodate those,dotnet/runtime#70073 from@schuettecarsten updated the code to keep both the new and old implementations, starting with the new one and then only falling back to the old if operating on an incompatible process.
Several additional PRs helped out theProcess
class. When launching processes withProcess.Start
on Unix, the implementation was usingEncoding.UTF8.GetBytes
as part of argument handling, resulting in a temporary array being allocated per argument;dotnet/runtime#71279 removes that per-argument allocation, instead usingEncoding.UTF8.GetByteCount
to determine how large a space is needed and then using theEncoding.UTF8.GetBytes
overload that accepts a span to encode directly into the native memory already being allocated.dotnet/runtime#71136 simplifies and streamlines the code involved in getting the “short name” of a process on Windows for use in comparing process names. Anddotnet/runtime#45690 replaces a custom cache with use ofArrayPool
in the Windows implementation of getting all process information, enabling effective reuse of the array that ends up being used rather than having it sequestered off in theProcess
implementation forever.
Another area of performance investment has been inDiagnosticSource
, and in particular around enumerating through data fromActivity
instances. This work translates into faster integration and interoperability viaOpenTelemetry, in order to be able to export data from .NETActivity
information faster.dotnet/runtime#67012 from@CodeBlanch, for example, improved the performance of the internalDiagLinkedList<T>.DiagEnumerator
type that’s the enumerator returned when enumeratingActivity.Links
andActivity.Events
by avoiding a copy of eachT
value:
private readonly Activity _activity;public Program(){ using ActivitySource activitySource = new ActivitySource("Perf7Source"); ActivitySource.AddActivityListener(new ActivityListener { ShouldListenTo = s => s == activitySource, Sample = (ref ActivityCreationOptions<ActivityContext> o) => ActivitySamplingResult.AllDataAndRecorded }); _activity = activitySource.StartActivity( "TestActivity", ActivityKind.Internal, parentContext: default, links: Enumerable.Range(0, 1024).Select(_ => new ActivityLink(default)).ToArray()); _activity.Stop();}[Benchmark(Baseline = true)]public ActivityLink EnumerateActivityLinks(){ ActivityLink last = default; foreach (ActivityLink link in _activity.Links) last = link; return last;}
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
EnumerateActivityLinks | .NET 6.0 | 19.62 us | 1.00 | 64 B | 1.00 |
EnumerateActivityLinks | .NET 7.0 | 13.72 us | 0.70 | 32 B | 0.50 |
Thendotnet/runtime#67920 from@CodeBlanch anddotnet/runtime#68933 from@CodeBlanch added newEnumerateTagObjects
,EnumerateEvents
, andEnumerateLinks
enumeration methods that return a struct-based enumerator that has aref T
-returningCurrent
to avoid yet another layer of copy.
private readonly Activity _activity;public Program(){ using ActivitySource activitySource = new ActivitySource("Perf7Source"); ActivitySource.AddActivityListener(new ActivityListener { ShouldListenTo = s => s == activitySource, Sample = (ref ActivityCreationOptions<ActivityContext> o) => ActivitySamplingResult.AllDataAndRecorded }); _activity = activitySource.StartActivity( "TestActivity", ActivityKind.Internal, parentContext: default, links: Enumerable.Range(0, 1024).Select(_ => new ActivityLink(default)).ToArray()); _activity.Stop();}[Benchmark(Baseline = true)]public ActivityLink EnumerateActivityLinks_Old(){ ActivityLink last = default; foreach (ActivityLink link in _activity.Links) last = link; return last;}[Benchmark]public ActivityLink EnumerateActivityLinks_New(){ ActivityLink last = default; foreach (ActivityLink link in _activity.EnumerateLinks()) last = link; return last;}
Method | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|
EnumerateActivityLinks_Old | 13.655 us | 1.00 | 32 B | 1.00 |
EnumerateActivityLinks_New | 2.380 us | 0.17 | – | 0.00 |
Of course, when it comes to diagnostics, anyone who’s ever done anything with regards to timing and measurements is likely familiar with good ol’Stopwatch
.Stopwatch
is a simple type that’s very handy for getting precise measurements and is thus used all over the place. But for folks that are really cost-sensitive, the fact thatStopwatch
is a class can be prohibitive, e.g. writing:
Stopwatch sw = Stopwatch.StartNew();...;TimeSpan elapsed = sw.Elapsed;
is easy, but allocates a new object just to measure. To address this,Stopwatch
has for years exposed the staticGetTimestamp()
method which avoids that allocation, but consuming and translating the resultinglong
value is complicated, requiring a formula involving usingStopwatch.Frequency
andTimeSpan.TicksPerSecond
in the right incantation. To make this pattern easy,dotnet/runtime#66372 adds a staticGetElapsedTime
method that handles that conversion, such that someone who wants that last mile of performance can write:
long timestamp = Stopwatch.GetTimestamp();...TimeSpan elapsed = Stopwatch.GetElapsedTime(timestamp);
which avoids the allocation and saves a few cycles:
[Benchmark(Baseline = true)]public TimeSpan Old(){ Stopwatch sw = Stopwatch.StartNew(); return sw.Elapsed;}[Benchmark]public TimeSpan New(){ long timestamp = Stopwatch.GetTimestamp(); return Stopwatch.GetElapsedTime(timestamp);}
Method | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|
Old | 32.90 ns | 1.00 | 40 B | 1.00 |
New | 26.30 ns | 0.80 | – | 0.00 |
Exceptions
It might be odd to see the subject of “exceptions” in a post on performance improvements. After all, exceptions are by their very nature meant to be “exceptional” (in the “rare” sense), and thus wouldn’t typically contribute to fast-path performance. Which is a good thing, because fast-paths that throw exceptions in the common case are no longer fast: throwing exceptions is quite expensive.
Instead, one of the things wedo concern ourselves with is how to minimize the impact of checking for exceptional conditions: the actual exception throwing may be unexpected and slow, but it’s super common to need to check for those unexpected conditions, and that checking should be very fast. We also want such checking to minimally impact binary size, especially if we’re going to have many such checks all over the place, in generic code for which we end up with many copies due to generic specialization, in functions that might be inlined, and so on. Further, we don’t want such checks to impede other optimizations; for example, if I have a small function that wants to do some argument validation and would otherwise be inlineable, I likely don’t want the presence of exception throwing to invalidate the possibility of inlining.
Because of all of that, high-performance libraries often come up with custom “throw helpers” they use to achieve their goals. There are a variety of patterns for this. Sometimes a library will just define its own static method that handles constructing and throwing an exception, and then call sites do the condition check and delegate to the method if throwing is needed:
if (arg is null) ThrowArgumentNullException(nameof(arg));...[DoesNotReturn]private static void ThrowArgumentNullException(string arg) => throw new ArgumentNullException(arg);
This keeps the IL associated with the throwing out of the calling function, minimizing the impact of the throw. That’s particularly valuable when additional work is needed to construct the exception, e.g.
private static void ThrowArgumentNullException(string arg) => throw new ArgumentNullException(arg, SR.SomeResourceMessage);
Other times, libraries will encapsulate both the checking and throwing. This is exactly what the ArgumentNullException.ThrowIfNull method that was added in .NET 6 does:
public static void ThrowIfNull([NotNull] object? argument, [CallerArgumentExpression("argument")] string? paramName = null){ if (argument is null) Throw(paramName);}[DoesNotReturn]internal static void Throw(string? paramName) => throw new ArgumentNullException(paramName);
With that, callers benefit from the concise call site:
public void M(string arg){ ArgumentNullException.ThrowIfNull(arg); ...}
the IL remains concise, and the assembly generated for the JIT will include the streamlined condition check from the inlinedThrowIfNull
but won’t inline theThrow
helper, resulting in effectively the same code as if you’d written the previously shown manual version withThrowArgumentNullException
yourself. Nice.
Whenever we introduce new public APIs in .NET, I’m particularly keen on seeing them used as widely as possible. Doing so serves multiple purposes, including helping to validate that the new API is usable and fully addresses the intended scenarios, and including the rest of the codebase benefiting from whatever that API is meant to provide, whether it be a performance improvement or just a reduction in routinely written code. In the case ofArgumentNullException.ThrowIfNull
, however, I purposefully put on the brakes. We used it in .NET 6 in several dozen call sites, but primarily just in place of customThrowIfNull
-like helpers that had sprung up in various libraries around the runtime, effectively deduplicating them. What we didn’t do, however, was replace the literally thousands of null checks we have with calls toArgumentNullException.ThrowIfNull
. Why? Because the new!!
C# feature was right around the corner, destined for C# 11.
For those unaware, the!!
feature enabled putting!!
onto parameter names in member signatures, e.g.
public void Process(string name!!){ ...}
The C# compiler then compiled that as equivalent to:
public void Process(string name){ ArgumentNullException.ThrowIfNull(name);}
(albeit using its ownThrowIfNull
helper injected as internal into the assembly). Armed with the new feature,dotnet/runtime#64720 anddotnet/runtime#65108 rolled out use of!!
acrossdotnet/runtime, replacing ~25,000 lines of code with ~5000 lines that used!!
. But, what’s the line from Kung Fu Panda, “One often meets his destiny on the road he takes to avoid it”? The presence of that initial PR kicked off an unprecedented debate about the!!
feature, with many folks liking the concept but a myriad of different opinions about exactly how it should be exposed, and in the end, the only common ground was to cut the feature. In response,dotnet/runtime#68178 undid all usage of!!
, replacing most of it withArgumentNullException.ThrowIfNull
. There are now ~5000 uses ofArgumentNullException.ThrowIfNull
acrossdotnet/runtime, making it one of our most popular APIs internally. Interestingly, while we expected a peanut-buttery effect of slight perf improvements in many places, our performance auto-analysis system flagged several performance improvements (e.g.dotnet/perf-autofiling-issues#3531) as stemming from these changes, in particular because it enabled the JIT’s inlining heuristics to flag more methods for inlining.
With the success ofArgumentNullException.ThrowIfNull
and along with its significant roll-out in .NET 7, .NET 7 also sees the introduction of several more such throw helpers.dotnet/runtime#61633, for example, adds an overload ofArgumentNullException.ThrowIfNull
that works with pointers.dotnet/runtime#64357 adds the newArgumentException.ThrowIfNullOrEmpty
helper as well as using it in several hundred places. Anddotnet/runtime#58684 from@Bibletoon adds the newObjectDisposedException.ThrowIf
helper (tweaked bydotnet/runtime#71544 to help ensure it’s inlineable), which is then used at over a hundred additional call sites bydotnet/runtime#71546.
Registry
On Windows, the Registry is a database provided by the OS for applications and the system itself to load and store configuration settings. Practically every application accesses the registry. I just tried a simple console app:
Console.WriteLine("Hello, world");
built it as release, and then ran the resulting .exe. That execution alone triggered 64RegQueryValue
operations (as visible via SysInternals’Process Monitor tool). The core .NET libraries even access the registry for a variety of purposes, such as for gathering data forTimeZoneInfo
, gathering data for various calendars likeHijriCalendar
andJapaneseCalendar
, or for serving up environment variables as part ofEnvironment.GetEnvironmentVariable(EnvironmentVariableTarget)
withEnvironmentVariableTarget.User
orEnvironmentVariableTarget.Machine
.
It’s thus beneficial to streamline access to registry data on Windows, in particular for reducing overheads in startup paths where the registry is frequently accessed.dotnet/runtime#66918 does just that. Previously, callingRegistryKey.GetValue
would make a call toRegQueryValueEx
with a null buffer; this tells theRegQueryValueEx
method that the caller wants to know how big a buffer is required in order to store the value for the key. The implementation would then allocate a buffer of the appropriate size and callRegQueryValueEx
again, and for values that are to be returned as strings, would then allocate a string based on the data in that buffer. This PR instead recognizes that the vast majority of data returned from calls to the registry is relatively small. It starts with astackalloc
‘d buffer of 512 bytes, and uses that buffer as part of the initial call toRegQueryValueEx
. If the buffer was sufficiently large, we no longer have to make a second system call to retrieve the actual data: we already got it. If the buffer was too small, we rent anArrayPool
buffer of sufficient size and use that pooled buffer for the subsequentRegQueryValueEx
call. Except in situations where we actually need to return abyte[]
array to the caller (e.g. the type of the key isREG_BINARY
), this avoids the need for the allocatedbyte[]
. And for keys that return strings (e.g. the type of the key isREG_SZ
), previously the old implementation would have allocated a temporarychar[]
to use as the buffer passed toRegQueryValueEx
, but we can instead just reinterpret cast (e.g.MemoryMarshal.Cast
) the original buffer (whether astackalloc
‘d span or the rented buffer as aSpan<char>
), and use that to construct the resulting string.
private static readonly RegistryKey s_netFramework = Registry.LocalMachine.OpenSubKey(@"SOFTWARE\Microsoft\.NETFramework");[Benchmark] public string RegSz() => (string)s_netFramework.GetValue("InstallRoot");
Method | Runtime | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|---|
RegSz | .NET 6.0 | 6.266 us | 1.00 | 200 B | 1.00 |
RegSz | .NET 7.0 | 3.182 us | 0.51 | 96 B | 0.48 |
Analyzers
The ability to easily plug custom code, whether for analyzers or source generators, into the Roslyn compiler is one of my favorite features in all of C#. It means the developers working on C# don’t need to be solely responsible for highlighting every possible thing you might want to diagnose in your code. Instead, library authors can write their own analyzers, ship them either in dedicated nuget packages or as side-by-side in nuget packages with APIs, and those analyzers augment the compiler’s own analysis to help developers write better code. We ship a large number of analyzer rules in the .NET SDK, many of which are focused on performance, and we augment that set with more and more analyzers every release. We also work to apply more and more of those rules against our own codebases in every release. .NET 7 is no exception.
One of my favorite new analyzers was added indotnet/roslyn-analyzers#5594 from@NewellClark (and tweaked indotnet/roslyn-analyzers#5972). In my.NET 6 performance post, I talked about some of the overheads possible when types aren’t sealed:
- Virtual calls are more expensive than regular non-virtual invocation and generally can’t be inlined, since the JIT doesn’t know what is the actual type of the instance and thus the actual target of the invocation (at least not without assistance from PGO). But if the JIT can see that a virtual method is being invoked on a sealed type, it can devirtualize the call and potentially even inline it.
- If a type check (e.g.
something is typeof(SomeType)
) is performed whereSomeType
is sealed, that check can be implemented along the lines ofsomething is not null && something.GetType() == typeof(SomeType)
. In contrast, ifSomeType
is not sealed, the check is going to be more along the lines ofCastHelpers.IsInstanceOfClass(typeof(SomeType), something)
, whereIsInstanceOfClass
is a non-trivial (and today non-inlined) call into a JIT helper method in Corelib that not only checks for null and for direct equality with the specified type, but also linearly walks the parent hierarchy of the type of the object being tested to see if it might derive from the specified type. - Arrays in .NET are covariant, which means if types
B
andC
both derive from typeA
, you can have a variable typed asA[]
that’s storing aB[]
. SinceC
derives fromA
, it’s valid to treat aC
as anA
, but if theA[]
is actually aB[]
, storing aC
into that array would mean storing aC
into aB[]
, which is invalid. Thus, every time you store an object reference into an array of reference types, additional validation may need to be performed to ensure the reference being written is compatible with the concrete type of the array in question. But, ifA
in this example were sealed, nothing could derive from it, so storing objects into it doesn’t require such covariance checks. - Spans shift this covariance check to their constructor; rather than performing the covariance check on every write into the array, the check is performed when a span is being constructed from an array, such that if you try to create a
new Span<A>(bArray)
, the ctor will throw an exception. IfA
is sealed, the JIT is able to elide such a check as well.
It effectively would be impossible for an analyzer to be able to safely recommend sealing public types. After all, it has no knowledge of the type’s purpose, how it’s intended to be used, and whether anyone outside of the assembly containing the type actually derives from it. But internal and private types are another story. An analyzercan actually see every possible type that could be deriving from a private type, since the analyzer has access to the whole compilation unit containing that type, and it needn’t worry about compatibility because anything that could derive from such a type necessarily must also be non-public and would be recompiled right along with the base type. Further, with the exception of assemblies annotated as InternalsVisibleTo, an analyzer can have the same insight into internal types. Thus, this PR adds CA1852, an analyzer that flags in non-InternalsVisibleTo assemblies all private and internal types that aren’t sealed and that have no types deriving from them and recommends they be sealed. (Due to some current limitations in the infrastructure around fixers and how this analyzer had to be written in order to be able to see all of the types in the assembly, the analyzer for CA1852 doesn’t show up in Visual Studio. It can, however, be applied using thedotnet format
tool. And if you bump up the level of the rule from info to warning or error, it’ll show up as part of builds as well.)
In .NET 6, we sealed over 2300 types, but even with that, this analyzer ended up finding more to seal.dotnet/runtime#59941 from@NewellClark sealed another ~70 types, anddotnet/runtime#68268 which enabled the rule as an warning indotnet/runtime (which builds with warnings-as-errors) sealed another ~100 types. As a larger example of the rule in use, ASP.NET hadn’t done much in the way of sealing types in previous releases, but with CA1852 now in the .NET SDK,dotnet/aspnetcore#41457 enabled the analyzer and sealed more than ~1100 types.
Another new analyzer, CA1854, was added indotnet/roslyn-analyzers#4851 from@CollinAlpert and then enabled indotnet/runtime#70157. This analyzer looks for the surprisingly common pattern where aDictionary<TKey, TValue>
‘sContainsKey
is used to determine whether a dictionary contains a particular entry, and then if it does, the dictionary’s indexer is used to retrieve the value associated with the key, e.g.
if (_dictionary.ContainsKey(key)){ var value = _dictionary[key]; Use(value);}
Dictionary’sTryGetValue
method already combines both of these operations, both looking up the key and retrieving its value if it exists, doing so as a single operation:
if (_dictionary.TryGetValue(key, out var value)){ Use(value);}
A benefit of this, in addition to arguably being simpler, is that it’s also faster. WhileDictionary<TKey, TValue>
provides very fast lookups, and while the performance of those lookups has gotten faster over time, doing fast work is still more expensive than doing no work, and if we can do one lookup instead of two, that can result in a meaningful performance boost, in particular if it’s being performed on a fast path. And we can see from this simple benchmark that looks up a word in a dictionary that, for this operation, making distinct calls toContainsKey
and the indexer does indeed double the cost of using the dictionary, almost exactly:
private readonly Dictionary<string, int> _counts = Regex.Matches( new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/100/pg100.txt").Result, @"\b\w+\b") .Cast<Match>() .GroupBy(word => word.Value, StringComparer.OrdinalIgnoreCase) .ToDictionary(word => word.Key, word => word.Count(), StringComparer.OrdinalIgnoreCase);private string _word = "the";[Benchmark(Baseline = true)]public int Lookup1(){ if (_counts.ContainsKey(_word)) { return _counts[_word]; } return -1;}[Benchmark]public int Lookup2(){ if (_counts.TryGetValue(_word, out int count)) { return count; } return -1;}
Method | Mean | Ratio |
---|---|---|
Lookup1 | 28.20 ns | 1.00 |
Lookup2 | 14.12 ns | 0.50 |
Somewhat ironically, even as I write this example, the analyzer and its auto-fixer are helpfully trying to get me to change my benchmark code:
Similarly,dotnet/roslyn-analyzers#4836 from@chucker added CA1853, which looks for cases where aRemove
call on a dictionary is guarded by aContainsKey
call. It seems it’s fairly natural for developers to only callRemove
on a dictionary once they’re sure the dictionary contains the thing being removed; maybe they thinkRemove
will throw an exception if the specified key doesn’t exist. However,Remove
actually allows this as a first-class scenario, with its returnBoolean
value indicating whether the key was in the dictionary (and thus successfully removed) or not. An example of this comes fromdotnet/runtime#68724, where CA1853 was enabled for dotnet/runtime. TheEventPipeEventDispatcher
type’sRemoveEventListener
method had code like this:
if (m_subscriptions.ContainsKey(listener)){ m_subscriptions.Remove(listener);}
which the analyzer flagged and which it’s auto-fixer replaced with just:
m_subscriptions.Remove(listener);
Nice and simple. And faster, since as with the TryGetValue case, this is now doing a single dictionary lookup rather than two.
Another nice analyzer added indotnet/roslyn-analyzers#5907 anddotnet/roslyn-analyzers#5910 is CA1851, which looks for code that iterates through some kinds of enumerables multiple times. Enumerating an enumerator, whether directly or via helper methods like those in LINQ, can have non-trivial cost. Calling GetEnumerator typically allocates an enumerator object, and every item yielded typically involves two interface calls, one to MoveNext and one to Current. If something can be done via a single pass over the enumerable rather than multiple passes, that can save such costs. In some cases, seeing places this analyzer fires can also inspire changes that avoid any use of enumerators. For example,dotnet/runtime#67292 enabled CA1851 fordotnet/runtime, and in doing so, it fixed several diagnostics issued by the analyzer (even in a code base that’s already fairly stringent about enumerator and LINQ usage). As an example, this is a function in System.ComponentModel.Composition that was flagged by the analyzer:
private void InitializeTypeCatalog(IEnumerable<Type> types){ foreach (Type type in types) { if (type == null) { throw ExceptionBuilder.CreateContainsNullElement(nameof(types)); } else if (type.Assembly.ReflectionOnly) { throw new ArgumentException(SR.Format(SR.Argument_ElementReflectionOnlyType, nameof(types)), nameof(types)); } } _types = types.ToArray();}
The method’s purpose is to convert the enumerable into an array to be stored, but also to validate that the contents are all non-null and non-“ReflectionOnly.” To achieve that, the method is first using a foreach to iterate through the enumerable, validating each element along the way, and then once it’s done so, it calls ToArray() to convert the enumerable into an array. There are multiple problems with this. First, it’s incurring the expense of interating through the enumerable twice, once for the foreach and once for theToArray()
, which internally needs to enumerate it if it can’t do something special like cast toICollection<Type>
andCopyTo
the data out of it. Second, it’s possible the caller’sIEnumerable<Type>
changes on each iteration, so any validation done in the first iteration isn’t actually ensuring there aren’t nulls in the resulting array, for example. Since the expectation of the method is that all inputs are valid and we don’t need to optimize for the failure cases, the better approach is tofirst callToArray()
and then validate the contents of that array, which is exactly what that PR fixes it to do:
private void InitializeTypeCatalog(IEnumerable<Type> types){ Type[] arr = types.ToArray(); foreach (Type type in arr) { if (type == null) { throw ExceptionBuilder.CreateContainsNullElement(nameof(types)); } if (type.Assembly.ReflectionOnly) { throw new ArgumentException(SR.Format(SR.Argument_ElementReflectionOnlyType, nameof(types)), nameof(types)); } } _types = arr;}
With that, we only ever iterate it once (and possibly 0 times ifToArray
can special-case it, and bonus, we validate on the copy rather than on the mutable original.
Yet another helpful analyzer is the new CA1850 introduced indotnet/roslyn-analyzers#4797 from@wzchua. It used to be that if you wanted to cryptographically hash some data in .NET, you would create an instance of a hash algorithm and call its ComputeHash method, e.g.
public byte[] Hash(byte[] data){ using (SHA256 h = SHA256.Create()) { return h.ComputeHash(data); }}
However, .NET 5 introduced new “one-shot” hashing methods, which obviates the need to create a newHashAlgorithm
instance, providing a static method that performs the whole operation.
public byte[] Hash(byte[] data){ return SHA256.HashData(data);}
CA1850 finds occurrences of the former pattern and recommends changing them to the latter.
The result is not only simpler, it’s also faster:
private readonly byte[] _data = RandomNumberGenerator.GetBytes(128);[Benchmark(Baseline = true)]public byte[] Hash1(){ using (SHA256 h = SHA256.Create()) { return h.ComputeHash(_data); }}[Benchmark]public byte[] Hash2(){ return SHA256.HashData(_data);}
Method | Mean | Ratio | Allocated | Alloc Ratio |
---|---|---|---|---|
Hash1 | 1,212.9 ns | 1.00 | 240 B | 1.00 |
Hash2 | 950.8 ns | 0.78 | 56 B | 0.23 |
The .NET 7 SDK also includes new analyzers around[GeneratedRegex(...)]
(dotnet/runtime#68976) and the already mentioned ones for LibraryImport, all of which help to move your code forwards to more modern patterns that have better performance characteristics.
This release also sawdotnet/runtime turn on a bunch of additional IDEXXXX code style rules and make a huge number of code changes in response. Most of the resulting changes are purely about simplifying the code, but in almost every case some portion of the changes also have a functional and performance impact.
Let’s start with IDE0200, which is about removing unnecessary lambdas. Consider a setup like this:
public class C{ public void CallSite() => M(i => Work(i)); public void M(Action<int> action) { } private static void Work(int value) { }}
Here we have a methodCallSite
that’s invoking a methodM
and passing a lambda to it. MethodM
accepts anAction<int>
, and the call site is passing a lambda that takes the suppliedInt32
and passes it off to some static functionality. For this code, the C# compiler is going to generate something along the lines of this:
public class C{ [CompilerGenerated] private sealed class <>c { public static readonly <>c <>9 = new <>c(); public static Action<int> <>9__0_0; internal void <CallSite>b__0_0(int i) => Work(i); } public void CallSite() => M(<>c.<>9__0_0 ??= new Action<int>(<>c.<>9.<CallSite>b__0_0)); public void M(Action<int> action) { } private static void Work(int value) { }}
The most important aspect of this is that<>9__0_0
field the compiler emitted. That field is a cache for the delegate created inCallSite
. The first timeCallSite
is invoked, it’ll allocate a new delegate for the lambda and store it into that field. For all subsequent invocations, however, it’ll find the field is non-null and will just reuse the same delegate. Thus, this lambda only ever results in a single allocation for the whole process (ignoring any race conditions on the initial lazy initialization such that multiple threads all racing to initialize the field might end up producing a few additional unnecessary allocations). It’s important to recognize this caching only happens because the lambda doesn’t access any instance state and doesn’t close over any locals; if it did either of those things, such caching wouldn’t happen. Secondarily, it’s interesting to note the pattern the compiler uses for the lambda itself. Note that generated<CallSite>b__0_0
method is generated as an instance method, and the call site refers to that method of a singleton instance that’s used to initialize a<>9
field. That’s done because delegates to static methods use something called a “shuffle thunk” to move arguments into the right place for the target method invocation, making delegates to statics ever so slightly more expensive to invoke than delegates to instance methods.
private Action _instance = new C().InstanceMethod;private Action _static = C.StaticMethod;[Benchmark(Baseline = true)]public void InvokeInstance() => _instance();[Benchmark]public void InvokeStatic() => _static(); private sealed class C{ public static void StaticMethod() { } public void InstanceMethod() { }}
Method | Mean | Ratio |
---|---|---|
InvokeInstance | 0.8858 ns | 1.00 |
InvokeStatic | 1.3979 ns | 1.58 |
So, the compiler is able to cache references to lambdas, great. What about method groups, i.e. where you just name the method directly? Previously, if changed my code to:
public class C{ public void CallSite() => M(Work); public void M(Action<int> action) { } private static void Work(int value) { }}
the compiler would generate the equivalent of:
public class C{ public void CallSite() => M(new Action<int>(Work)); public void M(Action<int> action) { } private static void Work(int value) { }}
which has the unfortunate effect of allocating a new delegate on every invocation, even though we’re still dealing with the exact same static method. Thanks todotnet/roslyn#58288 from@pawchen, the compiler will now generate the equivalent of:
public class C{ [CompilerGenerated] private static class <>O { public static Action<int> <0>__Work; } public void CallSite() => M(<>O.<0>__Work ??= new Action<int>(Work)); public void M(Action<int> action) { } private static void Work(int value) { }}
Note we again have a caching field that’s used to enable allocating the delegate once and caching it. That means that places where code was using a lambda to enable this caching can now switch back to the cleaner and simpler method group way of expressing the desired functionality. There is the interesting difference to be cognizant of that since we don’t have a lambda which required the compiler emitting a new method for, we’re still creating a delegate directly to the static method. However, the minor difference in thunk overhead is typically made up for by the fact that we don’t have a second method to invoke; in the common case where the static helper being invoked isn’t inlinable (because it’s not super tiny, because it has exception handling, etc.), we previously would have incurred the cost of the delegate invocation plus the non-inlinable method call, and now we just have the cost of an ever-so-slightly more expensive delegate invocation; on the whole, it’s typically a wash.
And that brings us to IDE0200, which recognizes lambda expressions that can be removed.dotnet/runtime#71011 enabled the analyzer fordotnet/runtime, resulting in more than 100 call sites changing accordingly. However, IDE0200 does more than just this mostly stylistic change. It also recognizes some patterns that can make a more substantial impact. Consider this code that was changed as part of that PR:
Action disposeAction;IDisposable? disposable = null;...if (disposable != null){ disposeAction = () => disposable.Dispose();}
That delegate closes over thedisposable
local, which means this method needs to allocate a display class. But IDE0200 recognizes that instead of closing overdisposable
, we can create the delegate directly to the Dispose method:
Action disposeAction;IDisposable? disposable = null;...if (disposable != null){ disposeAction = disposable.Dispose;}
We still get a delegate allocation, but we avoid the display class allocation, and as a bonus we save on the additional metadata required for the synthesized display class and method generated for the lambda.
IDE0020 is another good example of an analyzer that is primarily focused on making code cleaner, more maintainable, more modern, but that can also lead to removing overhead from many different places. The analyzer looks for code performing unnecessary duplicative casts and recommends using C# pattern matching syntax instead. For example,dotnet/runtime#70523 enabled the analyzer and switched more than 250 locations from code like:
if (value is SqlDouble){ SqlDouble i = (SqlDouble)value; return CompareTo(i);}
to instead be like:
if (value is SqlDouble i){ return CompareTo(i);}
In addition to being cleaner, this ends up saving a cast operation, which can add measurable overhead if the JIT is unable to remove it:
private object _value = new List<string>();[Benchmark(Baseline = true)]public List<string> WithCast(){ object value = _value; return value is List<string> ? (List<string>)value : null;}[Benchmark]public List<string> WithPattern(){ object value = _value; return value is List<string> list ? list : null;}
Method | Mean | Ratio |
---|---|---|
WithCast | 2.602 ns | 1.00 |
WithPattern | 1.886 ns | 0.73 |
Then there’s IDE0031, which promotes using null propagation features of C#. This analyzer typically manifests as recommending changing snippets like:
return _value != null ? _value.Property : null;
into code that’s instead like:
return _value?.Property;
Nice, concise, and primarily about cleaning up the code and making it simpler and more maintainable by utilizing newer C# syntax. However, there is also a small performance advantage in some situations as well. For example, consider this snippet:
public class C{ private C _value; public int? Get1() => _value != null ? _value.Prop : null; public int? Get2() => _value?.Prop; public int Prop => 42;}
The C# compiler lowers these expressions to the equivalent of this:
public Nullable<int> Get1(){ if (_value == null) return null; return _value.Prop;}public Nullable<int> Get2(){ C value = _value; if (value == null) return null; return value.Prop;}
for which the JIT then generates:
; Program.Get1() push rax mov rdx,[rcx+8] test rdx,rdx jne short M00_L00 xor eax,eax add rsp,8 retM00_L00: cmp [rdx],dl mov dword ptr [rsp+4],2A mov byte ptr [rsp],1 mov rax,[rsp] add rsp,8 ret; Total bytes of code 40; Program.Get2() push rax mov rax,[rcx+8] test rax,rax jne short M00_L00 xor eax,eax add rsp,8 retM00_L00: mov dword ptr [rsp+4],2A mov byte ptr [rsp],1 mov rax,[rsp] add rsp,8 ret; Total bytes of code 38
Note how theGet1
variant has an extracmp
instruction (cmp [rdx],dl
) in the otherwise identical assembly toGet2
(other than register selection). Thatcmp
instruction inGet1
is the JIT forcing a null check on the second read of_value
prior to accessing itsProp
, whereas inGet2
the null check against the local means the JIT doesn’t need to add an additional null check on the second use of the local, since nothing could have changed it.dotnet/runtime#70965 rolled out additional use of the null propagation operator via auto-fixing IDE0031, resulting in ~120 uses being improved.
Another interesting example is IDE0060, which finds unused parameters and recommends removing them. This was done for non-public members in System.Private.CoreLib indotnet/runtime#63015. As with some of the other mentioned rules, it’s primarily about good hygiene. There can be some small additional cost associated with passing additional parameters (the overhead of reading the values at the call site, putting them into the right register or stack location, etc., and also the metadata size associated with the additional parameter information), but the larger benefit comes from auditing all of the cited violations and finding places where work is simply being performed unnecessarily. For example, that PR made some updates to theTimeZoneInfo
type’s implementation for Unix. In that implementation is aTZif_ParseRaw
method, which is used to extract some information from a time zone data file. Amongst many input and output parameters, it hadout bool[] StandardTime, out bool[] GmtTime
, which the implementation was dutifully filling in by allocating and populating new arrays for each. The call site forTZif_ParseRaw
was then taking those arrays and feeding them into another methodTZif_GenerateAdjustmentRules
, which ignored them! Thus, not only was this PR able to remove those parameters fromTZif_GenerateAdjustmentRules
, it was able to updateTZif_ParseRaw
to no longer need to allocate and populate those arrays at all, which obviously yields a much larger gain.
One final example of peanut-buttery performance improvements from applying an analyzer comes fromdotnet/runtime#70896 anddotnet/runtime#71361, which applied IDE0029 across dotnet/runtime. IDE0029 flags cases where null coalescing can be used, e.g. flagging:
return message != null ? message : string.Empty;
and recommending it be converted to:
return message ?? string.Empty;
As with some of the previous rules discussed, that in and of itself doesn’t make a meaningful performance improvement, and rather is about clarity and simplicity. However, in various cases it can. For example, the aforementioned PRs contained an example like:
null != foundColumns[i] ? foundColumns[i] : DBNull.Value;
which is rewritten to:
foundColumns[i] ?? DBNull.Value
This avoids an unnecessary re-access to an array. Or again from those PRs the expression:
entry.GetKey(_thisCollection) != null ? entry.GetKey(_thisCollection) : "key"
being changed to:
entry.GetKey(_thisCollection) ?? "key"
and avoiding an unnecessary table lookup.
What’s Next?
Whew! That was a lot. Congrats on getting through it all.
The next step is on you. Download the latest .NET 7 bits and take them for a spin. Upgrade your apps. Write and share your own benchmarks. Provide feedback, positive and critical. Find something you think can be better? Open an issue, or better yet, submit a PR with the fix. We’re excited to work with you to polish .NET 7 to be the best .NET release yet; meanwhile, we’re getting going on .NET 8 🙂
Until next time…
Happy coding!
Author

Stephen Toub is a developer on the .NET team at Microsoft.
154 comments
Discussion is closed.Login to edit/delete existing comments.
Caner Güreşci it was a great article. thanks Stephen
Akash Bagh With so many performance improvements in each new release, I’m wondering if you can get faster than the Java runtime which was said to be much faster than .NET ten years ago. Where are we today? I can’t find any information about it.
Dan Friedman You are my hero! Thank you.
Adryzz6 . These posts are awesome as always. The longer the better.
Nimrod Cohen I love the focus on performance! had a great read, keep up the good work!
.NET is just getting better and better 🙂Dean Jackson I wanted tothank you for your old post in 2011 about
AsyncLazy<T>
. I needed aCurrentDictionary
with values being async and lazy, because my valueFactory for the dictionary items needs to only run once (usingGetOrAdd
). I found a blog by someone that linked to your post. Thanks not only for the code, but for explaining in detail how it works compared to regular Lazy!! I’m using it now in an important project.BellarmineHead Read moreRe. sealed classes, this analyzer: https://github.com/dotnet/roslyn-analyzers/pull/5594 and the comment:-
"But internal and private types are another story. An analyzer can actually see every possible type that could be deriving from a private type, since the analyzer has access to the whole compilation unit containing that type, and it needn’t worry about compatibility because anything that could derive from such a type necessarily must also be non-public and would be recompiled right along with the base type."
In these web-oriented days, all my assemblies form either a Web API or a Blazor WebAssembly application. Nobody derives from one of my classes, except...
Read lessRe. sealed classes, this analyzer:https://github.com/dotnet/roslyn-analyzers/pull/5594 and the comment:-
“But internal and private types are another story. An analyzer can actually see every possible type that could be deriving from a private type, since the analyzer has access to the whole compilation unit containing that type, and it needn’t worry about compatibility because anything that could derive from such a type necessarily must also be non-public and would be recompiled right along with the base type.”
In these web-oriented days, all my assemblies form either a Web API or a Blazor WebAssembly application. Nobody derives from one of my classes, except for my own code (and in fairly rare cases).
But I’ve grown lazy, and make all my classes
public
by default. That’s necessary for types in theShared
project of Blazor Wasm app, but not necessary for types in theClient
andServer
projects.But who cares if all the types in
Client
andServer
are public? I don’t…Client
andServer
are marked as console applications, and nothing on the outside is going to derive from these classes.But maybe Ishould care if this sealed analyzer is only going to flag up internal and private types…
Or… can the analyzer tell that your assembly is a console app, and not a class library, and so flag up public types for sealing?
Thoughts?
Stephen Toub - MSFT can the analyzer tell that your assembly is a console app, and not a class library, and so flag up public types for sealing?
You can actually add a reference to both and use anything publicly exposed, so the analyzer can’t really special-case one or the other.
But maybe I should care
Personally I think you should 🙂 but that’s your call to make.
Laszlo Deak· Edited Read moreExcellent article. I wonder about this: what is the benefit of having RuntimeHelpers.IsKnownConstant? My understanding of dynamic PGO and constant folding would suggest that it should be handled the same way even without the RuntimeHelpers.IsKnownConstant check when Tier-1 JIT happens. For example, StartsWith could be:
<code>
I would expect that for constant input values, only the corresponding branch of this method would be inlined. Is that something today not supported? Otherwise, how is this case different, that makes RuntimeHelpers.IsKnownConstant required?
Read lessExcellent article. I wonder about this: what is the benefit of having
RuntimeHelpers.IsKnownConstant
? My understanding of dynamic PGO and constant folding would suggest that it should be handled the same way even without theRuntimeHelpers.IsKnownConstant
check when Tier-1 JIT happens. For example, StartsWith could be:public bool StartsWith(char value){ if (value != '\0') return _firstChar == value; return Length != 0 && _firstChar == value;}
I would expect that for constant input values, only the corresponding branch of this method would be inlined. Is that something today not supported? Otherwise, how is this case different, that makes
RuntimeHelpers.IsKnownConstant
required?Stephen Toub - MSFT Read moreThe goal here is to avoid extra branches and comparisons. If the value isn't a constant '\0', then we don't want to increase the number of comparisons and code size by adding an upfront check that compares the input to '\0'. If we can prove that the input character is a constant non-'\0', then we want the code gen to be just , otherwise we want the code gen to be . In other words, the benefit of the IsKnownConstant check is that it let's us optimize without negatively impacting the non-constant case.
Read lessThe goal here is to avoid extra branches and comparisons. If the value isn’t a constant ‘\0’, then we don’t want to increase the number of comparisons and code size by adding an upfront check that compares the input to ‘\0’. If we can prove that the input character is a constant non-‘\0’, then we want the code gen to be just
_firstChar == value
, otherwise we want the code gen to beLength != 0 && _firstChar == value
. In other words, the benefit of the IsKnownConstant check is that it let’s us optimize without negatively impacting the non-constant case.
_ _ Small hint:
i * 2 + 1 will always be in the range [0, 15]
this is technically true, but I think it would be better to be more precise and say
i * 2 + 1 will always be in the range [1, 15]
COVET Pierre-Antoine Thank you for this amazing post, or rather book 🙂 . I just started reading it and I have already learned a lot. Great writing and of course great job to the dotnet team comming up with all this. Hats off to you all.