NotificationsYou must be signed in to change notification settings
Fork5.2k
Star17.2k

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder#55955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

stephentoub merged 2 commits intodotnet:mainfromstephentoub:tlsprocpool

Jul 20, 2021

Merged

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder#55955

stephentoub merged 2 commits intodotnet:mainfromstephentoub:tlsprocpool

Jul 20, 2021

Conversation

Copy link

Member

stephentoub commentedJul 19, 2021•
edited
Loading

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention. The locked stack is capped by default at ProcessorCount*4 objects.

The new scheme is simpler: one instance per thread, one instance per core. This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects. As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention. It also reduces concerns about larger machines being more negatively impacted by the caching. Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).

This is on my 12-logical core box:

Method	Toolchain	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Allocated
NonPooling	\main\CoreRun.exe	4.314 s	0.0795 s	0.1005 s	1.00	1933000.0000	483000.0000	11,800,056 KB
NonPooling	\pr\corerun.exe	4.284 s	0.0188 s	0.0167 s	0.99	1933000.0000	483000.0000	11,800,063 KB

Pooling	\main\CoreRun.exe	3.010 s	0.0452 s	0.0423 s	1.00	-	-	323 KB
Pooling	\pr\corerun.exe	2.874 s	0.0452 s	0.0423 s	0.95	-	-	203 KB

usingBenchmarkDotNet.Attributes;usingBenchmarkDotNet.Running;usingBenchmarkDotNet.Diagnosers;usingSystem.Runtime.CompilerServices;[MemoryDiagnoser]publicclassProgram{publicstaticvoidMain(string[]args)=>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);privateconstintConcurrency=256;privateconstintIters=100_000;[Benchmark]publicTaskNonPooling(){returnTask.WhenAll(fromiinEnumerable.Range(0,Concurrency)selectTask.Run(asyncdelegate{for(inti=0;i<Iters;i++)awaitA().ConfigureAwait(false);}));staticasyncValueTaskA()=>awaitB().ConfigureAwait(false);staticasyncValueTaskB()=>awaitC().ConfigureAwait(false);staticasyncValueTaskC()=>awaitD().ConfigureAwait(false);staticasyncValueTaskD()=>awaitTask.Yield();}[Benchmark]publicTaskPooling(){returnTask.WhenAll(fromiinEnumerable.Range(0,Concurrency)selectTask.Run(asyncdelegate{for(inti=0;i<Iters;i++)awaitA().ConfigureAwait(false);}));[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskA()=>awaitB().ConfigureAwait(false);[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskB()=>awaitC().ConfigureAwait(false);[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskC()=>awaitD().ConfigureAwait(false);[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskD()=>awaitTask.Yield();}}

stephentoub added area-System.Threading.Tasks tenet-performancePerformance related issue labels

Jul 19, 2021

stephentoub added this to the6.0.0 milestone

Jul 19, 2021

Copy link

ghost commentedJul 19, 2021

Tagging subscribers to this area: @dotnet/area-system-threading-tasks
See info inarea-owners.md if you want to be subscribed.

Issue Details

Method	Toolchain	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Allocated
NonPooling	\main\CoreRun.exe	4.314 s	0.0795 s	0.1005 s	1.00	1933000.0000	483000.0000	11,800,056 KB
NonPooling	\pr\corerun.exe	4.284 s	0.0188 s	0.0167 s	0.99	1933000.0000	483000.0000	11,800,063 KB

Pooling	\main\CoreRun.exe	3.010 s	0.0452 s	0.0423 s	1.00	-	-	323 KB
Pooling	\pr\corerun.exe	2.874 s	0.0452 s	0.0423 s	0.95	-	-	203 KB

usingBenchmarkDotNet.Attributes;usingBenchmarkDotNet.Running;usingBenchmarkDotNet.Diagnosers;usingSystem.Runtime.CompilerServices;[MemoryDiagnoser]publicclassProgram{publicstaticvoidMain(string[]args)=>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);privateconstintConcurrency=256;privateconstintIters=100_000;[Benchmark]publicTaskNonPooling(){returnTask.WhenAll(fromiinEnumerable.Range(0,Concurrency)selectTask.Run(asyncdelegate{for(inti=0;i<Iters;i++)awaitA().ConfigureAwait(false);}));staticasyncValueTaskA()=>awaitB().ConfigureAwait(false);staticasyncValueTaskB()=>awaitC().ConfigureAwait(false);staticasyncValueTaskC()=>awaitD().ConfigureAwait(false);staticasyncValueTaskD()=>awaitTask.Yield();}[Benchmark]publicTaskPooling(){returnTask.WhenAll(fromiinEnumerable.Range(0,Concurrency)selectTask.Run(asyncdelegate{for(inti=0;i<Iters;i++)awaitA().ConfigureAwait(false);}));[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskA()=>awaitB().ConfigureAwait(false);[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskB()=>awaitC().ConfigureAwait(false);[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskC()=>awaitD().ConfigureAwait(false);[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]staticasyncValueTaskD()=>awaitTask.Yield();}}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Threading.Tasks`,`tenet-performance`
Milestone:	6.0.0

stephentoub force-pushed thetlsprocpool branch from76459fc to8889b7dCompare

July 20, 2021 02:40

adamsitnik approved these changes

Jul 20, 2021

View reviewed changes

Copy link

Member

adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM!

It also reduces concerns about larger machines being more negatively impacted by the caching

To validate that you could usethis template, modify it and run the benchmarks with and without your changes using the AMD (32 cores), ARM (48 cores), and Mono machine (56 cores).

...m.Private.CoreLib/src/System/Runtime/CompilerServices/PoolingAsyncValueTaskMethodBuilderT.cs OutdatedShow resolvedHide resolved

stephentoub added2 commits

July 20, 2021 09:21

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder

127364f

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention.  The locked stack is capped by default at ProcessorCount*4 objects.The new scheme is simpler: one instance per thread, one instance per core.  This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects.  As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention.  It also reduces concerns about larger machines being more negatively impacted by the caching.  Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).

Address PR feedback

e839f25

stephentoub force-pushed thetlsprocpool branch from8889b7d toe839f25Compare

July 20, 2021 13:27