NotificationsYou must be signed in to change notification settings
Fork5.1k
Star16.6k

Port yield normalization from CoreCLR to Native AOT#103675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

eduardo-vp merged 23 commits intodotnet:mainfromeduardo-vp:port-yield-norm-to-aot

Jul 17, 2024

Merged

Port yield normalization from CoreCLR to Native AOT#103675

eduardo-vp merged 23 commits intodotnet:mainfromeduardo-vp:port-yield-norm-to-aot

Jul 17, 2024

Conversation

Copy link

Member

eduardo-vp commentedJun 18, 2024•
edited
Loading

Porting the current way yield normalization is done to Native AOT.

The CoreCLR implementation was moved to src/coreclr/vm/yieldprocessornormalizedshared.cpp.

Both the CoreCLR file (src/coreclr/vm/yieldprocessornormalized.cpp) and the Native AOT file (src/coreclr/nativeaot/Runtime/yieldprocessornormalized.cpp) now share the same implementation.

Initial commit

392c652

ghost added the needs-area-labelAn area label is needed to ensure this gets routed to the appropriate area owners label

Jun 18, 2024

dotnet-policy-servicebot assignededuardo-vp

Jun 18, 2024

eduardo-vp added area-System.Threading and removed needs-area-labelAn area label is needed to ensure this gets routed to the appropriate area owners labels

Jun 18, 2024

Copy link

Contributor

dotnet-policy-servicebot commentedJun 18, 2024

Tagging subscribers to this area:@mangod9
See info inarea-owners.md if you want to be subscribed.

This was referencedJun 19, 2024

The Operation will be canceled. The next steps may not contain expected logs.dotnet/dnceng#3008

Open

GC/Regressions/v2.0-beta2/452950 failed in CI#103494

Closed

LibraryTests (mostly) timing out#103674

Closed

Eduardo Manuel Velarde Polar added3 commits

June 19, 2024 14:08

Use PalGetTickCount64

9237b9f

Add limits.h

830d8d0

Declare g_pFinalizerThread for Windows only

d0a884c

jkotas reviewed

Jun 19, 2024

View reviewed changes

src/coreclr/inc/yieldprocessornormalized.h OutdatedShow resolvedHide resolved

jkotas reviewed

Jun 19, 2024

View reviewed changes

src/coreclr/vm/finalizerthread.h OutdatedShow resolvedHide resolved

jkotas reviewed

Jun 19, 2024

View reviewed changes

src/coreclr/vm/finalizerthread.h OutdatedShow resolvedHide resolved

Eduardo Manuel Velarde Polar added4 commits

June 19, 2024 16:19

PR comments

165bbb9

Fix build/x86

b089bac

Remove finalizer thread from native aot

f4ed8e8

Remove unnecessary code

b646568

build-analysisbot mentioned this pull request

Jun 22, 2024

System.Numerics.Tensors.Tests.TensorSpanTests test failure#103525

Closed

jkotas reviewed

Jun 22, 2024

View reviewed changes

src/coreclr/inc/yieldprocessornormalized.h OutdatedShow resolvedHide resolved

src/coreclr/nativeaot/Runtime/FinalizerHelpers.cpp

		@@ -46,9 +46,6 @@ uint32_t WINAPI FinalizerStart(void* pContext)

		g_pFinalizerThread = PTR_Thread(pThread);

		// We have some time until the first finalization request - use the time to calibrate normalized waits.
		EnsureYieldProcessorNormalizedInitialized();

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How is the measurement going to be triggered when this is deleted?

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm still trying to figure this out, I'm not very familiar with Native AOT in general so I'd appreciate any suggestions

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It looks like we would need to callYieldProcessorNormalization::PerformMeasurement() from here or add aEnsureYieldProcessorNormalizedInitialized() entry point to the new code that simply callsYieldProcessorNormalization::PerformMeasurement()

Copy link

MemberAuthor

eduardo-vpJun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do you happen to know if this function is called every ~4 seconds or faster than that? Currently we letYieldProcessorNormalization::PerformMeasurement() run every ~4 s so if that's the case, I believe we may add here the same call as in CoreCLR

if (YieldProcessorNormalization::IsMeasurementScheduled())    {GCX_PREEMP();YieldProcessorNormalization::PerformMeasurement();    }

Copy link

Member

jkotasJun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

FinalizerStart function is called once per process. It is equivalent ofFinalizerThreadStart function in regular CoreCLR.

I think you want to follow the same structure as in regular CoreCLR: Trigger the measurement from ScheduleMeasurementIfNecessary by callingRhEnableFinalization (it is equivalent ofFinalizerThread::EnableFinalization in regular CoreCLR) and then add the measurement to loop inProcessFinalizers().

Copy link

Member

VSadovJun 24, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do you happen to know if this function is called every ~4 seconds or faster than that?

I am not sure. The whole deal with measuring duration of something that is proportional to CPU cycle is not very precise, since the CPU cycle can change drastically and many times per second and will be different for every core. Unless machine is configured into HighPerformance power plan, every measurement is a bit of a coin toss and will produce the same result with the same error margins.

The main purpose of calibration is to continue using historically hard-coded spin counts in numerous places where we spinwait while allowing that to work on systems with vastly different pause durations (i.e. on post-skylake intel CPUs pause takes ~140 cycles, pre-skylake is about ~10 cycles). For such purpose the callibration is precise enough.

I am not sure about the value of redoing the measurement over and over.
Perhaps to support scenarios where a VM is migrated between pre/post skylake machines.

Copy link

Member

VSadovJun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess we can add a periodic call PerformMeasurement in NativeAOT and see what happens.

My guess - nothing will change, just a bit more time spent in PerformMeasurement.

Copy link

Member

VSadovJun 24, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There is value in having the same behavior though.
If the re-measuring (or the whole calibration deal) could be somehow avoided or improved, it would make sense to do for both runtimes.

Copy link

MemberAuthor

eduardo-vpJun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

IIRC there's a good reason to keep re-doing measurements, so probably keeping this behaviour in Native AOT would be better, I believe@kouvel or@mangod9 may elaborate better

Copy link

Contributor

kouvelJul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The measurements done are very short and can be perturbed by CPU activity, the rolling min helps to stabilize it over time.

src/coreclr/nativeaot/Runtime/yieldprocessornormalized.cpp OutdatedShow resolvedHide resolved

jkotas requested review fromVSadov andkouvel

June 22, 2024 01:34

Eduardo Manuel Velarde Polar added2 commits

June 21, 2024 19:23

PR comments + Fix InterlockedExchange

4127158

Add TODOs

a862782

jkotas reviewed

Jun 24, 2024

View reviewed changes

src/coreclr/vm/yieldprocessornormalizedshared.cpp OutdatedShow resolvedHide resolved

jkotas reviewed

Jun 24, 2024

View reviewed changes

src/coreclr/vm/yieldprocessornormalizedshared.cpp OutdatedShow resolvedHide resolved

jkotas reviewed

Jun 24, 2024

View reviewed changes

src/coreclr/inc/yieldprocessornormalized.h OutdatedShow resolvedHide resolved

Eduardo Manuel Velarde Polar added2 commits

June 24, 2024 22:49

Use max/min and RhEnableFinalization

73d3d71

Remove TODO

0d226da

jkotas reviewed

Jun 27, 2024

View reviewed changes

src/coreclr/nativeaot/Runtime/FinalizerHelpers.cppShow resolvedHide resolved

Move PerformMeasurement

6519b0b

build-analysisbot mentioned this pull request

Jun 28, 2024

The job running on agent NetCore-Public ran longer than the maximum time#104044

Closed

jkotas reviewed

Jun 29, 2024

View reviewed changes

src/coreclr/nativeaot/Runtime/windows/PalRedhawkInline.h OutdatedShow resolvedHide resolved

jkotas reviewed

Jun 29, 2024

View reviewed changes

src/coreclr/nativeaot/Runtime/windows/PalRedhawkInline.h OutdatedShow resolvedHide resolved

eduardo-vp force-pushed theport-yield-norm-to-aot branch fromaf5ceaa to6519b0bCompare

July 2, 2024 18:48

kouvel reviewed

Jul 2, 2024

View reviewed changes

src/coreclr/vm/yieldprocessornormalizedshared.cpp OutdatedShow resolvedHide resolved

src/coreclr/vm/synch.h OutdatedShow resolvedHide resolved

src/coreclr/nativeaot/Runtime/FinalizerHelpers.cpp OutdatedShow resolvedHide resolved

src/coreclr/vm/yieldprocessornormalizedshared.cpp OutdatedShow resolvedHide resolved

src/coreclr/utilcode/yieldprocessornormalized.cpp OutdatedShow resolvedHide resolved

src/coreclr/nativeaot/Runtime/MiscHelpers.cpp OutdatedShow resolvedHide resolved

src/coreclr/nativeaot/Runtime/startup.cppShow resolvedHide resolved

jkotas reviewed

Jul 2, 2024

View reviewed changes

src/coreclr/nativeaot/Runtime/windows/PalRedhawkInline.h OutdatedShow resolvedHide resolved

This was referencedJul 2, 2024

Build failure: Static graph-based restore failed with exit code .* but did not log an error.#103526

Open

Build failure: Static graph-based restore failed with exit code .* but did not log an error.dotnet/dnceng#3139

Closed

Eduardo Manuel Velarde Polar added2 commits

July 2, 2024 14:40

Fix PalInterlockedExchange64

9606eb9

PR comments

234d61b

eduardo-vp marked this pull request as ready for review

July 2, 2024 23:09

eduardo-vp requested a review fromMichalStrehovsky as acode owner

July 2, 2024 23:09

Eduardo Manuel Velarde Polar added2 commits

July 2, 2024 17:04

Fix build

51c4573

Fix PalInterlocked

e8e1290

eduardo-vp requested review fromjkotas,kouvel andVSadov

July 3, 2024 21:30

kouvel approved these changes

Jul 3, 2024

View reviewed changes

Copy link

Contributor

kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, thanks!

Copy link

Member

jkotas commentedJul 4, 2024

What kind of testing you have done on the change to validate that it works as expected? Do we expect improvements in any perf benchmarks?

Copy link

Contributor

kouvel commentedJul 4, 2024•
edited
Loading

I don't think there would be any changes to benchmarks. I would expect that the CPU time spent during startup in the measurements would be a lot less (the new scheme measures lazily, and in narrower windows), that's about it. It would be good to measure that.

Copy link

MemberAuthor

eduardo-vp commentedJul 17, 2024

I tested the following snippet and I checked that the 8 initial measurements were done and subsequent measurements every ~4 seconds were done as well.

usingSystem;usingSystem.Threading;intminutesToSpin=10;intstartTicks=Environment.TickCount;while(Environment.TickCount-startTicks<minutesToSpin*60*1000){Thread.SpinWait(1000);Thread.Sleep(2000);}