NotificationsYou must be signed in to change notification settings
Fork5.1k
Star16.6k

Reimplement stubs to improve performance#65738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

janvorli merged 11 commits intodotnet:mainfromjanvorli:new-stubs

Mar 17, 2022

Merged

Reimplement stubs to improve performance#65738

janvorli merged 11 commits intodotnet:mainfromjanvorli:new-stubs

Mar 17, 2022

Conversation

Copy link

Member

janvorli commentedFeb 22, 2022

This change implementsFixupPrecodeStub,PrecodeStub,CallCountingStub and VSD stubsLookupStub,DispatchStub andResolveStub using a new mechanism with fixed code and separate RW data. TheLoaderHeap was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the methods usingFixupPrecode, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages.

As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled.

I think theLoaderHeap would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state.

There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.

janvorli added the area-VM-coreclr label

Feb 22, 2022

janvorli requested a review fromjkotas

February 22, 2022 21:21

janvorli self-assigned this

Feb 22, 2022

Copy link

MemberAuthor

janvorli commentedFeb 22, 2022

Performance improvements - plaintext benchmark server side start time and client side read throughput (which is linearly proportional to requests/s). Please take the results with a grain of salt, they vary a lot and these are averages of 7 runs with the lowest outlier removed. But the trend is stable.

Win x64

	Main		This PR
	W^X off	W^X on	W^X off	W^X on
Start time [ms]	324	386	324	357
Read throughput [MB/s]	662.27	640.35	674.22	679.10

Linux Intel x64

	Main		This PR
	W^X off	W^X on	W^X off	W^X on
Start time [ms]	194	251	197	212
Read throughput [MB/s]	532.87	485.32	556.29	549.11

Linux AMD x64

	Main		This PR
	W^X off	W^X on	W^X off	W^X on
Start time [ms]	143	182	143	155
Read throughput [MB/s]	933.55	882.35	960.85	921.60

Copy link

MemberAuthor

janvorli commentedFeb 22, 2022

I've measured similar trends on arm64 Linux in the past, but I need to re-measure the stuff after the recent cleanup and code unification changes.

Copy link

MemberAuthor

janvorli commentedFeb 22, 2022

It seems disabling the mapping loggings has broken the build, I am looking into it.

EgorBo reviewed

Feb 22, 2022

View reviewed changes

src/coreclr/vm/arm/thunktemplates.SShow resolvedHide resolved

jkotas requested a review fromdavidwrighton

February 22, 2022 22:21

jkotas reviewed

Feb 23, 2022

View reviewed changes

src/coreclr/vm/amd64/thunktemplates.asm OutdatedShow resolvedHide resolved

src/coreclr/inc/jithelpers.h OutdatedShow resolvedHide resolved

src/coreclr/utilcode/executableallocator.cpp OutdatedShow resolvedHide resolved

src/coreclr/vm/amd64/thunktemplates.asm Outdated

		jmp QWORD PTR [DATA_SLOT(LookupStub, ResolveWorkerTarget)]
		LEAF_END_MARKED LookupStubCode, _TEXT

		LEAF_ENTRY DispatchStubCode, _TEXT

Copy link

Member

jkotasFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am surprised that you were able to get away with the extra indirection in DispatchStubCode. It will be interesting to see the results of the microbenchmark runs.

Copy link

MemberAuthor

janvorliFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am not sure what indirection you mean. This was the original code:

runtime/src/coreclr/vm/amd64/virtualcallstubcpu.hpp

Lines 286 to 289 ina1a653c

	BYTE _entryPoint [2];// 48 B8 mov rax,
	size_t _expectedMT;// xx xx xx xx xx xx xx xx 64-bit address
	BYTE part1 [3];// 48 39 XX cmp [THIS_REG], rax
	BYTE nopOp;// 90 nop ; 1-byte nop to align _implTarget

runtime/src/coreclr/vm/amd64/virtualcallstubcpu.hpp

Lines 183 to 191 ina1a653c

	BYTE part1[2];// 48 B8 mov rax,
	size_t _implTarget;// xx xx xx xx xx xx xx xx 64-bit address
	BYTE part2 [1];// 75 jne
	BYTE _failDispl;// xx failLabel
	BYTE part3 [2];// FF E0 jmp rax
	// failLabel:
	BYTE part4 [2];// 48 B8 mov rax,
	size_t _failTarget;// xx xx xx xx xx xx xx xx 64-bit address
	BYTE part5 [2];// FF E0 jmp rax

Copy link

Member

jkotasFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I meant the performance overhead from the extra indirections introduced by this refactoring. The original code had two immediates, the new code has two indirections.

@adamsitnik What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

Copy link

MemberAuthor

janvorliFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, got it, I've thought you meant it somehow the other way.

Copy link

Member

adamsitnikFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

We have two options:

Run the benchmarks using two different coreruns (before & after) and store them in dedicated folders, then compare them using ResultsComparer tool:https://github.com/dotnet/performance/blob/ef3edbc52d92c6b30ba1c316082f46865e5ff1d6/docs/benchmarking-workflow-dotnet-runtime.md#preventing-regressions It's the best solution if the implementation might change and you might need to re-run the benchmarks using updated corerun.
Run the benchmarks using two different coreruns without storing them in dedicated dolders, use BDN to perform the statistical test on the fly:https://github.com/dotnet/performance/blob/03207f183f042f6fc6b9f341df7a0e36b7175f5d/src/benchmarks/micro/README.md#private-runtime-builds it's good if you have only few benchmarks or don't intend to change the implementation

src/coreclr/vm/loaderallocator.cpp OutdatedShow resolvedHide resolved

src/coreclr/vm/jitinterface.cpp Outdated

		@@ -8804,7 +8804,15 @@ void CEEInfo::getFunctionEntryPoint(CORINFO_METHOD_HANDLE ftnHnd,
		// Resolve methodImpl.
		ftn = ftn->GetMethodTable()->MapMethodDeclToMethodImpl(ftn);

		ret = (void *)ftn->TryGetMultiCallableAddrOfCode(accessFlags);
		if (!ftn->IsFCall() && ftn->MayHavePrecode() &&ftn->GetPrecodeType() == PRECODE_FIXUP)

Copy link

Member

jkotasFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think this condition is not right for the tiered compilation disabled case (not default, but still used by some first parties). Does this need to call into code version manager?

Copy link

MemberAuthor

janvorliFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It is possible (although all the tests pass with tiered compilation disabled except for one where the failure looks unrelated to this). What do you think is the specific problem here?

Copy link

Member

jkotasFeb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that the tiered compilation disabled case will go through unnecessary indirection when the target method is JITed already.

Copy link

MemberAuthor

janvorliFeb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've tried to add!ftn->HasStableEntryPoint() to the condition. It severely degraded the plaintext benchmark performance. I've also thought that!ftn->IsPointingToStableNativeCode() might be needed, but the function seems to never report true together with the preexisting condition (based on my testing when I've added an assert and none of the coreclr pri1 tests hit it, with or without tiered compilation enabled).

Copy link

Member

jkotasFeb 24, 2022•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the logic should be something like this:

if (ftn->IsVersionable()){    IAT_PVALUE // must go via indirection to enable versioning }else{    if (the target has final code)        IAT_VALUE for the final code // this includes FCalls and methods that are already JITed (w/o tiering)    else        IAT_PVALUE // must go via indirection to pick up the final code once it is ready}

Copy link

MemberAuthor

janvorliFeb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I will give it a try

This was referencedFeb 23, 2022

System.Runtime.Serialization.Xml.Tests failed#59926

Closed

system.diagnostics.tests.providermetadatatests.getprovidernames_assertproperties#64153

Open

Failing test system.tests.gcextendedtests.getallocatedbytesforcurrentthread#65183

Closed

rzikm mentioned this pull request

Feb 24, 2022

Assertion failed in System.Net.NetworkInformation.Functional.Tests#65841

Closed

Copy link

Member

EgorBo commentedFeb 24, 2022

@janvorli since you seem to be refactoring a lot of stuff is it possible now to get an understanding where a currently jitted method will end up in memory precisely (or at lest approximately)?
Basically, exposed AllocPtr of a LoaderCodeHeap

to be able to rely on it for "relocs" in jitted code

Copy link

MemberAuthor

janvorli commentedFeb 24, 2022

@EgorBo this change only changes allocation of stubs, it doesn't change in any way where jitted code will end up. But I am not sure if I fully understood what you are after.

janvorli force-pushed thenew-stubs branch from813c033 to6d62803Compare

February 24, 2022 15:10

janvorli added6 commits

February 25, 2022 17:43

Reimplement stubs to improve performance

efe9210

This change implements `FixupPrecodeStub`, `PrecodeStub`,`CallCountingStub` and VSD stubs `LookupStub`, `DispatchStub` and`ResolveStub` using a new mechanism with fixed code and separate RWdata. The `LoaderHeap` was updated to support a new kind of allocationusing interleaved code and data pages to support this new mechanism.The JIT now generates code that uses indirection slot to jump to themethods using `FixupPrecode`, improving performance of the ASPNetplaintext benchmark by 3-4% depending on the target platform (measuredon x64 Windows / Linux and arm64 Linux).I have also removed the Holders, as the stubs are naturally properlyaligned due to the way they are allocated.There is now only a single variant of each stub, there are no long /short ones anymore as they are not needed - the indirect jumps we usenow are not range limited.Most of the stubs stuff is now target agnostic and the originally splitimplementation is now in single place for all targets. Only a fewconstants are defined as target specific in these.The code for the stubs is no longer generated as bytes by C++ code, butrather written in asm and compiled. These precompiled templates are thenused as a source to copy the code from. The x86 is a bit more complexthan that due to the fact that it doesn't support PC relative indirectaddressing, so we need to relocate all access to the data slots whengenerating the code pages.As a further improvement, we could generate just a single page of thecode and then just map it many times. This is left for future work.ARM64 Unix differs from the other targets / platforms - there arevarious page sizes being used. So the asm templates are generated for4k..64k page sizes and the variant is then picked at runtime based onthe page size extracted from the OS.This also removes a lot of writeable mappings created for modificationsof the stub code when W^X is enabled, in the plaintext benchmark theywere reduced by 75%. That results in a significant reducing of the .NETapplication startup time with W^X enabled.I think the `LoaderHeap` would benefit from some refactoring, but I'dprefer leaving it for a follow up. It seems that for the sake of thereview, it is better to keep it as is.The change also implements logging of number of mappings and their exactlocations. This helped me to drive the work and I am planning to use itfor further changes. It can be removed in the future once we reach afinal state.There are still opportunities for improvement, but these stubs allowedme to scrape off the most significant portion of the mappings.

Disable executable allocator statistics by default

9114ae8

Fix build with executable allocator logging disabled

9e5211a

Fix Windows ARM/ARM64 and macOS x64 builds

5e0d202

Reflect PR feedback and few fixes

ad97eed

* Change the CallCountingStub to not to use return address of anunbalanced call as a stub identifying token. The counter address is usedinstead on all targets.* Fix some tabs instead of spaces* Fix getTargetMethodDesc - in some cases, we get address of the startof the FixupPrecode too.* Remove a leftover comment

Fix macOS x64

c769d0f

The assembler was generating 32 bit conditional relative jumps insteadof ones with 8 bit displacement. I've found that a presence of a globallabel between the jump site and the destination makes the assembler todo that. Changing PATCH_LABEL macro fixed it.

janvorli force-pushed thenew-stubs branch from44511cf toc769d0fCompare

February 25, 2022 16:44

Fix ARM64 StubPrecodeCode_End extraction

4dd122b

janvorli mentioned this pull request

Mar 1, 2022

[LoongArch64] coreclr-vm directory#62885

Merged

Reflect feedback and improve some JIT helpers perf

118b427

jkotas reviewed

Mar 3, 2022

View reviewed changes

src/coreclr/vm/jitinterface.cpp

		{
		Precode* pPrecode = Precode::GetPrecodeFromEntryPoint((PCODE)hlpDynamicFuncTable[dynamicFtnNum].pfnHelper);
		_ASSERTE(pPrecode->GetType() == PRECODE_FIXUP);
		ppIndirection = ((FixupPrecode)pPrecode)->GetTargetSlot();

Copy link

Member

jkotasMar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What guarantees that we have the final code of the method by this point?

If this is a valid optimization, we should be doing it when fillinghlpDynamicFuncTable.

Copy link

MemberAuthor

janvorliMar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We don't need to have the final code. This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot. When we have the JITted code, the slot points to it. Without this optimization, JIT would jump to the beginning of the fixup stub and the indirect jump in there would jump through the slot. So we save one jump by this. I have found this using the performance repo microbenchmarks where the casting benchmarks were consistently slower with my change. With this fix, they are now consistently faster.

Copy link

MemberAuthor

janvorliMar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can modify thehlpDynamicFuncTable entry to be able to carry either the indirection or the target address and a flag indicating which one it is. I was going back and forth in my head whether to do that or do it the way I've ended up doing it.

However, there is a_AddrIsJITHelper function on 32 bit Windows in the debugger controller code that compares addresses of all the helpers with an address that the debugger has stepped into to avoid the debugger stopping in unmanaged runtime code. While I believe we should not call this methods that code for prefix stubs (we should get TRACE_STUB tracetype), I need to double check that. I am talking about the case when the helper was not JITted yet and so the indirection would go to the middle of the fixup stub and the code there would not detect it was in a helper.

Copy link

Member

jkotasMar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot.

Ah ok, I have missed that.

janvorli force-pushed thenew-stubs branch from553095a to5292970Compare

March 16, 2022 20:05

Copy link

MemberAuthor

janvorli commentedMar 16, 2022

I have removed the VSD stubs change, it was causing steady state performance degradation in more complex ASPNet benchmarks like the Orchard or Fortunes. I am planning to re-add the Lookup stub in a follow up change, but for this PR, it was safer to remove it all.

janvorli force-pushed thenew-stubs branch from5292970 to7c708cdCompare

March 16, 2022 22:45

Revert VSD stubs changes

346af35

Extensive benchmarking has shown that the new model for VSD stubs iscausing steady state performance degradation for code using a lot ofvirtual calls. It wasn't showing up in the plaintext benchmark I haveused as the driver of the change.This change looses part of the startup perf improvements, but I amplanning to add back the Lookup stubs in a follow up change. Thoseshould not be perf critical and they are likely the ones where thestartup improvements are gained from.

janvorli force-pushed thenew-stubs branch from7c708cd to346af35Compare

March 17, 2022 00:26

jkotas reviewed

Mar 17, 2022

View reviewed changes

src/coreclr/vm/common.h OutdatedShow resolvedHide resolved

jkotas approved these changes

Mar 17, 2022

View reviewed changes

janvorliand others added2 commits

March 17, 2022 16:53

Move FillStubCodePage to a better location

bd1c8ba

Merge branch 'main' into new-stubs

33703aa

Copy link

MemberAuthor

janvorli commentedMar 17, 2022

Before merging this change, I would like to share some new details on the performance. It turned out that while the plaintext and JSON benchmarks that I was using as a driver were showing gains all over the place, Orchard and Fortunes that I've tested after this PR was created are showing 2.5-4.5% regression in the read throughput on Citrine Intel Windows and Linux. But on Citrine AMD Linux, the same metric is showing 12-17% gain. The AMD machine has more CPU cores, so it might be one of the reasons.
I wasn't able to pinpoint the source of the regression yet, I'll keep investigating these regressions after the PR is merged.

janvorli merged commiteb8460f intodotnet:main

Mar 17, 2022

Copy link

Member

EgorBo commentedMar 22, 2022

I assume this perf improvement is because of this PR 👍 (according to the graph, the commit range is05cb7f5...110cb9f)

Copy link

Contributor

kunalspathak commentedMar 22, 2022•
edited
Loading

windows x64 improvements:dotnet/perf-autofiling-issues#4226,dotnet/perf-autofiling-issues#4225

Copy link

Contributor

kunalspathak commentedMar 22, 2022•
edited
Loading

Ubuntu x64 improvementsdotnet/perf-autofiling-issues#4219,dotnet/perf-autofiling-issues#4218,dotnet/perf-autofiling-issues#4217,dotnet/perf-autofiling-issues#4214

DrewScoggins mentioned this pull request

Mar 24, 2022

Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer and Hashset.ContainsTrue<Int32>#67101

Closed

Copy link

Member

DrewScoggins commentedMar 24, 2022•
edited
Loading

Windows Arm64 Improvementsdotnet/perf-autofiling-issues#4251 dotnet/perf-autofiling-issues#4252

Copy link

Member

DrewScoggins commentedMar 24, 2022•
edited
Loading

Ubuntu Arm64 improvementsdotnet/perf-autofiling-issues#4258 dotnet/perf-autofiling-issues#4260

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request

Mar 30, 2022

Reimplement stubs to improve performance (dotnet#65738)

e93b158

* Reimplement stubs to improve performanceThis change implements `FixupPrecodeStub`, `PrecodeStub`and `CallCountingStub` using a new mechanism with fixed code and separate RWdata. The `LoaderHeap` was updated to support a new kind of allocationusing interleaved code and data pages to support this new mechanism.The JIT now generates code that uses indirection slot to jump to themethods using `FixupPrecode`, improving performance of the ASPNetplaintext benchmark by 3-4% depending on the target platform (measuredon x64 Windows / Linux and arm64 Linux).I have also removed the Holders, as the stubs are naturally properlyaligned due to the way they are allocated.There is now only a single variant of each stub, there are no long /short ones anymore as they are not needed - the indirect jumps we usenow are not range limited.Most of the stubs stuff is now target agnostic and the originally splitimplementation is now in single place for all targets. Only a fewconstants are defined as target specific in these.The code for the stubs is no longer generated as bytes by C++ code, butrather written in asm and compiled. These precompiled templates are thenused as a source to copy the code from. The x86 is a bit more complexthan that due to the fact that it doesn't support PC relative indirectaddressing, so we need to relocate all access to the data slots whengenerating the code pages.As a further improvement, we could generate just a single page of thecode and then just map it many times. This is left for future work.ARM64 Unix differs from the other targets / platforms - there arevarious page sizes being used. So the asm templates are generated for4k..64k page sizes and the variant is then picked at runtime based onthe page size extracted from the OS.This also removes a lot of writeable mappings created for modificationsof the stub code when W^X is enabled, in the plaintext benchmark theywere reduced by 75%. That results in a significant reducing of the .NETapplication startup time with W^X enabled.I think the `LoaderHeap` would benefit from some refactoring, but I'dprefer leaving it for a follow up. It seems that for the sake of thereview, it is better to keep it as is.The change also implements logging of number of mappings and their exactlocations. This helped me to drive the work and I am planning to use itfor further changes. It can be removed in the future once we reach afinal state.There are still opportunities for improvement, but these stubs allowedme to scrape off the most significant portion of the mappings.

janvorli mentioned this pull request

Mar 31, 2022

What's new in .NET 7 Preview 3 [WIP]dotnet/core#7108

Closed

wfurt mentioned this pull request

Apr 13, 2022

segfault in libcoreclr.so`CallDescrWorkerWithHandler#66970

Closed

adamsitnik mentioned this pull request

Apr 13, 2022

PerfLabTests.LowLevelPerf.StaticDelegate has regressed on Linux x64#67967

Closed

BruceForstall mentioned this pull request

Apr 16, 2022

Assert failure: !CREATE_CHECK_STRING(pMT && pMT->Validate())#67046

Closed

ta264 mentioned this pull request

Apr 22, 2022

#65738 causes SIGSEGV running helloworld on linux-x86 preview 3#68391

Closed

ghost locked asresolvedand limited conversation to collaborators

Apr 23, 2022

Copy link

Member

jozkee commentedOct 14, 2022

@janvorli we detected a slight x64 regression for theSystem.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List) benchmark in the 7.0 RC2 vs 6.0 perf report that seems to be related to this commit rangec032e0d...731d936. Do you suspect this PR could be to blame?

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name
Same	2740.76	2694.02	1.02	+0	ubuntu 18.04	Arm64	Unknown processor
Same	621.78	599.29	1.04	+0	Windows 11	Arm64	Unknown processor
Same	956.34	949.61	1.01	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Same	1009.64	987.95	1.02	+0	Windows 11	Arm64	Microsoft SQ1 3.0 GHz
Same	561.31	561.70	1.00	+0	macOS Monterey 12.6	Arm64	Apple M1
Same	546.88	545.49	1.00	+0	macOS Monterey 12.6	Arm64	Apple M1 Max
Slower	917.99	1120.40	0.82	+0	Windows 10	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower	732.11	826.22	0.89	+0	Windows 11	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower	499.89	572.45	0.87	+0	Windows 11	X64	AMD Ryzen 9 5900X
Same	455.45	463.50	0.98	+0	Windows 11	X64	AMD Ryzen 9 7950X
Same	716.52	775.36	0.92	+0	Windows 11	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	699.97	829.86	0.84	+0	debian 11	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same	496.60	540.29	0.92	+0	ubuntu 18.04	X64	AMD Ryzen 9 5900X
Slower	757.61	911.97	0.83	+0	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Slower	497.21	559.91	0.89	+0	ubuntu 20.04	X64	AMD Ryzen 9 5900X
Same	1078.91	1143.66	0.94	+0	ubuntu 20.04	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same	755.07	779.39	0.97	+0	ubuntu 20.04	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower	1031.91	1159.22	0.89	+0	macOS Big Sur 11.7	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower	866.00	981.43	0.88	+0	macOS Monterey 12.6	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

Copy link

MemberAuthor

janvorli commentedOct 26, 2022

@jozkee I am sorry for missing your message before. Yes, I believe the regression was caused by that change. There are few corner cases where the new stubs implementation has slight negative effect on micro-benchmarks. Some delegate calls are one of those cases and this test was one of those where I've seen it to have visible impact when I was developing the change.

Labels

area-VM-coreclr

7 participants

Movatterモバイル変換

Reimplement stubs to improve performance#65738

Reimplement stubs to improve performance#65738

Uh oh!

Conversation

janvorli commentedFeb 22, 2022

Uh oh!

janvorli commentedFeb 22, 2022

Win x64

Linux Intel x64

Linux AMD x64

Uh oh!

janvorli commentedFeb 22, 2022

Uh oh!

janvorli commentedFeb 22, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotasFeb 24, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EgorBo commentedFeb 24, 2022

Uh oh!

janvorli commentedFeb 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janvorli commentedMar 16, 2022

Uh oh!

Uh oh!

janvorli commentedMar 17, 2022

Uh oh!

EgorBo commentedMar 22, 2022

Uh oh!

kunalspathak commentedMar 22, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

kunalspathak commentedMar 22, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

DrewScoggins commentedMar 24, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

DrewScoggins commentedMar 24, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

jozkee commentedOct 14, 2022

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)

Uh oh!

janvorli commentedOct 26, 2022

jkotasFeb 24, 2022•
edited
Loading

kunalspathak commentedMar 22, 2022•
edited
Loading

kunalspathak commentedMar 22, 2022•
edited
Loading

DrewScoggins commentedMar 24, 2022•
edited
Loading

DrewScoggins commentedMar 24, 2022•
edited
Loading