Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Reimplement stubs to improve performance#65738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
janvorli merged 11 commits intodotnet:mainfromjanvorli:new-stubs
Mar 17, 2022
Merged

Conversation

janvorli
Copy link
Member

This change implementsFixupPrecodeStub,PrecodeStub,CallCountingStub and VSD stubsLookupStub,DispatchStub andResolveStub using a new mechanism with fixed code and separate RW data. TheLoaderHeap was updated to support a new kind of allocation using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the methods usingFixupPrecode, improving performance of the ASPNet plaintext benchmark by 3-4% depending on the target platform (measured on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long / short ones anymore as they are not needed - the indirect jumps we use now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split implementation is now in single place for all targets. Only a few constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but rather written in asm and compiled. These precompiled templates are then used as a source to copy the code from. The x86 is a bit more complex than that due to the fact that it doesn't support PC relative indirect addressing, so we need to relocate all access to the data slots when generating the code pages.

As a further improvement, we could generate just a single page of the code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are various page sizes being used. So the asm templates are generated for 4k..64k page sizes and the variant is then picked at runtime based on the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications of the stub code when W^X is enabled, in the plaintext benchmark they were reduced by 75%. That results in a significant reducing of the .NET application startup time with W^X enabled.

I think theLoaderHeap would benefit from some refactoring, but I'd prefer leaving it for a follow up. It seems that for the sake of the review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact locations. This helped me to drive the work and I am planning to use it for further changes. It can be removed in the future once we reach a final state.

There are still opportunities for improvement, but these stubs allowed me to scrape off the most significant portion of the mappings.

EgorBo, Wraith2, rgwood, omariom, nathan-moore, and startewho reacted with thumbs up emojidanmoseley and AlgorithmsAreCool reacted with heart emojiSingleAccretion, rgwood, saucecontrol, and hez2010 reacted with rocket emojiSingleAccretion, Sergio0694, alaatm, and BruceForstall reacted with eyes emoji
@janvorli
Copy link
MemberAuthor

Performance improvements - plaintext benchmark server side start time and client side read throughput (which is linearly proportional to requests/s). Please take the results with a grain of salt, they vary a lot and these are averages of 7 runs with the lowest outlier removed. But the trend is stable.

Win x64

MainThis PR
W^X offW^X onW^X offW^X on
Start time [ms]324386324357
Read throughput [MB/s]662.27640.35674.22679.10

Linux Intel x64

MainThis PR
W^X offW^X onW^X offW^X on
Start time [ms]194251197212
Read throughput [MB/s]532.87485.32556.29549.11

Linux AMD x64

MainThis PR
W^X offW^X onW^X offW^X on
Start time [ms]143182143155
Read throughput [MB/s]933.55882.35960.85921.60
mangod9 and En3Tho reacted with thumbs up emojidanmoseley reacted with heart emoji

@janvorli
Copy link
MemberAuthor

I've measured similar trends on arm64 Linux in the past, but I need to re-measure the stuff after the recent cleanup and code unification changes.

@janvorli
Copy link
MemberAuthor

It seems disabling the mapping loggings has broken the build, I am looking into it.

jmp QWORD PTR [DATA_SLOT(LookupStub, ResolveWorkerTarget)]
LEAF_END_MARKED LookupStubCode, _TEXT

LEAF_ENTRY DispatchStubCode, _TEXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am surprised that you were able to get away with the extra indirection in DispatchStubCode. It will be interesting to see the results of the microbenchmark runs.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am not sure what indirection you mean. This was the original code:

BYTE _entryPoint [2];// 48 B8 mov rax,
size_t _expectedMT;// xx xx xx xx xx xx xx xx 64-bit address
BYTE part1 [3];// 48 39 XX cmp [THIS_REG], rax
BYTE nopOp;// 90 nop ; 1-byte nop to align _implTarget

BYTE part1[2];// 48 B8 mov rax,
size_t _implTarget;// xx xx xx xx xx xx xx xx 64-bit address
BYTE part2 [1];// 75 jne
BYTE _failDispl;// xx failLabel
BYTE part3 [2];// FF E0 jmp rax
// failLabel:
BYTE part4 [2];// 48 B8 mov rax,
size_t _failTarget;// xx xx xx xx xx xx xx xx 64-bit address
BYTE part5 [2];// FF E0 jmp rax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I meant the performance overhead from the extra indirections introduced by this refactoring. The original code had two immediates, the new code has two indirections.

@adamsitnik What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, got it, I've thought you meant it somehow the other way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What's the best way to do dotnet/performance benchmark run for a PR like this one to see what improved/regressed on different target platforms?

We have two options:

  1. Run the benchmarks using two different coreruns (before & after) and store them in dedicated folders, then compare them using ResultsComparer tool:https://github.com/dotnet/performance/blob/ef3edbc52d92c6b30ba1c316082f46865e5ff1d6/docs/benchmarking-workflow-dotnet-runtime.md#preventing-regressions It's the best solution if the implementation might change and you might need to re-run the benchmarks using updated corerun.
  2. Run the benchmarks using two different coreruns without storing them in dedicated dolders, use BDN to perform the statistical test on the fly:https://github.com/dotnet/performance/blob/03207f183f042f6fc6b9f341df7a0e36b7175f5d/src/benchmarks/micro/README.md#private-runtime-builds it's good if you have only few benchmarks or don't intend to change the implementation

@@ -8804,7 +8804,15 @@ void CEEInfo::getFunctionEntryPoint(CORINFO_METHOD_HANDLE ftnHnd,
// Resolve methodImpl.
ftn = ftn->GetMethodTable()->MapMethodDeclToMethodImpl(ftn);

ret = (void *)ftn->TryGetMultiCallableAddrOfCode(accessFlags);
if (!ftn->IsFCall() && ftn->MayHavePrecode() &&ftn->GetPrecodeType() == PRECODE_FIXUP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think this condition is not right for the tiered compilation disabled case (not default, but still used by some first parties). Does this need to call into code version manager?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It is possible (although all the tests pass with tiered compilation disabled except for one where the failure looks unrelated to this). What do you think is the specific problem here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that the tiered compilation disabled case will go through unnecessary indirection when the target method is JITed already.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've tried to add!ftn->HasStableEntryPoint() to the condition. It severely degraded the plaintext benchmark performance. I've also thought that!ftn->IsPointingToStableNativeCode() might be needed, but the function seems to never report true together with the preexisting condition (based on my testing when I've added an assert and none of the coreclr pri1 tests hit it, with or without tiered compilation enabled).

Copy link
Member

@jkotasjkotasFeb 24, 2022
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the logic should be something like this:

if (ftn->IsVersionable()){    IAT_PVALUE // must go via indirection to enable versioning }else{    if (the target has final code)        IAT_VALUE for the final code // this includes FCalls and methods that are already JITed (w/o tiering)    else        IAT_PVALUE // must go via indirection to pick up the final code once it is ready}

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I will give it a try

@EgorBo
Copy link
Member

@janvorli since you seem to be refactoring a lot of stuff is it possible now to get an understanding where a currently jitted method will end up in memory precisely (or at lest approximately)?
Basically, exposed AllocPtr of a LoaderCodeHeap

to be able to rely on it for "relocs" in jitted code

@janvorli
Copy link
MemberAuthor

@EgorBo this change only changes allocation of stubs, it doesn't change in any way where jitted code will end up. But I am not sure if I fully understood what you are after.

This change implements `FixupPrecodeStub`, `PrecodeStub`,`CallCountingStub` and VSD stubs `LookupStub`, `DispatchStub` and`ResolveStub` using a new mechanism with fixed code and separate RWdata. The `LoaderHeap` was updated to support a new kind of allocationusing interleaved code and data pages to support this new mechanism.The JIT now generates code that uses indirection slot to jump to themethods using `FixupPrecode`, improving performance of the ASPNetplaintext benchmark by 3-4% depending on the target platform (measuredon x64 Windows / Linux and arm64 Linux).I have also removed the Holders, as the stubs are naturally properlyaligned due to the way they are allocated.There is now only a single variant of each stub, there are no long /short ones anymore as they are not needed - the indirect jumps we usenow are not range limited.Most of the stubs stuff is now target agnostic and the originally splitimplementation is now in single place for all targets. Only a fewconstants are defined as target specific in these.The code for the stubs is no longer generated as bytes by C++ code, butrather written in asm and compiled. These precompiled templates are thenused as a source to copy the code from. The x86 is a bit more complexthan that due to the fact that it doesn't support PC relative indirectaddressing, so we need to relocate all access to the data slots whengenerating the code pages.As a further improvement, we could generate just a single page of thecode and then just map it many times. This is left for future work.ARM64 Unix differs from the other targets / platforms - there arevarious page sizes being used. So the asm templates are generated for4k..64k page sizes and the variant is then picked at runtime based onthe page size extracted from the OS.This also removes a lot of writeable mappings created for modificationsof the stub code when W^X is enabled, in the plaintext benchmark theywere reduced by 75%. That results in a significant reducing of the .NETapplication startup time with W^X enabled.I think the `LoaderHeap` would benefit from some refactoring, but I'dprefer leaving it for a follow up. It seems that for the sake of thereview, it is better to keep it as is.The change also implements logging of number of mappings and their exactlocations. This helped me to drive the work and I am planning to use itfor further changes. It can be removed in the future once we reach afinal state.There are still opportunities for improvement, but these stubs allowedme to scrape off the most significant portion of the mappings.
* Change the CallCountingStub to not to use return address of anunbalanced call as a stub identifying token. The counter address is usedinstead on all targets.* Fix some tabs instead of spaces* Fix getTargetMethodDesc - in some cases, we get address of the startof the FixupPrecode too.* Remove a leftover comment
The assembler was generating 32 bit conditional relative jumps insteadof ones with 8 bit displacement. I've found that a presence of a globallabel between the jump site and the destination makes the assembler todo that. Changing PATCH_LABEL macro fixed it.
{
Precode* pPrecode = Precode::GetPrecodeFromEntryPoint((PCODE)hlpDynamicFuncTable[dynamicFtnNum].pfnHelper);
_ASSERTE(pPrecode->GetType() == PRECODE_FIXUP);
*ppIndirection = ((FixupPrecode*)pPrecode)->GetTargetSlot();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What guarantees that we have the final code of the method by this point?

If this is a valid optimization, we should be doing it when fillinghlpDynamicFuncTable.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We don't need to have the final code. This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot. When we have the JITted code, the slot points to it. Without this optimization, JIT would jump to the beginning of the fixup stub and the indirect jump in there would jump through the slot. So we save one jump by this. I have found this using the performance repo microbenchmarks where the casting benchmarks were consistently slower with my change. With this fix, they are now consistently faster.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can modify thehlpDynamicFuncTable entry to be able to carry either the indirection or the target address and a flag indicating which one it is. I was going back and forth in my head whether to do that or do it the way I've ended up doing it.

However, there is a_AddrIsJITHelper function on 32 bit Windows in the debugger controller code that compares addresses of all the helpers with an address that the debugger has stepped into to avoid the debugger stopping in unmanaged runtime code. While I believe we should not call this methods that code for prefix stubs (we should get TRACE_STUB tracetype), I need to double check that. I am talking about the case when the helper was not JITted yet and so the indirection would go to the middle of the fixup stub and the code there would not detect it was in a helper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This just passes the fixup precode indirection slot to the JIT. If we don't have the final code, the slot points to the fixup part of the slot.

Ah ok, I have missed that.

@janvorli
Copy link
MemberAuthor

I have removed the VSD stubs change, it was causing steady state performance degradation in more complex ASPNet benchmarks like the Orchard or Fortunes. I am planning to re-add the Lookup stub in a follow up change, but for this PR, it was safer to remove it all.

Extensive benchmarking has shown that the new model for VSD stubs iscausing steady state performance degradation for code using a lot ofvirtual calls. It wasn't showing up in the plaintext benchmark I haveused as the driver of the change.This change looses part of the startup perf improvements, but I amplanning to add back the Lookup stubs in a follow up change. Thoseshould not be perf critical and they are likely the ones where thestartup improvements are gained from.
@janvorli
Copy link
MemberAuthor

Before merging this change, I would like to share some new details on the performance. It turned out that while the plaintext and JSON benchmarks that I was using as a driver were showing gains all over the place, Orchard and Fortunes that I've tested after this PR was created are showing 2.5-4.5% regression in the read throughput on Citrine Intel Windows and Linux. But on Citrine AMD Linux, the same metric is showing 12-17% gain. The AMD machine has more CPU cores, so it might be one of the reasons.
I wasn't able to pinpoint the source of the regression yet, I'll keep investigating these regressions after the PR is merged.

@janvorlijanvorli merged commiteb8460f intodotnet:mainMar 17, 2022
@EgorBo
Copy link
Member

image

I assume this perf improvement is because of this PR 👍 (according to the graph, the commit range is05cb7f5...110cb9f)

danmoseley reacted with heart emojidanmoseley reacted with rocket emoji

@kunalspathak
Copy link
Contributor

kunalspathak commentedMar 22, 2022
edited
Loading

@kunalspathak
Copy link
Contributor

kunalspathak commentedMar 22, 2022
edited
Loading

@DrewScoggins
Copy link
Member

DrewScoggins commentedMar 24, 2022
edited
Loading

@DrewScoggins
Copy link
Member

DrewScoggins commentedMar 24, 2022
edited
Loading

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull requestMar 30, 2022
* Reimplement stubs to improve performanceThis change implements `FixupPrecodeStub`, `PrecodeStub`and `CallCountingStub` using a new mechanism with fixed code and separate RWdata. The `LoaderHeap` was updated to support a new kind of allocationusing interleaved code and data pages to support this new mechanism.The JIT now generates code that uses indirection slot to jump to themethods using `FixupPrecode`, improving performance of the ASPNetplaintext benchmark by 3-4% depending on the target platform (measuredon x64 Windows / Linux and arm64 Linux).I have also removed the Holders, as the stubs are naturally properlyaligned due to the way they are allocated.There is now only a single variant of each stub, there are no long /short ones anymore as they are not needed - the indirect jumps we usenow are not range limited.Most of the stubs stuff is now target agnostic and the originally splitimplementation is now in single place for all targets. Only a fewconstants are defined as target specific in these.The code for the stubs is no longer generated as bytes by C++ code, butrather written in asm and compiled. These precompiled templates are thenused as a source to copy the code from. The x86 is a bit more complexthan that due to the fact that it doesn't support PC relative indirectaddressing, so we need to relocate all access to the data slots whengenerating the code pages.As a further improvement, we could generate just a single page of thecode and then just map it many times. This is left for future work.ARM64 Unix differs from the other targets / platforms - there arevarious page sizes being used. So the asm templates are generated for4k..64k page sizes and the variant is then picked at runtime based onthe page size extracted from the OS.This also removes a lot of writeable mappings created for modificationsof the stub code when W^X is enabled, in the plaintext benchmark theywere reduced by 75%. That results in a significant reducing of the .NETapplication startup time with W^X enabled.I think the `LoaderHeap` would benefit from some refactoring, but I'dprefer leaving it for a follow up. It seems that for the sake of thereview, it is better to keep it as is.The change also implements logging of number of mappings and their exactlocations. This helped me to drive the work and I am planning to use itfor further changes. It can be removed in the future once we reach afinal state.There are still opportunities for improvement, but these stubs allowedme to scrape off the most significant portion of the mappings.
@ghostghost locked asresolvedand limited conversation to collaboratorsApr 23, 2022
@jozkee
Copy link
Member

@janvorli we detected a slight x64 regression for theSystem.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List) benchmark in the 7.0 RC2 vs 6.0 perf report that seems to be related to this commit rangec032e0d...731d936. Do you suspect this PR could be to blame?

newplot (1)

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: List)

ResultBaseDiffRatioAlloc DeltaOperating SystemBitProcessor NameModality
Same2740.762694.021.02+0ubuntu 18.04Arm64Unknown processor
Same621.78599.291.04+0Windows 11Arm64Unknown processor
Same956.34949.611.01+0Windows 11Arm64Microsoft SQ1 3.0 GHz
Same1009.64987.951.02+0Windows 11Arm64Microsoft SQ1 3.0 GHz
Same561.31561.701.00+0macOS Monterey 12.6Arm64Apple M1
Same546.88545.491.00+0macOS Monterey 12.6Arm64Apple M1 Max
Slower917.991120.400.82+0Windows 10X64Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower732.11826.220.89+0Windows 11X64AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower499.89572.450.87+0Windows 11X64AMD Ryzen 9 5900X
Same455.45463.500.98+0Windows 11X64AMD Ryzen 9 7950X
Same716.52775.360.92+0Windows 11X64Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower699.97829.860.84+0debian 11X64Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same496.60540.290.92+0ubuntu 18.04X64AMD Ryzen 9 5900X
Slower757.61911.970.83+0ubuntu 18.04X64Intel Xeon CPU E5-1650 v4 3.60GHz
Slower497.21559.910.89+0ubuntu 20.04X64AMD Ryzen 9 5900X
Same1078.911143.660.94+0ubuntu 20.04X64Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same755.07779.390.97+0ubuntu 20.04X64Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower1031.911159.220.89+0macOS Big Sur 11.7X64Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower866.00981.430.88+0macOS Monterey 12.6X64Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

@janvorli
Copy link
MemberAuthor

@jozkee I am sorry for missing your message before. Yes, I believe the regression was caused by that change. There are few corner cases where the new stubs implementation has slight negative effect on micro-benchmarks. Some delegate calls are one of those cases and this test was one of those where I've seen it to have visible impact when I was developing the change.

Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@EgorBoEgorBoEgorBo left review comments

@adamsitnikadamsitnikadamsitnik left review comments

@jkotasjkotasjkotas approved these changes

@davidwrightondavidwrightonAwaiting requested review from davidwrighton

Assignees

@janvorlijanvorli

Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

7 participants
@janvorli@EgorBo@kunalspathak@DrewScoggins@jozkee@adamsitnik@jkotas

[8]ページ先頭

©2009-2025 Movatter.jp