- Notifications
You must be signed in to change notification settings - Fork5.1k
Move memset/memcpy helpers to managed impl#98623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
...reclr/nativeaot/System.Private.CoreLib/src/Internal/Runtime/CompilerHelpers/MemoryHelpers.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
...reclr/nativeaot/System.Private.CoreLib/src/Internal/Runtime/CompilerHelpers/MemoryHelpers.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
jkotas commentedFeb 18, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
We have two places in CoreLib that call
|
EgorBo commentedFeb 18, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The problem that No longer happens if we change it to call |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
@jkotas we don't use memset/memcpy helpers in JIT for x86 and always use |
|
Uh oh!
There was an error while loading.Please reload this page.
src/coreclr/tools/aot/ILCompiler.Compiler/IL/ILImporter.Scanner.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
…teMemOps.csCo-authored-by: Jan Kotas <jkotas@microsoft.com>
EgorBo commentedFeb 25, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Btw, I found & fixed an existing bug - Memmove used to not throw an NRE if both src and dest are null + length is not zero e.g.: Buffer.MemoryCopy(null,null,10,10); this doesn't fail (likely can be reproduced in other APIs which use Memmove under the hood) |
This is not a bug that needs fixing. Passing invalid buffer to Memmove is UB. |
EgorBo commentedFeb 25, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
if I don't fix it, then: voidTest(MyStruct*a,MyStruct*b)=>*a=*b;// or its ref version won't fail on both being null (or will depend on opts level) |
Ok, I agree that it makes sense to fix this for internal JIT use of memory copy. |
Uh oh!
There was an error while loading.Please reload this page.
EgorBo commentedFeb 25, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Done. I wrote somebenchmarks NOTE: PGO doesn't unroll these unknown-size memops because the opt I landed for this currently require R2R=0 and I didn't disable it. Overall looks good to me, IMO. It's possible that in some cases helper calls are always indirect because they're promoted to Tier1 together with the benchmarks, for cast helpers we didthis trick, but I am not sure we want to add the whole SpanHelpers here as well. CopyBlock4000/InitBlock4000 likely struggle from going to native memset/memmove anyway from the managed helpers |
@@ -10343,18 +10343,34 @@ void Compiler::impImportBlockCode(BasicBlock* block) | |||
// TODO: enable for X86 as well, it currently doesn't support memset/memcpy helpers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
x86 supports memset/memcpy helpers now (this can be a follow up change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, it's part of a follow up PR because it involves removal ofGT_STORE_DYN_BLK
op, definitely didn't want to do it as part of this PR
#else | ||
private const nuint MemmoveNativeThreshold = 2048; | ||
#endif | ||
// TODO: Determine optimal value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
// TODO: Determine optimal value |
Anything TODO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
um.. do you mean it needs a link? I didn't spend enough time testing this treshold on different cpus/platforms yet. I presume it also can't be too big to avoid regressing NAOT compiled for generic cpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If you believe that somebody should spend time on this, it should have a link.
It is complicated to pick one universal threshold, so I am not sure whether it would be time well spent. You end up picking your winners and losers. Anything in the 1kB-10kB range sounds about right to me. The point of threshold is to amortize the cost of the PInvoke transition.
If there is something to look at here, I think it would be the infinite threshold for Arm64 memcpy. I would expect that the unmanaged memcpy is going to have more tricks that we have here. The threshold for Arm64 was set years ago when the Arm64 optimizations in the overall ecosystem were just starting to show up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I wasn't able to find any regressions from that infinite threshold on modern hw in microbenchmarks, but it's very possible the picture is different in real world due to those non-temporal stores etc.
I agree it won't hurt to change that treshold to some big value like 32kb etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'll remove this comment as part of theGT_STORE_DYN_BLK
clean up PR to avoid spinning CI for comment change if you don't mind
Where would you expect it to help with the current state of the PR? The wrappers for JIT helpers are not there anymore, so the JIT helpers are not affected by this.
This sounds like a general problem with ordering of tiered compilation. For direct calls, it would be preferable to compile the bottom layer methods faster than upper layer methods. I do not think we have anything in place to help with that (correct me if I am wrong). If there are a few more bottom layer types that we care about in particular, I do not see a problem with adding them to this workaround. |
Is this left for a follow up? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looks great. Thank you!
For ordinary managed calls for Clear/Memmove inside void methods, example: voidTest(Span<byte>span)=>span.Clear(); ^ the codegen is expected to perform a tail call to
Overall I agree, it's just that I'd rather want
Yep. We probably can ask our ARM64/Intel contributors for help here, although, I am not sure non-zero Memset is a popular operation (very few hits accross SPMI). |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Closes#98620
Closes#95517