Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Merged stores: Fix alignment-related issues and enable SIMD where possible#92939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
EgorBo merged 15 commits intodotnet:mainfromEgorBo:merge-stores-simd
Oct 5, 2023

Conversation

@EgorBo
Copy link
Member

@EgorBoEgorBo commentedOct 3, 2023
edited
Loading

Adjust rules when we can use unaligned stores for merged ones. Also, enable 2xLONG/REF -> SIMD. And 2xSIMD to wider SIMD.

Wider scalar primitives for naturally aligned data of primitives (>1B):

Target memoryCrosses cache-line
boundary?
x64*arm64
Global memoryYes🚫🚫
 No🚫
Local memory (not exposed)Yes
 No

SIMD for for naturally aligned data of primitives (>1B):

Target memoryKnown alignmentx64*arm64
Global memory1B (aka unknown)🚫🚫
8B (most common)🚫
16B (rare**)✅(AVX+)
Local memory (not exposed)1B
8B
16B

* both Intel and AMD
** it's very unlikely JIT can assume 16-byte alignment currently anyhow

PS: Merged stores are conservatively disabled on LA64 and RISC-V

Per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory  are treated as a pair of single - copy atomic 64 - bit writes.

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Related issues:#76503,#51638,

PaulusParssinen reacted with eyes emoji
@ghostghost assignedEgorBoOct 3, 2023
@ghostghost added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labelOct 3, 2023
@ghost
Copy link

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

Issue Details

Merge e.g. two consecutive SIMD stores (e.g. 2x Vector256 into 1x Vector512).
It's safe to do since we take an existing SIMD store that doesn't promise any guarantees about atomicity and convert it to a bigger SIMD store.

But I am still trying to build a mental model for the case with "multiple scalar stores -> SIMD store" (we currently don't do it).
I came up with this (only for 64bit for simplicity):

Target memoryKnown alignmentx64arm64
Heap1B (aka unknown)🚫🚫
 8B🚫
Stack1B✅*✅*
 8B✅*
Unmanaged1B
 8B

* - only if target (e.g. struct) is known not to contain GC handles

So far, it seems that x86/AMD64 doesn't offer any kind of guarantee for atomicity officially (even per component).
At the same time, per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory  are treated as a pair of single - copy atomic 64 - bit writes.

Related issues:#76503,#51638,

Author:EgorBo
Assignees:EgorBo
Labels:

area-CodeGen-coreclr

Milestone:-

@tannergooding
Copy link
Member

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Note this is from9.1.1 Guaranteed Atomic Operations inIntel® 64 and IA-32 Architectures Software Developer’s Manual; Volume 3 (3A, 3B, 3C, & 3D): System Programming Guide

EgorBo reacted with thumbs up emoji

@tannergooding
Copy link
Member

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
• Reading or writing a byte.
• Reading or writing a word aligned on a 16-bit boundary.
• Reading or writing a doubleword aligned on a 32-bit boundary.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary.
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee
that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
• MOVAPD, MOVAPS, and MOVDQA.
• VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking
disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be
atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium,
and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and
P6 family processors provide bus control signals that permit external memory subsystems to make split accesses
atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be
avoided.

Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be
implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may
complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an
page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible
to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section
4.10.4.4), such page faults may occur even if all accesses are to the same page.

EgorBoand others added2 commitsOctober 3, 2023 23:24
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
@EgorBoEgorBo marked this pull request as ready for reviewOctober 3, 2023 21:54
@EgorBoEgorBo changed the titleJIT: Merge SIMD stores into wider SIMDsMerged stores: Fix alignment-related issues and enable SIMD where possibleOct 4, 2023
@EgorBoEgorBo mentioned this pull requestOct 4, 2023
@EgorBo
Copy link
MemberAuthor

EgorBo commentedOct 4, 2023
edited
Loading

@jakobbotsch @dotnet/jit-contrib PTAL,Diffs (regression as expected because it made the whole#92852 algorithm more conservative, but the initial diffs were -400kb so most wins are expected to remain, obviously, most base addresses are TYP_REF like Jakob predicted).

Wins on ARM64 due better SIMD guarantees.

@EgorBo
Copy link
MemberAuthor

EgorBo commentedOct 5, 2023
edited
Loading

ImprovedDiffs on arm64

@kunalspathak
Copy link
Contributor

ImprovedDiffs on arm64

seems there are more regressions on linux/windows x64. Do we know why?

image

@kunalspathak
Copy link
Contributor

ImprovedDiffs on arm64

seems there are more regressions on linux/windows x64. Do we know why?

image

@EgorBo
Copy link
MemberAuthor

seems there are more regressions on linux/windows x64. Do we know why?

these are reverted improvements from#92852 because they turned out to be not legal (but fortunately, most improvements remained)

@EgorBo
Copy link
MemberAuthor

x86 SPMI jobs failed with timeout/"no space left", I'll check other runs

@EgorBoEgorBo merged commitce655e3 intodotnet:mainOct 5, 2023
@EgorBoEgorBo deleted the merge-stores-simd branchOctober 5, 2023 19:16
@ghostghost locked asresolvedand limited conversation to collaboratorsNov 5, 2023
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@jakobbotschjakobbotschjakobbotsch approved these changes

+1 more reviewer

@SingleAccretionSingleAccretionSingleAccretion left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

@EgorBoEgorBo

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

5 participants

@EgorBo@tannergooding@kunalspathak@jakobbotsch@SingleAccretion

[8]ページ先頭

©2009-2025 Movatter.jp