NotificationsYou must be signed in to change notification settings
Fork5.2k
Star17.2k

Merged stores: Fix alignment-related issues and enable SIMD where possible#92939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

EgorBo merged 15 commits intodotnet:mainfromEgorBo:merge-stores-simd

Oct 5, 2023

Merged

Merged stores: Fix alignment-related issues and enable SIMD where possible#92939

EgorBo merged 15 commits intodotnet:mainfromEgorBo:merge-stores-simd

Oct 5, 2023

Conversation

Copy link

Member

EgorBo commentedOct 3, 2023•
edited
Loading

Adjust rules when we can use unaligned stores for merged ones. Also, enable 2xLONG/REF -> SIMD. And 2xSIMD to wider SIMD.

Wider scalar primitives for naturally aligned data of primitives (>1B):

Target memory	Crosses cache-line boundary?	x64*	arm64
Global memory	Yes	🚫	🚫
	No	✅	🚫
Local memory (not exposed)	Yes	✅	✅
	No	✅	✅

SIMD for for naturally aligned data of primitives (>1B):

Target memory	Known alignment	x64*	arm64
Global memory	1B (aka unknown)	🚫	🚫
	8B (most common)	🚫	✅
	16B (rare**)	✅(AVX+)	✅
Local memory (not exposed)	1B	✅	✅
	8B	✅	✅
	16B	✅	✅

* both Intel and AMD
** it's very unlikely JIT can assume 16-byte alignment currently anyhow

PS: Merged stores are conservatively disabled on LA64 and RISC-V

Per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory  are treated as a pair of single - copy atomic 64 - bit writes.

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Related issues:#76503,#51638,

EgorBo added2 commits

October 3, 2023 16:49

Merge simd stores into wider simds

9a123c9

Take PreferredVectorByteLength into account

dcf2d6c

ghost assignedEgorBo

Oct 3, 2023

ghost added the area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label

Oct 3, 2023

Copy link

ghost commentedOct 3, 2023

Tagging subscribers to this area:@JulieLeeMSFT,@jakobbotsch
See info inarea-owners.md if you want to be subscribed.

Issue Details

Merge e.g. two consecutive SIMD stores (e.g. 2x Vector256 into 1x Vector512).
It's safe to do since we take an existing SIMD store that doesn't promise any guarantees about atomicity and convert it to a bigger SIMD store.

But I am still trying to build a mental model for the case with "multiple scalar stores -> SIMD store" (we currently don't do it).
I came up with this (only for 64bit for simplicity):

Target memory	Known alignment	x64	arm64
Heap	1B (aka unknown)	🚫	🚫
	8B	🚫	✅
Stack	1B	✅*	✅*
	8B	✅*	✅
Unmanaged	1B	✅	✅
	8B	✅	✅

* - only if target (e.g. struct) is known not to contain GC handles

So far, it seems that x86/AMD64 doesn't offer any kind of guarantee for atomicity officially (even per component).
At the same time, per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory  are treated as a pair of single - copy atomic 64 - bit writes.

Related issues:#76503,#51638,

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

Clean up

1a31d1a

EgorBo force-pushed themerge-stores-simd branch from4250c12 to1a31d1aCompare

October 3, 2023 15:46

Clean up

a8ea815

Copy link

Member

tannergooding commentedOct 3, 2023

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Note this is from9.1.1 Guaranteed Atomic Operations inIntel® 64 and IA-32 Architectures Software Developer’s Manual; Volume 3 (3A, 3B, 3C, & 3D): System Programming Guide

Copy link

Member

tannergooding commentedOct 3, 2023

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
• Reading or writing a byte.
• Reading or writing a word aligned on a 16-bit boundary.
• Reading or writing a doubleword aligned on a 32-bit boundary.
The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary.
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.
The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.
Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee
that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
• MOVAPD, MOVAPS, and MOVDQA.
• VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking
disabled).
(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be
atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium,
and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and
P6 family processors provide bus control signals that permit external memory subsystems to make split accesses
atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be
avoided.
Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be
implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may
complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an
page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible
to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section
4.10.4.4), such page faults may occur even if all accesses are to the same page.

Conservatively disable the optimization when we can't guess alignment

ebc9dc2

EgorBo commented

Oct 3, 2023

View reviewed changes

src/coreclr/jit/lower.cppShow resolvedHide resolved

SingleAccretion reviewed

Oct 3, 2023

View reviewed changes

src/coreclr/jit/gentree.h OutdatedShow resolvedHide resolved

build-analysisbot mentioned this pull request

Oct 3, 2023

Tracking issue for CI build timeouts#76454

Closed

EgorBoand others added2 commits

October 3, 2023 23:24

Update src/coreclr/jit/gentree.h

3c4ec5e

Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>

Clean up

12f6081

EgorBo marked this pull request as ready for review

October 3, 2023 21:54

EgorBo added2 commits

October 4, 2023 01:03

Enable scalar -> simd

ad8be75

Clean up

233cbb0

EgorBo changed the title~~JIT: Merge SIMD stores into wider SIMDs~~Merged stores: Fix alignment-related issues and enable SIMD where possible

Oct 4, 2023

More clean up

eb6bf79

EgorBo mentioned this pull request

Oct 4, 2023

JIT: Merge stores#92852

Merged

Improve TP.

82011ad

Copy link

MemberAuthor

EgorBo commentedOct 4, 2023•
edited
Loading

@jakobbotsch @dotnet/jit-contrib PTAL,Diffs (regression as expected because it made the whole#92852 algorithm more conservative, but the initial diffs were -400kb so most wins are expected to remain, obviously, most base addresses are TYP_REF like Jakob predicted).

Wins on ARM64 due better SIMD guarantees.