NotificationsYou must be signed in to change notification settings
Fork177
Star2.6k

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf.#637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

dmurph wants to merge6 commits intozig-gamedev:main

base:main

Choose a base branch

fromdmurph:perf-improvements

Draft

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf.#637

dmurph wants to merge6 commits intozig-gamedev:mainfromdmurph:perf-improvements

Conversation

Copy link

Contributor

dmurph commentedJul 15, 2024•
edited
Loading

These changes fixes Issuezig-gamedev/zmath#5 by changing swizzles to the builtin@shuffle, which generates smaller code.

There is a chance that the zig compiler is able to eventually fully optimize a swizzle call - but that isn't the case right now.

Other changes:

dot2 anddot4 have been simplified
all andany now use@reduce appropriately (which should offer SIMD speed improvements) as a comptime decision, and actually support float types now by falling back to looping.
- (added tests for float support)

Perf results from M1 mac:

                matrix mul benchmark (AOS) - scalar version: 1.0043s, zmath version: 0.9783s       cross3, scale, bias benchmark (AOS) - scalar version: 0.6268s, zmath version: 0.6478s cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9808s, zmath version: 0.9543s            quaternion mul benchmark (AOS) - scalar version: 0.9863s, zmath version: 0.7783s                      wave benchmark (SOA) - scalar version: 3.4083s, zmath version: 1.0393s

(notice how the cross3, dot3, scale, bias benchmark benchmark is now faster with zmath). Other benchmarks seem faster too, but it's hard to fully know.

I attempted to make a 'more efficient' swizzle that used i32s instead of the enum but somehow that still generated pushes for the arguments, sadly. So just directly using@shuffle, which should be more future-proof anyways.

dmurph added2 commits

July 14, 2024 16:55

Replace swizzles with shuffles, remove unnecessary math complexity

c40e647

Upgrade all swizzles

c2c705a

dmurph commented

Jul 15, 2024

View reviewed changes

libs/zmath/src/zmath.zig OutdatedShow resolvedHide resolved

michal-z closed this

Jul 15, 2024

michal-z reopened this

Jul 15, 2024

dmurph added3 commits

July 15, 2024 07:45

using std.simd.iota

351559a

updated benchmark data

0bdda7d

whoops

51db2cc

dmurph commented

Jul 15, 2024

View reviewed changes

libs/zmath/src/benchmark.zig

		@@ -22,13 +22,13 @@
		// wave benchmark (SOA) - scalar version: 3.6598s, zmath version: 0.4231s
		//
		// -------------------------------------------------------------------------------------------------
		// 'Apple M1Max', macOS Version 12.4, Zig 0.10.0-dev.2657+74442f350, ReleaseFast
		// 'Apple M1Pro', macOS Version 12.5, Zig 0.13.0, ReleaseFast

Copy link

ContributorAuthor

dmurphJul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

happy to revert this if you like - or you can re-try with m1 max and do follow-up patch.

Copy link

Member

hazeycode commentedJul 15, 2024•
edited
Loading

M3 Pro, MacOS 14.5 results sample:

`-Doptimize=Debug`

Benchmark	Before (s)	After (s)
matrix mul benchmark (AOS)	32.6926	26.2363
cross3, scale, bias benchmark (AOS)	12.9573	10.0015
cross3, dot3, scale, bias benchmark (AOS)	14.2043	11.5713
quaternion mul benchmark (AOS)	20.4624	14.0007
wave benchmark (SOA)	6.5604	6.9920

`-Doptimize=ReleaseFast`

Benchmark	Before (s)	After (s)
matrix mul benchmark (AOS)	0.7679	0.7680
cross3, scale, bias benchmark (AOS)	0.4885	0.5564
cross3, dot3, scale, bias benchmark (AOS)	0.8333	0.8264
quaternion mul benchmark (AOS)	0.6551	0.6535
wave benchmark (SOA)	0.7284	0.7302

Copy link

ContributorAuthor

dmurph commentedJul 15, 2024•
edited
Loading

I tried to create a minimal godbolt to show the generated output
https://godbolt.org/z/8oP6v7jvx
(you'll have to search for the methods in the generated output)

What's interesting here now is that it is now able to fully optimize the swizzle in themain function. In the old example using thedot4Old function in this godbolt, by swapping swizzle for@shuffle for it generated less instructions.

Anyways - the dot4, dot2, any, and all are clearly much better instruction-wise. I can remove the@shuffle changes if you want.

dot4 old:

push    rbp        mov     rbp, rsp        sub     rsp, 160        vmovaps xmmword ptr [rbp - 128], xmm0        vmovaps xmmword ptr [rbp - 112], xmm1        vmulps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 96], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        vmovaps xmmword ptr [rbp - 64], xmm0        mov     byte ptr [rbp - 36], 1        mov     byte ptr [rbp - 35], 0        mov     byte ptr [rbp - 34], 3        mov     byte ptr [rbp - 33], 2        vpermilps       xmm0, xmm0, 177        vmovaps xmmword ptr [rbp - 144], xmm0        vmovaps xmm0, xmmword ptr [rbp - 144]        vmovaps xmmword ptr [rbp - 80], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        vmovaps xmm1, xmmword ptr [rbp - 80]        vaddps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 80], xmm0        vmovaps xmm0, xmmword ptr [rbp - 80]        vmovaps xmmword ptr [rbp - 32], xmm0        mov     byte ptr [rbp - 4], 3        mov     byte ptr [rbp - 3], 2        mov     byte ptr [rbp - 2], 1        mov     byte ptr [rbp - 1], 0        vpermilps       xmm0, xmm0, 27        vmovaps xmmword ptr [rbp - 160], xmm0        vmovaps xmm0, xmmword ptr [rbp - 160]        vmovaps xmmword ptr [rbp - 96], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        vaddps  xmm0, xmm0, xmmword ptr [rbp - 80]        vmovaps xmmword ptr [rbp - 96], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        add     rsp, 160        pop     rbp        ret

new dot4:

example.dot4:        push    rbp        mov     rbp, rsp        sub     rsp, 48        vmovaps xmmword ptr [rbp - 48], xmm0        vmovaps xmmword ptr [rbp - 32], xmm1        vmulps  xmm1, xmm0, xmm1        vmovaps xmmword ptr [rbp - 16], xmm1        vmovaps xmm0, xmm1        vmovshdup       xmm2, xmm1        vaddss  xmm0, xmm0, xmm2        vpermilpd       xmm2, xmm1, 1        vaddss  xmm0, xmm0, xmm2        vpermilps       xmm1, xmm1, 255        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0        add     rsp, 48        pop     rbp        ret

Sorry more info - I changedswizzle to@shuffle in that dot4Old example (changed codehere, and it generates less instructions there now:

push    rbp        mov     rbp, rsp        sub     rsp, 64        vmovaps xmmword ptr [rbp - 64], xmm0        vmovaps xmmword ptr [rbp - 48], xmm1        vmulps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 32], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        vpermilps       xmm0, xmm0, 177        vmovaps xmmword ptr [rbp - 16], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        vmovaps xmm1, xmmword ptr [rbp - 16]        vaddps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 16], xmm0        vmovaps xmm0, xmmword ptr [rbp - 16]        vpermilps       xmm0, xmm0, 27        vmovaps xmmword ptr [rbp - 32], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        vaddps  xmm0, xmm0, xmmword ptr [rbp - 16]        vmovaps xmmword ptr [rbp - 32], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        add     rsp, 64        pop     rbp        ret

I have no idea why. You can see it no longer does the argument moves. So - I propose we make the@shuffle changes here too.

Copy link

Member

hazeycode commentedJul 15, 2024

@shuffle looks good to me!

Overall the improvements are very obvious in Debug mode benchmarks.

Copy link

ContributorAuthor

dmurph commentedJul 15, 2024

Cool - I don't have write permission so someone else is free to squash+commit

hazeycode requested a review frommichal-z

July 16, 2024 21:06

Copy link

Collaborator

michal-z commentedJul 17, 2024•
edited
Loading

@dmurph You must use-O ReleaseFast in godbolt to enable optimized build (-DReleaseFast doesn't work and produces debug code).

I've writtendot() functions very carefully to ensure that they do not touch the stack (in optimized code). In general, indexing SIMD registers (xmm1[2]) can cause spilling it to the stack, so my versions ofdot() does not use indexing.

Copy link

ContributorAuthor

dmurph commentedJul 17, 2024

Very cool! thanks.

Ok, after that, here are the new things for dots:

dot4Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vshufps xmm1, xmm0, xmm0, 177        vaddps  xmm0, xmm0, xmm1        vshufps xmm1, xmm0, xmm0, 27        vaddps  xmm0, xmm0, xmm1

dot4:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vmovshdup       xmm1, xmm0        vaddss  xmm1, xmm0, xmm1        vshufpd xmm2, xmm0, xmm0, 1        vaddss  xmm1, xmm2, xmm1        vshufps xmm0, xmm0, xmm0, 255        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0

dot2Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vmovshdup       xmm1, xmm0        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0

dot2:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vmovshdup       xmm1, xmm0        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0

The shuffle vs swizzle also fully optimizes to be the same thing.

I wrote incorrect code for .all and .any - I'm going to fix that up and see what the difference is.

Here is my current compiler explorer:https://godbolt.org/z/7E9YW8oqv

Copy link

Collaborator

michal-z commentedJul 17, 2024

Also, by default Zig compiler compiles code for your native CPU. In this case it compiles for AVX2 instruction set. It is also a good idea to see the code compiled for a regular x86_64 CPU that has only SSE2 instruction set. You can use a-mcpu x86_64 option to force this.

Copy link

ContributorAuthor

dmurph commentedJul 18, 2024

Cool - after fixing theall function, I can say that that certainly improved instructions:

before, len 3 of size 4 vector - 10

        cmp     dword ptr [rsp + 64], 0        setne   cl        cmp     dword ptr [rsp + 68], 0        setne   dl        and     dl, cl        cmp     dword ptr [rsp + 72], 0        setne   cl        and     cl, dl        mov     byte ptr [rsp + 11], cl        lea     rcx, [rsp + 11]

after, worst case - 7

        vpbroadcastq    xmm0, qword ptr [rsp + 72]        vpand   xmm0, xmm0, xmmword ptr [rsp + 64]        vpsrlq  xmm1, xmm0, 32        vpand   xmm0, xmm0, xmm1        vmovd   eax, xmm0        test    eax, eax        setne   byte ptr [rsp + 11]

(When len=3) - 4

        mov     eax, dword ptr [rsp + 64]        and     eax, dword ptr [rsp + 68]        test    dword ptr [rsp + 72], eax        setne   byte ptr [rsp + 11]

setting the cpu to x86_64 - 8

        movdqa  xmm0, xmmword ptr [rsp + 64]        pshufd  xmm1, xmm0, 238        movd    eax, xmm1        pshufd  xmm0, xmm0, 85        movd    edx, xmm0        and     edx, dword ptr [rsp + 64]        test    edx, eax        setne   byte ptr [rsp + 11]

https://godbolt.org/z/d9W4hMco9

I'm going to remove the dot changes and keep the 'any' and 'all' changes. Let me know if you want me to keep the shuffle changes, as they seem to affect debug builds but not release.

revert and fix

c18ed5b

Copy link

ContributorAuthor

dmurph commentedJul 18, 2024

I guess let me know if you want any of this - happy to just close this request as my original changes weren't actually more performant lol.

Copy link

Member

hazeycode commentedJul 20, 2024

Debug perf is important and we should consider changes to improve it carefully.

Copy link

ContributorAuthor

dmurph commentedJul 22, 2024

I've gotten a bit busy - some thoughts:

I'll plan on splitting this up to discuss separately
- all/any change (not sure if that's wanted) adds 'fast' support for int & bool calls to that. Float is identical. This 'new' support is probably not really important / needed. Happy to abandon that.
swizzle -> shuffle change in zmath.zig to help debug builds

Since this won't be for a bit - feel free to take over this patch if you like if you feel inspired or have free time. Otherwise I'll likely revisit in a week or so.