Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf.#637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
dmurph wants to merge6 commits intozig-gamedev:main
base:main
Choose a base branch
Loading
fromdmurph:perf-improvements

Conversation

dmurph
Copy link
Contributor

@dmurphdmurph commentedJul 15, 2024
edited
Loading

These changes fixes Issuezig-gamedev/zmath#5 by changing swizzles to the builtin@shuffle, which generates smaller code.

There is a chance that the zig compiler is able to eventually fully optimize a swizzle call - but that isn't the case right now.

Other changes:

  • dot2 anddot4 have been simplified
  • all andany now use@reduce appropriately (which should offer SIMD speed improvements) as a comptime decision, and actually support float types now by falling back to looping.
    • (added tests for float support)

Perf results from M1 mac:

                matrix mul benchmark (AOS) - scalar version: 1.0043s, zmath version: 0.9783s       cross3, scale, bias benchmark (AOS) - scalar version: 0.6268s, zmath version: 0.6478s cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9808s, zmath version: 0.9543s            quaternion mul benchmark (AOS) - scalar version: 0.9863s, zmath version: 0.7783s                      wave benchmark (SOA) - scalar version: 3.4083s, zmath version: 1.0393s

(notice how the cross3, dot3, scale, bias benchmark benchmark is now faster with zmath). Other benchmarks seem faster too, but it's hard to fully know.

I attempted to make a 'more efficient' swizzle that used i32s instead of the enum but somehow that still generated pushes for the arguments, sadly. So just directly using@shuffle, which should be more future-proof anyways.

@@ -22,13 +22,13 @@
// wave benchmark (SOA) - scalar version: 3.6598s, zmath version: 0.4231s
//
// -------------------------------------------------------------------------------------------------
// 'Apple M1Max', macOS Version 12.4, Zig 0.10.0-dev.2657+74442f350, ReleaseFast
// 'Apple M1Pro', macOS Version 12.5, Zig 0.13.0, ReleaseFast
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

happy to revert this if you like - or you can re-try with m1 max and do follow-up patch.

@hazeycode
Copy link
Member

hazeycode commentedJul 15, 2024
edited
Loading

M3 Pro, MacOS 14.5 results sample:

-Doptimize=Debug

BenchmarkBefore (s)After (s)
matrix mul benchmark (AOS)32.692626.2363
cross3, scale, bias benchmark (AOS)12.957310.0015
cross3, dot3, scale, bias benchmark (AOS)14.204311.5713
quaternion mul benchmark (AOS)20.462414.0007
wave benchmark (SOA)6.56046.9920

-Doptimize=ReleaseFast

BenchmarkBefore (s)After (s)
matrix mul benchmark (AOS)0.76790.7680
cross3, scale, bias benchmark (AOS)0.48850.5564
cross3, dot3, scale, bias benchmark (AOS)0.83330.8264
quaternion mul benchmark (AOS)0.65510.6535
wave benchmark (SOA)0.72840.7302

@dmurph
Copy link
ContributorAuthor

dmurph commentedJul 15, 2024
edited
Loading

I tried to create a minimal godbolt to show the generated output
https://godbolt.org/z/8oP6v7jvx
(you'll have to search for the methods in the generated output)

What's interesting here now is that it is now able to fully optimize the swizzle in themain function. In the old example using thedot4Old function in this godbolt, by swapping swizzle for@shuffle for it generated less instructions.

Anyways - the dot4, dot2, any, and all are clearly much better instruction-wise. I can remove the@shuffle changes if you want.

dot4 old:

push    rbp        mov     rbp, rsp        sub     rsp, 160        vmovaps xmmword ptr [rbp - 128], xmm0        vmovaps xmmword ptr [rbp - 112], xmm1        vmulps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 96], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        vmovaps xmmword ptr [rbp - 64], xmm0        mov     byte ptr [rbp - 36], 1        mov     byte ptr [rbp - 35], 0        mov     byte ptr [rbp - 34], 3        mov     byte ptr [rbp - 33], 2        vpermilps       xmm0, xmm0, 177        vmovaps xmmword ptr [rbp - 144], xmm0        vmovaps xmm0, xmmword ptr [rbp - 144]        vmovaps xmmword ptr [rbp - 80], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        vmovaps xmm1, xmmword ptr [rbp - 80]        vaddps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 80], xmm0        vmovaps xmm0, xmmword ptr [rbp - 80]        vmovaps xmmword ptr [rbp - 32], xmm0        mov     byte ptr [rbp - 4], 3        mov     byte ptr [rbp - 3], 2        mov     byte ptr [rbp - 2], 1        mov     byte ptr [rbp - 1], 0        vpermilps       xmm0, xmm0, 27        vmovaps xmmword ptr [rbp - 160], xmm0        vmovaps xmm0, xmmword ptr [rbp - 160]        vmovaps xmmword ptr [rbp - 96], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        vaddps  xmm0, xmm0, xmmword ptr [rbp - 80]        vmovaps xmmword ptr [rbp - 96], xmm0        vmovaps xmm0, xmmword ptr [rbp - 96]        add     rsp, 160        pop     rbp        ret

new dot4:

example.dot4:        push    rbp        mov     rbp, rsp        sub     rsp, 48        vmovaps xmmword ptr [rbp - 48], xmm0        vmovaps xmmword ptr [rbp - 32], xmm1        vmulps  xmm1, xmm0, xmm1        vmovaps xmmword ptr [rbp - 16], xmm1        vmovaps xmm0, xmm1        vmovshdup       xmm2, xmm1        vaddss  xmm0, xmm0, xmm2        vpermilpd       xmm2, xmm1, 1        vaddss  xmm0, xmm0, xmm2        vpermilps       xmm1, xmm1, 255        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0        add     rsp, 48        pop     rbp        ret

Sorry more info - I changedswizzle to@shuffle in that dot4Old example (changed codehere, and it generates less instructions there now:

push    rbp        mov     rbp, rsp        sub     rsp, 64        vmovaps xmmword ptr [rbp - 64], xmm0        vmovaps xmmword ptr [rbp - 48], xmm1        vmulps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 32], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        vpermilps       xmm0, xmm0, 177        vmovaps xmmword ptr [rbp - 16], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        vmovaps xmm1, xmmword ptr [rbp - 16]        vaddps  xmm0, xmm0, xmm1        vmovaps xmmword ptr [rbp - 16], xmm0        vmovaps xmm0, xmmword ptr [rbp - 16]        vpermilps       xmm0, xmm0, 27        vmovaps xmmword ptr [rbp - 32], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        vaddps  xmm0, xmm0, xmmword ptr [rbp - 16]        vmovaps xmmword ptr [rbp - 32], xmm0        vmovaps xmm0, xmmword ptr [rbp - 32]        add     rsp, 64        pop     rbp        ret

I have no idea why. You can see it no longer does the argument moves. So - I propose we make the@shuffle changes here too.

hazeycode reacted with thumbs up emoji

@hazeycode
Copy link
Member

@shuffle looks good to me!

Overall the improvements are very obvious in Debug mode benchmarks.

dmurph reacted with thumbs up emoji

@dmurph
Copy link
ContributorAuthor

Cool - I don't have write permission so someone else is free to squash+commit

@hazeycodehazeycode requested a review frommichal-zJuly 16, 2024 21:06
@michal-z
Copy link
Collaborator

michal-z commentedJul 17, 2024
edited
Loading

@dmurph You must use-O ReleaseFast in godbolt to enable optimized build (-DReleaseFast doesn't work and produces debug code).

I've writtendot() functions very carefully to ensure that they do not touch the stack (in optimized code). In general, indexing SIMD registers (xmm1[2]) can cause spilling it to the stack, so my versions ofdot() does not use indexing.

hazeycode reacted with eyes emoji

@dmurph
Copy link
ContributorAuthor

Very cool! thanks.

Ok, after that, here are the new things for dots:

dot4Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vshufps xmm1, xmm0, xmm0, 177        vaddps  xmm0, xmm0, xmm1        vshufps xmm1, xmm0, xmm0, 27        vaddps  xmm0, xmm0, xmm1

dot4:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vmovshdup       xmm1, xmm0        vaddss  xmm1, xmm0, xmm1        vshufpd xmm2, xmm0, xmm0, 1        vaddss  xmm1, xmm2, xmm1        vshufps xmm0, xmm0, xmm0, 255        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0

dot2Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vmovshdup       xmm1, xmm0        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0

dot2:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]        vmovshdup       xmm1, xmm0        vaddss  xmm0, xmm0, xmm1        vbroadcastss    xmm0, xmm0

The shuffle vs swizzle also fully optimizes to be the same thing.

I wrote incorrect code for .all and .any - I'm going to fix that up and see what the difference is.

Here is my current compiler explorer:https://godbolt.org/z/7E9YW8oqv

@michal-z
Copy link
Collaborator

Also, by default Zig compiler compiles code for your native CPU. In this case it compiles for AVX2 instruction set. It is also a good idea to see the code compiled for a regular x86_64 CPU that has only SSE2 instruction set. You can use a-mcpu x86_64 option to force this.

hazeycode reacted with thumbs up emoji

@dmurph
Copy link
ContributorAuthor

Cool - after fixing theall function, I can say that that certainly improved instructions:

before, len 3 of size 4 vector - 10

        cmp     dword ptr [rsp + 64], 0        setne   cl        cmp     dword ptr [rsp + 68], 0        setne   dl        and     dl, cl        cmp     dword ptr [rsp + 72], 0        setne   cl        and     cl, dl        mov     byte ptr [rsp + 11], cl        lea     rcx, [rsp + 11]

after, worst case - 7

        vpbroadcastq    xmm0, qword ptr [rsp + 72]        vpand   xmm0, xmm0, xmmword ptr [rsp + 64]        vpsrlq  xmm1, xmm0, 32        vpand   xmm0, xmm0, xmm1        vmovd   eax, xmm0        test    eax, eax        setne   byte ptr [rsp + 11]

(When len=3) - 4

        mov     eax, dword ptr [rsp + 64]        and     eax, dword ptr [rsp + 68]        test    dword ptr [rsp + 72], eax        setne   byte ptr [rsp + 11]

setting the cpu to x86_64 - 8

        movdqa  xmm0, xmmword ptr [rsp + 64]        pshufd  xmm1, xmm0, 238        movd    eax, xmm1        pshufd  xmm0, xmm0, 85        movd    edx, xmm0        and     edx, dword ptr [rsp + 64]        test    edx, eax        setne   byte ptr [rsp + 11]

https://godbolt.org/z/d9W4hMco9

I'm going to remove the dot changes and keep the 'any' and 'all' changes. Let me know if you want me to keep the shuffle changes, as they seem to affect debug builds but not release.

@dmurph
Copy link
ContributorAuthor

I guess let me know if you want any of this - happy to just close this request as my original changes weren't actually more performant lol.

@hazeycode
Copy link
Member

Debug perf is important and we should consider changes to improve it carefully.

@dmurph
Copy link
ContributorAuthor

I've gotten a bit busy - some thoughts:

  • I'll plan on splitting this up to discuss separately
    • all/any change (not sure if that's wanted) adds 'fast' support for int & bool calls to that. Float is identical. This 'new' support is probably not really important / needed. Happy to abandon that.
  • swizzle -> shuffle change in zmath.zig to help debug builds

Since this won't be for a bit - feel free to take over this patch if you like if you feel inspired or have free time. Otherwise I'll likely revisit in a week or so.

hazeycode reacted with thumbs up emoji

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@michal-zmichal-zAwaiting requested review from michal-z

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@dmurph@hazeycode@michal-z

[8]ページ先頭

©2009-2025 Movatter.jp