Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Don't send commas to stage 2, avoid clmul in most cases#2049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
jkeiser wants to merge28 commits intomaster
base:master
Choose a base branch
Loading
fromjkeiser/comma

Conversation

@jkeiser
Copy link
Member

@jkeiserjkeiser commentedAug 15, 2023
edited
Loading

The algorithm detects all missing/extra separator errors in stage 1, and then doesn't send commas.

lemire and lin72h reacted with rocket emoji
borrow_out = result >= value1;
return result;
#else
return__builtin_subcll(value1, value2, borrow, &borrow);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

At a glance, it looks like __builtin_subcll is LLVM specific?

It might be worth guarding its usage:
https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fbuiltin.html

It might worth examining alternatives:

https://godbolt.org/z/1WT9nPv6M

#include<cstdint>usingborrow_t =unsignedlonglong;uint64_tsubtract_borrow(constuint64_t value1,constuint64_t value2,borrow_t& borrow)noexcept {return__builtin_subcll(value1, value2, borrow, &borrow);}uint64_tsubtract_borrow_manual(constuint64_t value1,constuint64_t value2,borrow_t& borrow)noexcept {uint64_t result = value1 - value2 - borrow;  borrow = result >= value1;return result;}#if defined(_M_X64) || defined(__amd64__)#include<x86intrin.h>// visual studio has _subborrow_u64 in <intrin.h>// https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170//uint64_tsubtract_borrow_intel(constuint64_t value1,constuint64_t value2,uint8_t& borrow) {uint64_t result;    borrow =_subborrow_u64(borrow, value2, value1, (unsignedlonglong *)&result);return result;}#endif

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's entirely possible; I did construct these to be analogues of theadd_overflow() implementation for the given architecture (i.e. pulling from the same libraries and using the same #ifdefs)

@jkeiser
Copy link
MemberAuthor

jkeiser commentedAug 30, 2023
edited
Loading

@lemire I made some more variantsin this Godbolt. Just based on manual inspection of the assembly, if I had to choose asubtract_borrow implementation, I would choose the clang one, because:

  • The manual version is a tiny bit longer (though neither seems particularly bad).
  • __builtin_subcll_overflow(a, b + borrow) produces significantly shorter code than__builtin_subcll(a, b, borrow). This is very strange.
  • Storing the overflow in abool is universally shorter, in part because they are guaranteed to be 0 and 1 and therefore can be set to a flag; and in part because flags can be quickly moved to bools but not to 64-bit values.
  • Other than that, the builtins are slightly shorter than manual, but really not by very much.
uint64_tsubtract_borrow_using_overflow_bool(constuint64_t value1,constuint64_t value2,bool& borrow) {unsignedlonglong result;  borrow =__builtin_usubll_overflow(value1, value2 + borrow, &result);return result;}uint64_tsubtract_borrow_intel_bool(constuint64_t value1,constuint64_t value2,bool& borrow) {unsignedlonglong result;    borrow =_subborrow_u64(borrow, value1, value2, (unsignedlonglong *)&result);return result;}uint64_tsubtract_borrow_manual_bool(constuint64_t value1,constuint64_t value2,bool& borrow)noexcept {uint64_t result = value1 - value2 - borrow;  borrow = result >= value1;return result;}
subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):        # result = value1- value2- overflowmovrax,rdi # value1movzxecx, byte ptr[rdx] # overflowaddrcx,rsi # overflow+ value2subrax,rcx # value1- (overflow+ value2)        # overflow = result >= value1        setb    byte ptr[rdx]subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):        # result = value1- value2- overflowmovrax,rdi # value1movzxecx, byte ptr[rdx] # overflowaddcl,-1 #cl = overflow-1 ???sbbrax,rsi # value1- value2- overflow        # overflow = result >= value1        setb    byte ptr[rdx]subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):        # result = value1- (value2+ overflow)movzxecx, byte ptr[rdx] # overflowaddrcx,rsi # overflow+ value2subrax,rcx # value1- (value2+ overflow)        # overflow = (result >= value1)cmprax,rdi        setae   byte ptr[rdx]

@jkeiser
Copy link
MemberAuthor

Added some comments in the assembly for easier following.

Bottom line on Ice Lake, once you usebool overflow:

  1. value1 - value2 - borrow; overflow = result >= value1 is the best. It produces the same number of instructions as the others, but two of them are purely for computing overflow--only 3 instructions (with lower latency) are required to compute the result.
  2. __builtin_usubll_overflow(value1, value2 + borrow) is the second best, with the same number of instructions but a longer chain to calculate the result.
  3. _subborrow_u64(borrow, value1, value2) is the worst, acting similarly to__builtin_usubll_overflow but usingSBB, which runs on only 2 ports instead ofSUB's 4.

Of course,running these in a performance test is the only way to know for sure, since the processor does some minor JIT-ish activities :)

lin72h reacted with eyes emoji

@jkeiser
Copy link
MemberAuthor

On ARM, it looks like__builtin_usubll_overflow(value1, value2 + borrow) is the winner, with__builtin_subcll once again producing more instructions, and the manual version producing alot more instructions.

@jkeiser
Copy link
MemberAuthor

And on GCC 13.2 Intel, everything pretty much looks the same as each other._subborrow_u64(0, value1, value2 + borrow) wins by not producing an extra AND._subborrow_u64(borrow, value1, value2) loses because ofSBB, as with clang.

subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        add     rcx, rsi        sub     rax, rcx        setb    BYTE PTR [rdx]        and     BYTE PTR [rdx], 1subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        sub     rax, rsi        sub     rax, rcx        cmp     rax, rdi        setnb   BYTE PTR [rdx]subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        add     cl, -1        sbb     rax, rsi        setc    BYTE PTR [rdx]subtract_borrow_using_overflow_intel_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        add     rcx, rsi        sub     rax, rcx        setb    BYTE PTR [rdx]

@jkeiser
Copy link
MemberAuthor

Changingsubtract_borrow to the manual version on Ice Lake brings stage 1 down from 93.2353 -> 93.0503 instructions/block and 28.4788 -> 27.7167 cycles/block, a nearly 3% speed improvement. I'll see if it can rescue the speculative string parser, as well.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@lemirelemirelemire left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@jkeiser@lemire

[8]ページ先頭

©2009-2025 Movatter.jp