NotificationsYou must be signed in to change notification settings
Fork1.2k
Star23k

Don't send commas to stage 2, avoid clmul in most cases#2049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

jkeiser wants to merge28 commits intomaster

base:master

Choose a base branch

fromjkeiser/comma

Draft

Don't send commas to stage 2, avoid clmul in most cases#2049

jkeiser wants to merge28 commits intomasterfromjkeiser/comma

Conversation

Copy link

Member

jkeiser commentedAug 15, 2023•
edited
Loading

The algorithm detects all missing/extra separator errors in stage 1, and then doesn't send commas.

jkeiser force-pushed thejkeiser/comma branch from3f1b0c3 to407f6f8Compare

August 29, 2023 22:22

jkeiser added17 commits

August 29, 2023 18:25

Make simd constexpr

78b4c0a

Define single simd_t for easier copy/paste

a88ad51

Add eq_any()

075bfb1

Use new eq_any for classification

a8635c9

Add byte_classifier abstraction to make lookup tables readable.

c6e43a6

Initial speculative parsing scanner

54cbebf

Move bitmask methods into namespace, add subtract_borrow_out

f594a49

Add no_bits_set to simd8x64

70674d2

Fix scanner to use actual bitmask/simd methods

32afd34

More betterer simd

4b11483

Better bitmask subtraction

589ef23

Update latencies

9dc70e8

Update numbers more

5d762fb

Send commas

52b2414

Make it compile

76dd137

Fix a few bugs

a7b7bc2

Restore UTF-8 algorithm to before lookup table

b524e29

jkeiser force-pushed thejkeiser/comma branch from407f6f8 toab8a27eCompare

August 29, 2023 22:25

lemire reviewed

Aug 30, 2023

View reviewed changes

include/simdjson/ppc64/bitmask.h Outdated

		borrow_out = result >= value1;
		return result;
		#else
		return__builtin_subcll(value1, value2, borrow, &borrow);

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

At a glance, it looks like __builtin_subcll is LLVM specific?

It might be worth guarding its usage:
https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fbuiltin.html

It might worth examining alternatives:

https://godbolt.org/z/1WT9nPv6M

#include<cstdint>usingborrow_t =unsignedlonglong;uint64_tsubtract_borrow(constuint64_t value1,constuint64_t value2,borrow_t& borrow)noexcept {return__builtin_subcll(value1, value2, borrow, &borrow);}uint64_tsubtract_borrow_manual(constuint64_t value1,constuint64_t value2,borrow_t& borrow)noexcept {uint64_t result = value1 - value2 - borrow;  borrow = result >= value1;return result;}#if defined(_M_X64) || defined(__amd64__)#include<x86intrin.h>// visual studio has _subborrow_u64 in <intrin.h>// https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170//uint64_tsubtract_borrow_intel(constuint64_t value1,constuint64_t value2,uint8_t& borrow) {uint64_t result;    borrow =_subborrow_u64(borrow, value2, value1, (unsignedlonglong *)&result);return result;}#endif

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's entirely possible; I did construct these to be analogues of theadd_overflow() implementation for the given architecture (i.e. pulling from the same libraries and using the same #ifdefs)

Copy link

MemberAuthor

jkeiser commentedAug 30, 2023•
edited
Loading

@lemire I made some more variantsin this Godbolt. Just based on manual inspection of the assembly, if I had to choose asubtract_borrow implementation, I would choose the clang one, because:

The manual version is a tiny bit longer (though neither seems particularly bad).
__builtin_subcll_overflow(a, b + borrow) produces significantly shorter code than__builtin_subcll(a, b, borrow). This is very strange.
Storing the overflow in abool is universally shorter, in part because they are guaranteed to be 0 and 1 and therefore can be set to a flag; and in part because flags can be quickly moved to bools but not to 64-bit values.
Other than that, the builtins are slightly shorter than manual, but really not by very much.

uint64_tsubtract_borrow_using_overflow_bool(constuint64_t value1,constuint64_t value2,bool& borrow) {unsignedlonglong result;  borrow =__builtin_usubll_overflow(value1, value2 + borrow, &result);return result;}uint64_tsubtract_borrow_intel_bool(constuint64_t value1,constuint64_t value2,bool& borrow) {unsignedlonglong result;    borrow =_subborrow_u64(borrow, value1, value2, (unsignedlonglong *)&result);return result;}uint64_tsubtract_borrow_manual_bool(constuint64_t value1,constuint64_t value2,bool& borrow)noexcept {uint64_t result = value1 - value2 - borrow;  borrow = result >= value1;return result;}

subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):        # result = value1- value2- overflowmovrax,rdi # value1movzxecx, byte ptr[rdx] # overflowaddrcx,rsi # overflow+ value2subrax,rcx # value1- (overflow+ value2)        # overflow = result >= value1        setb    byte ptr[rdx]subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):        # result = value1- value2- overflowmovrax,rdi # value1movzxecx, byte ptr[rdx] # overflowaddcl,-1 #cl = overflow-1 ???sbbrax,rsi # value1- value2- overflow        # overflow = result >= value1        setb    byte ptr[rdx]subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):        # result = value1- (value2+ overflow)movzxecx, byte ptr[rdx] # overflowaddrcx,rsi # overflow+ value2subrax,rcx # value1- (value2+ overflow)        # overflow = (result >= value1)cmprax,rdi        setae   byte ptr[rdx]

Copy link

MemberAuthor

jkeiser commentedAug 30, 2023

Added some comments in the assembly for easier following.

Bottom line on Ice Lake, once you usebool overflow:

value1 - value2 - borrow; overflow = result >= value1 is the best. It produces the same number of instructions as the others, but two of them are purely for computing overflow--only 3 instructions (with lower latency) are required to compute the result.
__builtin_usubll_overflow(value1, value2 + borrow) is the second best, with the same number of instructions but a longer chain to calculate the result.
_subborrow_u64(borrow, value1, value2) is the worst, acting similarly to__builtin_usubll_overflow but usingSBB, which runs on only 2 ports instead ofSUB's 4.

Of course,running these in a performance test is the only way to know for sure, since the processor does some minor JIT-ish activities :)

Copy link

MemberAuthor

On ARM, it looks like__builtin_usubll_overflow(value1, value2 + borrow) is the winner, with__builtin_subcll once again producing more instructions, and the manual version producing alot more instructions.

Copy link

MemberAuthor

jkeiser commentedAug 30, 2023

And on GCC 13.2 Intel, everything pretty much looks the same as each other._subborrow_u64(0, value1, value2 + borrow) wins by not producing an extra AND._subborrow_u64(borrow, value1, value2) loses because ofSBB, as with clang.

subtract_borrow_using_overflow_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        add     rcx, rsi        sub     rax, rcx        setb    BYTE PTR [rdx]        and     BYTE PTR [rdx], 1subtract_borrow_manual_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        sub     rax, rsi        sub     rax, rcx        cmp     rax, rdi        setnb   BYTE PTR [rdx]subtract_borrow_intel_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        add     cl, -1        sbb     rax, rsi        setc    BYTE PTR [rdx]subtract_borrow_using_overflow_intel_bool(unsigned long, unsigned long, bool&):        movzx   ecx, BYTE PTR [rdx]        mov     rax, rdi        add     rcx, rsi        sub     rax, rcx        setb    BYTE PTR [rdx]

Copy link

MemberAuthor

jkeiser commentedAug 30, 2023

Changingsubtract_borrow to the manual version on Ice Lake brings stage 1 down from 93.2353 -> 93.0503 instructions/block and 28.4788 -> 27.7167 cycles/block, a nearly 3% speed improvement. I'll see if it can rescue the speculative string parser, as well.

jkeiser added5 commits

August 30, 2023 14:59

Fix subtraction to use fewer instructions

61ef447

Don't send or receive commas

5b7acd2

Move input reading near other input reading

d2d1525

Expose and use real quote, reducing register pressure

319b10a

Centralize classification

eac04c7

jkeiser added3 commits

August 30, 2023 19:26

Move classifications into a struct

6a05439

Consolidate classification methods

e0f906d

Just use the one struct

3960880

jkeiser force-pushed thejkeiser/comma branch fromab8a27e to573ff46Compare

August 31, 2023 04:23

jkeiser added3 commits

August 31, 2023 00:46

Fix ctrl character detection

04ab4bb

Actually use new algorithm for borrows

0e28a83

Revert string parsing to old algorithm

8149284

jkeiser force-pushed thejkeiser/comma branch from573ff46 to8149284Compare

August 31, 2023 05:40

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't send commas to stage 2, avoid clmul in most cases#2049

Are you sure you want to change the base?

Don't send commas to stage 2, avoid clmul in most cases#2049

Uh oh!

Conversation

jkeiser commentedAug 15, 2023•
edited
Loading

Uh oh!

Uh oh!

lemireAug 30, 2023

Choose a reason for hiding this comment

Uh oh!

jkeiserAug 30, 2023

Choose a reason for hiding this comment

Uh oh!

jkeiser commentedAug 30, 2023•
edited
Loading

Uh oh!

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Movatterモバイル変換

Don't send commas to stage 2, avoid clmul in most cases#2049

Are you sure you want to change the base?

Don't send commas to stage 2, avoid clmul in most cases#2049

Uh oh!

Conversation

jkeiser commentedAug 15, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lemireAug 30, 2023

Choose a reason for hiding this comment

Uh oh!

jkeiserAug 30, 2023

Choose a reason for hiding this comment

Uh oh!

jkeiser commentedAug 30, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

jkeiser commentedAug 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkeiser commentedAug 15, 2023•
edited
Loading

jkeiser commentedAug 30, 2023•
edited
Loading