Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Allow progressively building bitmask#2236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
jkeiser wants to merge1 commit intomaster
base:master
Choose a base branch
Loading
fromjkeiser/simd-sized-bitmask

Conversation

@jkeiser
Copy link
Member

This PR adds asimd_bitmask64_builder class, which lets you progressively build a 64-bit bitmask fromsimd8<bool> (the results of== and friends on a simd register). It uses it in allsimd8x64<bool>.to_bitmask() implementations. It is designed to have identical performance toto_bitmask() today, but allow it to be called after each SIMD register is processed, throwing away whatever it can on the platform allowed.

For Intel platforms, this just means shifting the bits and or'ing them into a register. For other platforms such as ARM, it's something like a reduce operation, progressively smushing together registers with 16-to-8-bit adds until they end up in a single SIMD-sized registers.

The purpose of this is to move towards a world where we can process one SIMD-sized chunk of bytes at a time before upscaling to 64-bit masks, before upscaling to SIMD-sized masks. We're going to run out of registers if we keep just stashing everything in asimd8x64 orsimd8x512 and processing them all at once at the end. This lets us be explicit and try to process everything we reasonably can as we process each SIMD register. The next step on that journey will be similar utilities to build a SIMD-sized bitmask progressively, over 8 steps--the savings from doing reduction become much more apparent there, requiring up to a maximum of 4 registers per bitmask if you are doing as few operations as possible with as much parallelism as possible (as we do today into_bitmask().

I have another draft built on this which allows the UTF-8 checking algorithm to run one SIMD register at a time as well.

DRAFT: I have not yet tested many of the platforms--just Haswell, which was running at the same performance as today on my machine.

@jkeiser
Copy link
MemberAuthor

The failures appear to be due to my deleting one of the two apparently redundant copies ofjson_path_to_pointer_conversion since they were causing compile errors in clangd ... I'm not sure which one was intended, though perhaps the fact that the one Idid preserve didn't compile at first should have been a clue :)

@lemire
Copy link
Member

Woohoo!

@Validark
Copy link
Contributor

I haven't looked at simdjson internals as much as I should, but I strongly believe the goal should be to use interleaved vectors from LD4 for everything on ARM. (two LD4's on 32-bit ARM, if you support that)

Insimdutf/simdutf#428, I show that 3 ext's per 16 byte vector can be reduced to 3 ext's per 64 byte chunk. That's 16 total vectors reduced to 7. I thought of this technique for the Accelerated-Zig-Parser which does a single LD4 per 64-byte chunk and uses that for everything (not all my updates are public yet though). I also believe that 4 sri's and a shrn are the best movemask routine compared to vpadds. It's only 5 instructions, requires no constants, I believe matches the speed of vpadd techniques (although I'd like to hear better ideas), and works on 32-bit ARM (no vpadd's allowed). It also seems to me that for high performance ARM CPU's you want to run at least 4 16-byte vector ops at once (or pipelined, starting a new one each cycle). If your goal is just to change the order in which instructions are emitted in the hopes of less spilling to the stack (have you looked at the assembly?), then maybe it doesn't hurt, but in general I think moving to checking one 16 byte chunk at a time is a step backwards.

I'm just addressing your stated eventual goal in your paragraph by the way, not your PR per se.

I'd be really interested to hear if you disagree, since I'd like to get better performance on ARM as well!

@jkeiserjkeiserforce-pushed thejkeiser/simd-sized-bitmask branch 3 times, most recently fromf4f4a87 to7725eb6CompareAugust 18, 2024 02:40
@jkeiser
Copy link
MemberAuthor

That definitely sounds interesting! I haven't quite grokked it yet, but UTF-8 may well work better in 64 bytes on ARM. The main reason I brought ARM up is that its algorithm is already designed to generate 128-bit bitmasks, but we're doing a redundant op and stopping at 64 bits right now and doing an extra redundant op to finish it up and extract the bitmasks. I don't actually have an ARM machine handy to test on, but that's part of what has me interested.

Part of the reason to put this up right now is to do the experiment and see what effect it actually has :) The thing that's really caught my fancy is the idea of doing a decent portion of simdjson's parsing in 128-512-bit masks representing booleans for 128-512 bytes. That obviously doesn't apply to UTF-8, where the algorithm is heavily byte-focused, so there I'm mainly trying to validate whether incrementally building bitmasks can be done without losing performance vs. the current algorithm, so we can tune the pressure+parallelism up and down at will.

@jkeiserjkeiserforce-pushed thejkeiser/simd-sized-bitmask branch from7725eb6 to09081f7CompareAugust 18, 2024 03:14
@Validark
Copy link
Contributor

I haven't quite grokked it yet

I think you or other people might be overthinking that issue? All I'm describing is a reordering.

Normal byte order is:

0 1 2 3 4 5 6 7 8 9 A B C D E F

We want to compare (or do some operation, doesn't matter) each byte to the byte that came before. What bytes came before those (at each position)? Well, these:

-1 0 1 2 3 4 5 6 7 8 9 A B C D E

(Column-wise) You just subtract one from each byte, that's the position you need to compare to. LD4 loads data like so:

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

Therefore, to reproduce the normal pairing (0 => -1, 1 => 0, 2 => 1, 3 => 2, 4 => 3), we subtract 1 from each of these to get the position of the bytes to compare against.

-1 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59

That happens to be almost the same as one of the vectors LD4 loads in:

3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63

Hope this explanation helps

@jkeiserjkeiserforce-pushed thejkeiser/simd-sized-bitmask branch 3 times, most recently fromd6cd8a2 to13e33ccCompareAugust 18, 2024 18:03
@jkeiser
Copy link
MemberAuthor

@Validark it's a really interesting idea and has some promise! I haven't had the time to really sit down with it, is the main reason I don't fully grok. I get how the prev is automatically loaded up for a lot of the masks, but as you noted, it's onlyalmost loaded (the 2 bottom bitmasks are missing some stuff that exists only in the previous chunk). I'm sure you already have a solution, my brain is just a little full at this exact moment :)

@jkeiserjkeiserforce-pushed thejkeiser/simd-sized-bitmask branch from13e33cc to94aa162CompareAugust 18, 2024 18:27
@jkeiserjkeiser changed the base branch frommaster tojkeiser/simdjson-vscodeAugust 18, 2024 18:27
@jkeiserjkeiserforce-pushed thejkeiser/simd-sized-bitmask branch from94aa162 to2726f45CompareAugust 18, 2024 21:29
@Validark
Copy link
Contributor

Validark commentedAug 18, 2024
edited
Loading

@Validark it's a really interesting idea and has some promise! I haven't had the time to really sit down with it, is the main reason I don't fully grok. I get how the prev is automatically loaded up for a lot of the masks, but as you noted, it's onlyalmost loaded (the 2 bottom bitmasks are missing some stuff that exists only in the previous chunk). I'm sure you already have a solution, my brain is just a little full at this exact moment :)

At the end of each iteration you hold on to the latter 3 vectors from LD4, and on your next LD4, you shift in the upper element from each of those into the bottom byte of the new latter 3 vectors. This is 3ext instructions, all parallel, and you can overwrite the stored vectors, so you still only have 7 vectors in play at once, with which you can line up any combination of prev0 (current vector), prev1, prev2, or prev3 for whatever you want!

I guess I didn't spell this part out before but when you move on to the next 64-byte chunk (0, 1, 2, 3, ...), the old 63 becomes your -1 byte, the old 62 is your -2, and the old 61 is your -3.

Base automatically changed fromjkeiser/simdjson-vscode tomasterAugust 19, 2024 05:04
@Validark
Copy link
Contributor

Hey@jkeiser, I just moved my blog over to a new framework and wrote an article on the subject. Maybe this helps?

https://validark.github.io/posts/interleaved-vectors-on-arm/

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@jkeiser@lemire@Validark

[8]ページ先頭

©2009-2025 Movatter.jp