This PR adds asimd_bitmask64_builder class, which lets you progressively build a 64-bit bitmask fromsimd8<bool> (the results of== and friends on a simd register). It uses it in allsimd8x64<bool>.to_bitmask() implementations. It is designed to have identical performance toto_bitmask() today, but allow it to be called after each SIMD register is processed, throwing away whatever it can on the platform allowed.

For Intel platforms, this just means shifting the bits and or'ing them into a register. For other platforms such as ARM, it's something like a reduce operation, progressively smushing together registers with 16-to-8-bit adds until they end up in a single SIMD-sized registers.

The purpose of this is to move towards a world where we can process one SIMD-sized chunk of bytes at a time before upscaling to 64-bit masks, before upscaling to SIMD-sized masks. We're going to run out of registers if we keep just stashing everything in asimd8x64 orsimd8x512 and processing them all at once at the end. This lets us be explicit and try to process everything we reasonably can as we process each SIMD register. The next step on that journey will be similar utilities to build a SIMD-sized bitmask progressively, over 8 steps--the savings from doing reduction become much more apparent there, requiring up to a maximum of 4 registers per bitmask if you are doing as few operations as possible with as much parallelism as possible (as we do today into_bitmask().

I have another draft built on this which allows the UTF-8 checking algorithm to run one SIMD register at a time as well.

DRAFT: I have not yet tested many of the platforms--just Haswell, which was running at the same performance as today on my machine.

Copy link

MemberAuthor

jkeiser commentedAug 17, 2024

The failures appear to be due to my deleting one of the two apparently redundant copies ofjson_path_to_pointer_conversion since they were causing compile errors in clangd ... I'm not sure which one was intended, though perhaps the fact that the one Idid preserve didn't compile at first should have been a clue :)

Copy link

Member

lemire commentedAug 17, 2024

Woohoo!

Copy link

Contributor

Validark commentedAug 17, 2024

I haven't looked at simdjson internals as much as I should, but I strongly believe the goal should be to use interleaved vectors from LD4 for everything on ARM. (two LD4's on 32-bit ARM, if you support that)

Insimdutf/simdutf#428, I show that 3 ext's per 16 byte vector can be reduced to 3 ext's per 64 byte chunk. That's 16 total vectors reduced to 7. I thought of this technique for the Accelerated-Zig-Parser which does a single LD4 per 64-byte chunk and uses that for everything (not all my updates are public yet though). I also believe that 4 sri's and a shrn are the best movemask routine compared to vpadds. It's only 5 instructions, requires no constants, I believe matches the speed of vpadd techniques (although I'd like to hear better ideas), and works on 32-bit ARM (no vpadd's allowed). It also seems to me that for high performance ARM CPU's you want to run at least 4 16-byte vector ops at once (or pipelined, starting a new one each cycle). If your goal is just to change the order in which instructions are emitted in the hopes of less spilling to the stack (have you looked at the assembly?), then maybe it doesn't hurt, but in general I think moving to checking one 16 byte chunk at a time is a step backwards.

I'm just addressing your stated eventual goal in your paragraph by the way, not your PR per se.

I'd be really interested to hear if you disagree, since I'd like to get better performance on ARM as well!

jkeiser force-pushed thejkeiser/simd-sized-bitmask branch 3 times, most recently fromf4f4a87 to7725eb6Compare

August 18, 2024 02:40

Copy link

MemberAuthor

jkeiser commentedAug 18, 2024

That definitely sounds interesting! I haven't quite grokked it yet, but UTF-8 may well work better in 64 bytes on ARM. The main reason I brought ARM up is that its algorithm is already designed to generate 128-bit bitmasks, but we're doing a redundant op and stopping at 64 bits right now and doing an extra redundant op to finish it up and extract the bitmasks. I don't actually have an ARM machine handy to test on, but that's part of what has me interested.

Part of the reason to put this up right now is to do the experiment and see what effect it actually has :) The thing that's really caught my fancy is the idea of doing a decent portion of simdjson's parsing in 128-512-bit masks representing booleans for 128-512 bytes. That obviously doesn't apply to UTF-8, where the algorithm is heavily byte-focused, so there I'm mainly trying to validate whether incrementally building bitmasks can be done without losing performance vs. the current algorithm, so we can tune the pressure+parallelism up and down at will.

jkeiser force-pushed thejkeiser/simd-sized-bitmask branch from7725eb6 to09081f7Compare

August 18, 2024 03:14

Copy link

Contributor

Validark commentedAug 18, 2024

I haven't quite grokked it yet

I think you or other people might be overthinking that issue? All I'm describing is a reordering.

Normal byte order is:

0 1 2 3 4 5 6 7 8 9 A B C D E F

We want to compare (or do some operation, doesn't matter) each byte to the byte that came before. What bytes came before those (at each position)? Well, these:

-1 0 1 2 3 4 5 6 7 8 9 A B C D E

(Column-wise) You just subtract one from each byte, that's the position you need to compare to. LD4 loads data like so:

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

Therefore, to reproduce the normal pairing (0 => -1, 1 => 0, 2 => 1, 3 => 2, 4 => 3), we subtract 1 from each of these to get the position of the bytes to compare against.

-1 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59

That happens to be almost the same as one of the vectors LD4 loads in:

3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63

Hope this explanation helps

jkeiser force-pushed thejkeiser/simd-sized-bitmask branch 3 times, most recently fromd6cd8a2 to13e33ccCompare

August 18, 2024 18:03

Copy link

MemberAuthor

jkeiser commentedAug 18, 2024

@Validark it's a really interesting idea and has some promise! I haven't had the time to really sit down with it, is the main reason I don't fully grok. I get how the prev is automatically loaded up for a lot of the masks, but as you noted, it's onlyalmost loaded (the 2 bottom bitmasks are missing some stuff that exists only in the previous chunk). I'm sure you already have a solution, my brain is just a little full at this exact moment :)

jkeiser force-pushed thejkeiser/simd-sized-bitmask branch from13e33cc to94aa162Compare

August 18, 2024 18:27

jkeiser changed the base branch frommaster tojkeiser/simdjson-vscode

August 18, 2024 18:27

Create simd bitmask builder

2726f45

jkeiser force-pushed thejkeiser/simd-sized-bitmask branch from94aa162 to2726f45Compare

August 18, 2024 21:29

Copy link

Contributor

Validark commentedAug 18, 2024•
edited
Loading

@Validark it's a really interesting idea and has some promise! I haven't had the time to really sit down with it, is the main reason I don't fully grok. I get how the prev is automatically loaded up for a lot of the masks, but as you noted, it's onlyalmost loaded (the 2 bottom bitmasks are missing some stuff that exists only in the previous chunk). I'm sure you already have a solution, my brain is just a little full at this exact moment :)

At the end of each iteration you hold on to the latter 3 vectors from LD4, and on your next LD4, you shift in the upper element from each of those into the bottom byte of the new latter 3 vectors. This is 3ext instructions, all parallel, and you can overwrite the stored vectors, so you still only have 7 vectors in play at once, with which you can line up any combination of prev0 (current vector), prev1, prev2, or prev3 for whatever you want!

I guess I didn't spell this part out before but when you move on to the next 64-byte chunk (0, 1, 2, 3, ...), the old 63 becomes your -1 byte, the old 62 is your -2, and the old 61 is your -3.

Base automatically changed fromjkeiser/simdjson-vscode tomaster

August 19, 2024 05:04

Copy link

Contributor

Validark commentedSep 3, 2024

Hey@jkeiser, I just moved my blog over to a new framework and wrote an article on the subject. Maybe this helps?

https://validark.github.io/posts/interleaved-vectors-on-arm/

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow progressively building bitmask#2236

Are you sure you want to change the base?

Allow progressively building bitmask#2236

Uh oh!

Conversation

jkeiser commentedAug 17, 2024

Uh oh!

jkeiser commentedAug 17, 2024

Uh oh!

lemire commentedAug 17, 2024

Uh oh!

Validark commentedAug 17, 2024

Uh oh!

jkeiser commentedAug 18, 2024

Uh oh!

Validark commentedAug 18, 2024

Uh oh!

jkeiser commentedAug 18, 2024

Uh oh!

Validark commentedAug 18, 2024•
edited
Loading

Uh oh!

Uh oh!

Validark commentedSep 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Movatterモバイル変換

Allow progressively building bitmask#2236

Are you sure you want to change the base?

Allow progressively building bitmask#2236

Uh oh!

Conversation

jkeiser commentedAug 17, 2024

Uh oh!

jkeiser commentedAug 17, 2024

Uh oh!

lemire commentedAug 17, 2024

Uh oh!

Validark commentedAug 17, 2024

Uh oh!

jkeiser commentedAug 18, 2024

Uh oh!

Validark commentedAug 18, 2024

Uh oh!

jkeiser commentedAug 18, 2024

Uh oh!

Validark commentedAug 18, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Validark commentedSep 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Validark commentedAug 18, 2024•
edited
Loading