- Notifications
You must be signed in to change notification settings - Fork1.2k
Allow progressively building bitmask#2236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
jkeiser commentedAug 17, 2024
The failures appear to be due to my deleting one of the two apparently redundant copies of |
lemire commentedAug 17, 2024
Woohoo! |
Validark commentedAug 17, 2024
I haven't looked at simdjson internals as much as I should, but I strongly believe the goal should be to use interleaved vectors from LD4 for everything on ARM. (two LD4's on 32-bit ARM, if you support that) Insimdutf/simdutf#428, I show that 3 ext's per 16 byte vector can be reduced to 3 ext's per 64 byte chunk. That's 16 total vectors reduced to 7. I thought of this technique for the Accelerated-Zig-Parser which does a single LD4 per 64-byte chunk and uses that for everything (not all my updates are public yet though). I also believe that 4 sri's and a shrn are the best movemask routine compared to vpadds. It's only 5 instructions, requires no constants, I believe matches the speed of vpadd techniques (although I'd like to hear better ideas), and works on 32-bit ARM (no vpadd's allowed). It also seems to me that for high performance ARM CPU's you want to run at least 4 16-byte vector ops at once (or pipelined, starting a new one each cycle). If your goal is just to change the order in which instructions are emitted in the hopes of less spilling to the stack (have you looked at the assembly?), then maybe it doesn't hurt, but in general I think moving to checking one 16 byte chunk at a time is a step backwards. I'm just addressing your stated eventual goal in your paragraph by the way, not your PR per se. I'd be really interested to hear if you disagree, since I'd like to get better performance on ARM as well! |
f4f4a87 to7725eb6Comparejkeiser commentedAug 18, 2024
That definitely sounds interesting! I haven't quite grokked it yet, but UTF-8 may well work better in 64 bytes on ARM. The main reason I brought ARM up is that its algorithm is already designed to generate 128-bit bitmasks, but we're doing a redundant op and stopping at 64 bits right now and doing an extra redundant op to finish it up and extract the bitmasks. I don't actually have an ARM machine handy to test on, but that's part of what has me interested. Part of the reason to put this up right now is to do the experiment and see what effect it actually has :) The thing that's really caught my fancy is the idea of doing a decent portion of simdjson's parsing in 128-512-bit masks representing booleans for 128-512 bytes. That obviously doesn't apply to UTF-8, where the algorithm is heavily byte-focused, so there I'm mainly trying to validate whether incrementally building bitmasks can be done without losing performance vs. the current algorithm, so we can tune the pressure+parallelism up and down at will. |
7725eb6 to09081f7CompareValidark commentedAug 18, 2024
I think you or other people might be overthinking that issue? All I'm describing is a reordering. Normal byte order is: We want to compare (or do some operation, doesn't matter) each byte to the byte that came before. What bytes came before those (at each position)? Well, these: (Column-wise) You just subtract one from each byte, that's the position you need to compare to. LD4 loads data like so: Therefore, to reproduce the normal pairing (0 => -1, 1 => 0, 2 => 1, 3 => 2, 4 => 3), we subtract 1 from each of these to get the position of the bytes to compare against. That happens to be almost the same as one of the vectors LD4 loads in: Hope this explanation helps |
d6cd8a2 to13e33ccComparejkeiser commentedAug 18, 2024
@Validark it's a really interesting idea and has some promise! I haven't had the time to really sit down with it, is the main reason I don't fully grok. I get how the prev is automatically loaded up for a lot of the masks, but as you noted, it's onlyalmost loaded (the 2 bottom bitmasks are missing some stuff that exists only in the previous chunk). I'm sure you already have a solution, my brain is just a little full at this exact moment :) |
13e33cc to94aa162Compare94aa162 to2726f45CompareValidark commentedAug 18, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
At the end of each iteration you hold on to the latter 3 vectors from LD4, and on your next LD4, you shift in the upper element from each of those into the bottom byte of the new latter 3 vectors. This is 3 I guess I didn't spell this part out before but when you move on to the next 64-byte chunk (0, 1, 2, 3, ...), the old 63 becomes your -1 byte, the old 62 is your -2, and the old 61 is your -3. |
Validark commentedSep 3, 2024
Hey@jkeiser, I just moved my blog over to a new framework and wrote an article on the subject. Maybe this helps? https://validark.github.io/posts/interleaved-vectors-on-arm/ |
This PR adds a
simd_bitmask64_builderclass, which lets you progressively build a 64-bit bitmask fromsimd8<bool>(the results of==and friends on a simd register). It uses it in allsimd8x64<bool>.to_bitmask()implementations. It is designed to have identical performance toto_bitmask()today, but allow it to be called after each SIMD register is processed, throwing away whatever it can on the platform allowed.For Intel platforms, this just means shifting the bits and or'ing them into a register. For other platforms such as ARM, it's something like a reduce operation, progressively smushing together registers with 16-to-8-bit adds until they end up in a single SIMD-sized registers.
The purpose of this is to move towards a world where we can process one SIMD-sized chunk of bytes at a time before upscaling to 64-bit masks, before upscaling to SIMD-sized masks. We're going to run out of registers if we keep just stashing everything in a
simd8x64orsimd8x512and processing them all at once at the end. This lets us be explicit and try to process everything we reasonably can as we process each SIMD register. The next step on that journey will be similar utilities to build a SIMD-sized bitmask progressively, over 8 steps--the savings from doing reduction become much more apparent there, requiring up to a maximum of 4 registers per bitmask if you are doing as few operations as possible with as much parallelism as possible (as we do today into_bitmask().I have another draft built on this which allows the UTF-8 checking algorithm to run one SIMD register at a time as well.
DRAFT: I have not yet tested many of the platforms--just Haswell, which was running at the same performance as today on my machine.