Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork10.9k
ENH: Implement np.strings.slice as a gufunc#27789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Didn't have a real look, but looks thorough! Mainly to link that there is some ancient PR that introduces it for old strings ingh-20694. |
ArvidJB commentedNov 21, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Thanks@seberg for taking a look! I had a couple of questions:
|
We can always start with only positional and relax it if someone feels strongly (I also wondered briefly if we should accept slices, but probably better not). I am hoping@lysnikolaou or@ngoldbaum will have time to dive in (but I would not expect this to happen very quickly necessarily). |
Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes? pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap. |
Sure, that sounds fine to me!
Oh, that's clever! Curious to see the details. We should probably add some benchmarks? It's probably worth to have a dedicated codepath for |
So there's a repo here:https://github.com/pypy/fast-utf8-methods, the pypy UTF-8 code also lives in pypy here:https://github.com/pypy/pypy/tree/6742499bbf3fc0aa63702fe4aa27147e11050c74/rpython/rlib/fastutf8 @mattip might know if there's a technical explanation somewhere of how this works. I'm not aware of a blog post or paper. This is more of a long-term idea. I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character. |
I don't think there is a technical summary of the SIMD utf8 code. The code actually was never merged into PyPy, it was an experiment that one developer tried to push forward but was never completed. PyPy doesbuild an index of every 4th codepoint which is only allocated and built on-demand when a utf-8 string is sliced. |
Reminder to myself to look through this sometime next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I didn't go through the C++ code yet but I spotted two issues in the python layer on my first pass
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
ngoldbaum left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it is necessary to do this in two passes for UTF-8 but I think you only need to do one pass over the whole string if you allocate another temporary buffer to build and save a byte index from the input string into the output string on the first pass.
I don't think it makes sense to build and cache an index for thewhole string, but there's no need to calculate the index of each character in the output string twice.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
This also needs a release note. |
Just a head's up I'm on Christmas break right now. I'm |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
@lysnikolaou is there any chance I can get you to do a high-level once-over on this PR? There's no need to go through the new C++ code in detail, I just did that. |
@mhvk could you give this one a once-over if you have a little time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This looks very nice! I did still have a few comments, see in-line.
// compute outsize | ||
npy_intp outsize = 0; | ||
for (int i = start; step > 0 ? i < stop : i > stop; i += step) { | ||
outsize += num_bytes_for_utf8_character(codepoint_offsets[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This seems weird - you already calculated the number of bytes - can one not just take a difference between the relevant offsets? (Alternatively, perhaps there should also be astd::vector::<unsigned char> code_point_lengths
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I didn't want to add the extra allocation because this function is so simple it seemed better to just call it twice. That said the allocation is amortized over the array so perhaps it doesn't matter.
What about makingnum_bytes_for_utf8_character
astatic inline
function? It's can also make it branchless, I think, with some bitmath...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, you're right, it should be possible to count the number of bytes in UTF8 by bitmath; after all, all information on a charcter's length is in the first byte (e.g.,http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
p.s. Totally fine to leave this for a subsequent PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There's alsohttps://github.com/simdutf/simdutf. But I will leave all of this as an exercise for the reader....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Here's a link to the codepoint counting implementation:
https://github.com/simdutf/simdutf/blob/c4c8e0c09c8b65d4d729ae7192f62ae1eac24b4d/src/generic/utf8.h#L10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Two utter nitpicks, and one query as I'm confused why theasanyarray
would be needed - but clearly I'm missing something too...
None of this is important, so I'll approve now (though probably need to squash-merge given the large number of commits).
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
This commit adds a `np.strings.slice` function whichvectorizes slicing of string arrays. For example``` >>> a = np.array(['hello', 'world']) >>> np.char.slice(a, 2) array(['he', 'wo'], dtype='<U5')```This supports fixed-length and variable-length string dtypes.It also supports broadcasting the start, stop, step args:``` >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType()) >>> np.strings.slice(b, [3, 0, 2, 1], -1) array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())```Closesnumpy#8579
I rebased on main and squashed the commits. Is this ready to be merged? |
I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:
|
Ah, looks like doctest running was fixed in one of the commits included in the rebase! We had decided to not add Everything is passing now. |
Thanks for your patience on getting this reviewed and merged@ArvidJB. If you spot other bugs or missing features in |
2e700c6
intonumpy:mainUh oh!
There was an error while loading.Please reload this page.
Thanks,@ArvidJB, really nice! |
This commit adds a
np.strings.slice
function which vectorizes slicing of string arrays. For exampleThis supports fixed-length and variable-length string dtypes. It also supports broadcasting the start, stop, step args:
Closes#8579