NotificationsYou must be signed in to change notification settings
Fork11k
Star29.8k

ENH: Implement np.strings.slice as a gufunc#27789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

ngoldbaum merged 2 commits intonumpy:mainfromArvidJB:string_slice_gufunc

Jan 8, 2025

Merged

ENH: Implement np.strings.slice as a gufunc#27789

ngoldbaum merged 2 commits intonumpy:mainfromArvidJB:string_slice_gufunc

Jan 8, 2025

Conversation

Copy link

Contributor

ArvidJB commentedNov 18, 2024

This commit adds anp.strings.slice function which vectorizes slicing of string arrays. For example

    >>> a = np.array(['hello', 'world'])    >>> np.char.slice(a, 2)    array(['he', 'wo'], dtype='<U5')

This supports fixed-length and variable-length string dtypes. It also supports broadcasting the start, stop, step args:

    >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType())    >>> np.strings.slice(b, [3, 0, 2, 1], -1)    array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())

Closes#8579

github-actionsbot added the 01 - Enhancement label

Nov 18, 2024

Copy link

Member

Didn't have a real look, but looks thorough! Mainly to link that there is some ancient PR that introduces it for old strings ingh-20694.
I think this ufunc version (which copies) is better though (both implementation and behavior wise, of course implementing it as a ufunc would have been hard back then).

Copy link

ContributorAuthor

ArvidJB commentedNov 21, 2024•
edited
Loading

Thanks@seberg for taking a look!

I had a couple of questions:

is the API design acceptable? I intentionally madeslice only take positional arguments to mimicbuiltins.slice's behavior. Or should we support keyword arguments, likenp.strings.slice(a, step=-1)?
For variable-length strings we end up doing three passes over the string: (1) to computenum_codepoints, (2) to computeoutsize, (3) to copy each codepoint. I cannot see a way to avoid this, but maybe someone has an idea?
Negative steps, where "step < 1" are tricky, because we could potentially read past the beginning of the string by accident. I would be thankful if someone could double-check the logic I wrote there.

Copy link

Member

seberg commentedNov 22, 2024

We can always start with only positional and relax it if someone feels strongly (I also wondered briefly if we should accept slices, but probably better not).

I am hoping@lysnikolaou or@ngoldbaum will have time to dive in (but I would not expect this to happen very quickly necessarily).
In general, I would think you could at least bring it down to 2 passes (merge the first two), although it may need a custom new method on the buffer helper implementation?
(I'll assume that small, usually one, steps are typically, and we shouldn't worry about optimizing large steps much.)

Copy link

Member

ngoldbaum commentedNov 22, 2024

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

Copy link

ContributorAuthor

ArvidJB commentedNov 22, 2024

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

Sure, that sounds fine to me!

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

Oh, that's clever! Curious to see the details. We should probably add some benchmarks? It's probably worth to have a dedicated codepath forstep==1.

Copy link

Member

ngoldbaum commentedNov 22, 2024

So there's a repo here:https://github.com/pypy/fast-utf8-methods, the pypy UTF-8 code also lives in pypy here:https://github.com/pypy/pypy/tree/6742499bbf3fc0aa63702fe4aa27147e11050c74/rpython/rlib/fastutf8

@mattip might know if there's a technical explanation somewhere of how this works. I'm not aware of a blog post or paper.

This is more of a long-term idea. I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

Copy link

Member

mattip commentedNov 23, 2024

I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

I don't think there is a technical summary of the SIMD utf8 code. The code actually was never merged into PyPy, it was an experiment that one developer tried to push forward but was never completed. PyPy doesbuild an index of every 4th codepoint which is only allocated and built on-demand when a utf-8 string is sliced.

Copy link

Member

ngoldbaum commentedDec 8, 2024

Reminder to myself to look through this sometime next week.

ngoldbaum self-assigned this

Dec 8, 2024

ngoldbaum reviewed

Dec 11, 2024

View reviewed changes

Copy link

Member

ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I didn't go through the C++ code yet but I spotted two issues in the python layer on my first pass

numpy/_core/defchararray.py OutdatedShow resolvedHide resolved

numpy/_core/strings.py OutdatedShow resolvedHide resolved

ngoldbaum reviewed

Dec 12, 2024

View reviewed changes

Copy link

Member

ngoldbaum left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it is necessary to do this in two passes for UTF-8 but I think you only need to do one pass over the whole string if you allocate another temporary buffer to build and save a byte index from the input string into the output string on the first pass.

I don't think it makes sense to build and cache an index for thewhole string, but there's no need to calculate the index of each character in the output string twice.

numpy/_core/src/umath/stringdtype_ufuncs.cppShow resolvedHide resolved

numpy/_core/src/umath/stringdtype_ufuncs.cpp OutdatedShow resolvedHide resolved

Copy link

Member

ngoldbaum commentedDec 12, 2024

This also needs a release note.

ngoldbaum added the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes label

Dec 12, 2024

ArvidJB requested a review fromngoldbaum

December 25, 2024 04:37

Copy link

Member

ngoldbaum commentedDec 25, 2024

Just a head's up I'm on Christmas break right now. I'm
planning to look at this in the new year.

ngoldbaum reviewed

Jan 3, 2025

View reviewed changes

numpy/_core/src/umath/stringdtype_ufuncs.cpp OutdatedShow resolvedHide resolved

ngoldbaum reviewed

Jan 3, 2025

View reviewed changes

numpy/char/__init__.py OutdatedShow resolvedHide resolved

Copy link

Member

ngoldbaum commentedJan 3, 2025

@lysnikolaou is there any chance I can get you to do a high-level once-over on this PR? There's no need to go through the new C++ code in detail, I just did that.

Copy link

Member

ngoldbaum commentedJan 6, 2025

@mhvk could you give this one a once-over if you have a little time?

mhvk reviewed

Jan 6, 2025

View reviewed changes

Copy link

Contributor

mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This looks very nice! I did still have a few comments, see in-line.

numpy/_core/src/umath/stringdtype_ufuncs.cpp Outdated

		// compute outsize
		npy_intp outsize = 0;
		for (int i = start; step > 0 ? i < stop : i > stop; i += step) {
		outsize += num_bytes_for_utf8_character(codepoint_offsets[i]);

Copy link

Contributor

mhvkJan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This seems weird - you already calculated the number of bytes - can one not just take a difference between the relevant offsets? (Alternatively, perhaps there should also be astd::vector::<unsigned char> code_point_lengths?)

Copy link

Member

ngoldbaumJan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I didn't want to add the extra allocation because this function is so simple it seemed better to just call it twice. That said the allocation is amortized over the array so perhaps it doesn't matter.

What about makingnum_bytes_for_utf8_character astatic inline function? It's can also make it branchless, I think, with some bitmath...

Copy link

Contributor

mhvkJan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, you're right, it should be possible to count the number of bytes in UTF8 by bitmath; after all, all information on a charcter's length is in the first byte (e.g.,http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html)

Copy link

Contributor

mhvkJan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

p.s. Totally fine to leave this for a subsequent PR!

Copy link

ContributorAuthor

ArvidJBJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There's alsohttps://github.com/simdutf/simdutf. But I will leave all of this as an exercise for the reader....

Copy link

ContributorAuthor

ArvidJBJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Here's a link to the codepoint counting implementation:
https://github.com/simdutf/simdutf/blob/c4c8e0c09c8b65d4d729ae7192f62ae1eac24b4d/src/generic/utf8.h#L10

Copy link

ContributorAuthor

ArvidJBJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And for AVX512:https://github.com/simdutf/simdutf/blob/c4c8e0c09c8b65d4d729ae7192f62ae1eac24b4d/src/icelake/implementation.cpp#L1272

numpy/_core/src/umath/stringdtype_ufuncs.cpp OutdatedShow resolvedHide resolved

numpy/_core/code_generators/ufunc_docstrings.py OutdatedShow resolvedHide resolved

numpy/_core/strings.py OutdatedShow resolvedHide resolved

numpy/_core/strings.pyShow resolvedHide resolved

numpy/_core/tests/test_strings.pyShow resolvedHide resolved

ArvidJB requested a review frommhvk

January 7, 2025 02:47

mhvk approved these changes

Jan 7, 2025

View reviewed changes

Copy link

Contributor

mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Two utter nitpicks, and one query as I'm confused why theasanyarray would be needed - but clearly I'm missing something too...

None of this is important, so I'll approve now (though probably need to squash-merge given the large number of commits).

numpy/_core/src/umath/stringdtype_ufuncs.cpp OutdatedShow resolvedHide resolved

numpy/_core/strings.pyShow resolvedHide resolved

ArvidJB force-pushed thestring_slice_gufunc branch from323aa73 to2fa92edCompare

January 8, 2025 02:39

ENH: Implement np.strings.slice as a gufunc

56c900a

This commit adds a `np.strings.slice` function whichvectorizes slicing of string arrays. For example```    >>> a = np.array(['hello', 'world'])    >>> np.char.slice(a, 2)    array(['he', 'wo'], dtype='<U5')```This supports fixed-length and variable-length string dtypes.It also supports broadcasting the start, stop, step args:```    >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType())    >>> np.strings.slice(b, [3, 0, 2, 1], -1)    array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())```Closesnumpy#8579

ArvidJB force-pushed thestring_slice_gufunc branch from2fa92ed to56c900aCompare

January 8, 2025 03:00

Copy link

ContributorAuthor

ArvidJB commentedJan 8, 2025

I rebased on main and squashed the commits. Is this ready to be merged?

Copy link

Member

seberg commentedJan 8, 2025

I rebased on main and squashed the commits. Is this ready to be merged?

I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:

1669     Examples1670     --------1671     >>> import numpy as np1672     >>> a = np.array(['hello', 'world'])1673     >>> np.char.slice(a, 2)UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

Fix reference to numpy.char.slice

cafa9c2

ngoldbaum removed the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes label

Jan 8, 2025

Copy link

ContributorAuthor

ArvidJB commentedJan 8, 2025

I rebased on main and squashed the commits. Is this ready to be merged?
I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:
1669     Examples1670     --------1671     >>> import numpy as np1672     >>> a = np.array(['hello', 'world'])1673     >>> np.char.slice(a, 2)UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

Ah, looks like doctest running was fixed in one of the commits included in the rebase!

We had decided to not addnp.char.slice since it's supposedly a deprecated module, but I had not fixed the doctests.

Everything is passing now.

Copy link

Member

ngoldbaum commentedJan 8, 2025

Thanks for your patience on getting this reviewed and merged@ArvidJB. If you spot other bugs or missing features innp.strings please don’t hesitate to follow up with more issues or PRs!