Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: Implement np.strings.slice as a gufunc#27789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
ngoldbaum merged 2 commits intonumpy:mainfromArvidJB:string_slice_gufunc
Jan 8, 2025

Conversation

ArvidJB
Copy link
Contributor

This commit adds anp.strings.slice function which vectorizes slicing of string arrays. For example

    >>> a = np.array(['hello', 'world'])    >>> np.char.slice(a, 2)    array(['he', 'wo'], dtype='<U5')

This supports fixed-length and variable-length string dtypes. It also supports broadcasting the start, stop, step args:

    >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType())    >>> np.strings.slice(b, [3, 0, 2, 1], -1)    array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())

Closes#8579

@seberg
Copy link
Member

Didn't have a real look, but looks thorough! Mainly to link that there is some ancient PR that introduces it for old strings ingh-20694.
I think this ufunc version (which copies) is better though (both implementation and behavior wise, of course implementing it as a ufunc would have been hard back then).

@ArvidJB
Copy link
ContributorAuthor

ArvidJB commentedNov 21, 2024
edited
Loading

Thanks@seberg for taking a look!

I had a couple of questions:

  1. is the API design acceptable? I intentionally madeslice only take positional arguments to mimicbuiltins.slice's behavior. Or should we support keyword arguments, likenp.strings.slice(a, step=-1)?
  2. For variable-length strings we end up doing three passes over the string: (1) to computenum_codepoints, (2) to computeoutsize, (3) to copy each codepoint. I cannot see a way to avoid this, but maybe someone has an idea?
  3. Negative steps, where "step < 1" are tricky, because we could potentially read past the beginning of the string by accident. I would be thankful if someone could double-check the logic I wrote there.

@seberg
Copy link
Member

We can always start with only positional and relax it if someone feels strongly (I also wondered briefly if we should accept slices, but probably better not).

I am hoping@lysnikolaou or@ngoldbaum will have time to dive in (but I would not expect this to happen very quickly necessarily).
In general, I would think you could at least bring it down to 2 passes (merge the first two), although it may need a custom new method on the buffer helper implementation?
(I'll assume that small, usually one, steps are typically, and we shouldn't worry about optimizing large steps much.)

@ngoldbaum
Copy link
Member

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

@ArvidJB
Copy link
ContributorAuthor

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

Sure, that sounds fine to me!

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

Oh, that's clever! Curious to see the details. We should probably add some benchmarks? It's probably worth to have a dedicated codepath forstep==1.

@ngoldbaum
Copy link
Member

So there's a repo here:https://github.com/pypy/fast-utf8-methods, the pypy UTF-8 code also lives in pypy here:https://github.com/pypy/pypy/tree/6742499bbf3fc0aa63702fe4aa27147e11050c74/rpython/rlib/fastutf8

@mattip might know if there's a technical explanation somewhere of how this works. I'm not aware of a blog post or paper.

This is more of a long-term idea. I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

@mattip
Copy link
Member

I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

I don't think there is a technical summary of the SIMD utf8 code. The code actually was never merged into PyPy, it was an experiment that one developer tried to push forward but was never completed. PyPy doesbuild an index of every 4th codepoint which is only allocated and built on-demand when a utf-8 string is sliced.

ngoldbaum reacted with thumbs up emoji

@ngoldbaum
Copy link
Member

Reminder to myself to look through this sometime next week.

@ngoldbaumngoldbaum self-assigned thisDec 8, 2024
Copy link
Member

@ngoldbaumngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I didn't go through the C++ code yet but I spotted two issues in the python layer on my first pass

Copy link
Member

@ngoldbaumngoldbaum left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it is necessary to do this in two passes for UTF-8 but I think you only need to do one pass over the whole string if you allocate another temporary buffer to build and save a byte index from the input string into the output string on the first pass.

I don't think it makes sense to build and cache an index for thewhole string, but there's no need to calculate the index of each character in the output string twice.

@ngoldbaum
Copy link
Member

This also needs a release note.

ArvidJB reacted with thumbs up emoji

@ngoldbaumngoldbaum added the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelDec 12, 2024
@ngoldbaum
Copy link
Member

Just a head's up I'm on Christmas break right now. I'm
planning to look at this in the new year.

ArvidJB reacted with thumbs up emoji

@ngoldbaum
Copy link
Member

@lysnikolaou is there any chance I can get you to do a high-level once-over on this PR? There's no need to go through the new C++ code in detail, I just did that.

@ngoldbaum
Copy link
Member

@mhvk could you give this one a once-over if you have a little time?

Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This looks very nice! I did still have a few comments, see in-line.

// compute outsize
npy_intp outsize = 0;
for (int i = start; step > 0 ? i < stop : i > stop; i += step) {
outsize += num_bytes_for_utf8_character(codepoint_offsets[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This seems weird - you already calculated the number of bytes - can one not just take a difference between the relevant offsets? (Alternatively, perhaps there should also be astd::vector::<unsigned char> code_point_lengths?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I didn't want to add the extra allocation because this function is so simple it seemed better to just call it twice. That said the allocation is amortized over the array so perhaps it doesn't matter.

What about makingnum_bytes_for_utf8_character astatic inline function? It's can also make it branchless, I think, with some bitmath...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, you're right, it should be possible to count the number of bytes in UTF8 by bitmath; after all, all information on a charcter's length is in the first byte (e.g.,http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

p.s. Totally fine to leave this for a subsequent PR!

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There's alsohttps://github.com/simdutf/simdutf. But I will leave all of this as an exercise for the reader....

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@ArvidJBArvidJB requested a review frommhvkJanuary 7, 2025 02:47
Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Two utter nitpicks, and one query as I'm confused why theasanyarray would be needed - but clearly I'm missing something too...

None of this is important, so I'll approve now (though probably need to squash-merge given the large number of commits).

This commit adds a `np.strings.slice` function whichvectorizes slicing of string arrays. For example```    >>> a = np.array(['hello', 'world'])    >>> np.char.slice(a, 2)    array(['he', 'wo'], dtype='<U5')```This supports fixed-length and variable-length string dtypes.It also supports broadcasting the start, stop, step args:```    >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType())    >>> np.strings.slice(b, [3, 0, 2, 1], -1)    array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())```Closesnumpy#8579
@ArvidJB
Copy link
ContributorAuthor

I rebased on main and squashed the commits. Is this ready to be merged?

@seberg
Copy link
Member

I rebased on main and squashed the commits. Is this ready to be merged?

I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:

1669     Examples1670     --------1671     >>> import numpy as np1672     >>> a = np.array(['hello', 'world'])1673     >>> np.char.slice(a, 2)UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

@ngoldbaumngoldbaum removed the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelJan 8, 2025
@ArvidJB
Copy link
ContributorAuthor

I rebased on main and squashed the commits. Is this ready to be merged?

I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:

1669     Examples1670     --------1671     >>> import numpy as np1672     >>> a = np.array(['hello', 'world'])1673     >>> np.char.slice(a, 2)UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

Ah, looks like doctest running was fixed in one of the commits included in the rebase!

We had decided to not addnp.char.slice since it's supposedly a deprecated module, but I had not fixed the doctests.

Everything is passing now.

@ngoldbaum
Copy link
Member

Thanks for your patience on getting this reviewed and merged@ArvidJB. If you spot other bugs or missing features innp.strings please don’t hesitate to follow up with more issues or PRs!

seberg reacted with thumbs up emoji

@ngoldbaumngoldbaum merged commit2e700c6 intonumpy:mainJan 8, 2025
67 checks passed
@mwtoewsmwtoews mentioned this pull requestJan 8, 2025
@mhvk
Copy link
Contributor

mhvk commentedJan 8, 2025

Thanks,@ArvidJB, really nice!

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull requestMay 16, 2025
ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull requestMay 16, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@sebergsebergseberg left review comments

@mhvkmhvkmhvk approved these changes

@ngoldbaumngoldbaumngoldbaum left review comments

Assignees

@ngoldbaumngoldbaum

Projects
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Feature request: vectorized character slice
5 participants
@ArvidJB@seberg@ngoldbaum@mattip@mhvk

[8]ページ先頭

©2009-2025 Movatter.jp