Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: np.unique: support hash based unique for string dtype#28767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
ngoldbaum merged 88 commits intonumpy:mainfrommath-hiyoko:feature/#28364
Jun 20, 2025

Conversation

math-hiyoko
Copy link
Contributor

@math-hiyokomath-hiyoko commentedApr 18, 2025
edited
Loading

Description

This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described#26018 (comment)

importrandomimportstringimporttimeimportnumpyasnpimportpolarsasplchars=string.ascii_letters+string.digitsarr=np.array(    [''.join(random.choices(chars,k=random.randint(5,10)))for_inrange(1_000)    ]*1_000_000,dtype='T',)np.random.shuffle(arr)time_start=time.perf_counter()print("unique count (hash based): ",len(np.unique(arr)))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))time_start=time.perf_counter()print("unique count (polars): ",len(pl.Series(arr).unique()))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))

Result

unique count (hash based):  100033.583 secsunique count (numpy main):  1000498.011 secsunique count (polars):  100074.023 secs

close#28364

adrinjalali reacted with heart emojingoldbaum reacted with rocket emoji
@math-hiyokomath-hiyoko marked this pull request as draftApril 18, 2025 11:14
@math-hiyoko
Copy link
ContributorAuthor

@seberg@ngoldbaum
I've addressed all comments received so far.

Copy link
Member

@ngoldbaumngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the implementation of fnv-1a in this PR isn't correct. Maybe we should just be using the (public-domain licensed) reference implementation:https://github.com/lcn2/fnv.

I didn't look closely at the rest after I noticed this issue.

template<typename T>
// function to caluculate the hash of a string
template <typename T>
size_t str_hash(const T *str, npy_intp num_chars) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note thatnpy_ucs4 is four bytes and fnv-1a operates on octets of data (e.g. individual bytes).

The reference implementation takes avoid * pointer and immediately casts it tounsigned char *. You could do similar.

We could also add the reference implementation as a vendored dependency (e.g. a git submodule). The license is compatible.

math-hiyoko reacted with eyes emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the suggestion.
If we go the submodule route for lcn2/fnv, where in the NumPy tree would you like the submodule to live? Let me know your preferred path and I’ll add it accordingly.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Vendoring would work for me as well (e.g. copying a single header the way stlab does inadobe/fnv.hpp).
If we go with a vendored file instead of a submodule, which path in the NumPy tree would you prefer for it to live?

Copy link
Member

@ngoldbaumngoldbaumMay 26, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would probably vendor it as a git submodule innumpy/_core/src/common, next topythoncapi-compat. Maybe take a look at the PR addingpythoncapi-compat as a vendered dependency to see how that header-only vendored dependency is integrated into the numpy build system.

I only have a slight preference for a git submodule, since it makes updating the vendored code marginally easier. If you feel like it's easier to structure it as just a copy/pasted new header file (that includes a note about the original copyright and license), I'd probably just put that header innumpy/_core/src/multiarray.

math-hiyoko reacted with thumbs up emoji
Copy link
ContributorAuthor

@math-hiyokomath-hiyokoMay 27, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Tried the submodule approach, but ran into two blockers:

  • the reference source expects to be built with a plainmake step that is not available on every platform/toolchain we target.
  • the code still uses legacy BSD typedefs such asu_int32_t; that builds fine on Linux/macOS but fails to compile on WASM, Windows/Clang, etc.

Because of these two issues the submodule route looks impractical for NumPy at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fair enough, thanks for checking. Vendoring a somewhat adapted version makes sense.

math-hiyoko reacted with thumbs up emoji
Copy link
ContributorAuthor

@math-hiyokomath-hiyokoJun 1, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I’ve re-implemented the FNV-1a hash function based on the reference implementation fromlcn2/fnv, with appropriate modifications. To keep the scope and visibility of the function well-managed, I split the implementation into a header and a .c source file.

After this change, the benchmarked runtime improved slightly from 35.1 sec to 33.5 sec.

@ngoldbaum
Copy link
Member

Sorry this didn't quite make it in time for NumPy 2.3, but i'm going to try to get this merged ASAP so we have time to iterate in-tree as needed. I'll try to give this another once-over sometime this week.

I also want to say that this is a really impressive contribution and any delay on my part is because I have to focus on other stuff and it takes a lot of time and attention to properly review CPython C API code rather than me being unenthusiastic about getting this improvement into NumPy.

math-hiyoko reacted with thumbs up emojimath-hiyoko reacted with heart emoji

Copy link
Member

@ngoldbaumngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I see some minor issues that should be addressed but this is ready to merge then. I looked all the C++ code over in detail and didn't spot any issues besides these.

math-hiyoko reacted with thumbs up emoji
@ngoldbaum
Copy link
Member

Also you might be interested in#29229

@math-hiyoko
Copy link
ContributorAuthor

Thanks so much for the thorough review! This was my first PR to NumPy, and your detailed feedback really helped me bring it to completion.

I'll take a look at#29229 as you suggested — I'm also interested in#28363, which I think I can complete in a similar way to this PR.

NpyString_load(in_allocator, packed_string, &unpacked_strings[i]);
int is_null = NpyString_load(in_allocator, packed_string, &unpacked_strings[i]);
if (is_null == -1) {
// failed to load string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think you need to explicitly set an error indicator with e.g.npy_gil_error.NpyString_load doesn't do it for you.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added the error indicator PyErr_SetString in the failure branch

if (is_null == -1) {PyErr_SetString(PyExc_RuntimeError,"Failed to load string from packed static string.");returnNULL;        }

@@ -243,7 +243,8 @@ unique_vstring(PyArrayObject *self, npy_bool equal_nan)
npy_packed_static_string *packed_string = (npy_packed_static_string *)idata;
int is_null = NpyString_load(in_allocator, packed_string, &unpacked_strings[i]);
if (is_null == -1) {
// failed to load string
PyErr_SetString(PyExc_RuntimeError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

you need to have the GIL acquired before you call this - that’s why I suggestednpy_gil_error, since it handles that dance

math-hiyoko reacted with eyes emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for pointing this out.
SwitchedPyErr_SetString tonpy_gil_error.

@ngoldbaum
Copy link
Member

Let's bring this in, thanks@math-hiyoko! Looking forward to future contributions.

math-hiyoko reacted with heart emoji

@ngoldbaumngoldbaum merged commit20d034f intonumpy:mainJun 20, 2025
76 checks passed
@github-project-automationgithub-project-automationbot moved this fromPending authors' response toCompleted inNumPy first-time contributor PRsJun 20, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@sebergsebergseberg left review comments

@ngoldbaumngoldbaumngoldbaum approved these changes

Assignees
No one assigned
Projects
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

np.unique: support string dtypes
3 participants
@math-hiyoko@seberg@ngoldbaum

[8]ページ先頭

©2009-2025 Movatter.jp