Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork10.9k
ENH: np.unique: support hash based unique for string dtype#28767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
@seberg@ngoldbaum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think the implementation of fnv-1a in this PR isn't correct. Maybe we should just be using the (public-domain licensed) reference implementation:https://github.com/lcn2/fnv.
I didn't look closely at the rest after I noticed this issue.
template<typename T> | ||
// function to caluculate the hash of a string | ||
template <typename T> | ||
size_t str_hash(const T *str, npy_intp num_chars) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Note thatnpy_ucs4
is four bytes and fnv-1a operates on octets of data (e.g. individual bytes).
The reference implementation takes avoid *
pointer and immediately casts it tounsigned char *
. You could do similar.
We could also add the reference implementation as a vendored dependency (e.g. a git submodule). The license is compatible.
Uh oh!
There was an error while loading.Please reload this page.
Description
This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.
Benchmark Results
The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described#26018 (comment)
Result
close#28364