Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: np.unique: support hash based unique for string dtype#28767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
math-hiyoko wants to merge78 commits intonumpy:main
base:main
Choose a base branch
Loading
frommath-hiyoko:feature/#28364

Conversation

math-hiyoko
Copy link

@math-hiyokomath-hiyoko commentedApr 18, 2025
edited
Loading

Description

This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described#26018 (comment)

importrandomimportstringimporttimeimportnumpyasnpimportpolarsasplchars=string.ascii_letters+string.digitsarr=np.array(    [''.join(random.choices(chars,k=random.randint(5,10)))for_inrange(1_000)    ]*1_000_000,dtype='T',)np.random.shuffle(arr)time_start=time.perf_counter()print("unique count (hash based): ",len(np.unique(arr)))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))time_start=time.perf_counter()print("unique count (polars): ",len(pl.Series(arr).unique()))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))

Result

unique count (hash based):  100035.127 secsunique count (numpy main):  1000498.011 secsunique count (polars):  100074.023 secs

close#28364

adrinjalali reacted with heart emojingoldbaum reacted with rocket emoji
@math-hiyokomath-hiyoko marked this pull request as draftApril 18, 2025 11:14
@math-hiyoko
Copy link
Author

@seberg@ngoldbaum
I've addressed all comments received so far.

Copy link
Member

@ngoldbaumngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the implementation of fnv-1a in this PR isn't correct. Maybe we should just be using the (public-domain licensed) reference implementation:https://github.com/lcn2/fnv.

I didn't look closely at the rest after I noticed this issue.

template<typename T>
// function to caluculate the hash of a string
template <typename T>
size_t str_hash(const T *str, npy_intp num_chars) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note thatnpy_ucs4 is four bytes and fnv-1a operates on octets of data (e.g. individual bytes).

The reference implementation takes avoid * pointer and immediately casts it tounsigned char *. You could do similar.

We could also add the reference implementation as a vendored dependency (e.g. a git submodule). The license is compatible.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@sebergsebergseberg left review comments

@ngoldbaumngoldbaumngoldbaum left review comments

Assignees
No one assigned
Projects
Status: Pending authors' response
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

np.unique: support string dtypes
3 participants
@math-hiyoko@seberg@ngoldbaum

[8]ページ先頭

©2009-2025 Movatter.jp