Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH add hash based unique#26018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
seberg merged 53 commits intonumpy:mainfromadrinjalali:unique-cpp
Feb 27, 2025
Merged

ENH add hash based unique#26018

seberg merged 53 commits intonumpy:mainfromadrinjalali:unique-cpp
Feb 27, 2025

Conversation

adrinjalali
Copy link
Contributor

@adrinjalaliadrinjalali commentedMar 14, 2024
edited
Loading

This calculates unique values with an unordered hashset in C++ for certain dtypes.

Towards#11136

@ngoldbaumngoldbaum marked this pull request as draftMarch 15, 2024 14:58
@ngoldbaum
Copy link
Member

I marked this as draft in the github UI, hope you don't mind.

@adrinjalali
Copy link
ContributorAuthor

Thanks, I had intended to have it as draft, but forgot.

@adrinjalali
Copy link
ContributorAuthor

adrinjalali commentedMar 19, 2024
edited
Loading

Still got work to do regarding figuring out dtypes / type lengths supported on all platforms and all, and figuring out a way to filter dtypes that should be included in the hash based unique, but for now, this is giving me a somewhat 10x speedup on 100_000_000 to 1_000_000_000 data points.

EDIT: the speedup is more of 2-3x on these numbers with compiler optimization enabled.

@adrinjalali
Copy link
ContributorAuthor

Example times on 1,000,000,000 samples:

unique count (hashmap):  10007.815 secsunique count (numpy main):  100021.436 secsunique count (polars):  100016.768 secs

Now that the code is pretty small, is this something we'd like to have in numpy?

cc@seberg ,@rgommers since you've been involved in this.

I still need to fix the compiler issues for all different compilers, and maybe figure out the macro defs to add 96 and 128 bit data, but for now I think we can discuss if this is the right thing to do.

Also, the script I use to benchmark:

importtimeimportnumpyasnpfromnumpy._core.uniqueimportunique_hashimportpolarsasplarr=np.random.randint(0,1000,1000_000_000)time_start=time.perf_counter()print("unique count (hashmap): ",len(unique_hash(arr)))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))time_start=time.perf_counter()print("unique count (numpy main): ",len(np.unique(arr)))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))time_start=time.perf_counter()print("unique count (polars): ",len(pl.Series(arr).unique()))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))

Haven't included pandas since I guess they still need to do a release with numpy2 first.

rgommers reacted with thumbs up emoji

@adrinjalaliadrinjalali changed the title[DRAFT] [IGNORE] ENH add hash based unique[DRAFT] ENH add hash based uniqueMar 21, 2024
@ngoldbaum
Copy link
Member

This seems like a reasonable amount of code to maintain; it's more or less a wrapper around the C++ standard libraryunordered_map.

rgommers reacted with thumbs up emojiadrinjalali reacted with heart emoji

@ngoldbaum
Copy link
Member

Also totally a style thing, but to keep things consistent it probably makes more sense to add theunique_hash function to the mainnumpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a newnp._core.unique module.

@rgommers
Copy link
Member

Now that the code is pretty small, is this something we'd like to have in numpy?

Nice work. I agree with what@ngoldbaum said above. This seems like a very clean and concise implementation.

adrinjalali reacted with heart emoji

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commentedMar 22, 2024
edited
Loading

Haven't included pandas since I guess they still need to do a release with numpy2 first.

You could install the nightly wheels from the scientific python channel if you want (I don't think there is any (compiled) library that already has a 2.0 compatible release given there is not yet a stable ABI ?)

I tried your benchmark case locally with pandas. I didn't fetch this branch to compare with, but also ran with released numpy as a point of comparison:

%time len(pd.unique(arr))CPU times: user 2.74 s, sys: 0 ns, total: 2.74 sWall time: 2.74 s%time len(np.unique(arr))CPU times: user 39.7 s, sys: 1.15 s, total: 40.8 sWall time: 40.9 s

Related tostd::unordered_map, you find quite some content "on the internet" complaining about how slow it is compared to other implementations (I remember from looking a bit around some time ago for pandas, wondering if we should consider using a different implementations; pandas uses khash). But a lost of those posts are also quite some years old, and C++ compilers might have improved their implementation. It's definitely convenient to use)

@adrinjalali
Copy link
ContributorAuthor

@ngoldbaum

Also totally a style thing, but to keep things consistent it probably makes more sense to add theunique_hash function to the mainnumpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a newnp._core.unique module.

I tried and failed. One is C++ the other C, and they also have different macros related to compilation. So not sure how to merge them.

@seberg

  • bools and floats: good points. Will make this PR "less smart" and dtype specific code for those types. Although I think those can also be a separate PR to keep this one small.
  • refactoring: I'll try
  • Could you please elaborate onOwnedRef please? I'll checkout the pybind thingy to see how exceptions should be handled in the meantime.
  • randomized hash: they're compiler specific and AFAIK at least for gcc they might be anything by randomized. Pandas useshttps://github.com/veorq/SipHash which is certainly cryptographically secure. But there are other randomized hash functions which are faster, but not as secure. We could also consider adding that / changing to it in a separate PR maybe? Or would you say having a terrible hash function as the one in this PR is too bad to be merged into numpy?

@jorisvandenbossche
Yes, pandas is really fast, and last I compared, for unique, it was faster than polars. But it's also a huge piece of code on the pandas side. Porting that in here was the first thing I tried (#25596).

@seberg
Copy link
Member

I tried and failed. One is C++ the other C

Should work, I think but I guess you could defer that for now. You will have to export anextern "C" { function to be called from multiarray-module.

Could you please elaborate onOwnedRef please?

In C, you do cleanup with agoto error: /* cleanup code */. In C++ you don't usegoto error: but rely on exceptions, when an exception gets triggered (which could be in many places e.g. when out of memory), you need to run that cleanup code.
In particular, if you havePyObject * you must decref them. There are three solutions to that:

  1. You use C++ only in a core function that uses only borrowed references.
  2. You wrap everything in atry/catch...
  3. You introduce a newOwnedRef which which does nothing but store aPyObject * but defines a dealloc. That way C++ does the right thing.

In this case, I am not sure what is better, it might be 1 (have a simple C entry-point and just call into some C++ code that maybe doesn't even need objects at all) or 3 if you want to work with objects in C++.

having a terrible hash function as the one in this PR is too bad to be merged into numpy?

I doubt it matters much, there is nothing cryptographic here, the issue would be denial of service by creating a dataset with a huge number of unique values which all hash to the same thing. Note that e.g. Python only randomizes the hash for strings (unless dicts randomize them for non-strings?).

@adrinjalali
Copy link
ContributorAuthor

Doesn't seem like the failure is related to this PR

===================================FAILURES===================================___________TestStringDiscovery.test_nested_arrays_stringlength[1.2]___________self=<test_array_coercion.TestStringDiscoveryobjectat0x3cd6fb17d010>obj=1.2@pytest.mark.parametrize("obj",            [object(),1.2,10**43,None,"string"],ids=["object","1.2","10**43","None","string"])deftest_nested_arrays_stringlength(self,obj):length=len(str(obj))expected=np.dtype(f"S{length}")arr=np.array(obj,dtype="O")>assertnp.array([arr,arr],dtype="S").dtype==expectedERuntimeWarning:invalidvalueencounteredincastarr=array(1.2,dtype=object)expected=dtype('S3')length=3obj=1.2self=<test_array_coercion.TestStringDiscoveryobjectat0x3cd6fb17d010>

@ngoldbaum
Copy link
Member

The fix for that was just merged

sebergand others added2 commitsFebruary 25, 2025 12:51
@sebergseberg removed the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelFeb 25, 2025
Copy link
Member

@sebergseberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I added a small commit to:

  • Usesorted=False forunique_values (and a release note)
  • Some smaller maintenance, partially just small preferences, though.
  • I removed the 0-size special case. I think it is just nice to not need it (and I don't consider 0-size to be important to optimize)

This LGTM, would be nice if someone could have a quick look over my changes.

adrinjalali reacted with hooray emoji
NPY_ALLOW_C_API;
PyArray_Descr *descr = PyArray_DESCR(self);
Py_INCREF(descr);
PyObject *res_obj = PyArray_NewFromDescr(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If there was a C++ exception after this, we would leak it... But, I don't think that is possible, so let's not worry about it.

@adrinjalali
Copy link
ContributorAuthor

There's a segfault and the docs need updating since output is not sorted now@seberg

@seberg
Copy link
Member

seberg commentedFeb 25, 2025
edited
Loading

There's a segfault and the docs need updating since output is not sorted now

Whoops, need to not try to iterate (I think the setup is OK though).

The output being sorted is not documented forunique_values, which is the one I changed.

EDIT: 🤦 sorry, the examples do need updating of course, oops.

(let's pretend it may not even be stable!)
@seberg
Copy link
Member

seberg commentedFeb 25, 2025
edited
Loading

Just to note the observation: The s390xbuild segfaulted once here. (i.e. not our test, but the compilation)

@seberg
Copy link
Member

Let's give this a shot, thanks@adrinjalali, we discussed it briefly yesterday and I don't think it is helpful to try to iterate here more.

There are a lot of smaller and larger follow-ups (larger e.g. thinking about using another hash-map, maybe even one with some randomization, etc.).

adrinjalali reacted with hooray emojiadrinjalali reacted with heart emoji

@sebergseberg merged commit9e557eb intonumpy:mainFeb 27, 2025
68 checks passed
@adrinjalali
Copy link
ContributorAuthor

Thank y'all for all the patience, all the reviews, and all the help. This was a steep learning curve for me, doing C++ after about 12 years, and never having had done cpython bindings using its barebone API 😅

I loved every bit of this experience, and look forward to more contributions on this little corner ❤️

rgommers, astrojuanlu, and thatgeeman reacted with heart emojilorentzenchr and ngoldbaum reacted with rocket emoji

@adrinjalaliadrinjalali deleted the unique-cpp branchFebruary 27, 2025 10:52
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@ngoldbaumngoldbaumngoldbaum left review comments

@JensWehnerJensWehnerJensWehner left review comments

@lorentzenchrlorentzenchrlorentzenchr left review comments

@sebergsebergseberg approved these changes

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

7 participants
@adrinjalali@ngoldbaum@rgommers@jorisvandenbossche@seberg@JensWehner@lorentzenchr

[8]ページ先頭

©2009-2025 Movatter.jp