NotificationsYou must be signed in to change notification settings
Fork11.1k
Star29.9k

ENH add hash based unique#26018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

seberg merged 53 commits intonumpy:mainfromadrinjalali:unique-cpp

Feb 27, 2025

Merged

ENH add hash based unique#26018

seberg merged 53 commits intonumpy:mainfromadrinjalali:unique-cpp

Feb 27, 2025

Conversation

Copy link

Contributor

adrinjalali commentedMar 14, 2024•
edited
Loading

This calculates unique values with an unordered hashset in C++ for certain dtypes.

Towards#11136

ENH add hash based unique

2933930

adrinjalali mentioned this pull request

Mar 14, 2024

DOC update C-API docs#26019

Open

getting closer

8da2e72

ngoldbaum marked this pull request as draft

March 15, 2024 14:58

Copy link

Member

ngoldbaum commentedMar 15, 2024

I marked this as draft in the github UI, hope you don't mind.

Copy link

ContributorAuthor

adrinjalali commentedMar 15, 2024

Thanks, I had intended to have it as draft, but forgot.

adrinjalali added9 commits

March 15, 2024 22:50

trying to expose as a module

961ef5b

trying to create a module

a6b1847

fix build

1f1c36c

...

0bc43c3

Merge remote-tracking branch 'upstream/main' into unique-cpp

f94cf89

segfault fix, imported numpy

a37b151

getting unique back, refcount issues exist

8f42b0b

unique works

f56634f

remove header

bce7534

Copy link

ContributorAuthor

adrinjalali commentedMar 19, 2024•
edited
Loading

Still got work to do regarding figuring out dtypes / type lengths supported on all platforms and all, and figuring out a way to filter dtypes that should be included in the hash based unique, but for now, this is giving me a somewhat 10x speedup on 100_000_000 to 1_000_000_000 data points.

EDIT: the speedup is more of 2-3x on these numbers with compiler optimization enabled.

adrinjalali added2 commits

March 21, 2024 14:17

cleanups and comments

0c6c588

change type

9b7d6f6

Copy link

ContributorAuthor

adrinjalali commentedMar 21, 2024

Example times on 1,000,000,000 samples:

unique count (hashmap):  10007.815 secsunique count (numpy main):  100021.436 secsunique count (polars):  100016.768 secs

Now that the code is pretty small, is this something we'd like to have in numpy?

cc@seberg ,@rgommers since you've been involved in this.

I still need to fix the compiler issues for all different compilers, and maybe figure out the macro defs to add 96 and 128 bit data, but for now I think we can discuss if this is the right thing to do.

Also, the script I use to benchmark:

importtimeimportnumpyasnpfromnumpy._core.uniqueimportunique_hashimportpolarsasplarr=np.random.randint(0,1000,1000_000_000)time_start=time.perf_counter()print("unique count (hashmap): ",len(unique_hash(arr)))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))time_start=time.perf_counter()print("unique count (numpy main): ",len(np.unique(arr)))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))time_start=time.perf_counter()print("unique count (polars): ",len(pl.Series(arr).unique()))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))

Haven't included pandas since I guess they still need to do a release with numpy2 first.

adrinjalali changed the title~~[DRAFT] [IGNORE] ENH add hash based unique~~[DRAFT] ENH add hash based unique

Mar 21, 2024

Copy link

Member

ngoldbaum commentedMar 21, 2024

This seems like a reasonable amount of code to maintain; it's more or less a wrapper around the C++ standard libraryunordered_map.

Copy link

Member

ngoldbaum commentedMar 21, 2024

Also totally a style thing, but to keep things consistent it probably makes more sense to add theunique_hash function to the mainnumpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a newnp._core.unique module.

Copy link

Member

rgommers commentedMar 21, 2024

Now that the code is pretty small, is this something we'd like to have in numpy?

Nice work. I agree with what@ngoldbaum said above. This seems like a very clean and concise implementation.

adrinjalali added2 commits

March 21, 2024 22:16

fix for initialization issue

85cf692

trying to move module

3db3349

seberg reviewed

Mar 22, 2024

View reviewed changes

numpy/_core/src/multiarray/unique.cppShow resolvedHide resolved

Copy link

Contributor

jorisvandenbossche commentedMar 22, 2024•
edited
Loading

Haven't included pandas since I guess they still need to do a release with numpy2 first.

You could install the nightly wheels from the scientific python channel if you want (I don't think there is any (compiled) library that already has a 2.0 compatible release given there is not yet a stable ABI ?)

I tried your benchmark case locally with pandas. I didn't fetch this branch to compare with, but also ran with released numpy as a point of comparison:

%time len(pd.unique(arr))CPU times: user 2.74 s, sys: 0 ns, total: 2.74 sWall time: 2.74 s%time len(np.unique(arr))CPU times: user 39.7 s, sys: 1.15 s, total: 40.8 sWall time: 40.9 s

Related tostd::unordered_map, you find quite some content "on the internet" complaining about how slow it is compared to other implementations (I remember from looking a bit around some time ago for pandas, wondering if we should consider using a different implementations; pandas uses khash). But a lost of those posts are also quite some years old, and C++ compilers might have improved their implementation. It's definitely convenient to use)

goruha mentioned this pull request

Mar 23, 2024

Follow this PRsintervals-mining-lab/foapy#18

Open

Copy link

ContributorAuthor

adrinjalali commentedMay 13, 2024

@ngoldbaum

Also totally a style thing, but to keep things consistent it probably makes more sense to add theunique_hash function to the mainnumpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a newnp._core.unique module.

I tried and failed. One is C++ the other C, and they also have different macros related to compilation. So not sure how to merge them.

@seberg

bools and floats: good points. Will make this PR "less smart" and dtype specific code for those types. Although I think those can also be a separate PR to keep this one small.
refactoring: I'll try
Could you please elaborate onOwnedRef please? I'll checkout the pybind thingy to see how exceptions should be handled in the meantime.
randomized hash: they're compiler specific and AFAIK at least for gcc they might be anything by randomized. Pandas useshttps://github.com/veorq/SipHash which is certainly cryptographically secure. But there are other randomized hash functions which are faster, but not as secure. We could also consider adding that / changing to it in a separate PR maybe? Or would you say having a terrible hash function as the one in this PR is too bad to be merged into numpy?

@jorisvandenbossche
Yes, pandas is really fast, and last I compared, for unique, it was faster than polars. But it's also a huge piece of code on the pandas side. Porting that in here was the first thing I tried (#25596).

Copy link

Member

seberg commentedMay 13, 2024

I tried and failed. One is C++ the other C

Should work, I think but I guess you could defer that for now. You will have to export anextern "C" { function to be called from multiarray-module.

Could you please elaborate onOwnedRef please?

In C, you do cleanup with agoto error: /* cleanup code */. In C++ you don't usegoto error: but rely on exceptions, when an exception gets triggered (which could be in many places e.g. when out of memory), you need to run that cleanup code.
In particular, if you havePyObject * you must decref them. There are three solutions to that:

You use C++ only in a core function that uses only borrowed references.
You wrap everything in atry/catch...
You introduce a newOwnedRef which which does nothing but store aPyObject * but defines a dealloc. That way C++ does the right thing.

In this case, I am not sure what is better, it might be 1 (have a simple C entry-point and just call into some C++ code that maybe doesn't even need objects at all) or 3 if you want to work with objects in C++.

having a terrible hash function as the one in this PR is too bad to be merged into numpy?

I doubt it matters much, there is nothing cryptographic here, the issue would be denial of service by creating a dataset with a huge number of unique values which all hash to the same thing. Note that e.g. Python only randomizes the hash for strings (unless dicts randomize them for non-strings?).

This was referencedFeb 20, 2025

np.unique: support float dtypes#28363

Open

np.unique: support string dtypes#28364

Closed

np.unique: support masked arrays#28366

Open

Hash based np.unique to support return_index, return_inverse, and return_counts#28374

Open

meta-issue: np.unique enhancements#28375

Open

use macro to return notimplemented

ab86574

Copy link

ContributorAuthor

adrinjalali commentedFeb 21, 2025

Doesn't seem like the failure is related to this PR

===================================FAILURES===================================___________TestStringDiscovery.test_nested_arrays_stringlength[1.2]___________self=<test_array_coercion.TestStringDiscoveryobjectat0x3cd6fb17d010>obj=1.2@pytest.mark.parametrize("obj",            [object(),1.2,10**43,None,"string"],ids=["object","1.2","10**43","None","string"])deftest_nested_arrays_stringlength(self,obj):length=len(str(obj))expected=np.dtype(f"S{length}")arr=np.array(obj,dtype="O")>assertnp.array([arr,arr],dtype="S").dtype==expectedERuntimeWarning:invalidvalueencounteredincastarr=array(1.2,dtype=object)expected=dtype('S3')length=3obj=1.2self=<test_array_coercion.TestStringDiscoveryobjectat0x3cd6fb17d010>

Copy link

Member

ngoldbaum commentedFeb 21, 2025

The fix for that was just merged

Merge remote-tracking branch 'upstream/main' into unique-cpp

08d7d62

lorentzenchr reviewed

Feb 22, 2025

View reviewed changes

doc/release/upcoming_changes/26018.performance.rst OutdatedShow resolvedHide resolved

sebergand others added2 commits

February 25, 2025 12:51

Apply suggestions from code review

e1e2ddf

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

MAINT,ENH: Smaller reorgs/maint and usesorted=False for `unique_va…

f96411a

…lues`

seberg removed the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes label

Feb 25, 2025

seberg approved these changes

Feb 25, 2025

View reviewed changes

Copy link

Member

seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I added a small commit to:

Usesorted=False forunique_values (and a release note)
Some smaller maintenance, partially just small preferences, though.
I removed the 0-size special case. I think it is just nice to not need it (and I don't consider 0-size to be important to optimize)

This LGTM, would be nice if someone could have a quick look over my changes.

numpy/_core/src/multiarray/unique.cpp

		NPY_ALLOW_C_API;
		PyArray_Descr *descr = PyArray_DESCR(self);
		Py_INCREF(descr);
		PyObject *res_obj = PyArray_NewFromDescr(

Copy link

Member

sebergFeb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If there was a C++ exception after this, we would leak it... But, I don't think that is possible, so let's not worry about it.

Copy link

ContributorAuthor

adrinjalali commentedFeb 25, 2025

There's a segfault and the docs need updating since output is not sorted now@seberg

Ensure we don't iterate if iterator is empty (also change thread stat…

2319947

…e slightly)

Copy link

Member

seberg commentedFeb 25, 2025•
edited
Loading

There's a segfault and the docs need updating since output is not sorted now

Whoops, need to not try to iterate (I think the setup is OK though).

The output being sorted is not documented forunique_values, which is the one I changed.

EDIT: 🤦 sorry, the examples do need updating of course, oops.

DOC: unique_values doc examples may have different order now

e188bf3

(let's pretend it may not even be stable!)

Copy link

Member

seberg commentedFeb 25, 2025•
edited
Loading

Just to note the observation: The s390xbuild segfaulted once here. (i.e. not our test, but the compilation)

Copy link

Member

seberg commentedFeb 27, 2025

Let's give this a shot, thanks@adrinjalali, we discussed it briefly yesterday and I don't think it is helpful to try to iterate here more.

There are a lot of smaller and larger follow-ups (larger e.g. thinking about using another hash-map, maybe even one with some randomization, etc.).

seberg merged commit9e557eb intonumpy:main

Feb 27, 2025

68 checks passed

Copy link

ContributorAuthor

adrinjalali commentedFeb 27, 2025

Thank y'all for all the patience, all the reviews, and all the help. This was a steep learning curve for me, doing C++ after about 12 years, and never having had done cpython bindings using its barebone API 😅

I loved every bit of this experience, and look forward to more contributions on this little corner ❤️