This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described#26018 (comment)

importrandomimportstringimporttimeimportnumpyasnpimportpolarsasplchars=string.ascii_letters+string.digitsarr=np.array(    [''.join(random.choices(chars,k=random.randint(5,10)))for_inrange(1_000)    ]*1_000_000,dtype='T',)np.random.shuffle(arr)time_start=time.perf_counter()print("unique count (hash based): ",len(np.unique(arr)))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))time_start=time.perf_counter()print("unique count (polars): ",len(pl.Series(arr).unique()))time_elapsed= (time.perf_counter()-time_start)print ("%5.3f secs"% (time_elapsed))

Result

unique count (hash based):  100033.583 secsunique count (numpy main):  1000498.011 secsunique count (polars):  100074.023 secs

close#28364

math-hiyoko added13 commits

April 16, 2025 01:20

Support NPY_STRING, NPY_UNICODE

f620f3b

unique for NPY_STRING and NPY_UNICODE

20ccefe

fix construct array

38626b9

remove unneccessary include

56bd858

refactor

f79736a

refactoring

c4e5438

comment

7c51049

feature: unique for NPY_VSTRING

bd70552

refactoring

cc8ece6

remove unneccessary include

f7b20a0

add test

d0170ed

add error message

dbb140f

linter

49ed502

math-hiyoko marked this pull request as draft

April 18, 2025 11:14

github-actionsbot added the 01 - Enhancement label

Apr 18, 2025

math-hiyoko added15 commits

April 18, 2025 20:16

linter

0238cee

reserve bucket

6905978

remove emoji from testcase

2fc1378

fix testcase

1ad6d6c

remove error

b478e15

fix testcase

95bc405

fix testcase name

3f1811b

use basic_string

99e3662

fix testcase

b99542a

add ValueError

2589dd7

fix testcase

3f40cdc

fix memory error

68d5a7b

remove multibyte char

d38c3e3

refactoring

8cf2c63

add multibyte char

0165d6a

math-hiyoko added3 commits

May 3, 2025 08:31

bool -> npy_bool

2a1bd41

FIX: cast

8b632f2

34sec -> 35.1sec

a7bfc08

Copy link

ContributorAuthor

math-hiyoko commentedMay 4, 2025

@seberg @ngoldbaum
I've addressed all comments received so far.

math-hiyokoand others added2 commits

May 21, 2025 22:11

Merge branch 'main' into feature/numpy#28364

dd0d8f5

fix: lint

9fc9ce3

ngoldbaum reviewed

May 22, 2025

View reviewed changes

Copy link

Member

ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the implementation of fnv-1a in this PR isn't correct. Maybe we should just be using the (public-domain licensed) reference implementation:https://github.com/lcn2/fnv.

I didn't look closely at the rest after I noticed this issue.

numpy/_core/src/multiarray/unique.cpp Outdated

		template<typename T>
		// function to caluculate the hash of a string
		template <typename T>
		size_t str_hash(const T *str, npy_intp num_chars) {

Copy link

Member

ngoldbaumMay 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note thatnpy_ucs4 is four bytes and fnv-1a operates on octets of data (e.g. individual bytes).

The reference implementation takes avoid * pointer and immediately casts it tounsigned char *. You could do similar.

We could also add the reference implementation as a vendored dependency (e.g. a git submodule). The license is compatible.

Copy link

ContributorAuthor

math-hiyokoMay 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the suggestion.
If we go the submodule route for lcn2/fnv, where in the NumPy tree would you like the submodule to live? Let me know your preferred path and I’ll add it accordingly.

Copy link

ContributorAuthor

math-hiyokoMay 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Vendoring would work for me as well (e.g. copying a single header the way stlab does inadobe/fnv.hpp).
If we go with a vendored file instead of a submodule, which path in the NumPy tree would you prefer for it to live?

Copy link

Member

ngoldbaumMay 26, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would probably vendor it as a git submodule innumpy/_core/src/common, next topythoncapi-compat. Maybe take a look at the PR addingpythoncapi-compat as a vendered dependency to see how that header-only vendored dependency is integrated into the numpy build system.

I only have a slight preference for a git submodule, since it makes updating the vendored code marginally easier. If you feel like it's easier to structure it as just a copy/pasted new header file (that includes a note about the original copyright and license), I'd probably just put that header innumpy/_core/src/multiarray.

Copy link

ContributorAuthor

math-hiyokoMay 27, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Tried the submodule approach, but ran into two blockers:

the reference source expects to be built with a plainmake step that is not available on every platform/toolchain we target.
the code still uses legacy BSD typedefs such asu_int32_t; that builds fine on Linux/macOS but fails to compile on WASM, Windows/Clang, etc.

Because of these two issues the submodule route looks impractical for NumPy at the moment.

Copy link

Member

ngoldbaumMay 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fair enough, thanks for checking. Vendoring a somewhat adapted version makes sense.

Copy link

ContributorAuthor

math-hiyokoJun 1, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I’ve re-implemented the FNV-1a hash function based on the reference implementation fromlcn2/fnv, with appropriate modifications. To keep the scope and visibility of the function well-managed, I split the implementation into a header and a .c source file.

After this change, the benchmarked runtime improved slightly from 35.1 sec to 33.5 sec.

melissawm added this toNumPy first-time contributor PRs

May 22, 2025

melissawm moved this toPending authors' response inNumPy first-time contributor PRs

May 22, 2025

fix: cast using const void *

998ca00

math-hiyoko force-pushed thefeature/#28364 branch fromb6394ed to998ca00Compare

May 30, 2025 15:17

math-hiyoko added3 commits

June 1, 2025 17:00

fix: fix fnv1a hash

3dd2667

fix: lint

94926cb

35.1sec -> 33.5sec

a711635

Copy link

Member

ngoldbaum commentedJun 9, 2025

Sorry this didn't quite make it in time for NumPy 2.3, but i'm going to try to get this merged ASAP so we have time to iterate in-tree as needed. I'll try to give this another once-over sometime this week.

I also want to say that this is a really impressive contribution and any delay on my part is because I have to focus on other stuff and it takes a lot of time and attention to properly review CPython C API code rather than me being unenthusiastic about getting this improvement into NumPy.

Merge branch 'main' into feature/numpy#28364

ccccc44

ngoldbaum approved these changes

Jun 18, 2025

View reviewed changes

Copy link

Member

ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I see some minor issues that should be addressed but this is ready to merge then. I looked all the C++ code over in detail and didn't spot any issues besides these.

numpy/_core/src/multiarray/unique.cpp OutdatedShow resolvedHide resolved

numpy/_core/src/multiarray/unique.cppShow resolvedHide resolved

numpy/_core/src/multiarray/unique.cpp OutdatedShow resolvedHide resolved

Copy link

Member

ngoldbaum commentedJun 18, 2025

Also you might be interested in#29229

math-hiyoko added3 commits

June 20, 2025 02:10

enh: define macro HASH_TABLE_INITIAL_BUCKETS

2b6b9b5

enh: error handling of NpyString_load

e92a387

enh: delete comments on GIL

397a594

Copy link

ContributorAuthor

math-hiyoko commentedJun 19, 2025

Thanks so much for the thorough review! This was my first PR to NumPy, and your detailed feedback really helped me bring it to completion.

I'll take a look at#29229 as you suggested — I'm also interested in#28363, which I think I can complete in a similar way to this PR.

ngoldbaum reviewed

Jun 19, 2025

View reviewed changes

numpy/_core/src/multiarray/unique.cpp Outdated

		NpyString_load(in_allocator, packed_string, &unpacked_strings[i]);
		int is_null = NpyString_load(in_allocator, packed_string, &unpacked_strings[i]);
		if (is_null == -1) {
		// failed to load string

Copy link

Member

ngoldbaumJun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think you need to explicitly set an error indicator with e.g.npy_gil_error.NpyString_load doesn't do it for you.

Copy link

ContributorAuthor

math-hiyokoJun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added the error indicator PyErr_SetString in the failure branch

if (is_null == -1) {PyErr_SetString(PyExc_RuntimeError,"Failed to load string from packed static string.");returnNULL;        }

fix: PyErr_SetString when NpyString_load failed

425a166

ngoldbaum reviewed

Jun 19, 2025

View reviewed changes

numpy/_core/src/multiarray/unique.cpp Outdated

		@@ -243,7 +243,8 @@ unique_vstring(PyArrayObject *self, npy_bool equal_nan)
		npy_packed_static_string packed_string = (npy_packed_static_string )idata;
		int is_null = NpyString_load(in_allocator, packed_string, &unpacked_strings[i]);
		if (is_null == -1) {
		// failed to load string
		PyErr_SetString(PyExc_RuntimeError,

Copy link

Member

ngoldbaumJun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

you need to have the GIL acquired before you call this - that’s why I suggestednpy_gil_error, since it handles that dance

Copy link

ContributorAuthor

math-hiyokoJun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for pointing this out.
SwitchedPyErr_SetString tonpy_gil_error.

fix: PyErr_SetString -> npy_gil_error

12eb788

Copy link

Member

ngoldbaum commentedJun 20, 2025

Let's bring this in, thanks@math-hiyoko! Looking forward to future contributions.

ngoldbaum merged commit20d034f intonumpy:main

Jun 20, 2025

76 checks passed

github-project-automationbot moved this fromPending authors' response toCompleted inNumPy first-time contributor PRs

Jun 20, 2025

math-hiyoko mentioned this pull request

Jun 21, 2025

np.unique: support float dtypes#28363

Open

Labels

01 - Enhancement

3 participants

Movatterモバイル変換

Uh oh!

ENH: np.unique: support hash based unique for string dtype#28767

ENH: np.unique: support hash based unique for string dtype#28767

Uh oh!

Conversation

math-hiyoko commentedApr 18, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Description

Benchmark Results

Result

Uh oh!

math-hiyoko commentedMay 4, 2025

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaumMay 26, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

math-hiyokoMay 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

math-hiyokoJun 1, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commentedJun 9, 2025

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commentedJun 18, 2025

Uh oh!

math-hiyoko commentedJun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commentedJun 20, 2025

Uh oh!

Uh oh!

Uh oh!

math-hiyoko commentedApr 18, 2025•
edited
Loading

ngoldbaumMay 26, 2025•
edited
Loading

math-hiyokoMay 27, 2025•
edited
Loading

math-hiyokoJun 1, 2025•
edited
Loading