Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

ENH, API: New sorting mechanism for DType API#28516

Open
MaanasArora wants to merge23 commits intonumpy:main
base:main
Choose a base branch
Loading
fromMaanasArora:enh/faster-string-sorting

Conversation

MaanasArora
Copy link
Contributor

@MaanasAroraMaanasArora commentedMar 14, 2025
edited
Loading

Resolves#26510.

Allocates the lock for the StringDType array before sort and releases after.

I noticed the sorting algorithms independently get the compare function from the descriptor, so I have created new helper functions instringdtype/dtype.c but not sure if that's the right place. Changes have been made only inquicksort.cpp (but will add others later), so this is a draft but would appreciate feedback.

Posting simple benchmarks in a comment below. Thank you for reviewing!

@MaanasArora
Copy link
ContributorAuthor

Running this script on both branches:

importnumpyimportrandomimporttimeitprint(numpy.__version__)# '2.0.0rc2'options= ["a","bb","ccc","dddd"]lst=random.choices(options,k=1000)arr_s=numpy.fromiter(lst,dtype="T",count=len(lst))print(timeit.timeit(lambda:numpy.unique(arr_s),number=10000))

produces on master:

2.3.0.dev0+git20250310.c275e253.481267879999905

and on this branch:

2.3.0.dev0+git20250314.0eb0c8e1.1663875859994732

@seberg
Copy link
Member

@ngoldbaum just to let you know, I'll let you decide on whether you want this. I am starting to think it is time to implement aget_sortfunction slot but that doesn't mean we can't do this in the mean-time as it's a pretty big speed advantage.

@ngoldbaum
Copy link
Member

I am starting to think it is time to implement a get_sortfunction slot

Agreed.@MaanasArora as you said this was a draft, would you be up for a bigger refactor? IMO this functionality deserves support in the new DType system without relying on the ArrFuncs baggage.

but that doesn't mean we can't do this in the mean-time as it's a pretty big speed advantage.

Also agreed if you don't want to take this further.

@MaanasArora
Copy link
ContributorAuthor

MaanasArora commentedMar 25, 2025
edited
Loading

Yes, agreed, and willing to do a larger refactor! I actually began by considering special casing array sorting for strings overall, but wondered what the preferred approach would be. I think the sorting routines are not very flexible and could use an overhaul.

PS just to clarify, adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.? I'll look further into this but would appreciate pointers.

@ngoldbaum
Copy link
Member

ngoldbaum commentedMar 25, 2025
edited
Loading

adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.?

Take a look atnumpy/_core/src/multiarray/dtypemeta.h - I'm talking about adding a new entry inNPY_DType_Slots that handles comparison. Adding a new slot takes some ceremony - there are some magic constants doing offsets on structs elsewhere in NumPy that need to be updated alongside any changes - but there are comments that should hopefully guide you along your way.

We already have entries for getitem and setitem as well as the legacy arrfuncs slots. You'd be migrating from usingNPY_DT_SLOTS(dtype)->f->compare to some new api likeNPY_DT_COMPARE(dtype) that uses its own slot and allows per-dtype setup for sorting.

@seberg
Copy link
Member

PS just to clarify, adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.? I'll look further into this but would appreciate pointers.

Yes, if you look around, you will find for exampleget_clear_loop and the way users can specify it. I think sorting would look similar.
(I.e. the core sort loop probably always works on a contiguous chunk of memory that is aligned -- that part may be simpler would have to check. I assume it would work in-place, but we will also need an argsort.)
Theget_sort_function -- or what we name it -- would then get the desired sort-kind passed in, that way we will also have an easier time of adding new sort methods in the future.
(We may want to provision for an ascending/descending flag even if we don't use it).


I see nathan has some other pointers, I don't expect mine to be quite enough, so please ask!

@MaanasArora
Copy link
ContributorAuthor

Thank you both, this was helpful! Starting to plan this now and will surely clarify if needed.

@MaanasArora
Copy link
ContributorAuthor

I've added the slots and done some patchy work around using it, and the stringdtype integration. Looking into how to better relate to the array funcs. WIP, but hopefully this is in the right direction!

ngoldbaum reacted with eyes emoji

@ngoldbaum
Copy link
Member

Yeah, I think you see we still have a lot of other functionality that we should have slots for. Definitely nonzero, at least.

@MaanasArora
Copy link
ContributorAuthor

Yes, nonzero and some other arrayfuncs could definitely use a slot!

Thank you for the guidance--I've completed most of the missing pieces I think. I assume we would deprecate some of the earlier uses more gradually, so I've fallen back on array funcs as defaults in some places. I think this should be ready for a first pass!

}

return 0;
}
Copy link
ContributorAuthor

@MaanasAroraMaanasAroraMar 27, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is probably a lot of somewhat redundant code, but I added it here as a 'test' use which provides a boilerplate to replace the routines with a more efficient special-case indirect sort in a future PR. As a "bonus" it allows us to at least temporarily do the allocation this PR was intended for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice! I'll let@seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.

That said - there definitely are specialized string sorting implementations we could be using here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is good, the thing that I would like to change is thePyArray_(Arg)SortFunc itself, so that it gets a context and auxdata instead of the "array". And this function would getNpyAuxData *out_auxdata in also.
(Although, maybe for sorting this is less interesting as it is not as common to sort many arrays in one.)

The unfortunate thing is that you need to wrap the existing functions if you do this or have a second path for the old function.

I am also considering if we should have a return-2 or so to indicate that the sort-kind is not supported (no error set), to allow NumPy to fall back to a different one.
(But I am not sure we need it, it is useful only for somtehing like mergesort/stablesort, explicitly.)
@charris may have a thought on that.

Copy link
ContributorAuthor

@MaanasAroraMaanasAroraMar 27, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking into passing context now! It looks like a good idea, will try to implement. Thanks.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This change would be really nice; unfortunately both thePyArray_CompareFunc and the sort functions use thePyArrayObject right now, which we probably (!) shouldn't get from the context.

If I'm thinking right, we can define a new type such asPyArray_SortCompareFunc that uses the descr instead of the arrayand makenew sort functions that do not use the array somehow (as we can no longer interchange theSortCompareFunc and theCompareFunc!), but we would still probably need the old functions to use with the older compare functions; I think the duplication will be quite complicated.

At the same time, this would be a missed opportunity to have the newCompareFunc type if we do deprecate laterand want to go down this route...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are two possible approaches here:

  1. We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
  2. You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
    That is, we wrap it into a dummy object for which basically the only valid field isarr->descr (and maybearr->flags, don't recall). (Seeget_dummy_stack_array, yes this is terrible and even reviewers stumble over it, but...)
    If you do that, you can write a short function that wraps the old call into a function taking the new one.

I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.


Allowing to set a compare function, seems like a nice idea (also to have a simple default).

It would be nice to move that into a default slot function. I.e. rather than setting it forStringDType here, auto-fill the slot with the function that tries to use theSortCompareFunc (If that slot is undefined, we can keep the slot filled withNULL).
That also removes the second check later.


About thisSortCompareFunc, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect thedtype to do the comparisons -- for example structured has to do this).

But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e.<=> in C++ can return a partial order, which means that:

  • For us probably an "error" is a valid return (right now we can't propagate errors!).
  • "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve withcompare(b, b) andcompare(a, a).
    But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to haveunordered_left,unordered_both,unordered_right.

Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'll try the first approach I think! I think it will help isolate the new API in a way that makes deprecation easier later too. I will also do this default logic for both compare and sort compare funcs.

As for allowing partial order: yes that could break the symmetry, I suppose. But it could be useful to make sorting more precise in the long run too, and so "unordered" does seem a better way to allow for those kinds of extensions. And then we can use this machinery as the go-to for anywhere comparison decisions need to be made as you said.

npy_intp, int, PyArray_SortFunc **);
typedef int *(PyArrayDTypeMeta_GetArgSortFunction)(PyArray_Descr *,
npy_intp, int, PyArray_ArgSortFunc **);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

New stuff in the public API needs new API docs as well as a release note describing the new features.

Maybe also as a proof-of-concept, it looks like both quaddtype and mpfdtype in numpy-user-dtypes implement sorting - would you be willing to update them to use the new API in a PR to numpy-user-dtypes that depends on this PR to numpy? That should give you a feeling for whether this API is helpful for someone writing a new user dtype. It'll also be a form of documentation - we don't have great docs for writing user dtypes besides the examples in numpy-user-dtypes.

Copy link
Member

@ngoldbaumngoldbaumMar 27, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also what should we do about the flags that got added before we made the dtype API public, e.g.NPY_DT_PyArray_ArrFuncs_compare? I guess we can deprecate them although I don't know how hard it would be to generate deprecation warnings if those are used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's easy to generate a deprecation warning during registration (a bit tedious maybe, as you need explicit check).

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure, I'll add API docs and a release note, and willing to make a PR to numpy-user-dtypes! Will look into that.

Just to be clear,NPY_DT_PyArray_ArrFuncs_compare is still needed right? We can move it to a new slot rather than an arrayfunc but it's going to be different from the sort comparison for now if I'm thinking right (as it is user-facing rather than used in the sorting). Do we need to do this another way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can't change slot numbers (unless they are guarded as private)! So the numbers are fixed (until they have not been used for a bit at least).
So yeah, I think we should keep it the old slot for now, maybe easier to make the deprecation a follow up.[^depr]

So, we just have to live with the numbering we got, I half thought I asked for an offset for theNPY_DT_PyArray_ArrFuncs slots, but maybe I didn't bother.
(It's not a big issue, the only thing is the convenience ifslot numbers == slot offset so you don't need to translate it.)

[^depr] I think this is as simple as asking users to compile with the new NumPy, and then addingPyArray_RUNTIME_VERSION, but this PR is complicated enough due to API decisions for the new loops, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So, we just have to live with the numbering we got, I half thought I asked for an offset for the NPY_DT_PyArray_ArrFuncs slots, but maybe I didn't bother.

There is an offset,_NPY_DT_ARRFUNCS_OFFSET:

#defineNPY_DT_MAX_ARRFUNCS_SLOT \
NPY_NUM_DTYPE_PYARRAY_ARRFUNCS_SLOTS + _NPY_DT_ARRFUNCS_OFFSET

seberg reacted with thumbs up emoji
}

return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice! I'll let@seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.

That said - there definitely are specialized string sorting implementations we could be using here.

}

static inline PyArray_CompareFunc *
PyArray_SortCompare(PyArray_Descr *descr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd call thisPyArray_GetSortCompareFunction

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, made this change. Thanks

@@ -44,7 +44,7 @@
* the below code implements this converted to an iteration and as an
* additional minor optimization skips the recursion depth checking on the
* smaller partition as it is always less than half of the remaining data and
* will thus terminate fast enough
* will thus terminate fast enough`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think this was added by mistake?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, sorry!

@MaanasAroraMaanasArora changed the titleENH: Allocate lock only once in StringDType quicksortENH, API: New sorting mechanism for DType APIMar 28, 2025
Copy link
Member

@ngoldbaumngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for iterating so quickly :)

I think the new error path needs a little more thought, sorry for the back-and-forth.

sort-kind and order.

Additionally, the new `NPY_DT_sort_compare` slot can be used to provide a comparison function for
sorting, which will replace the default comparison function for the dtype in sorting functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

maybe a note that the old arrfuncs slots may be deprecated in the future.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added, thanks!

#define NPY_DT_sort_compare 11
#define NPY_DT_get_clear_loop 12
#define NPY_DT_get_fill_zero_loop 13
#define NPY_DT_finalize_descr 14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can you re-order these so the slots that were already in the struct keep their old values? I don't know offhand if changing this order is problematic but it seems more consistent to not change the old values even if it's fine.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, done!

Copy link
Member

@sebergseberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry, long comments, and I realize this is becoming a lot more complex than it may have looked initially. But, I would really like to see the context passed in/a new signature.

That also probably includes filling in/returningARRAY_METHODFLAGS, even if the only useful flag is "requires GIL".

I also still tend to think it may make sense to have a magic return for "unsupported sort method", although should maybe ask Chuck once in a meeting about that.
(in principle I agree we usually just need stable and not-stable, but if we want users to be able to choose more precisely, I think it may make sense to allow us to fallback here. We could still use something like "no error indicated, butfunc == NULL for it even, but maybe a special return is easier.)


If needed, maybe we have to talk briefly about it synchronously? Or maybe just write the docs/signatures first that we want for the public API.

@@ -0,0 +1 @@
* `PyArray_GetSortFunction`, `PyArray_GetArgSortFunction`, and `PyArray_GetSortCompareFunction` have been added to the C-API. These functions return the sorting, argsorting, and sort comparison functions if provided for a given dtype in new slots.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You did not actually add them to the public C-API. Which is totally fine, though.

(I might start with adding aSortBuffer() function or so.)

}

return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are two possible approaches here:

  1. We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
  2. You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
    That is, we wrap it into a dummy object for which basically the only valid field isarr->descr (and maybearr->flags, don't recall). (Seeget_dummy_stack_array, yes this is terrible and even reviewers stumble over it, but...)
    If you do that, you can write a short function that wraps the old call into a function taking the new one.

I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.


Allowing to set a compare function, seems like a nice idea (also to have a simple default).

It would be nice to move that into a default slot function. I.e. rather than setting it forStringDType here, auto-fill the slot with the function that tries to use theSortCompareFunc (If that slot is undefined, we can keep the slot filled withNULL).
That also removes the second check later.


About thisSortCompareFunc, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect thedtype to do the comparisons -- for example structured has to do this).

But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e.<=> in C++ can return a partial order, which means that:

  • For us probably an "error" is a valid return (right now we can't propagate errors!).
  • "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve withcompare(b, b) andcompare(a, a).
    But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to haveunordered_left,unordered_both,unordered_right.

Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.

NPY_SORTKIND which, int descending, PyArray_SortFunc **out_sort)
{
if (NPY_DT_SLOTS(NPY_DTYPE(descr))->get_sort_function == NULL) {
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This needs to set an error (TypeError or DTypeTypeError, which is defined somewhere I think.)

(An error here will make sense after you move the fallback logic into a default slot filling. Or you could have the default slot raise an error, that is also completely fine.)

@MaanasArora
Copy link
ContributorAuthor

No worries; thank you for the detailed feedback actually! It's nice to be able to iron out the direction for the API. I'll address the docs and public API changes and keep working away at the SortFunc changes. Happy to have a synchronous chat if it seems useful.

@@ -477,4 +481,18 @@ typedef PyArray_Descr *(PyArrayDTypeMeta_FinalizeDescriptor)(PyArray_Descr *dtyp
typedef int(PyArrayDTypeMeta_SetItem)(PyArray_Descr *, PyObject *, char *);
typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *);

typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *,
PyArray_Descr *);
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The naming is a bit weird here, but I didn't want to disturb the original type as it's used a lot. I think theSortCompareFunc should still be a unique type so will do that (even if only a clone of this type).

@MaanasArora
Copy link
ContributorAuthor

MaanasArora commentedApr 5, 2025
edited
Loading

Sorry for the bit of delay, I was thinking through this and essentially ended up with separating more the legacy sorting machinery from this API. This way, the new signatures can freely use context-related features and we do not have to create some sort of empty array or refactor a very large number of sorting-related files. Aside from how nice the feature is, I think this separation is actually a plus and even rolling back the stringdtype integration was worth it (in any case, user dtypes are not to define sorting with the internal, now legacy functions, so may be best to add a specialized routine).

We also have the newcompare slot which defaults tosort_compare and related features now, though they're not used yet. Hopefully this is in the right direction. If it is, we can gradually move the older sorting machinery to the context signatures, thus converging. Thank you!

@ngoldbaum
Copy link
Member

Sorry for not getting to this yet. I'm going to try to make sure to give this a once-over next week.

I think you can fix the test failures by rebasing?

@MaanasArora
Copy link
ContributorAuthor

No worries, I have some things to address as well.

Just rebased--sorry not sure if things went perfectly smoothly.

@MaanasArora
Copy link
ContributorAuthor

MaanasArora commentedApr 11, 2025
edited
Loading

Just brought this implementation with the new signature to parity with the previous one, including the usage in StringDType and ensuring use cases for the new and old sort functions are handled properly in the handlers. There is repetitive code but I guess we will phase out the legacy slots. Now we can make the context nicer if needed, and incorporate the auxdata!

@MaanasAroraMaanasAroraforce-pushed theenh/faster-string-sorting branch fromd7cd9ed to0e8b6a5CompareApril 11, 2025 21:29
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@sebergsebergseberg left review comments

@ngoldbaumngoldbaumngoldbaum left review comments

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

ENH: np.unique and sorting is slow for StringDType
3 participants
@MaanasArora@seberg@ngoldbaum

[8]ページ先頭

©2009-2025 Movatter.jp