Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork11.1k
ENH, API: New sorting mechanism for DType API#28516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Running this script on both branches: importnumpyimportrandomimporttimeitprint(numpy.__version__)# '2.0.0rc2'options= ["a","bb","ccc","dddd"]lst=random.choices(options,k=1000)arr_s=numpy.fromiter(lst,dtype="T",count=len(lst))print(timeit.timeit(lambda:numpy.unique(arr_s),number=10000)) produces on master:
and on this branch:
|
@ngoldbaum just to let you know, I'll let you decide on whether you want this. I am starting to think it is time to implement a |
Agreed.@MaanasArora as you said this was a draft, would you be up for a bigger refactor? IMO this functionality deserves support in the new DType system without relying on the ArrFuncs baggage.
Also agreed if you don't want to take this further. |
MaanasArora commentedMar 25, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Yes, agreed, and willing to do a larger refactor! I actually began by considering special casing array sorting for strings overall, but wondered what the preferred approach would be. I think the sorting routines are not very flexible and could use an overhaul. PS just to clarify, adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.? I'll look further into this but would appreciate pointers. |
ngoldbaum commentedMar 25, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Take a look at We already have entries for getitem and setitem as well as the legacy arrfuncs slots. You'd be migrating from using |
Yes, if you look around, you will find for example I see nathan has some other pointers, I don't expect mine to be quite enough, so please ask! |
Thank you both, this was helpful! Starting to plan this now and will surely clarify if needed. |
I've added the slots and done some patchy work around using it, and the stringdtype integration. Looking into how to better relate to the array funcs. WIP, but hopefully this is in the right direction! |
Yeah, I think you see we still have a lot of other functionality that we should have slots for. Definitely nonzero, at least. |
Yes, nonzero and some other arrayfuncs could definitely use a slot! Thank you for the guidance--I've completed most of the missing pieces I think. I assume we would deprecate some of the earlier uses more gradually, so I've fallen back on array funcs as defaults in some places. I think this should be ready for a first pass! |
} | ||
return 0; | ||
} |
MaanasAroraMar 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is probably a lot of somewhat redundant code, but I added it here as a 'test' use which provides a boilerplate to replace the routines with a more efficient special-case indirect sort in a future PR. As a "bonus" it allows us to at least temporarily do the allocation this PR was intended for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice! I'll let@seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.
That said - there definitely are specialized string sorting implementations we could be using here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is good, the thing that I would like to change is thePyArray_(Arg)SortFunc
itself, so that it gets a context and auxdata instead of the "array". And this function would getNpyAuxData *out_auxdata
in also.
(Although, maybe for sorting this is less interesting as it is not as common to sort many arrays in one.)
The unfortunate thing is that you need to wrap the existing functions if you do this or have a second path for the old function.
I am also considering if we should have a return-2
or so to indicate that the sort-kind is not supported (no error set), to allow NumPy to fall back to a different one.
(But I am not sure we need it, it is useful only for somtehing like mergesort/stablesort, explicitly.)
@charris may have a thought on that.
MaanasAroraMar 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looking into passing context now! It looks like a good idea, will try to implement. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This change would be really nice; unfortunately both thePyArray_CompareFunc
and the sort functions use thePyArrayObject
right now, which we probably (!) shouldn't get from the context.
If I'm thinking right, we can define a new type such asPyArray_SortCompareFunc
that uses the descr instead of the arrayand makenew sort functions that do not use the array somehow (as we can no longer interchange theSortCompareFunc
and theCompareFunc
!), but we would still probably need the old functions to use with the older compare functions; I think the duplication will be quite complicated.
At the same time, this would be a missed opportunity to have the newCompareFunc
type if we do deprecate laterand want to go down this route...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are two possible approaches here:
- We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
- You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
That is, we wrap it into a dummy object for which basically the only valid field isarr->descr
(and maybearr->flags
, don't recall). (Seeget_dummy_stack_array
, yes this is terrible and even reviewers stumble over it, but...)
If you do that, you can write a short function that wraps the old call into a function taking the new one.
I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.
Allowing to set a compare function, seems like a nice idea (also to have a simple default).
It would be nice to move that into a default slot function. I.e. rather than setting it forStringDType
here, auto-fill the slot with the function that tries to use theSortCompareFunc
(If that slot is undefined, we can keep the slot filled withNULL
).
That also removes the second check later.
About thisSortCompareFunc
, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect thedtype
to do the comparisons -- for example structured has to do this).
But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e.<=>
in C++ can return a partial order, which means that:
- For us probably an "error" is a valid return (right now we can't propagate errors!).
- "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve with
compare(b, b)
andcompare(a, a)
.
But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to haveunordered_left
,unordered_both
,unordered_right
.
Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'll try the first approach I think! I think it will help isolate the new API in a way that makes deprecation easier later too. I will also do this default logic for both compare and sort compare funcs.
As for allowing partial order: yes that could break the symmetry, I suppose. But it could be useful to make sorting more precise in the long run too, and so "unordered" does seem a better way to allow for those kinds of extensions. And then we can use this machinery as the go-to for anywhere comparison decisions need to be made as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just an update: after some thought, I think it might be quite nice to go withunordered_left
,unordered_both
,unordered_right
, mainly because it saves us any issues with reverse sorting down the line, might as well get everything in, especially as you mentioned dataframes and that's clearly a very important use case. Working on this! I'll try to draft an API that can easily fill in 'defaults' somehow, so that the user-facing side can be used at different levels of complexity / customization.
MaanasAroraJun 1, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I've pushed a draft for this! Includesreversed
(currently not functional, but working on that) andnan_position
added into the context.
I've also restructured the context / legacy wrappers and updated theSortCompareFunc
to use the sorting context instead of thedescr
; it should allow richer features down the line.
npy_intp, int, PyArray_SortFunc **); | ||
typedef int *(PyArrayDTypeMeta_GetArgSortFunction)(PyArray_Descr *, | ||
npy_intp, int, PyArray_ArgSortFunc **); | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
New stuff in the public API needs new API docs as well as a release note describing the new features.
Maybe also as a proof-of-concept, it looks like both quaddtype and mpfdtype in numpy-user-dtypes implement sorting - would you be willing to update them to use the new API in a PR to numpy-user-dtypes that depends on this PR to numpy? That should give you a feeling for whether this API is helpful for someone writing a new user dtype. It'll also be a form of documentation - we don't have great docs for writing user dtypes besides the examples in numpy-user-dtypes.
ngoldbaumMar 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Also what should we do about the flags that got added before we made the dtype API public, e.g.NPY_DT_PyArray_ArrFuncs_compare
? I guess we can deprecate them although I don't know how hard it would be to generate deprecation warnings if those are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It's easy to generate a deprecation warning during registration (a bit tedious maybe, as you need explicit check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sure, I'll add API docs and a release note, and willing to make a PR to numpy-user-dtypes! Will look into that.
Just to be clear,NPY_DT_PyArray_ArrFuncs_compare
is still needed right? We can move it to a new slot rather than an arrayfunc but it's going to be different from the sort comparison for now if I'm thinking right (as it is user-facing rather than used in the sorting). Do we need to do this another way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We can't change slot numbers (unless they are guarded as private)! So the numbers are fixed (until they have not been used for a bit at least).
So yeah, I think we should keep it the old slot for now, maybe easier to make the deprecation a follow up.[^depr]
So, we just have to live with the numbering we got, I half thought I asked for an offset for theNPY_DT_PyArray_ArrFuncs
slots, but maybe I didn't bother.
(It's not a big issue, the only thing is the convenience ifslot numbers == slot offset
so you don't need to translate it.)
[^depr] I think this is as simple as asking users to compile with the new NumPy, and then addingPyArray_RUNTIME_VERSION
, but this PR is complicated enough due to API decisions for the new loops, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
So, we just have to live with the numbering we got, I half thought I asked for an offset for the NPY_DT_PyArray_ArrFuncs slots, but maybe I didn't bother.
There is an offset,_NPY_DT_ARRFUNCS_OFFSET
:
numpy/numpy/_core/src/multiarray/dtypemeta.h
Lines 94 to 95 in9389862
#defineNPY_DT_MAX_ARRFUNCS_SLOT \ | |
NPY_NUM_DTYPE_PYARRAY_ARRFUNCS_SLOTS + _NPY_DT_ARRFUNCS_OFFSET |
Uh oh!
There was an error while loading.Please reload this page.
} | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice! I'll let@seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.
That said - there definitely are specialized string sorting implementations we could be using here.
} | ||
static inline PyArray_CompareFunc * | ||
PyArray_SortCompare(PyArray_Descr *descr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd call thisPyArray_GetSortCompareFunction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, made this change. Thanks
@@ -44,7 +44,7 @@ | |||
* the below code implements this converted to an iteration and as an | |||
* additional minor optimization skips the recursion depth checking on the | |||
* smaller partition as it is always less than half of the remaining data and | |||
* will thus terminate fast enough | |||
* will thus terminate fast enough` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think this was added by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks for iterating so quickly :)
I think the new error path needs a little more thought, sorry for the back-and-forth.
sort-kind and order. | ||
Additionally, the new `NPY_DT_sort_compare` slot can be used to provide a comparison function for | ||
sorting, which will replace the default comparison function for the dtype in sorting functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
maybe a note that the old arrfuncs slots may be deprecated in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Added, thanks!
#define NPY_DT_sort_compare 11 | ||
#define NPY_DT_get_clear_loop 12 | ||
#define NPY_DT_get_fill_zero_loop 13 | ||
#define NPY_DT_finalize_descr 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
can you re-order these so the slots that were already in the struct keep their old values? I don't know offhand if changing this order is problematic but it seems more consistent to not change the old values even if it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, done!
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sorry, long comments, and I realize this is becoming a lot more complex than it may have looked initially. But, I would really like to see the context passed in/a new signature.
That also probably includes filling in/returningARRAY_METHODFLAGS
, even if the only useful flag is "requires GIL".
I also still tend to think it may make sense to have a magic return for "unsupported sort method", although should maybe ask Chuck once in a meeting about that.
(in principle I agree we usually just need stable and not-stable, but if we want users to be able to choose more precisely, I think it may make sense to allow us to fallback here. We could still use something like "no error indicated, butfunc == NULL
for it even, but maybe a special return is easier.)
If needed, maybe we have to talk briefly about it synchronously? Or maybe just write the docs/signatures first that we want for the public API.
@@ -0,0 +1 @@ | |||
* `PyArray_GetSortFunction`, `PyArray_GetArgSortFunction`, and `PyArray_GetSortCompareFunction` have been added to the C-API. These functions return the sorting, argsorting, and sort comparison functions if provided for a given dtype in new slots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You did not actually add them to the public C-API. Which is totally fine, though.
(I might start with adding aSortBuffer()
function or so.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This comment is still valid, I think we might as well remove this for now.
} | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are two possible approaches here:
- We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
- You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
That is, we wrap it into a dummy object for which basically the only valid field isarr->descr
(and maybearr->flags
, don't recall). (Seeget_dummy_stack_array
, yes this is terrible and even reviewers stumble over it, but...)
If you do that, you can write a short function that wraps the old call into a function taking the new one.
I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.
Allowing to set a compare function, seems like a nice idea (also to have a simple default).
It would be nice to move that into a default slot function. I.e. rather than setting it forStringDType
here, auto-fill the slot with the function that tries to use theSortCompareFunc
(If that slot is undefined, we can keep the slot filled withNULL
).
That also removes the second check later.
About thisSortCompareFunc
, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect thedtype
to do the comparisons -- for example structured has to do this).
But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e.<=>
in C++ can return a partial order, which means that:
- For us probably an "error" is a valid return (right now we can't propagate errors!).
- "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve with
compare(b, b)
andcompare(a, a)
.
But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to haveunordered_left
,unordered_both
,unordered_right
.
Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.
NPY_SORTKIND which, int descending, PyArray_SortFunc **out_sort) | ||
{ | ||
if (NPY_DT_SLOTS(NPY_DTYPE(descr))->get_sort_function == NULL) { | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This needs to set an error (TypeError or DTypeTypeError, which is defined somewhere I think.)
(An error here will make sense after you move the fallback logic into a default slot filling. Or you could have the default slot raise an error, that is also completely fine.)
No worries; thank you for the detailed feedback actually! It's nice to be able to iron out the direction for the API. I'll address the docs and public API changes and keep working away at the SortFunc changes. Happy to have a synchronous chat if it seems useful. |
@@ -477,4 +481,18 @@ typedef PyArray_Descr *(PyArrayDTypeMeta_FinalizeDescriptor)(PyArray_Descr *dtyp | |||
typedef int(PyArrayDTypeMeta_SetItem)(PyArray_Descr *, PyObject *, char *); | |||
typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *); | |||
typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *, | |||
PyArray_Descr *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The naming is a bit weird here, but I didn't want to disturb the original type as it's used a lot. I think theSortCompareFunc
should still be a unique type so will do that (even if only a clone of this type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I have slightly mixed feelings. On the one hand, I think this is the pragmatic thing to have.
On the other hand, we could also look this function from thenp.less_than
ornp.great_than
ufunc to implement sorting, I think.
(The problem there is still how to deal with unordered elements, a compare ufunc would work better...)
But, on the other hand, it seems pragmatic even if it won't work well e.g. for structured dtypes (performance issues), it will always work and provides an easy entry-point (we can also use this to define default comparison ufuncs).
So overall, I think I end up at just doing this, although I could imaging punting if we don't need it forStringDType
(I suspect we do, though).
Would like to hear if@ngoldbaum has an opinion.
(A neater future path would also be if this was more of a header-only code binding generator job with us making the sorting patterns available maybe. I.e. if this was defined in a C++ class and our sort code available, the DType could compile the full loop and avoid calling such a helper everywhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
IMO this is fine, if only because it exists right now 😄
MaanasArora commentedApr 5, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Sorry for the bit of delay, I was thinking through this and essentially ended up with separating more the legacy sorting machinery from this API. This way, the new signatures can freely use context-related features and we do not have to create some sort of empty array or refactor a very large number of sorting-related files. Aside from how nice the feature is, I think this separation is actually a plus and even rolling back the stringdtype integration was worth it (in any case, user dtypes are not to define sorting with the internal, now legacy functions, so may be best to add a specialized routine). We also have the new |
Sorry for not getting to this yet. I'm going to try to make sure to give this a once-over next week. I think you can fix the test failures by rebasing? |
No worries, I have some things to address as well. Just rebased--sorry not sure if things went perfectly smoothly. |
MaanasArora commentedApr 11, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Just brought this implementation with the new signature to parity with the previous one, including the usage in StringDType and ensuring use cases for the new and old sort functions are handled properly in the handlers. There is repetitive code but I guess we will phase out the legacy slots. Now we can make the context nicer if needed, and incorporate the auxdata! |
d7cd9ed
to0e8b6a5
CompareMaanasArora commentedMay 8, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Hello! Getting back to this, is there anything I need to address? Thinking of adding the functions to the public C-API if things look fine. Would we need to create a new C-API version (regenerate the hashes and such), and I guess it would come under 2.4, given how close 2.3 is? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A few comments, yeah, this won't make 2.3, sorry. I think it might be good to discuss a bit in depth with@ngoldbaum some time (not next week, sorry).
Another thing that I would like addressed/discussed is the problem of reverse sorting.
I do think we need at least areverse=True
, Ithink it might make sense to also provision for anan_position
(if nan goes first or last).
(NULL/NA ordering is very important in dataframe world, and I am tempted to include this, even if we say that the value for now isalways "last").
doc/source/reference/c-api/array.rst Outdated
@@ -1873,6 +1873,29 @@ described below. | |||
pointer. Currently this is used for zero-filling and clearing arrays storing | |||
embedded references. | |||
.. c:type:: int (PyArray_SortFunc)( \ | |||
void *start, npy_intp num, PyArrayMethod_Context *context, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Let's move the context to the first spot just for similarity. I think I added a context for traversal functions, I am not sure that was smart, but since we have it, it may be a slightly better fit.
I might callstart
,data
(not that it matters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is done!
doc/source/reference/c-api/array.rst Outdated
NpyAuxData *auxdata, NpyAuxData **out_auxdata) | ||
A function to sort a buffer of data. The *start* is a pointer to the | ||
beginning of the buffer containing *num* elements. A function of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
beginning of thebuffer containing *num* elements. A function of this | |
beginning of thecontiguous containing *num* elements. A function of this |
It also should be aligned, but I have to think whether we should allow indicating support for unaligned data here. (Which would require flags, for ufuncs "supports unaligned" is flagged beforeget_loop()
, although since here we always do contiguous, flagging it insideget_loop()
is OK also -- that is, becuaseget_loop()
is not passed any strides).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Makes sense! Committed (modulo typo).
NpyAuxData **out_auxdata) | ||
A function to arg-sort a buffer of data. The *start* is a pointer to the | ||
beginning of the buffer containing *num* elements. The *tosort* is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@charris to confirm, even for argsorting it probably makes sense to always use a contiguous buffer for sorting?
doc/source/reference/c-api/array.rst Outdated
PyArrayMethod_Context *context, NpyAuxData *auxdata, \ | ||
NpyAuxData **out_auxdata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
PyArrayMethod_Context *context, NpyAuxData *auxdata, \ | |
NpyAuxData **out_auxdata) | |
PyArrayMethod_Context *context, NpyAuxData *auxdata) |
Theout_auxdata
belongs on theget_loop
function!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is done, thank you :)
doc/source/reference/c-api/array.rst Outdated
.. c:macro:: NPY_DT_get_sort_function | ||
.. c:type:: int *(PyArrayDTypeMeta_GetSortFunction)(PyArray_Descr *, \ | ||
npy_intp sort_kind, int descending, PyArray_SortFunc **out_sort); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This needs*out_flags
, since we need the ability to indicate whether the GIL is required (I think we can ignore FPEs), but who knows if we'll have a reason for other flags eventually.
It also needs**out_auxdata
, since auxdata needs to come from somewhere :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done, thanks!
@@ -1570,20 +1592,41 @@ PyArray_Sort(PyArrayObject *op, int axis, NPY_SORTKIND which) | |||
return -1; | |||
} | |||
sort = PyDataType_GetArrFuncs(PyArray_DESCR(op))->sort[which]; | |||
PyArray_GetSortFunction(PyArray_DESCR(op),which, 0, &sort); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hmmm, let's just use< 0
to decide if it's an error. In which casesort != NULL
is assumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Makes sense, this is done!
@@ -477,4 +481,18 @@ typedef PyArray_Descr *(PyArrayDTypeMeta_FinalizeDescriptor)(PyArray_Descr *dtyp | |||
typedef int(PyArrayDTypeMeta_SetItem)(PyArray_Descr *, PyObject *, char *); | |||
typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *); | |||
typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *, | |||
PyArray_Descr *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I have slightly mixed feelings. On the one hand, I think this is the pragmatic thing to have.
On the other hand, we could also look this function from thenp.less_than
ornp.great_than
ufunc to implement sorting, I think.
(The problem there is still how to deal with unordered elements, a compare ufunc would work better...)
But, on the other hand, it seems pragmatic even if it won't work well e.g. for structured dtypes (performance issues), it will always work and provides an easy entry-point (we can also use this to define default comparison ufuncs).
So overall, I think I end up at just doing this, although I could imaging punting if we don't need it forStringDType
(I suspect we do, though).
Would like to hear if@ngoldbaum has an opinion.
(A neater future path would also be if this was more of a header-only code binding generator job with us making the sorting patterns available maybe. I.e. if this was defined in a C++ class and our sort code available, the DType could compile the full loop and avoid calling such a helper everywhere.)
MaanasArora commentedMay 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Thanks for the comments! And no worries, I was just making sure I wasn't missing something to do :) I think I need to think a bit more about the best way to adjust this for the extra features you mentioned, yes. Unordered elements is definitely something to consider at this stage, so I might try to draft something for that soon enough; that should hopefully create a clearer story around these features! |
39f5ef2
to237b7f0
CompareI want to call your attention to this suggestion:#28516 (comment). Did you ever take a look at numpy-user-dtypes? A worked example would help. |
Yes, sorry, I took a look actually--but was having some trouble with installing the dtype packages over the editable install of numpy. I'll push my draft anyway and try to address that. Thanks for the reminder. |
… with unordered returns, and restructure legacy wrappers
3570d70
to30b6c9c
CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sorry, forgot to submit the review. Please disregard if things are just odd/outdated!
doc/source/reference/c-api/array.rst Outdated
@@ -3588,15 +3636,19 @@ DType API slots but for now we have exposed the legacy | |||
.. c:macro:: NPY_DT_PyArray_ArrFuncs_sort | |||
An array ofPyArray_SortFunc of length ``NPY_NSORTS``. If set, allows | |||
An array ofPyArray_SortFuncWithContext of length ``NPY_NSORTS``. If set, allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This seems wrong, should be unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, that's wrong, thanks for catching
doc/source/reference/c-api/array.rst Outdated
.. c:macro:: NPY_DT_PyArray_ArrFuncs_argsort | ||
An array ofPyArray_ArgSortFunc of length ``NPY_NSORTS``. If set, | ||
An array ofPyArray_ArgSortFuncWithContext of length ``NPY_NSORTS``. If set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Same as above, should remain unchanged.
doc/source/reference/c-api/array.rst Outdated
An enum used to indicate the position of NaN values in sorting. | ||
.. c:enumerator:: NPY_SORT_NAN_FIRST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would prefer_TO_END
and_TO_START
or_LAST
/_FIRST
or_AT_END
/_AT_START
.
I.e. make it sort order independent. first/last to me seems slightly less clear whether it is sort order dependent or not (i.e. does "first" mean that it's value is considered lower than others?).
(The problem is, that in practice both conventions are relatively common.)
MaanasAroraJul 3, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I've used_TO_END
and_TO_START
; it does make more sense, thanks!
PyArray_SortFunc *sort[NPY_NSORTS]; | ||
PyArray_ArgSortFunc *argsort[NPY_NSORTS]; | ||
PyArray_SortFuncWithContext *sort[NPY_NSORTS]; | ||
PyArray_ArgSortFuncWithContext *argsort[NPY_NSORTS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You really can't change this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Indeed, reverted!
.. c:member:: NPY_SORT_NAN_POSITION nan_position | ||
The position of NaN values in the sort order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OK, makes sense to pass these as information, and I guess that means it makes sense to use a custom context and not try to rely on an existing one (the ufunc/casting one is more complex, and the traverse one too simple maybe.)
Strictly speaking, we could avoid the sortcompare slot here, but I think it's maybe pragmatic to just include it and leave it NULL.
} | ||
/* This should never happen, but just in case */ | ||
PyErr_SetString(PyExc_RuntimeError, "Unexpected comparison result in sort function"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This can't happen I guess, but if we set an error should probably grab the GIL.
Actually... we should have an error indicator on the enum anyway. And it's OK to just use the same error return for invalid values I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The old compare can't technically reutrn an error code. Should we just useNPY_MIN_INT
to indicate error?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sure, this is done, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just commenting before I forget: this is a change that we should add a release note for explicitly, I think. (I don't think anyone will see it, but it's a real C-API change in theory.)
(Didn't check, but also good to add a.. versionchanged
to the docs maybe, even if niche.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Where would the.. versionchanged
be, sorry? The old types haven't really changed AFAIK?
I've added a release note and also some missing documentation--looking to see if anything is missing. Thank you
*/ | ||
static inline int | ||
compare_from_context(const void *a, const void *b, void *context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Could investigate templating, but seems also OK not to for now.
If defined, sets a custom comparison function for the DType for use in | ||
sorting, which will replace `NPY_DT_PyArray_ArrFuncs_compare`. Implements | ||
``PyArray_CompareFunc``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Add a.. versionadded::
here. We don't have to strictly hide it, since it'll at least fail gracefully on older versions (at least I hope).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done, thanks
NPY_DT_SLOTS(DType)->get_sort_function = NULL; | ||
NPY_DT_SLOTS(DType)->get_argsort_function = NULL; | ||
NPY_DT_SLOTS(DType)->compare = NULL; | ||
NPY_DT_SLOTS(DType)->sort_compare = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
must come at the end to not break ABI!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, this is done!
@@ -0,0 +1 @@ | |||
* `PyArray_GetSortFunction`, `PyArray_GetArgSortFunction`, and `PyArray_GetSortCompareFunction` have been added to the C-API. These functions return the sorting, argsorting, and sort comparison functions if provided for a given dtype in new slots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This comment is still valid, I think we might as well remove this for now.
Uh oh!
There was an error while loading.Please reload this page.
Resolves#26510.
Allocates the lock for the StringDType array before sort and releases after.
I noticed the sorting algorithms independently get the compare function from the descriptor, so I have created new helper functions in
stringdtype/dtype.c
but not sure if that's the right place. Changes have been made only inquicksort.cpp
(but will add others later), so this is a draft but would appreciate feedback.Posting simple benchmarks in a comment below. Thank you for reviewing!