Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork10.9k
PERF: improve multithreaded ufunc scaling#27896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
ngoldbaum commentedDec 3, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Darn, this seems to break building on Windows:
If anyone happens to know what's causing that let me know, I'll try debugging on a Windows machine in the meantime... |
Uh oh!
There was an error while loading.Please reload this page.
if (res->buckets == NULL) { | ||
PyErr_NoMemory(); | ||
PyMem_Free(res); | ||
return NULL; | ||
} | ||
#ifdef Py_GIL_DISABLED | ||
res->mutex = new std::shared_mutex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm using a heap allocation because otherwise I'd need to convertufunc_object.c
- which includesnpy_hashtable.h
and sees the layout ofPyArrayIdentityHash
- to C++ and that seemed like too much to take on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Well... unfortunately, untilufunc_object.c
is convertedand all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.
(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)
Uh oh!
There was an error while loading.Please reload this page.
ngoldbaum commentedDec 3, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I think the Windows linker errors I ran into earlier were caused by missing |
/* Returns a borrowed ref of the second value in the matching info tuple */ | ||
PyObject * | ||
get_info_no_cast(PyUFuncObject *ufunc, PyArray_DTypeMeta *op_dtype, | ||
int ndtypes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
not totally clear to me if originally not including this function indispatching.h
and forward-declaring it here was an intentional decision for some reason
@charris do you think this is OK to backport to 2.1? IMO moving stuff to C++ like this feels risky for a bugfix release. |
Python multithreading is still in its early stages, so 2.2.0 is probably soon enough for those wanting to experiment with that feature, and it will come out at about the same time as 2.1.4. I would prefer to limit the 2.1 backports to clear bug fixes. |
Ah I see you added the backport-candidate label because it's a backport to 2.2. I was confused about the target branch. |
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I was curious and had a look, thinking it might be too much, but it is nice to see this actually is a small change! My review may not mean too much, but it seems all OK, and the only thing I have is a suggested name change...
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The_with_locking
is a much better name, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM, modulo the exception thing. I do wonder slightly whether allows re-entry from within the same thread?
Not sure that it matters, but I think it may be nice to have a recrusive mutex here. It doesn't seem implausible for a promotion function to call the promoter again to figure things out.
(But, it should also be rare enough that I don't care if we defer that to when someone runs into it.)
if (res->buckets == NULL) { | ||
PyErr_NoMemory(); | ||
PyMem_Free(res); | ||
return NULL; | ||
} | ||
#ifdef Py_GIL_DISABLED | ||
res->mutex = new std::shared_mutex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Well... unfortunately, untilufunc_object.c
is convertedand all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.
(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)
@@ -163,7 +167,9 @@ check_and_adjust_axis(int *axis, int ndim) | |||
* <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52023>. | |||
* clang versions < 8.0.0 have the same bug. | |||
*/ | |||
#if (!defined __STDC_VERSION__ || __STDC_VERSION__ < 201112 \ | |||
#ifdef __cplusplus | |||
#define NPY_ALIGNOF(type) alignof(type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Wondering if we shouldn't just usealignof
, but let's go with this.
@@ -115,6 +122,9 @@ PyArrayIdentityHash_Dealloc(PyArrayIdentityHash *tb) | |||
{ | |||
PyMem_Free(tb->buckets); | |||
PyMem_Free(tb); | |||
#ifdef Py_GIL_DISABLED | |||
delete (std::shared_mutex *)tb->mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Seems OK to assume thatdelete
will never throw an exception.
"index %"NPY_INTP_FMT" is out of bounds " | ||
"for size %"NPY_INTP_FMT, *index, max_item); | ||
"index %"NPY_INTP_FMT" is out of bounds " | ||
"for size %"NPY_INTP_FMT, *index, max_item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We should just replace allNPY_INTP_FMT
with%zd
one of these days...
Unfortunately then we run into the issue of finding a portable implementation. My brief research indicates that there isn't a standardized shared recursive mutex because of the performance trade-offs, although the posix shared read-write lock is recursive. If someone does try to use this reentrantly, it will deadlock. Do you think that's problematic in practice? Thanks for the pointer about exceptions. I found |
Not right now. I could imagine it being meaningful, for example if you had a parametric dtype like
Yeah, that seems good, nice. (No idea if it would catch non memory errors, but I don't think those are possible). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks@ngoldbaum looks like this is ready.
7124091
intonumpy:mainUh oh!
There was an error while loading.Please reload this page.
* PERF: add a fast path to ufunc type resolution* MAINT: move dispatching.c to C++* MAINT: move npy_hashtable to C++ and use std::shared_mutex* MAINT: fix windows linking* MAINT: remove outdated comment* MAINT: only call try_promote on free-threaded buildConverts dispatching to cpp in order to use `std::shared_mutex` to improve free-threaded scaling.* MAINT: try to give new function a name indicating it uses a mutex* MAINT: only do complicated casting to get a mutex pointer once* MAINT: use std::nothrow to avoid dealing with exceptions* DOC: add changelog
PRnumpy#27896 was backported to NumPy 2.2.[skip ci]
return NULL; | ||
} | ||
return info; | ||
} | ||
#ifdef Py_GIL_DISABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sorry for the drive-by comment, butPy_GIL_DISABLED
means you're running a on free-threading enabled Python, not necessarily that the GIL is concretely disabled, right?
If you're callingmutex->lock
ormutex->lock_shared
with the GIL held, then there is a risk of deadlock if, for example,promote_and_get_info_and_ufuncimpl
released the GIL in another thread and is waiting to take the GIL back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
So you may want to wrap calls tomutex->lock[_shared]
in aPy_BEGIN_ALLOW_THREADS
/Py_END_ALLOW_THREADS
pair.
Uh oh!
There was an error while loading.Please reload this page.
Fixes#27786 and re-do of#27859.
Thanks to@seiko2plus for the suggestion to convert this code to C++ and use C++17 features. Let's see if any of the platforms we run CI on object to using this, it passes the tests on my Macbook Pro.
There is an earlier commit included in this PR that keeps things in C and uses the CPython-internal
_PyRWMutex
, which also seems to work, but using C++ lets us avoid the need to write a PEP or make an argument to the C API workgroup that making_PyRWMutex
public would be nice for C extensions.With the test script in the issue, I see the following scaling using
std::shared_mutex
: