Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

PERF: improve multithreaded ufunc scaling#27896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
seberg merged 10 commits intonumpy:mainfromngoldbaum:fix-multithreaded-scaling
Dec 5, 2024

Conversation

ngoldbaum
Copy link
Member

@ngoldbaumngoldbaum commentedDec 3, 2024
edited
Loading

Fixes#27786 and re-do of#27859.

Thanks to@seiko2plus for the suggestion to convert this code to C++ and use C++17 features. Let's see if any of the platforms we run CI on object to using this, it passes the tests on my Macbook Pro.

There is an earlier commit included in this PR that keeps things in C and uses the CPython-internal_PyRWMutex, which also seems to work, but using C++ lets us avoid the need to write a PEP or make an argument to the C API workgroup that making_PyRWMutex public would be nice for C extensions.

With the test script in the issue, I see the following scaling usingstd::shared_mutex:

mflops_array_length_1000

@ngoldbaum
Copy link
MemberAuthor

ngoldbaum commentedDec 3, 2024
edited
Loading

Darn, this seems to break building on Windows:

FAILED: numpy/_core/_multiarray_umath.cp311-win_amd64.pyd numpy/_core/_multiarray_umath.cp311-win_amd64.pdb "link" @numpy/_core/_multiarray_umath.cp311-win_amd64.pyd.rsp   Creating library numpy\_core\_multiarray_umath.cp311-win_amd64.lib and object numpy\_core\_multiarray_umath.cp311-win_amd64.expsrc_umath_dispatching.cpp.obj : error LNK2001: unresolved external symbol "struct _typeobject PyArrayDTypeMeta_Type" (?PyArrayDTypeMeta_Type@@3U_typeobject@@A)  Hint on symbols that are defined and could potentially match:    PyArrayDTypeMeta_Typesrc_umath_dispatching.cpp.obj : error LNK2001: unresolved external symbol "struct npy_static_pydata_struct npy_static_pydata" (?npy_static_pydata@@3Unpy_static_pydata_struct@@A)  Hint on symbols that are defined and could potentially match:    npy_static_pydatanumpy\_core\_multiarray_umath.cp311-win_amd64.pyd : fatal error LNK1120: 2 unresolved externals

If anyone happens to know what's causing that let me know, I'll try debugging on a Windows machine in the meantime...

if (res->buckets == NULL) {
PyErr_NoMemory();
PyMem_Free(res);
return NULL;
}

#ifdef Py_GIL_DISABLED
res->mutex = new std::shared_mutex();
Copy link
MemberAuthor

@ngoldbaumngoldbaumDec 3, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm using a heap allocation because otherwise I'd need to convertufunc_object.c - which includesnpy_hashtable.h and sees the layout ofPyArrayIdentityHash - to C++ and that seemed like too much to take on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Well... unfortunately, untilufunc_object.c is convertedand all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.

(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)

@ngoldbaumngoldbaum added the 39 - free-threadingPRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) labelDec 3, 2024
@ngoldbaumngoldbaum added this to the2.2.0 release milestoneDec 3, 2024
@ngoldbaum
Copy link
MemberAuthor

ngoldbaum commentedDec 3, 2024
edited
Loading

If anyone happens to know what's causing that let me know

I think the Windows linker errors I ran into earlier were caused by missingextern "C" blocks in some headers, looks like it's fixed now.

@ngoldbaumngoldbaum added the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelDec 3, 2024
/* Returns a borrowed ref of the second value in the matching info tuple */
PyObject *
get_info_no_cast(PyUFuncObject *ufunc, PyArray_DTypeMeta *op_dtype,
int ndtypes);
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

not totally clear to me if originally not including this function indispatching.h and forward-declaring it here was an intentional decision for some reason

@charrischarris added the 09 - Backport-CandidatePRs tagged should be backported labelDec 3, 2024
@ngoldbaum
Copy link
MemberAuthor

@charris do you think this is OK to backport to 2.1? IMO moving stuff to C++ like this feels risky for a bugfix release.

@charris
Copy link
Member

do you think this is OK to backport to 2.1?

Python multithreading is still in its early stages, so 2.2.0 is probably soon enough for those wanting to experiment with that feature, and it will come out at about the same time as 2.1.4. I would prefer to limit the 2.1 backports to clear bug fixes.

seberg reacted with thumbs up emoji

@ngoldbaum
Copy link
MemberAuthor

Ah I see you added the backport-candidate label because it's a backport to 2.2. I was confused about the target branch.

Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was curious and had a look, thinking it might be too much, but it is nice to see this actually is a small change! My review may not mean too much, but it seems all OK, and the only thing I have is a suggested name change...

Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The_with_locking is a much better name, thanks!

Copy link
Member

@sebergseberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, modulo the exception thing. I do wonder slightly whether allows re-entry from within the same thread?

Not sure that it matters, but I think it may be nice to have a recrusive mutex here. It doesn't seem implausible for a promotion function to call the promoter again to figure things out.
(But, it should also be rare enough that I don't care if we defer that to when someone runs into it.)

if (res->buckets == NULL) {
PyErr_NoMemory();
PyMem_Free(res);
return NULL;
}

#ifdef Py_GIL_DISABLED
res->mutex = new std::shared_mutex();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Well... unfortunately, untilufunc_object.c is convertedand all Python C-API facing functions are auto-wrapped in some exception handler, we need to catch exceptions here.

(We will also always have to do that for internal C/C++ API boundaries, i.e. ufunc inner loops calls, I guess. But we do it there, e.g. in the fft ones.)

@@ -163,7 +167,9 @@ check_and_adjust_axis(int *axis, int ndim)
* <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52023>.
* clang versions < 8.0.0 have the same bug.
*/
#if (!defined __STDC_VERSION__ || __STDC_VERSION__ < 201112 \
#ifdef __cplusplus
#define NPY_ALIGNOF(type) alignof(type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Wondering if we shouldn't just usealignof, but let's go with this.

@@ -115,6 +122,9 @@ PyArrayIdentityHash_Dealloc(PyArrayIdentityHash *tb)
{
PyMem_Free(tb->buckets);
PyMem_Free(tb);
#ifdef Py_GIL_DISABLED
delete (std::shared_mutex *)tb->mutex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Seems OK to assume thatdelete will never throw an exception.

"index %"NPY_INTP_FMT" is out of bounds "
"for size %"NPY_INTP_FMT, *index, max_item);
"index %"NPY_INTP_FMT" is out of bounds "
"for size %"NPY_INTP_FMT, *index, max_item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should just replace allNPY_INTP_FMT with%zd one of these days...

@ngoldbaum
Copy link
MemberAuthor

Not sure that it matters, but I think it may be nice to have a recrusive mutex here.

Unfortunately then we run into the issue of finding a portable implementation. My brief research indicates that there isn't a standardized shared recursive mutex because of the performance trade-offs, although the posix shared read-write lock is recursive.

If someone does try to use this reentrantly, it will deadlock. Do you think that's problematic in practice?

Thanks for the pointer about exceptions. I foundstd::nothrow which seems to be doing what you wanted with a little less boilerplate - is that OK? Do I need to worry about exceptions elsewhere in NumPy?

@ngoldbaumngoldbaum removed the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelDec 4, 2024
@seberg
Copy link
Member

Do you think that's problematic in practice?

Not right now. I could imagine it being meaningful, for example if you had a parametric dtype likecategorical[dtype]. Such a dtype could wish to callufunc.resolve_dtypes to figure out the right resultcategorical[dtype].

which seems to be doing what you wanted

Yeah, that seems good, nice. (No idea if it would catch non memory errors, but I don't think those are possible).

ngoldbaum reacted with thumbs up emoji

Copy link
Member

@sebergseberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks@ngoldbaum looks like this is ready.

@sebergseberg merged commit7124091 intonumpy:mainDec 5, 2024
69 checks passed
charris pushed a commit to charris/numpy that referenced this pull requestDec 5, 2024
* PERF: add a fast path to ufunc type resolution* MAINT: move dispatching.c to C++* MAINT: move npy_hashtable to C++ and use std::shared_mutex* MAINT: fix windows linking* MAINT: remove outdated comment* MAINT: only call try_promote on free-threaded buildConverts dispatching to cpp in order to use `std::shared_mutex` to improve free-threaded scaling.* MAINT: try to give new function a name indicating it uses a mutex* MAINT: only do complicated casting to get a mutex pointer once* MAINT: use std::nothrow to avoid dealing with exceptions* DOC: add changelog
@charrischarris removed the 09 - Backport-CandidatePRs tagged should be backported labelDec 5, 2024
charris added a commit to charris/numpy that referenced this pull requestDec 5, 2024
PRnumpy#27896 was backported to NumPy 2.2.[skip ci]
@charrischarris mentioned this pull requestDec 5, 2024
return NULL;
}
return info;
}

#ifdef Py_GIL_DISABLED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry for the drive-by comment, butPy_GIL_DISABLED means you're running a on free-threading enabled Python, not necessarily that the GIL is concretely disabled, right?

If you're callingmutex->lock ormutex->lock_shared with the GIL held, then there is a risk of deadlock if, for example,promote_and_get_info_and_ufuncimpl released the GIL in another thread and is waiting to take the GIL back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So you may want to wrap calls tomutex->lock[_shared] in aPy_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS pair.

ngoldbaum reacted with heart emoji
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@pitroupitroupitrou left review comments

@sebergsebergseberg approved these changes

@mhvkmhvkmhvk approved these changes

Assignees
No one assigned
Labels
03 - Maintenance39 - free-threadingPRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Milestone
2.2.0 release
Development

Successfully merging this pull request may close these issues.

BUG: poor multithreaded performance scaling for small arrays (nogil)
5 participants
@ngoldbaum@charris@seberg@pitrou@mhvk

[8]ページ先頭

©2009-2025 Movatter.jp