Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork11.8k
ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort#28619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
seiko2plus left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice to see multi-threading support - well done. This pr should also include a release note and add documentation mentioning that sort operations now support multi-threading onx86 and that the number of threads can be controlled via the environment variableOMP_NUM_THREADS. Additionally, OpenMP flags should be disabled if the meson optiondisable-threading is enabled.
| if omp.found() | ||
| omp_cflags= ['-fopenmp','-DXSS_USE_OPENMP'] | ||
| endif | ||
| endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Are we "all good" to use OpenMP in NumPy directly? I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues, etc. Maybe this is just for a custom local build rather than for activation in wheels.
If we are actually "ok" for that, I guess that in addition to the env variable Sayed mentioned there is alsothreadpoolctl where one might need to modulate both with something likewith controller.limit(limits={"openblas": 2, "openmp": 4})?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Are we "all good" to use OpenMP in NumPy directly?
I am not familiar with implications of using openMP and how it could potentially interact with other modules. I was hoping to get an answer to that via the pull request and everyone's input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OpenBLAS usually manages OpenMP itself, which creates a bit of confusion when nesting it.
OpenBLAS now has:
https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/blas_server_callback.c
Which we can use if we have a central thread pool we want to re-use across multiple things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@ogrisel I was wondering if you know anything about this (i.e. whether it is safe to use OpenMP in NumPy, or it would create issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Which we can use if we have a central thread pool we want to re-use across multiple things?
Thelibopenblas we ship inside our wheels is always built with pthreads, not openmp. Build scripts live athttps://github.com/MacPython/openblas-libs/tree/main/tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS:scikit-learn/scikit-learn#28883
There is a draft PR here to force OpenBLAS to use OpenMP (alway from pthreads):scikit-learn/scikit-learn#29403
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues
Yes, there are some quite ugly issues with multiple installed openmp libraries and segfaults depending on import order, see
microsoft/LightGBM#6595
numpy/_core/tests/test_multiarray.py Outdated
| @pytest.mark.parametrize("dtype", [np.float16,np.float32,np.float64]) | ||
| deftest_sort_largearrays(dtype): | ||
| N=1000000 | ||
| arr=np.random.rand(N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
maybe we should pin the values down with moderndefault_rng?
One other thing I checked was how slow the test might be (do we want aslow marker for the large array handling?). It didn't seem too bad (< 1 s for each case locally on an ARM Mac), though it is in the top 10 slowest for this module for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I did pin down the value with a determined seed.
r-devulap commentedApr 2, 2025
@rgommers pointed out this discussion in scipy that details the complications of using openmpscipy/scipy#10239 (comment) |
rgommers commentedApr 2, 2025
Also xrefhttps://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp, which details a bunch of issues. |
rgommers commentedApr 2, 2025
I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it. |
r-devulap commentedApr 9, 2025
@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides? |
rgommers commentedApr 9, 2025
That is already the case - the answer for that one is: (a) please don't mix distro packages with wheels, and (b) the If you have PyTorch and scikit-learn both installed in the same environment and then imported, that usually works just fine, but it has given rise to a host of hard to debug issues in the past. Also You get the gist - this is a little painful. There's only one way to really do OpenMP right - build everything in a coherent fashion against the same OpenMP library. Distros usually get this right. With wheels you can't. And it gets worse when users do This PR looks nice and simple though, so if the performance gains are really large, maybe making it opt-in rather than use-if-detected could work. |
rgommers commentedApr 9, 2025
For the design question of whether NumPy et al. should enable parallelism by default, please seehttps://thomasjpfan.github.io/parallelism-python-libraries-design/ for a good discussion. Cc@thomasjpfan for visibility. |
r-devulap commentedApr 14, 2025
updated patch:
|
rgommers left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks for the update@r-devulap. CI additions looks fine to me; a few comments on build support.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
numpy/_core/meson.build Outdated
| if use_intel_sortand use_openmp | ||
| omp=dependency('openmp',required:false) | ||
| if omp.found() | ||
| omp_cflags= ['-fopenmp','-DXSS_USE_OPENMP'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Let's not have a separate variable for this. It's more idiomatic to use a dependency object:
| omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP'] | |
| omp_dep = declare_dependency(dependencies: omp, compile_args: ['-DXSS_USE_OPENMP']) |
The-fopemp flag shouldn't need to be added explicitly, it's already present in theomp dependency (e.g., seehere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sounds good to me. I have also updated the dependencyrequired flag to be true. If someone is explicitly usingenable-openmp=true, then its reasonable to expect a build failure if openMP isn't available.
rgommers commentedApr 16, 2025
This will also need a release note in |
r-devulap commentedApr 16, 2025
Thanks for reviewing this! I have added two release notes: one for performance improvements and another highlighting general openMP build support. |
r-devulap commentedApr 17, 2025
@thomasjpfan Thanks for the reference. Correct me if I am wrong but from what I understand, it looks like the performance problem occurs when calling two functions back to back in a loop: where one uses pthreads and other uses openMP to manage their respective threads. This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing. Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this? |
thomasjpfan commentedApr 17, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Yes, this is the underlying issue.
Standardizing the thread management layer is the only long term solution I see. The hard part is figuring out a way to standardize that works for most projects. Top of mind projects and how they multi-thread:
Some workarounds for threadpool specific issues:
|
r-devulap commentedApr 23, 2025
Adding 2.3.0 label. Would like to have this in for that release with or without the openmp support. |
Pulls in 2 major changes:(1) Fixes a performance regression on 16-bit dtype sorting (seenumpy/x86-simd-sort#190)(2) Adds openmp support for quicksort which speeds up sorting arrays >100,000 by up to 3x. See:numpy/x86-simd-sort#179
Also adds a simple unit test to stress the openmp code paths
r-devulap commentedMay 1, 2025
Rebased with main. |
seiko2plus left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM, Thank you! massive performance gains for 16-bit sorting. Since OpenMP is disabled by default, I think it's fine to merge.
25d26e5 intonumpy:mainUh oh!
There was an error while loading.Please reload this page.
charris commentedMay 14, 2025
Thanks@r-devulap . |
gitboy16 commentedMay 20, 2025
Hi, thank you for the PR. Would it be possible to have wheels/packages with opennmp enabled available somewhere so that multithreaded sort can be used? Thank you! |
rgommers commentedMay 20, 2025
That would a lot of work including vendoring |
Uh oh!
There was an error while loading.Please reload this page.
Update x86-simd-sort module to latest to pull in 4 major changes:
np.argsortperf regressions on sorted data (as reported in:BUG: Performance regression in argsort on sorted data #28714)Benchmark numbers for `np.sort` on a TGL (sorting an array of 5 million numbers):
Benchmark numbers for `np.argsort` on a TGL (sorting an array of 500,000 numbers):