NotificationsYou must be signed in to change notification settings
Fork11.8k
Star30.9k

ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort#28619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

charris merged 10 commits intonumpy:mainfromr-devulap:xss-openmp

May 14, 2025

Merged

ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort#28619

charris merged 10 commits intonumpy:mainfromr-devulap:xss-openmp

May 14, 2025

Conversation

Copy link

Member

r-devulap commentedApr 1, 2025•
edited
Loading

Update x86-simd-sort module to latest to pull in 4 major changes:

Fixes a performance regression on 16-bit dtype sorting (RefChange 16-bit swizzle from vector to C arrays x86-simd-sort#190)
Adds openmp support for quicksort which speeds up sorting arrays >100,000 by up to 3x. (ref:Adds OpenMP to qsort, should also improve test speed a bit x86-simd-sort#179)
Adds openmp support for argsort which speeds up np.argsort for arrays > 10,000 by up to 3.5x (ref:Add OpenMP support to argsort x86-simd-sort#195)
Fixesnp.argsort perf regressions on sorted data (as reported in:BUG: Performance regression in argsort on sorted data #28714)

Benchmark numbers for `np.sort` on a TGL (sorting an array of 5 million numbers):

[100.00%] ··· bench_function_base.Sort.time_sort_worst                     9.64±0.05ms                                                                  12:53:12 [100/24308]| Change   | Before [93898621] <main>   | After [19f94d3c] <xss-openmp>   |   Ratio | Benchmark (Parameter)                                                          ||----------|----------------------------|---------------------------------|---------|--------------------------------------------------------------------------------|| -        | 4.57±0.08ms                | 3.61±0.04ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))           || -        | 4.51±0.04ms                | 3.57±0.04ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))   || -        | 4.53±0.03ms                | 3.57±0.02ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))  || -        | 4.54±0.03ms                | 3.58±0.03ms                     |    0.79 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) || -        | 54.8±0.2ms                 | 27.0±0.05ms                     |    0.49 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))           || -        | 60.6±0.3ms                 | 29.8±0.3ms                      |    0.49 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) || -        | 53.0±0.1ms                 | 25.2±0.1ms                      |    0.48 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))            || -        | 55.0±0.3ms                 | 26.2±0.2ms                      |    0.48 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))  || -        | 55.1±0.2ms                 | 25.7±0.06ms                     |    0.47 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))          || -        | 54.8±0.1ms                 | 25.9±0.3ms                      |    0.47 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))   || -        | 64.7±0.3ms                 | 29.0±0.2ms                      |    0.45 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))   || -        | 28.4±0.1ms                 | 12.4±0.2ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000)) || -        | 26.3±0.1ms                 | 11.6±0.7ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))   || -        | 59.3±0.2ms                 | 26.1±0.2ms                      |    0.44 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))             || -        | 26.8±1ms                   | 11.6±0.4ms                      |    0.43 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))           || -        | 57.3±0.2ms                 | 24.7±0.3ms                      |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))              || -        | 58.9±0.09ms                | 25.2±1ms                        |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))     || -        | 59.4±0.09ms                | 25.4±0.09ms                     |    0.43 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))    || -        | 60.1±0.1ms                 | 25.2±0.1ms                      |    0.42 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))            || -        | 26.0±0.2ms                 | 10.9±0.1ms                      |    0.42 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))  || -        | 25.8±1ms                   | 10.3±0.2ms                      |    0.4  | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))            || -        | 26.5±0.08ms                | 10.7±0.1ms                      |    0.4  | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))   || -        | 9.64±0.05ms                | 3.89±0.1ms                      |    0.4  | bench_function_base.Sort.time_sort_worst                                       || -        | 26.8±0.4ms                 | 10.5±0.2ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))          || -        | 26.8±0.1ms                 | 10.4±0.7ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))  || -        | 24.9±0.6ms                 | 9.80±0.2ms                      |    0.39 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))             || -        | 24.6±0.7ms                 | 9.60±0.02ms                     |    0.39 | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))            || -        | 24.5±0.06ms                | 9.25±0.03ms                     |    0.38 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))     || -        | 24.0±0.08ms                | 9.22±0.02ms                     |    0.38 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))    || -        | 24.0±1ms                   | 8.85±0.08ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))              || -        | 24.6±0.2ms                 | 9.06±0.05ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))    || -        | 23.7±0.07ms                | 8.78±0.07ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))             || -        | 24.2±0.2ms                 | 8.88±0.3ms                      |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))           || -        | 24.4±0.2ms                 | 9.02±0.09ms                     |    0.37 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))   || -        | 24.6±0.2ms                 | 8.95±0.08ms                     |    0.36 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))            || -        | 89.0±0.3ms                 | 7.42±0.07ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))             || -        | 87.7±0.5ms                 | 6.67±0.06ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))              || -        | 88.3±0.2ms                 | 6.81±0.04ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))     || -        | 87.2±0.07ms                | 6.54±0.02ms                     |    0.08 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))    |

Benchmark numbers for `np.argsort` on a TGL (sorting an array of 500,000 numbers):

| Change   | Before [93898621] <main>   | After [41eb9481] <xss-openmp>   |   Ratio | Benchmark (Parameter)                                                             ||----------|----------------------------|---------------------------------|---------|-----------------------------------------------------------------------------------|| +        | 530±20μs                   | 788±50μs                        |    1.49 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))            || +        | 528±30μs                   | 759±30μs                        |    1.44 | bench_function_base.Sort.time_argsort('quick', 'int32', ('uniform',))             || +        | 608±30μs                   | 787±20μs                        |    1.29 | bench_function_base.Sort.time_argsort('quick', 'float32', ('uniform',))           || -        | 10.8±0.02ms                | 4.12±0.02ms                     |    0.38 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 1000)) || -        | 11.3±0.03ms                | 4.28±0.03ms                     |    0.38 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 1000)) || -        | 10.5±0.03ms                | 3.92±0.03ms                     |    0.37 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 1000))   || -        | 11.0±0.03ms                | 3.95±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 100))  || -        | 11.7±0.04ms                | 4.23±0.03ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 100))  || -        | 10.7±0.02ms                | 3.84±0.05ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 100))    || -        | 11.8±0.04ms                | 4.23±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))   || -        | 10.4±0.02ms                | 3.78±0.02ms                     |    0.36 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 1000))  || -        | 8.28±0.2ms                 | 2.92±0.1ms                      |    0.35 | bench_function_base.Sort.time_argsort('quick', 'int32', ('reversed',))            || -        | 10.8±0.08ms                | 3.72±0.06ms                     |    0.35 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 100))   || -        | 8.62±0.2ms                 | 2.96±0.09ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))           || -        | 8.71±0.2ms                 | 2.92±0.09ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float32', ('reversed',))          || -        | 8.70±0.02ms                | 2.95±0.01ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))           || -        | 8.74±0.03ms                | 3.00±0.02ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))          || -        | 12.3±0.02ms                | 4.21±0.03ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))    || -        | 8.35±0.2ms                 | 2.81±0.08ms                     |    0.34 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('ordered',))            || -        | 8.34±0.1ms                 | 2.79±0.1ms                      |    0.33 | bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))             || -        | 8.30±0.2ms                 | 2.78±0.09ms                     |    0.33 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('reversed',))           || -        | 12.0±0.3ms                 | 3.87±0.1ms                      |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))            || -        | 10.3±0.04ms                | 3.31±0.03ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 10))   || -        | 12.5±0.04ms                | 3.98±0.03ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 10))   || -        | 9.50±0.03ms                | 3.08±0.06ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))             || -        | 9.48±0.04ms                | 3.07±0.04ms                     |    0.32 | bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))            || -        | 13.5±0.04ms                | 4.18±0.04ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))            || -        | 11.7±0.3ms                 | 3.66±0.1ms                      |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))              || -        | 9.97±0.03ms                | 3.12±0.03ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 10))     || -        | 13.0±0.02ms                | 4.03±0.05ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))     || -        | 11.7±0.3ms                 | 3.58±0.08ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))             || -        | 9.96±0.01ms                | 3.08±0.03ms                     |    0.31 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 10))    || -        | 14.2±0.06ms                | 4.11±0.07ms                     |    0.29 | bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))              |

github-actionsbot added the 01 - Enhancement label

Apr 1, 2025

seiko2plus added 56 - Needs Release Note.

Needs an entry in doc/release/upcoming_changes

component: SIMDIssues in SIMD (fast instruction sets) code or machinery labels

Apr 1, 2025

seiko2plus requested changes

Apr 1, 2025

View reviewed changes

Copy link

Member

seiko2plus left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice to see multi-threading support - well done. This pr should also include a release note and add documentation mentioning that sort operations now support multi-threading onx86 and that the number of threads can be controlled via the environment variableOMP_NUM_THREADS. Additionally, OpenMP flags should be disabled if the meson optiondisable-threading is enabled.

tylerjereddy reviewed

Apr 1, 2025

View reviewed changes

numpy/_core/meson.build

		if omp.found()
		omp_cflags= ['-fopenmp','-DXSS_USE_OPENMP']
		endif
		endif

Copy link

Contributor

tylerjereddyApr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Are we "all good" to use OpenMP in NumPy directly? I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues, etc. Maybe this is just for a custom local build rather than for activation in wheels.

If we are actually "ok" for that, I guess that in addition to the env variable Sayed mentioned there is alsothreadpoolctl where one might need to modulate both with something likewith controller.limit(limits={"openblas": 2, "openmp": 4})?

Copy link

MemberAuthor

r-devulapApr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Are we "all good" to use OpenMP in NumPy directly?

I am not familiar with implications of using openMP and how it could potentially interact with other modules. I was hoping to get an answer to that via the pull request and everyone's input.

Copy link

Member

MousiusApr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OpenBLAS usually manages OpenMP itself, which creates a bit of confusion when nesting it.

OpenBLAS now has:
https://github.com/OpenMathLib/OpenBLAS/blob/develop/driver/others/blas_server_callback.c

Which we can use if we have a central thread pool we want to re-use across multiple things?

Copy link

Member

sebergApr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@ogrisel I was wondering if you know anything about this (i.e. whether it is safe to use OpenMP in NumPy, or it would create issues).

Copy link

Member

rgommersApr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Which we can use if we have a central thread pool we want to re-use across multiple things?

Thelibopenblas we ship inside our wheels is always built with pthreads, not openmp. Build scripts live athttps://github.com/MacPython/openblas-libs/tree/main/tools.

Copy link

Contributor

thomasjpfanApr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS:scikit-learn/scikit-learn#28883

There is a draft PR here to force OpenBLAS to use OpenMP (alway from pthreads):scikit-learn/scikit-learn#29403

Copy link

Contributor

lorentzenchrMay 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I thought there were other ecosystem interaction concerns to consider, like cross-interactions with OpenBLAS or wheel-related issues

Yes, there are some quite ugly issues with multiple installed openmp libraries and segfaults depending on import order, see
microsoft/LightGBM#6595

numpy/_core/tests/test_multiarray.py Outdated

		@pytest.mark.parametrize("dtype", [np.float16,np.float32,np.float64])
		deftest_sort_largearrays(dtype):
		N=1000000
		arr=np.random.rand(N)

Copy link

Contributor

tylerjereddyApr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

maybe we should pin the values down with moderndefault_rng?

One other thing I checked was how slow the test might be (do we want aslow marker for the large array handling?). It didn't seem too bad (< 1 s for each case locally on an ARM Mac), though it is in the top 10 slowest for this module for example.

Copy link

MemberAuthor

r-devulapApr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I did pin down the value with a determined seed.

Copy link

MemberAuthor

r-devulap commentedApr 2, 2025

@rgommers pointed out this discussion in scipy that details the complications of using openmpscipy/scipy#10239 (comment)

Copy link

Member

rgommers commentedApr 2, 2025

Also xrefhttps://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp, which details a bunch of issues.

Copy link

Member

rgommers commentedApr 2, 2025

I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it.

Copy link

MemberAuthor

r-devulap commentedApr 9, 2025

I suspect that if we start using OpenMP code, we should disable it in wheels and only let distro packagers enable it.

@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides?

Copy link

Member

rgommers commentedApr 9, 2025

@rgommers How does the interaction with PyTorch or Sciki-learn work? Don't they vendor their own version of libgomp which can potentially conflict with what the distro provides?

That is already the case - the answer for that one is: (a) please don't mix distro packages with wheels, and (b) thelibomp/libiomp (notlibgomp, which interacts badly withmultiprocessing) inside wheels will have its symbols mangled withauditwheel, so there's no clash.

If you have PyTorch and scikit-learn both installed in the same environment and then imported, that usually works just fine, but it has given rise to a host of hard to debug issues in the past. Alsonumpy from conda defaults, which pulls in OpenMP via MKL. In wheels you just get two OpenMP runtimes loaded that are isolated from each other (bad for performance but should be robust again symbol conflicts), in conda envs youshould get only one but can get two if dependency trees don't work out well. E.g., PyTorch relies on MKL which pulls in Intel OpenMP (libiomp); scikit-learn uses LLVM OpenMP (libomp). There has to be code that handles this correctly, to avoid conflicts. IIRCmkl-service does this, so it uses the other OpenMP runtime if that's already loaded in memory.

You get the gist - this is a little painful. There's only one way to really do OpenMP right - build everything in a coherent fashion against the same OpenMP library. Distros usually get this right. With wheels you can't. And it gets worse when users dopip install . orpip install pkgs-that-uses-openmp from source - because then you don't getauditwheel symbol mangling, and things may go more wrong.

This PR looks nice and simple though, so if the performance gains are really large, maybe making it opt-in rather than use-if-detected could work.

Copy link

Member

rgommers commentedApr 9, 2025

For the design question of whether NumPy et al. should enable parallelism by default, please seehttps://thomasjpfan.github.io/parallelism-python-libraries-design/ for a good discussion. Cc@thomasjpfan for visibility.

r-devulap changed the title~~ENH: Use openmp on x86-simd-sort to speed up sorting large arrays~~ENH: Use openmp on x86-simd-sort to speed up np.sort and np.argsort

Apr 14, 2025

Copy link

MemberAuthor

r-devulap commentedApr 14, 2025

updated patch:

Ported openMP support fornp.argsort fromAdd OpenMP support to argsort x86-simd-sort#195 which speeds up sorting arrays > 10000 by up-to 3.5x on both AVX-512 and AVX2. Updated the title and added benchmark numbers in the first comment.
Added a new meson option-enable-openmp which is false by default. Meson only builds with openmp when bothdisable-threading == false andenable-openmp == true
Added CI coverage to test the openMP code paths.

r-devulap force-pushed thexss-openmp branch from8656315 tob8fc6beCompare

April 15, 2025 21:35

r-devulap mentioned this pull request

Apr 15, 2025

BUG: Performance regression in argsort on sorted data#28714

Open

rgommers reviewed

Apr 16, 2025

View reviewed changes

Copy link

Member

rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the update@r-devulap. CI additions looks fine to me; a few comments on build support.

numpy/_core/meson.build OutdatedShow resolvedHide resolved

numpy/_core/meson.build Outdated

		if use_intel_sortand use_openmp
		omp=dependency('openmp',required:false)
		if omp.found()
		omp_cflags= ['-fopenmp','-DXSS_USE_OPENMP']

Copy link

Member

rgommersApr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let's not have a separate variable for this. It's more idiomatic to use a dependency object:

Suggested change

	omp_cflags = ['-fopenmp', '-DXSS_USE_OPENMP']
	omp_dep = declare_dependency(dependencies: omp, compile_args: ['-DXSS_USE_OPENMP'])

The-fopemp flag shouldn't need to be added explicitly, it's already present in theomp dependency (e.g., seehere).

Copy link

MemberAuthor

r-devulapApr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sounds good to me. I have also updated the dependencyrequired flag to be true. If someone is explicitly usingenable-openmp=true, then its reasonable to expect a build failure if openMP isn't available.

Copy link

Member

rgommers commentedApr 16, 2025

This will also need a release note indoc/release/upcoming_changes/

Copy link

MemberAuthor

r-devulap commentedApr 16, 2025

This will also need a release note indoc/release/upcoming_changes/

Thanks for reviewing this! I have added two release notes: one for performance improvements and another highlighting general openMP build support.

Copy link

MemberAuthor

r-devulap commentedApr 17, 2025

scikit-learn does run into issues with running OpenMP's threadpool together with OpenBLAS:scikit-learn/scikit-learn#28883

@thomasjpfan Thanks for the reference. Correct me if I am wrong but from what I understand, it looks like the performance problem occurs when calling two functions back to back in a loop: where one uses pthreads and other uses openMP to manage their respective threads. This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing.

Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this?

Copy link

Contributor

thomasjpfan commentedApr 17, 2025•
edited
Loading

This causes resource contention when both functions try to use available CPU cores with no visibility into what the other one is doing.

Yes, this is the underlying issue.

Beyond standardizing the thread management library across the entire Python ecosystem, are there alternative approaches to solve this?

Standardizing the thread management layer is the only long term solution I see. The hard part is figuring out a way to standardize that works for most projects. Top of mind projects and how they multi-thread:

SciPy's fft uses C++ threads &scipy.linalg uses OpenBLAS
NumPy uses OpenBLAS
PyTorch uses MKL
Polars uses Rust+rayon
Scikit-learn uses OpenMP and Python multi-threading
Python multi-threading uses pthreads, which will become more common with free-threading

Some workarounds for threadpool specific issues:

Contention: SetOPENBLAS_THREAD_TIMEOUT:
- Slowdown when using openblas-pthreads alongside openmp based parallel code OpenMathLib/OpenBLAS#3187
- Consider unifying the two OpenBLAS libraries in NumPy and SciPy wheels Create OpenBLAS wheel scipy/scipy#15129
Incompatible between Intel OpenMP and LLVM OpenMP: SetMKL_THREADING_LAYER
- https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

r-devulap added this to the2.3.0 release milestone

Apr 23, 2025

Copy link

MemberAuthor

r-devulap commentedApr 23, 2025

Adding 2.3.0 label. Would like to have this in for that release with or without the openmp support.

Raghuveer Devulapalli added4 commits

April 30, 2025 21:16

Update x86-simd-sort module to latest

130cac5

Pulls in 2 major changes:(1) Fixes a performance regression on 16-bit dtype sorting (seenumpy/x86-simd-sort#190)(2) Adds openmp support for quicksort which speeds up sorting arrays >100,000 by up to 3x. See:numpy/x86-simd-sort#179

BLD: Add openmp flags to build x86-simd-sort

7fd938b

Also adds a simple unit test to stress the openmp code paths

ENH: Update x86-simd-sort to port openmp support for argsort

f8cfa4e

Add meson option to toggle building with openMP

02c4728

r-devulap force-pushed thexss-openmp branch fromffe1c72 to6eff29eCompare

May 1, 2025 04:16

Raghuveer Devulapalli added3 commits

April 30, 2025 21:17

TST: Add np.argsort test for openmp paths

21bc19f

CI: Add openmp flags to test openMP code paths

e425de8

Update x86-simd-sort: detect already sorted arrays for np.argsort

ac59ea9

Raghuveer Devulapalli added3 commits

April 30, 2025 21:17

DOCS: add release notes

8ba425e

Minor changes to meson.build

e0f0247

Initialize omp to empty variable

6eff29e

Copy link

MemberAuthor

r-devulap commentedMay 1, 2025

Rebased with main.

seiko2plus approved these changes

May 14, 2025

View reviewed changes

Copy link

Member

seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, Thank you! massive performance gains for 16-bit sorting. Since OpenMP is disabled by default, I think it's fine to merge.

charris merged commit25d26e5 intonumpy:main

May 14, 2025

72 of 73 checks passed

Copy link

Member

charris commentedMay 14, 2025

Thanks@r-devulap .

mattip mentioned this pull request

May 20, 2025

ENH: Multithreaded sort / x86-simd-sort#29009

Closed

Copy link

gitboy16 commentedMay 20, 2025

Hi, thank you for the PR. Would it be possible to have wheels/packages with opennmp enabled available somewhere so that multithreaded sort can be used? Thank you!

Copy link

Member

rgommers commentedMay 20, 2025

That would a lot of work including vendoringlibomp into wheels, we won't be putting anything like that up on PyPI. If you or anyone else is willing to do this in a fork and host them outside of PyPI, that is of course possible - there's no need for us as maintainers to do that.