Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork10.9k
Description
As of today, the SIMD "baseline" that we compile for goes up to SSE3, and any higher features are opt-in and runtime dispatched. SSE3 has been the maximum assumed feature for quite a while. We haven't reviewed this choice recently. At some point in the past we determined a rule of thumb saying that we could drop support for a particular feature (or lack thereof) if support for it dropped below 0.5%. That seems to be the case now for systems without SSE4.1 and SSE4.2.
Here is the full list of dispatchable targets and the features we currently build for each one, in the format "headers: enabled target list, e.g.:
Generating multi-targets for "_umath_tests.dispatch.h" Enabled targets: AVX2, SSE41, baseline
Full set of dispatchable targets:
Generating multi-targets for "_umath_tests.dispatch.h" Enabled targets: AVX2, SSE41, baselineGenerating multi-targets for "argfunc.dispatch.h" Enabled targets: AVX512_SKX, AVX2, SSE42, baselineGenerating multi-targets for "x86_simd_argsort.dispatch.h" Enabled targets: AVX512_SKX, AVX2Generating multi-targets for "x86_simd_qsort.dispatch.h" Enabled targets: AVX512_SKX, AVX2Generating multi-targets for "x86_simd_qsort_16bit.dispatch.h" Enabled targets: AVX512_SPR, AVX512_ICLGenerating multi-targets for "highway_qsort.dispatch.h" Enabled targets: Generating multi-targets for "highway_qsort_16bit.dispatch.h" Enabled targets: Generating multi-targets for "loops_arithm_fp.dispatch.h" Enabled targets: AVX2, baselineGenerating multi-targets for "loops_arithmetic.dispatch.h" Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE41, baselineGenerating multi-targets for "loops_comparison.dispatch.h" Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE42, baselineGenerating multi-targets for "loops_exponent_log.dispatch.h" Enabled targets: AVX512_SKX, AVX512F, AVX2, baselineGenerating multi-targets for "loops_hyperbolic.dispatch.h" Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_logical.dispatch.h" Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_minmax.dispatch.h" Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_modulo.dispatch.h" Enabled targets: baselineGenerating multi-targets for "loops_trigonometric.dispatch.h" Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_umath_fp.dispatch.h" Enabled targets: AVX512_SKX, baselineGenerating multi-targets for "loops_unary.dispatch.h" Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_unary_fp.dispatch.h" Enabled targets: SSE41, baselineGenerating multi-targets for "loops_unary_fp_le.dispatch.h" Enabled targets: SSE41, baselineGenerating multi-targets for "loops_unary_complex.dispatch.h" Enabled targets: AVX512F, AVX2, baselineGenerating multi-targets for "loops_autovec.dispatch.h" Enabled targets: AVX2, baselineGenerating multi-targets for "_simd.dispatch.h" Enabled targets: SSE42, AVX2, FMA3, AVX512F, AVX512_SKX, baseline
The most widely used data source for determining what hardware is out there is, I believe,https://store.steampowered.com/hwsurvey/?platform=combined. That currently says that SSE3 is at 100%, SSE4.1 at 99.78% and SSE4.2 at 99.70%. Meaning that if we bump the baseline up to SSE4.2, we'd only be dropping support for ~0.3% of systems with really old CPUs.
For more context, SSE4.2 was introduced in 2008, and even Windows 11 (v2024H2) now requires it (xrefhttps://en.wikipedia.org/wiki/SSE4#SSE4.2).
Now the other side of this coin is - what do we gain by making this change? I haven't quantified each item, but the basic answer is:
- Reduces build time on x86-64: 40% of build targets (206/517) on my 6 year Intel CPU with AVX512 are SIMD targets. We can trim off a decent fraction of those.
- Reduces binary size:
numpy/_core/_simd.so
currently is 3.1 MB out of 39.9 MB on disk for a Linux release build. Looking at the multi-targets list higher up, it looks like we can trim that a fair bit. - Reduces number of variations that should be tested in CI (
linux_simd.yml
). Given the current config, we can't actually drop a job, but we do make the test coverage higher (there are current zero test configs for baseline + SSE4.1/2).
I'd suggest making the change inmain
this release cycle, meaning for numpy 2.3.0, which will probably be released in June 2025.
Hat tip to@itamarst for bringing up this topic (xrefscientific-python/faster-scientific-python-ideas#11).