numpy/numpyPublic

NotificationsYou must be signed in to change notification settings
Fork11k
Star29.8k

moving x86-64 feature baseline to SSE4.2? #27851

New issue

Open

#28896

Open

moving x86-64 feature baseline to SSE4.2?#27851

#28896

Labels

component: SIMDIssues in SIMD (fast instruction sets) code or machinery

Description

rgommers

opened

on Nov 25, 2024

As of today, the SIMD "baseline" that we compile for goes up to SSE3, and any higher features are opt-in and runtime dispatched. SSE3 has been the maximum assumed feature for quite a while. We haven't reviewed this choice recently. At some point in the past we determined a rule of thumb saying that we could drop support for a particular feature (or lack thereof) if support for it dropped below 0.5%. That seems to be the case now for systems without SSE4.1 and SSE4.2.

Here is the full list of dispatchable targets and the features we currently build for each one, in the format "headers: enabled target list, e.g.:

Generating multi-targets for "_umath_tests.dispatch.h"   Enabled targets: AVX2, SSE41, baseline

Full set of dispatchable targets:

Generating multi-targets for "_umath_tests.dispatch.h"   Enabled targets: AVX2, SSE41, baselineGenerating multi-targets for "argfunc.dispatch.h"   Enabled targets: AVX512_SKX, AVX2, SSE42, baselineGenerating multi-targets for "x86_simd_argsort.dispatch.h"   Enabled targets: AVX512_SKX, AVX2Generating multi-targets for "x86_simd_qsort.dispatch.h"   Enabled targets: AVX512_SKX, AVX2Generating multi-targets for "x86_simd_qsort_16bit.dispatch.h"   Enabled targets: AVX512_SPR, AVX512_ICLGenerating multi-targets for "highway_qsort.dispatch.h"   Enabled targets: Generating multi-targets for "highway_qsort_16bit.dispatch.h"   Enabled targets: Generating multi-targets for "loops_arithm_fp.dispatch.h"   Enabled targets: AVX2, baselineGenerating multi-targets for "loops_arithmetic.dispatch.h"   Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE41, baselineGenerating multi-targets for "loops_comparison.dispatch.h"   Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE42, baselineGenerating multi-targets for "loops_exponent_log.dispatch.h"   Enabled targets: AVX512_SKX, AVX512F, AVX2, baselineGenerating multi-targets for "loops_hyperbolic.dispatch.h"   Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_logical.dispatch.h"   Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_minmax.dispatch.h"   Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_modulo.dispatch.h"   Enabled targets: baselineGenerating multi-targets for "loops_trigonometric.dispatch.h"   Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_umath_fp.dispatch.h"   Enabled targets: AVX512_SKX, baselineGenerating multi-targets for "loops_unary.dispatch.h"   Enabled targets: AVX512_SKX, AVX2, baselineGenerating multi-targets for "loops_unary_fp.dispatch.h"   Enabled targets: SSE41, baselineGenerating multi-targets for "loops_unary_fp_le.dispatch.h"   Enabled targets: SSE41, baselineGenerating multi-targets for "loops_unary_complex.dispatch.h"   Enabled targets: AVX512F, AVX2, baselineGenerating multi-targets for "loops_autovec.dispatch.h"   Enabled targets: AVX2, baselineGenerating multi-targets for "_simd.dispatch.h"   Enabled targets: SSE42, AVX2, FMA3, AVX512F, AVX512_SKX, baseline

The most widely used data source for determining what hardware is out there is, I believe,https://store.steampowered.com/hwsurvey/?platform=combined. That currently says that SSE3 is at 100%, SSE4.1 at 99.78% and SSE4.2 at 99.70%. Meaning that if we bump the baseline up to SSE4.2, we'd only be dropping support for ~0.3% of systems with really old CPUs.

For more context, SSE4.2 was introduced in 2008, and even Windows 11 (v2024H2) now requires it (xrefhttps://en.wikipedia.org/wiki/SSE4#SSE4.2).

Now the other side of this coin is - what do we gain by making this change? I haven't quantified each item, but the basic answer is:

Reduces build time on x86-64: 40% of build targets (206/517) on my 6 year Intel CPU with AVX512 are SIMD targets. We can trim off a decent fraction of those.
Reduces binary size:numpy/_core/_simd.so currently is 3.1 MB out of 39.9 MB on disk for a Linux release build. Looking at the multi-targets list higher up, it looks like we can trim that a fair bit.
Reduces number of variations that should be tested in CI (linux_simd.yml). Given the current config, we can't actually drop a job, but we do make the test coverage higher (there are current zero test configs for baseline + SSE4.1/2).

I'd suggest making the change inmain this release cycle, meaning for numpy 2.3.0, which will probably be released in June 2025.

Hat tip to@itamarst for bringing up this topic (xrefscientific-python/faster-scientific-python-ideas#11).

Metadata

Assignees

No one assigned

Labels

component: SIMDIssues in SIMD (fast instruction sets) code or machinery

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

moving x86-64 feature baseline to SSE4.2? #27851

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions