Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork11.9k
ENH: Add SIMD sin/cos implementation with numpy-simd-routines#29699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
09414c8 toaf5d98aCompare2f4c0a6 to9b9a438Comparenumpy-simd-routines added as subrepo in meson subprojects directory and the current FP configuration is static, ~1ulp used for double-precision ~4ulp for single-precision with handling floating-point errors, special-cases extended precision for large arguments, subnormals are enabled by default too. numpy-simd-routines supports all SIMD extensions that are supported by Google Highway including non-FMA extensions and is fully independent from libm to guarantee unified results across all compilers and platforms. Full benchmarks will be provided within the pull-request, the following benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X) with AVX512 enabled. Note: that there was no SIMD optimization enabled for sin/cos for double-precision, only single-precision. | Before | After | Ratio | Benchmark (Parameter) | |---------------|-------------|--------|------------------------------------------| | 713±6μs | 633±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f') | | 717±9μs | 637±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f') | | 705±3μs | 607±10μs | 0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f') | | 714±10μs | 595±0.5μs | 0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f') | | 370±0.3μs | 277±4μs | 0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f') | | 373±2μs | 236±0.6μs | 0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f') | | 1.06±0.01ms | 648±3μs | 0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f') | | 1.06±0.01ms | 617±30μs | 0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f') | | 5.06±0.06ms | 2.61±0.3ms | 0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') | | 1.48±0ms | 715±5μs | 0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') | | 1.50±0.01ms | 639±6μs | 0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') | | 5.15±0.1ms | 1.96±0.01ms | 0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') | | 5.72±0.02ms | 2.09±0.1ms | 0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd') | | 5.76±0.01ms | 2.03±0.08ms | 0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd') | | 5.07±0.08ms | 1.76±0.2ms | 0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') | | 6.04±0.04ms | 2.05±0.09ms | 0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') | | 5.79±0.03ms | 1.90±0.2ms | 0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd') | | 2.29±0.1ms | 762±40μs | 0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') | | 5.72±0.1ms | 1.75±0.07ms | 0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd') | | 6.04±0.03ms | 1.82±0.2ms | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') | | 2.49±0.1ms | 748±30μs | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') | | 2.23±0.1ms | 634±6μs | 0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') | | 1.31±0.03ms | 367±5μs | 0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') | | 2.55±0.09ms | 654±30μs | 0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') | | 4.97±0.03ms | 1.14±0ms | 0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') | | 5.67±0.01ms | 1.22±0.03ms | 0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd') | | 5.76±0.03ms | 1.28±0.06ms | 0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd') | | 1.26±0.01ms | 272±2μs | 0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') | | 7.03±0.02ms | 1.31±0.01ms | 0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') | | 5.67±0.01ms | 810±9μs | 0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd') | | 5.71±0.01ms | 817±40μs | 0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd') | | 7.05±0.03ms | 915±4μs | 0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |
Allow up to 3 ULP error for float32 sin/cos when nativeFMA is not available.
seiko2plus commentedDec 9, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I have updated the benchmark to use GCC tests instead of Clang, as GCC is commonly the default compiler for wheels on Linux. At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures. The ufunc inner implementation is optimized for size as much as possible and only increases |
seiko2plus commentedDec 9, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Ah, regarding the reverted#23399, as discussed in thethread/C6EYZZSR4EWGVKHAZXLE7IBILRMNVK7L/. The default precision for f64 is set up to 1ULP, including for large arguments, with the ability to add a build-time option to change this behavior later: numpy/numpy/_core/src/common/simd/simd.hpp Lines 79 to 100 in67d0274
cc:@mhvk |
rgommers commentedDec 9, 2025
No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.
That's impressively small, glad to see that. |
seberg commentedDec 9, 2025
@seiko2plus thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.) |
mhvk left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@seiko2plus - looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.
| // Note: the intrinsics of NumPy SIMD routines are inlined by default. | ||
| if (NPY_UNLIKELY(is_mem_overlap(args[0], sin, args[1], sout, len) || | ||
| sin !=sizeof(T) || sout !=sizeof(T))) { | ||
| // this for non-contiguous or overlapping case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I always have trouble knowing exactly what one can ask the iterator, but is it not possible to set some flag that ensures that for non-contiguous or overlapping data, a copy is made to a contiguous buffer? I ask since if so we do not have to do this inside the loop (with similar duplicated code presumably elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is already the case I believe. But these paths get used by.accumulate unfortunately. Although, it may be that we can simplify the check based on that knowledge?
EDIt: Sorry to be clear, Ithink that is the case, I am not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't think it can be used by.accumulate since this ufunc hasnin=1... If a flag is indeed passed, maybe it is worth replacing this with an assert on the stride and see if anything breaks? Logically, if we are passing the iterator a flag that we require contiguous input and output, then we should never get anything else...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah, there is no flag to enforce a contiguous loop, but we could probably add one, if we need this more I think that makes sense. The buffer machinery will have a lot of overhead unfortunately, but overall it is still better likely.
Foroverlap, I believe the rule should be (or am I missing something?!):
- If memory overlap exists it must be exactly the same memory (i.e. args[0] == args[1] and steps identical).
- Except for
.accumulatewhich you are right, doesn't apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah, if there is no "needs contiguous" flag, it is out of scope here. But I think it does make sense to add the option to request buffering to a contiguous and aligned chunk to the iterator. I would think the overhead cannot be that much worse than it is here, since one would already be inside the iterator anyway typically for non-contiguous data. And it would certainly be nice to keep the underlying loops simple...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would think the overhead cannot be that much worse than it is here
Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@seberg - you mentioned there is no flag, but is one needed? Ifsin/cos are defined as an array method, would things not work automatically if one filled only thecontiguous_loop slot? Or is that at the moment only treated as something that will be used if present if data are contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would think the overhead cannot be that much worse than it is here
Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).
The other thing is that if we centralize this, there will be more of an incentive to speed it up. Isn't the buffer pre-allocated by the compiler here? It would make more sense to have just one scratch buffer in the iterator, perhaps using your mechanism that allocates if it is too small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@mhvk no not yet, the contiguous loop is an optimization only, only with that flag or a "non-contig to contig wrapper" could it be more. Dunno what is better, centralizing or not, I mostly like it because it removes unnecessary repitition both in code and binary, whether it is slower or faster.
(but yeah, we shouldn't worry about that here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
So, basically it would be some way to tell the ufunc (in this case) that theNPY_ITER_CONTIG flag should be passed on to the iterator... Anyway, let's take the discussion out of this PR -- see#30413.
| /// npsr is tag free by design so we only include it within main namespace (np::simd) | ||
| namespacesr= npsr::HWY_NAMESPACE; | ||
| /// Default precision configrations for NumPy SIMD Routines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Might be good here to tell what low and high imply (or at least where to look it up)
| assert_array_max_ulp(np.cos(tx_f32,out=tx_f32),np.float32(np.cos(x_f64)),maxulp=2) | ||
| assert_array_max_ulp(np.sin(x_f32,out=x_f32),np.float32(np.sin(x_f64)),maxulp=maxulp_f32) | ||
| assert_array_max_ulp(np.cos(tx_f32,out=tx_f32),np.float32(np.cos(x_f64)),maxulp=maxulp_f32) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It maybe good to add an explicit test thatcos(0)==1 for all float, as it is the thing that threw our tests off in a previous iteration (and I think it is something you now explicitly capture).
- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type.- Add test cases for sine/cosine facts—for example, `cos(0) == 1`.
Uh oh!
There was an error while loading.Please reload this page.
numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.
numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.
Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.
Benchmark
X86 Platform
The following benchmark was tested against GCC 14.3.0 on an x86 CPU (Ryzen 7 7700X).
Environment
x86-64-v4
x86-64-v3
x86-64-v2
Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so