numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.

numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.

Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.

Benchmark

X86 Platform

The following benchmark was tested against GCC 14.3.0 on an x86 CPU (Ryzen 7 7700X).

Environment

> uname -aLinux seiko-pc 6.12.60#1-NixOS SMP PREEMPT_DYNAMIC Mon Dec  1 10:43:41 UTC 2025 x86_64 GNU/Linux> gcc --versiongcc (GCC) 14.3.0Copyright (C) 2024 Free Software Foundation, Inc.This is free software; see thesourcefor copying conditions.  There is NOwarranty; not evenfor MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.> python --versionPython 3.12.12> lscpuArchitecture:                x86_64  CPU op-mode(s):            32-bit, 64-bit  Address sizes:             48 bits physical, 48 bits virtual  Byte Order:                Little EndianCPU(s):                      16  On-line CPU(s) list:       0-15Vendor ID:                   AuthenticAMD  Model name:                AMD Ryzen 7 7700X 8-Core Processor    CPU family:              25    Model:                   97    Thread(s) per core:      2    Core(s) per socket:      8    Socket(s):               1    Stepping:                2    Frequency boost:         enabled    CPU(s) scaling MHz:      72%    CPU max MHz:             5573.0000    CPU min MHz:             545.0000    BogoMIPS:                8982.62    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl                              xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_leg                             acy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp                              ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec                              xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cl                             ean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpo                             pcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze

x86-64-v4

export NPY_DISABLE_CPU_FEATURES=""spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [c5fc8423] <brings_npsr>	Ratio	Benchmark (Parameter)
-	689±6μs	640±30μs	0.93	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	683±0.9μs	602±2μs	0.88	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	682±1μs	603±1μs	0.88	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	695±0.7μs	604±1μs	0.87	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	354±0.7μs	231±0.4μs	0.65	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	363±0.2μs	234±4μs	0.64	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	1.04±0.02ms	632±20μs	0.61	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	1.02±0ms	612±10μs	0.6	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	1.32±0.01ms	795±2μs	0.6	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	1.72±0.03ms	806±3μs	0.47	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	1.44±0.02ms	603±0.3μs	0.42	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	5.05±0.1ms	2.07±0.01ms	0.41	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	4.98±0.01ms	1.88±0.03ms	0.38	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.95±0.02ms	2.26±0.07ms	0.38	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	2.18±0.05ms	792±1μs	0.36	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	1.78±0.03ms	608±4μs	0.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	5.65±0.02ms	1.87±0.2ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.65±0.01ms	1.87±0.01ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.77±0.01ms	1.90±0.01ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	6.00±0.04ms	1.98±0.02ms	0.33	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	1.36±0.01ms	443±2μs	0.32	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	5.75±0ms	1.68±0.04ms	0.29	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	4.97±0ms	1.46±0.04ms	0.29	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	2.42±0.2ms	609±2μs	0.25	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	5.62±0.01ms	1.25±0.02ms	0.22	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.72±0ms	1.25±0.01ms	0.22	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	4.97±0ms	1.08±0ms	0.22	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	7.08±0.01ms	1.58±0.02ms	0.22	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	7.07±0ms	1.20±0.01ms	0.17	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	1.49±0ms	233±10μs	0.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	5.61±0ms	828±0.4μs	0.15	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.73±0ms	859±8μs	0.15	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')

x86-64-v3

export NPY_DISABLE_CPU_FEATURES="X86_V4"spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [c5fc8423] <brings_npsr>	Ratio	Benchmark (Parameter)
+	844±1μs	939±50μs	1.11	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	4.98±0ms	4.57±0.1ms	0.92	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	4.98±0ms	4.20±0.01ms	0.84	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	6.03±0.06ms	5.00±0.06ms	0.83	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	6.06±0.06ms	4.78±0.04ms	0.79	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	1.37±0.01ms	913±40μs	0.67	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	1.37±0.01ms	901±10μs	0.66	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	1.45±0ms	941±2μs	0.65	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	830±0.3μs	511±2μs	0.62	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	7.07±0.01ms	4.35±0.1ms	0.61	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	844±2μs	496±5μs	0.59	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	1.46±0.01ms	864±9μs	0.59	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.21±0.03ms	1.83±0ms	0.57	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	7.08±0.01ms	3.98±0.06ms	0.56	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	3.63±0.09ms	1.84±0.01ms	0.51	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	5.65±0.01ms	2.82±0.1ms	0.5	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.64±0ms	2.74±0.3ms	0.49	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.75±0ms	2.76±0.1ms	0.48	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	3.00±0.09ms	1.44±0ms	0.48	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	3.78±0.1ms	1.82±0.02ms	0.48	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	5.77±0.01ms	2.68±0.02ms	0.46	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	5.73±0ms	2.11±0.03ms	0.37	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	5.62±0ms	2.03±0.01ms	0.36	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.62±0.02ms	1.67±0ms	0.3	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.72±0ms	1.73±0ms	0.3	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	3.23±0.02ms	866±10μs	0.27	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	3.48±0.1ms	856±3μs	0.25	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.68±0.06ms	851±2μs	0.23	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	3.03±0.08ms	497±4μs	0.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')

x86-64-v2

export NPY_DISABLE_CPU_FEATURES="X86_V4 X86_V3"spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [c5fc8423] <brings_npsr>	Ratio	Benchmark (Parameter)
+	4.98±0ms	7.04±0.1ms	1.41	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
+	2.12±0ms	2.88±0.05ms	1.36	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
+	4.97±0.01ms	6.68±0.02ms	1.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
+	2.12±0ms	2.84±0.01ms	1.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
+	2.13±0ms	2.82±0ms	1.32	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
+	4.98±0.01ms	6.19±0.05ms	1.24	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
+	4.97±0ms	5.77±0.05ms	1.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
+	5.96±0.03ms	6.90±0.2ms	1.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
+	2.13±0ms	2.45±0ms	1.15	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
+	5.95±0.06ms	6.75±0.03ms	1.13	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	7.07±0.01ms	6.31±0.02ms	0.89	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	7.08±0ms	5.86±0.03ms	0.83	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	5.64±0.01ms	4.47±0.3ms	0.79	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.76±0.01ms	4.15±0.1ms	0.72	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	5.64±0.02ms	3.94±0.06ms	0.7	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.76±0.01ms	4.01±0.1ms	0.7	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	2.01±0.01ms	1.36±0.02ms	0.67	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	2.02±0.02ms	1.35±0.03ms	0.67	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	2.02±0ms	1.30±0ms	0.64	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	5.62±0.01ms	3.54±0.08ms	0.63	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.73±0.01ms	3.58±0.06ms	0.62	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	5.62±0.01ms	3.11±0.07ms	0.55	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.72±0.01ms	3.12±0.05ms	0.54	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	2.01±0ms	926±3μs	0.46	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	3.29±0ms	1.45±0.1ms	0.44	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.31±0ms	1.41±0.04ms	0.42	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	3.31±0ms	1.36±0.01ms	0.41	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	3.28±0ms	1.30±0ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	3.29±0ms	1.32±0.03ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	3.30±0ms	1.32±0.03ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	3.30±0ms	947±4μs	0.29	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	3.28±0ms	925±2μs	0.28	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

stripped	size (bytes)
before	9,895,936
after	10,014,720
diff	+118,784

github-actionsbot added the 01 - Enhancement label

Sep 6, 2025

seiko2plus force-pushed thebrings_npsr branch 4 times, most recently from09414c8 toaf5d98aCompare

September 7, 2025 22:33

rgommers added the component: SIMDIssues in SIMD (fast instruction sets) code or machinery label

Sep 10, 2025

seiko2plus added this to the2.4.0 release milestone

Sep 13, 2025

seiko2plus mentioned this pull request

Oct 7, 2025

Generate trigonometric constants and lookup tables using Sollyanumpy/numpy-simd-routines#4

Merged

seiko2plus force-pushed thebrings_npsr branch 2 times, most recently from2f4c0a6 to9b9a438Compare

October 8, 2025 02:30

charris modified the milestones:2.4.0 release,2.5.0 Release

Nov 20, 2025

seiko2plus force-pushed thebrings_npsr branch from9b9a438 to67d0274Compare

December 8, 2025 15:46

seiko2plus marked this pull request as ready for review

December 8, 2025 15:46

seiko2plus added2 commits

December 9, 2025 06:46

ENH: Add SIMD sin/cos implementation with numpy-simd-routines

919d0b8

  numpy-simd-routines added as subrepo in meson subprojects  directory and the current FP configuration is static, ~1ulp used for double-precision  ~4ulp for single-precision with handling floating-point errors,  special-cases extended precision for large arguments,  subnormals are enabled by default too.  numpy-simd-routines supports all SIMD extensions that are supported  by Google Highway including non-FMA extensions and is fully independent  from libm to guarantee unified results across all compilers and  platforms.  Full benchmarks will be provided within the pull-request, the following  benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X)  with AVX512 enabled.  Note: that there was no SIMD optimization enabled for sin/cos  for double-precision, only single-precision.  | Before        | After       |  Ratio | Benchmark (Parameter)                    |  |---------------|-------------|--------|------------------------------------------|  | 713±6μs       | 633±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f')        |  | 717±9μs       | 637±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f')        |  | 705±3μs       | 607±10μs    |   0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f')        |  | 714±10μs      | 595±0.5μs   |   0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f')        |  | 370±0.3μs     | 277±4μs     |   0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f')        |  | 373±2μs       | 236±0.6μs   |   0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f')        |  | 1.06±0.01ms   | 648±3μs     |   0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f')        |  | 1.06±0.01ms   | 617±30μs    |   0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f')        |  | 5.06±0.06ms   | 2.61±0.3ms  |   0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') |  | 1.48±0ms      | 715±5μs     |   0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') |  | 1.50±0.01ms   | 639±6μs     |   0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') |  | 5.15±0.1ms    | 1.96±0.01ms |   0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') |  | 5.72±0.02ms   | 2.09±0.1ms  |   0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd')        |  | 5.76±0.01ms   | 2.03±0.08ms |   0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd')        |  | 5.07±0.08ms   | 1.76±0.2ms  |   0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') |  | 6.04±0.04ms   | 2.05±0.09ms |   0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') |  | 5.79±0.03ms   | 1.90±0.2ms  |   0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd')        |  | 2.29±0.1ms    | 762±40μs    |   0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') |  | 5.72±0.1ms    | 1.75±0.07ms |   0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd')        |  | 6.04±0.03ms   | 1.82±0.2ms  |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') |  | 2.49±0.1ms    | 748±30μs    |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') |  | 2.23±0.1ms    | 634±6μs     |   0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') |  | 1.31±0.03ms   | 367±5μs     |   0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') |  | 2.55±0.09ms   | 654±30μs    |   0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') |  | 4.97±0.03ms   | 1.14±0ms    |   0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') |  | 5.67±0.01ms   | 1.22±0.03ms |   0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd')        |  | 5.76±0.03ms   | 1.28±0.06ms |   0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd')        |  | 1.26±0.01ms   | 272±2μs     |   0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') |  | 7.03±0.02ms   | 1.31±0.01ms |   0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') |  | 5.67±0.01ms   | 810±9μs     |   0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd')        |  | 5.71±0.01ms   | 817±40μs    |   0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd')        |  | 7.05±0.03ms   | 915±4μs     |   0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |

Relax sin/cos ULP test for float32 on non-FMA

c5fc842

Allow up to 3 ULP error for float32 sin/cos when nativeFMA is not available.

Copy link

MemberAuthor

seiko2plus commentedDec 9, 2025•
edited
Loading

I have updated the benchmark to use GCC tests instead of Clang, as GCC is commonly the default compiler for wheels on Linux. At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures. The ufunc inner implementation is optimized for size as much as possible and only increases+118,784B (stripped) on x86/gcc.

cc:@charris,@rgommers,@mattip,@seberg,@r-devulap,@Mousius

Copy link

MemberAuthor

seiko2plus commentedDec 9, 2025•
edited
Loading

Ah, regarding the reverted#23399, as discussed in thethread/C6EYZZSR4EWGVKHAZXLE7IBILRMNVK7L/. The default precision for f64 is set up to 1ULP, including for large arguments, with the ability to add a build-time option to change this behavior later:

numpy/numpy/_core/src/common/simd/simd.hpp

Lines 79 to 100 in67d0274

	using PreciseHigh =decltype(npsr::Precise{});
	using PreciseLow =decltype(npsr::Precise{npsr::kLowAccuracy});
	structPresiceDummy {};
	template<typename T>
	structPreciseByType {};
	template<>
	structPreciseByType<float> {using Type = PreciseLow; };
	template<>
	structPreciseByType<double> {
	#if NPY_HWY_F64
	using Type = PreciseHigh;
	#else
	// If float64 SIMD isn’t available, use a dummy type.
	// The scalar path will run, but `Type` must still be defined.
	// The dummy is never passed; it only satisfies interfaces.
	// This also avoids spurious FP exceptions during RAII.
	using Type = PresiceDummy;
	#endif
	};
	}// namespace detail
	template<typename T>
	using Precise =typename detail::PreciseByType<T>::Type;

cc:@mhvk

Copy link

Member

rgommers commentedDec 9, 2025

At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures.

No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.

The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

That's impressively small, glad to see that.

Copy link

Member

seberg commentedDec 9, 2025

@seiko2plus thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.)

mhvk mentioned this pull request

Dec 9, 2025

Maybe small improvement for accuracy for cos for regular anglesnumpy/numpy-simd-routines#6

Open

mhvk reviewed

Dec 9, 2025

View reviewed changes

Copy link

Contributor

mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@seiko2plus - looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.

numpy/_core/src/umath/loops_trigonometric.dispatch.cpp

		// Note: the intrinsics of NumPy SIMD routines are inlined by default.
		if (NPY_UNLIKELY(is_mem_overlap(args[0], sin, args[1], sout, len) \|\|
		sin !=sizeof(T) \|\| sout !=sizeof(T))) {
		// this for non-contiguous or overlapping case

Copy link

Contributor

mhvkDec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I always have trouble knowing exactly what one can ask the iterator, but is it not possible to set some flag that ensures that for non-contiguous or overlapping data, a copy is made to a contiguous buffer? I ask since if so we do not have to do this inside the loop (with similar duplicated code presumably elsewhere).

Copy link

Member

sebergDec 9, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is already the case I believe. But these paths get used by.accumulate unfortunately. Although, it may be that we can simplify the check based on that knowledge?

EDIt: Sorry to be clear, Ithink that is the case, I am not 100% sure.

Copy link

Contributor

mhvkDec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think it can be used by.accumulate since this ufunc hasnin=1... If a flag is indeed passed, maybe it is worth replacing this with an assert on the stride and see if anything breaks? Logically, if we are passing the iterator a flag that we require contiguous input and output, then we should never get anything else...

Copy link

Member

sebergDec 9, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, there is no flag to enforce a contiguous loop, but we could probably add one, if we need this more I think that makes sense. The buffer machinery will have a lot of overhead unfortunately, but overall it is still better likely.

Foroverlap, I believe the rule should be (or am I missing something?!):

If memory overlap exists it must be exactly the same memory (i.e. args[0] == args[1] and steps identical).
Except for.accumulate which you are right, doesn't apply here.

Copy link

Contributor

mhvkDec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, if there is no "needs contiguous" flag, it is out of scope here. But I think it does make sense to add the option to request buffering to a contiguous and aligned chunk to the iterator. I would think the overhead cannot be that much worse than it is here, since one would already be inside the iterator anyway typically for non-contiguous data. And it would certainly be nice to keep the underlying loops simple...

Copy link

Member

sebergDec 9, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

Copy link

Contributor

mhvkDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@seberg - you mentioned there is no flag, but is one needed? Ifsin/cos are defined as an array method, would things not work automatically if one filled only thecontiguous_loop slot? Or is that at the moment only treated as something that will be used if present if data are contiguous.

Copy link

Contributor

mhvkDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would think the overhead cannot be that much worse than it is here
Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

The other thing is that if we centralize this, there will be more of an incentive to speed it up. Isn't the buffer pre-allocated by the compiler here? It would make more sense to have just one scratch buffer in the iterator, perhaps using your mechanism that allocates if it is too small.

Copy link

Member

sebergDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@mhvk no not yet, the contiguous loop is an optimization only, only with that flag or a "non-contig to contig wrapper" could it be more. Dunno what is better, centralizing or not, I mostly like it because it removes unnecessary repitition both in code and binary, whether it is slower or faster.
(but yeah, we shouldn't worry about that here.)

Copy link

Contributor

mhvkDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So, basically it would be some way to tell the ufunc (in this case) that theNPY_ITER_CONTIG flag should be passed on to the iterator... Anyway, let's take the discussion out of this PR -- see#30413.

numpy/_core/src/common/simd/simd.hpp

		/// npsr is tag free by design so we only include it within main namespace (np::simd)
		namespacesr= npsr::HWY_NAMESPACE;

		/// Default precision configrations for NumPy SIMD Routines

Copy link

Contributor

mhvkDec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Might be good here to tell what low and high imply (or at least where to look it up)

numpy/_core/tests/test_umath.py

		assert_array_max_ulp(np.cos(tx_f32,out=tx_f32),np.float32(np.cos(x_f64)),maxulp=2)
		assert_array_max_ulp(np.sin(x_f32,out=x_f32),np.float32(np.sin(x_f64)),maxulp=maxulp_f32)
		assert_array_max_ulp(np.cos(tx_f32,out=tx_f32),np.float32(np.cos(x_f64)),maxulp=maxulp_f32)

Copy link

Contributor

mhvkDec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It maybe good to add an explicit test thatcos(0)==1 for all float, as it is the thing that threw our tests off in a previous iteration (and I think it is something you now explicitly capture).

mhvk mentioned this pull request

Dec 10, 2025

ENH: Allow ufuncs to request the iterator to provide contiguous arrays#30413

Open

SIMD, TST: Apply Marten’s suggestions

e16efa5

- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type.- Add test cases for sine/cosine facts—for example, `cos(0) == 1`.

seiko2plus force-pushed thebrings_npsr branch from67d0274 toe16efa5Compare

December 19, 2025 06:58

SIMD: Enable SVE/RVV dynamic dispatch for sin/cos kernels

2f99039

Labels

01 - Enhancement component: SIMD

Issues in SIMD (fast instruction sets) code or machinery

Movatterモバイル変換

Uh oh!

ENH: Add SIMD sin/cos implementation with numpy-simd-routines#29699

Are you sure you want to change the base?

ENH: Add SIMD sin/cos implementation with numpy-simd-routines#29699

Conversation

seiko2plus commentedSep 6, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Benchmark

X86 Platform

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

Uh oh!

seiko2plus commentedDec 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

seiko2plus commentedDec 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

rgommers commentedDec 9, 2025

Uh oh!

seberg commentedDec 9, 2025

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebergDec 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebergDec 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebergDec 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seiko2plus commentedSep 6, 2025•
edited
Loading

seiko2plus commentedDec 9, 2025•
edited
Loading

seiko2plus commentedDec 9, 2025•
edited
Loading

sebergDec 9, 2025•
edited
Loading

sebergDec 9, 2025•
edited
Loading

sebergDec 9, 2025•
edited
Loading