Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: Add SIMD sin/cos implementation with numpy-simd-routines#29699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
seiko2plus wants to merge4 commits intonumpy:main
base:main
Choose a base branch
Loading
fromseiko2plus:brings_npsr

Conversation

@seiko2plus
Copy link
Member

@seiko2plusseiko2plus commentedSep 6, 2025
edited
Loading

numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.

numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.

Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.

Benchmark

X86 Platform

The following benchmark was tested against GCC 14.3.0 on an x86 CPU (Ryzen 7 7700X).

Environment
> uname -aLinux seiko-pc 6.12.60#1-NixOS SMP PREEMPT_DYNAMIC Mon Dec  1 10:43:41 UTC 2025 x86_64 GNU/Linux> gcc --versiongcc (GCC) 14.3.0Copyright (C) 2024 Free Software Foundation, Inc.This is free software; see thesourcefor copying conditions.  There is NOwarranty; not evenfor MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.> python --versionPython 3.12.12> lscpuArchitecture:                x86_64  CPU op-mode(s):            32-bit, 64-bit  Address sizes:             48 bits physical, 48 bits virtual  Byte Order:                Little EndianCPU(s):                      16  On-line CPU(s) list:       0-15Vendor ID:                   AuthenticAMD  Model name:                AMD Ryzen 7 7700X 8-Core Processor    CPU family:              25    Model:                   97    Thread(s) per core:      2    Core(s) per socket:      8    Socket(s):               1    Stepping:                2    Frequency boost:         enabled    CPU(s) scaling MHz:      72%    CPU max MHz:             5573.0000    CPU min MHz:             545.0000    BogoMIPS:                8982.62    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl                              xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_leg                             acy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp                              ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec                              xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cl                             ean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpo                             pcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
x86-64-v4
export NPY_DISABLE_CPU_FEATURES=""spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"
ChangeBefore [b70fc77] <brings_npsr~2>After [c5fc8423] <brings_npsr>RatioBenchmark (Parameter)
-689±6μs640±30μs0.93bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-683±0.9μs602±2μs0.88bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-682±1μs603±1μs0.88bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-695±0.7μs604±1μs0.87bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-354±0.7μs231±0.4μs0.65bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-363±0.2μs234±4μs0.64bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-1.04±0.02ms632±20μs0.61bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-1.02±0ms612±10μs0.6bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-1.32±0.01ms795±2μs0.6bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-1.72±0.03ms806±3μs0.47bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-1.44±0.02ms603±0.3μs0.42bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-5.05±0.1ms2.07±0.01ms0.41bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-4.98±0.01ms1.88±0.03ms0.38bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-5.95±0.02ms2.26±0.07ms0.38bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-2.18±0.05ms792±1μs0.36bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-1.78±0.03ms608±4μs0.34bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-5.65±0.02ms1.87±0.2ms0.33bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-5.65±0.01ms1.87±0.01ms0.33bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-5.77±0.01ms1.90±0.01ms0.33bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-6.00±0.04ms1.98±0.02ms0.33bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-1.36±0.01ms443±2μs0.32bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-5.75±0ms1.68±0.04ms0.29bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-4.97±0ms1.46±0.04ms0.29bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-2.42±0.2ms609±2μs0.25bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-5.62±0.01ms1.25±0.02ms0.22bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-5.72±0ms1.25±0.01ms0.22bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-4.97±0ms1.08±0ms0.22bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-7.08±0.01ms1.58±0.02ms0.22bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-7.07±0ms1.20±0.01ms0.17bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-1.49±0ms233±10μs0.16bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-5.61±0ms828±0.4μs0.15bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-5.73±0ms859±8μs0.15bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
x86-64-v3
export NPY_DISABLE_CPU_FEATURES="X86_V4"spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"
ChangeBefore [b70fc77] <brings_npsr~2>After [c5fc8423] <brings_npsr>RatioBenchmark (Parameter)
+844±1μs939±50μs1.11bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-4.98±0ms4.57±0.1ms0.92bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-4.98±0ms4.20±0.01ms0.84bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-6.03±0.06ms5.00±0.06ms0.83bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-6.06±0.06ms4.78±0.04ms0.79bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-1.37±0.01ms913±40μs0.67bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-1.37±0.01ms901±10μs0.66bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-1.45±0ms941±2μs0.65bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-830±0.3μs511±2μs0.62bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-7.07±0.01ms4.35±0.1ms0.61bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-844±2μs496±5μs0.59bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-1.46±0.01ms864±9μs0.59bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-3.21±0.03ms1.83±0ms0.57bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-7.08±0.01ms3.98±0.06ms0.56bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-3.63±0.09ms1.84±0.01ms0.51bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-5.65±0.01ms2.82±0.1ms0.5bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-5.64±0ms2.74±0.3ms0.49bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-5.75±0ms2.76±0.1ms0.48bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-3.00±0.09ms1.44±0ms0.48bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-3.78±0.1ms1.82±0.02ms0.48bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-5.77±0.01ms2.68±0.02ms0.46bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-5.73±0ms2.11±0.03ms0.37bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-5.62±0ms2.03±0.01ms0.36bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-5.62±0.02ms1.67±0ms0.3bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-5.72±0ms1.73±0ms0.3bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-3.23±0.02ms866±10μs0.27bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-3.48±0.1ms856±3μs0.25bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-3.68±0.06ms851±2μs0.23bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-3.03±0.08ms497±4μs0.16bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
x86-64-v2
export NPY_DISABLE_CPU_FEATURES="X86_V4 X86_V3"spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"
ChangeBefore [b70fc77] <brings_npsr~2>After [c5fc8423] <brings_npsr>RatioBenchmark (Parameter)
+4.98±0ms7.04±0.1ms1.41bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
+2.12±0ms2.88±0.05ms1.36bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
+4.97±0.01ms6.68±0.02ms1.34bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
+2.12±0ms2.84±0.01ms1.34bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
+2.13±0ms2.82±0ms1.32bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
+4.98±0.01ms6.19±0.05ms1.24bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
+4.97±0ms5.77±0.05ms1.16bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
+5.96±0.03ms6.90±0.2ms1.16bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
+2.13±0ms2.45±0ms1.15bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
+5.95±0.06ms6.75±0.03ms1.13bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-7.07±0.01ms6.31±0.02ms0.89bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-7.08±0ms5.86±0.03ms0.83bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-5.64±0.01ms4.47±0.3ms0.79bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-5.76±0.01ms4.15±0.1ms0.72bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-5.64±0.02ms3.94±0.06ms0.7bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-5.76±0.01ms4.01±0.1ms0.7bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-2.01±0.01ms1.36±0.02ms0.67bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-2.02±0.02ms1.35±0.03ms0.67bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-2.02±0ms1.30±0ms0.64bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-5.62±0.01ms3.54±0.08ms0.63bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-5.73±0.01ms3.58±0.06ms0.62bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-5.62±0.01ms3.11±0.07ms0.55bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-5.72±0.01ms3.12±0.05ms0.54bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-2.01±0ms926±3μs0.46bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-3.29±0ms1.45±0.1ms0.44bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-3.31±0ms1.41±0.04ms0.42bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-3.31±0ms1.36±0.01ms0.41bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-3.28±0ms1.30±0ms0.4bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-3.29±0ms1.32±0.03ms0.4bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-3.30±0ms1.32±0.03ms0.4bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-3.30±0ms947±4μs0.29bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-3.28±0ms925±2μs0.28bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

strippedsize (bytes)
before9,895,936
after10,014,720
diff+118,784

jorenham, abhishek-iitmadras, ngoldbaum, and seberg reacted with rocket emoji
@seiko2plusseiko2plusforce-pushed thebrings_npsr branch 4 times, most recently from09414c8 toaf5d98aCompareSeptember 7, 2025 22:33
@rgommersrgommers added the component: SIMDIssues in SIMD (fast instruction sets) code or machinery labelSep 10, 2025
@seiko2plusseiko2plus added this to the2.4.0 release milestoneSep 13, 2025
@seiko2plusseiko2plusforce-pushed thebrings_npsr branch 2 times, most recently from2f4c0a6 to9b9a438CompareOctober 8, 2025 02:30
@seiko2plusseiko2plus marked this pull request as ready for reviewDecember 8, 2025 15:46
  numpy-simd-routines added as subrepo in meson subprojects  directory and the current FP configuration is static, ~1ulp used for double-precision  ~4ulp for single-precision with handling floating-point errors,  special-cases extended precision for large arguments,  subnormals are enabled by default too.  numpy-simd-routines supports all SIMD extensions that are supported  by Google Highway including non-FMA extensions and is fully independent  from libm to guarantee unified results across all compilers and  platforms.  Full benchmarks will be provided within the pull-request, the following  benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X)  with AVX512 enabled.  Note: that there was no SIMD optimization enabled for sin/cos  for double-precision, only single-precision.  | Before        | After       |  Ratio | Benchmark (Parameter)                    |  |---------------|-------------|--------|------------------------------------------|  | 713±6μs       | 633±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f')        |  | 717±9μs       | 637±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f')        |  | 705±3μs       | 607±10μs    |   0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f')        |  | 714±10μs      | 595±0.5μs   |   0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f')        |  | 370±0.3μs     | 277±4μs     |   0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f')        |  | 373±2μs       | 236±0.6μs   |   0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f')        |  | 1.06±0.01ms   | 648±3μs     |   0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f')        |  | 1.06±0.01ms   | 617±30μs    |   0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f')        |  | 5.06±0.06ms   | 2.61±0.3ms  |   0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') |  | 1.48±0ms      | 715±5μs     |   0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') |  | 1.50±0.01ms   | 639±6μs     |   0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') |  | 5.15±0.1ms    | 1.96±0.01ms |   0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') |  | 5.72±0.02ms   | 2.09±0.1ms  |   0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd')        |  | 5.76±0.01ms   | 2.03±0.08ms |   0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd')        |  | 5.07±0.08ms   | 1.76±0.2ms  |   0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') |  | 6.04±0.04ms   | 2.05±0.09ms |   0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') |  | 5.79±0.03ms   | 1.90±0.2ms  |   0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd')        |  | 2.29±0.1ms    | 762±40μs    |   0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') |  | 5.72±0.1ms    | 1.75±0.07ms |   0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd')        |  | 6.04±0.03ms   | 1.82±0.2ms  |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') |  | 2.49±0.1ms    | 748±30μs    |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') |  | 2.23±0.1ms    | 634±6μs     |   0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') |  | 1.31±0.03ms   | 367±5μs     |   0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') |  | 2.55±0.09ms   | 654±30μs    |   0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') |  | 4.97±0.03ms   | 1.14±0ms    |   0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') |  | 5.67±0.01ms   | 1.22±0.03ms |   0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd')        |  | 5.76±0.03ms   | 1.28±0.06ms |   0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd')        |  | 1.26±0.01ms   | 272±2μs     |   0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') |  | 7.03±0.02ms   | 1.31±0.01ms |   0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') |  | 5.67±0.01ms   | 810±9μs     |   0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd')        |  | 5.71±0.01ms   | 817±40μs    |   0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd')        |  | 7.05±0.03ms   | 915±4μs     |   0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |
Allow up to 3 ULP error for float32 sin/cos when nativeFMA is not available.
@seiko2plus
Copy link
MemberAuthor

seiko2plus commentedDec 9, 2025
edited
Loading

I have updated the benchmark to use GCC tests instead of Clang, as GCC is commonly the default compiler for wheels on Linux. At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures. The ufunc inner implementation is optimized for size as much as possible and only increases+118,784B (stripped) on x86/gcc.

cc:@charris,@rgommers,@mattip,@seberg,@r-devulap,@Mousius

@seiko2plus
Copy link
MemberAuthor

seiko2plus commentedDec 9, 2025
edited
Loading

Ah, regarding the reverted#23399, as discussed in thethread/C6EYZZSR4EWGVKHAZXLE7IBILRMNVK7L/. The default precision for f64 is set up to 1ULP, including for large arguments, with the ability to add a build-time option to change this behavior later:

using PreciseHigh =decltype(npsr::Precise{});
using PreciseLow =decltype(npsr::Precise{npsr::kLowAccuracy});
structPresiceDummy {};
template<typename T>
structPreciseByType {};
template<>
structPreciseByType<float> {using Type = PreciseLow; };
template<>
structPreciseByType<double> {
#if NPY_HWY_F64
using Type = PreciseHigh;
#else
// If float64 SIMD isn’t available, use a dummy type.
// The scalar path will run, but `Type` must still be defined.
// The dummy is never passed; it only satisfies interfaces.
// This also avoids spurious FP exceptions during RAII.
using Type = PresiceDummy;
#endif
};
}// namespace detail
template<typename T>
using Precise =typename detail::PreciseByType<T>::Type;

cc:@mhvk

@rgommers
Copy link
Member

At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures.

No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.

The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

That's impressively small, glad to see that.

@seberg
Copy link
Member

@seiko2plus thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.)

Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@seiko2plus - looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.

// Note: the intrinsics of NumPy SIMD routines are inlined by default.
if (NPY_UNLIKELY(is_mem_overlap(args[0], sin, args[1], sout, len) ||
sin !=sizeof(T) || sout !=sizeof(T))) {
// this for non-contiguous or overlapping case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I always have trouble knowing exactly what one can ask the iterator, but is it not possible to set some flag that ensures that for non-contiguous or overlapping data, a copy is made to a contiguous buffer? I ask since if so we do not have to do this inside the loop (with similar duplicated code presumably elsewhere).

Copy link
Member

@sebergsebergDec 9, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is already the case I believe. But these paths get used by.accumulate unfortunately. Although, it may be that we can simplify the check based on that knowledge?

EDIt: Sorry to be clear, Ithink that is the case, I am not 100% sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think it can be used by.accumulate since this ufunc hasnin=1... If a flag is indeed passed, maybe it is worth replacing this with an assert on the stride and see if anything breaks? Logically, if we are passing the iterator a flag that we require contiguous input and output, then we should never get anything else...

Copy link
Member

@sebergsebergDec 9, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, there is no flag to enforce a contiguous loop, but we could probably add one, if we need this more I think that makes sense. The buffer machinery will have a lot of overhead unfortunately, but overall it is still better likely.

Foroverlap, I believe the rule should be (or am I missing something?!):

  • If memory overlap exists it must be exactly the same memory (i.e. args[0] == args[1] and steps identical).
  • Except for.accumulate which you are right, doesn't apply here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, if there is no "needs contiguous" flag, it is out of scope here. But I think it does make sense to add the option to request buffering to a contiguous and aligned chunk to the iterator. I would think the overhead cannot be that much worse than it is here, since one would already be inside the iterator anyway typically for non-contiguous data. And it would certainly be nice to keep the underlying loops simple...

Copy link
Member

@sebergsebergDec 9, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@seberg - you mentioned there is no flag, but is one needed? Ifsin/cos are defined as an array method, would things not work automatically if one filled only thecontiguous_loop slot? Or is that at the moment only treated as something that will be used if present if data are contiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

The other thing is that if we centralize this, there will be more of an incentive to speed it up. Isn't the buffer pre-allocated by the compiler here? It would make more sense to have just one scratch buffer in the iterator, perhaps using your mechanism that allocates if it is too small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@mhvk no not yet, the contiguous loop is an optimization only, only with that flag or a "non-contig to contig wrapper" could it be more. Dunno what is better, centralizing or not, I mostly like it because it removes unnecessary repitition both in code and binary, whether it is slower or faster.
(but yeah, we shouldn't worry about that here.)

mhvk reacted with thumbs up emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So, basically it would be some way to tell the ufunc (in this case) that theNPY_ITER_CONTIG flag should be passed on to the iterator... Anyway, let's take the discussion out of this PR -- see#30413.

/// npsr is tag free by design so we only include it within main namespace (np::simd)
namespacesr= npsr::HWY_NAMESPACE;

/// Default precision configrations for NumPy SIMD Routines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Might be good here to tell what low and high imply (or at least where to look it up)

assert_array_max_ulp(np.cos(tx_f32,out=tx_f32),np.float32(np.cos(x_f64)),maxulp=2)
assert_array_max_ulp(np.sin(x_f32,out=x_f32),np.float32(np.sin(x_f64)),maxulp=maxulp_f32)
assert_array_max_ulp(np.cos(tx_f32,out=tx_f32),np.float32(np.cos(x_f64)),maxulp=maxulp_f32)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It maybe good to add an explicit test thatcos(0)==1 for all float, as it is the thing that threw our tests off in a previous iteration (and I think it is something you now explicitly capture).

- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type.- Add test cases for sine/cosine facts—for example, `cos(0) == 1`.
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@sebergsebergseberg left review comments

@mhvkmhvkmhvk left review comments

Assignees

No one assigned

Labels

01 - Enhancementcomponent: SIMDIssues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Milestone

2.5.0 Release

Development

Successfully merging this pull request may close these issues.

5 participants

@seiko2plus@rgommers@seberg@mhvk@charris

[8]ページ先頭

©2009-2025 Movatter.jp