Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics)#17587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
mattip merged 3 commits intonumpy:masterfromseiko2plus:to_npyv_sincos_f32
Dec 26, 2020

Conversation

seiko2plus
Copy link
Member

@seiko2plusseiko2plus commentedOct 19, 2020
edited
Loading

Merge after#17790,#17789

SIMD: Replace raw SIMD of sin/cos with NPYV

The new code improves the performance of non-contiguous memory access
for the output array without any reduction in performance.
For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.

TODO:

Performance tests(ASV)

Args

--bench-compare master bench_ufunc_strides.Unary -- --sort name --cpu-affinity 1,5

X86

I had to count on my local machine because I couldn't able to get stable ratios using aws.
see standalone benchamrk for AVX512F.

CPU
Architecture:        x86_64CPU op-mode(s):      32-bit, 64-bitByte Order:          Little EndianCPU(s):              8On-line CPU(s) list: 0-7Thread(s) per core:  2Core(s) per socket:  4Socket(s):           1NUMA node(s):        1Vendor ID:           GenuineIntelCPU family:          6Model:               142Model name:          Intel(R) Core(TM) i7-8550U CPU @ 1.80GHzStepping:            10CPU MHz:             1800.344CPU max MHz:         4000.0000CPU min MHz:         400.0000BogoMIPS:            3984.00Virtualization:      VT-xL1d cache:           32KL1i cache:           32KL2 cache:            256KL3 cache:            8192KNUMA node0 CPU(s):   0-7Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx
OS
Linux ac6279ab1a82 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 x86_64 x86_64 GNU/Linuxgcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0

Benchmark

AVX2 & FMA3 - Changed only
       before           after         ratio     [098a3b41]       [a0322ee9]<master><to_npyv_sincos_f32>        259~3us       55.1~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2,'f')        260~4us       56.2~0.2us     0.22  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4,'f')      334~0.8us      60.4~0.07us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2,'f')      335~0.9us       61.5~0.2us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4,'f')      337~0.4us       62.1~0.2us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2,'f')        339~2us       61.2~0.6us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4,'f')       266~10us       54.9~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2,'f')       270~20us       55.6~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4,'f')        331~3us       60.3~0.1us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2,'f')        332~2us       61.0~0.3us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4,'f')        336~1us       61.7~0.3us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2,'f')      335~0.2us       61.5~0.4us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4,'f')

Power little-endian

CPU
Architecture:                    ppc64leByte Order:                      Little EndianCPU(s):                          8On-line CPU(s) list:             0-7Thread(s) per core:              1Core(s) per socket:              1Socket(s):                       8NUMA node(s):                    1Model:                           2.2 (pvr 004e 1202)Model name:                      POWER9 (architected), altivec supportedL1d cache:                       256 KiBL1i cache:                       256 KiBNUMA node0 CPU(s):               0-7Vulnerability L1tf:              Not affectedVulnerability Meltdown:          Mitigation; RFI FlushVulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)Vulnerability Spectre v1:        Mitigation; __user pointer sanitizationVulnerability Spectre v2:        Vulnerableprocessor   : 7cpu     : POWER9 (architected), altivec supportedclock       : 2200.000000MHzrevision    : 2.2 (pvr 004e 1202)timebase    : 512000000platform    : pSeriesmodel       : IBM pSeries (emulated by qemu)machine     : CHRP IBM pSeries (emulated by qemu)MMU     : Radix
OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64legcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

Benchmark

VSX2(ISA >= 2.07) - Changed only
       before           after         ratio     [098a3b41]       [a0322ee9]     <master>         <to_npyv_sincos_f32>       120±0.2μs      44.7±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 1, 'f')       121±0.5μs      48.9±0.04μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2, 'f')       121±0.3μs      49.1±0.04μs     0.41  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4, 'f')       121±0.2μs      48.7±0.02μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 1, 'f')       121±0.1μs      52.4±0.04μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2, 'f')       121±0.1μs      52.5±0.05μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4, 'f')       121±0.2μs      48.8±0.06μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 1, 'f')       122±0.6μs      52.6±0.04μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2, 'f')      122±0.09μs      53.0±0.01μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4, 'f')       126±0.6μs      44.1±0.01μs     0.35  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 1, 'f')       131±0.5μs      48.2±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2, 'f')       130±0.7μs      48.4±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4, 'f')       131±0.6μs      47.9±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 1, 'f')       131±0.5μs      51.4±0.04μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2, 'f')       131±0.6μs      51.6±0.02μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4, 'f')       130±0.7μs      48.1±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 1, 'f')       131±0.2μs      51.7±0.02μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2, 'f')       131±0.4μs      52.0±0.05μs     0.40  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4, 'f')

Performance tests(standalone#15987)

Args used within#15987

--filter "(sin|cos)::.*[f]" --strides 1 2 3 10 --msleep 1 --iteration 100

Note:--msleep 1 force the running thread to sleep 1 millisecond before collecting each sample
to revert any frequency reduction, since it seems that throttling effect on wall time whenAVX512F is enabled.

X86

CPU
Architecture:                    x86_64CPU op-mode(s):                  32-bit, 64-bitByte Order:                      Little EndianAddress sizes:                   46 bits physical, 48 bits virtualCPU(s):                          4On-line CPU(s) list:             0-3Thread(s) per core:              2Core(s) per socket:              2Socket(s):                       1NUMA node(s):                    1Vendor ID:                       GenuineIntelCPU family:                      6Model:                           85Model name:                      Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GH                                 zStepping:                        7CPU MHz:                         3604.410BogoMIPS:                        6000.00Hypervisor vendor:               KVMVirtualization type:             fullL1d cache:                       64 KiBL1i cache:                       64 KiBL2 cache:                        2 MiBL3 cache:                        35.8 MiBNUMA node0 CPU(s):               0-3Vulnerability Itlb multihit:     KVM: VulnerableVulnerability L1tf:              Mitigation; PTE InversionVulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no m                                 icrocode; SMT Host state unknownVulnerability Meltdown:          Mitigation; PTIVulnerability Spec store bypass: VulnerableVulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __us                                 er pointer sanitizationVulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP dis                                 abled, RSB fillingVulnerability Srbds:             Not affectedVulnerability Tsx async abort:   Not affectedFlags:                           fpu vme de pse tsc msr pae mce cx8 apic sep m                                 trr pge mca cmov pat pse36 clflush mmx fxsr s                                 se sse2 ss ht syscall nx pdpe1gb rdtscp lm co                                 nstant_tsc rep_good nopl xtopology nonstop_ts                                 c cpuid aperfmperf tsc_known_freq pni pclmulq                                 dq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic m                                 ovbe popcnt tsc_deadline_timer aes xsave avx                                  f16c rdrand hypervisor lahf_lm abm 3dnowprefe                                 tch invpcid_single pti fsgsbase tsc_adjust bm                                 i1 avx2 smep bmi2 erms invpcid mpx avx512f av                                 x512dq rdseed adx smap clflushopt clwb avx512                                 cd avx512bw avx512vl xsaveopt xsavec xgetbv1                                  xsaves ida arat pku ospke
OS
Linux ip-172-31-28-146 5.4.0-1025-awsgcc version 7.5.0 (Ubuntu 7.5.0-6ubuntu2)

Benchmark

AVX512F - Contiguous only

metric: gmean, units: ms

name of testbefore_contig_avx512fafter_contig_avx512fafter_contig_avx512f vs before_contig_avx512f
cos::1024      f::1  ->  f::10.00110.00111.01
cos::2048      f::1  ->  f::10.00180.00181.01
cos::4096      f::1  ->  f::10.00330.00291.13
sin::1024      f::1  ->  f::10.00110.00111.01
sin::2048      f::1  ->  f::10.00180.00180.98
sin::4096      f::1  ->  f::10.00320.00301.07
AVX512F

metric: gmean, units: ms

name of testbefore_avx512fafter_avx512fafter_avx512f vs before_avx512f
cos::1024      f::1  ->  f::10.00110.00111.01
cos::2048      f::1  ->  f::10.00180.00180.99
cos::4096      f::1  ->  f::10.00340.00321.05
cos::1024      f::1  ->  f::20.01390.001014.02
cos::2048      f::1  ->  f::20.02780.001914.76
cos::4096      f::1  ->  f::20.05610.004612.17
cos::1024      f::1  ->  f::30.01400.001013.88
cos::2048      f::1  ->  f::30.02800.001914.76
cos::4096      f::1  ->  f::30.05650.004512.54
cos::1024      f::1  ->  f::100.01400.001212.03
cos::2048      f::1  ->  f::100.02800.002014.18
cos::4096      f::1  ->  f::100.05620.004612.13
cos::1024      f::2  ->  f::10.00100.00091.07
cos::2048      f::2  ->  f::10.00190.00171.08
cos::4096      f::2  ->  f::10.00380.00351.09
cos::1024      f::2  ->  f::20.02000.001216.48
cos::2048      f::2  ->  f::20.04000.002516.22
cos::4096      f::2  ->  f::20.07990.004816.82
cos::1024      f::2  ->  f::30.02000.001216.53
cos::2048      f::2  ->  f::30.04000.002416.88
cos::4096      f::2  ->  f::30.08000.004717.02
cos::1024      f::2  ->  f::100.02000.001315.8
cos::2048      f::2  ->  f::100.04000.002516.08
cos::4096      f::2  ->  f::100.08010.005016.02
cos::1024      f::3  ->  f::10.00100.00091.07
cos::2048      f::3  ->  f::10.00190.00171.09
cos::4096      f::3  ->  f::10.00390.00351.11
cos::1024      f::3  ->  f::20.02000.001216.6
cos::2048      f::3  ->  f::20.04000.002416.65
cos::4096      f::3  ->  f::20.08020.004816.53
cos::1024      f::3  ->  f::30.02000.001216.69
cos::2048      f::3  ->  f::30.04000.002416.92
cos::4096      f::3  ->  f::30.08020.004717.09
cos::1024      f::3  ->  f::100.02000.001315.8
cos::2048      f::3  ->  f::100.04000.002516.01
cos::4096      f::3  ->  f::100.08040.005016.23
cos::1024      f::10  ->  f::10.00110.00101.08
cos::2048      f::10  ->  f::10.00210.00191.11
cos::4096      f::10  ->  f::10.00420.00381.11
cos::1024      f::10  ->  f::20.02000.001315.14
cos::2048      f::10  ->  f::20.04000.002615.54
cos::4096      f::10  ->  f::20.08010.005215.46
cos::1024      f::10  ->  f::30.02000.001314.92
cos::2048      f::10  ->  f::30.04000.002615.54
cos::4096      f::10  ->  f::30.07990.005115.59
cos::1024      f::10  ->  f::100.02000.001414.02
cos::2048      f::10  ->  f::100.04000.002814.4
cos::4096      f::10  ->  f::100.08020.005514.49
sin::1024      f::1  ->  f::10.00110.00111.01
sin::2048      f::1  ->  f::10.00170.00171.03
sin::4096      f::1  ->  f::10.00330.00321.02
sin::1024      f::1  ->  f::20.01320.001310.26
sin::2048      f::1  ->  f::20.02640.002013.36
sin::4096      f::1  ->  f::20.05330.004611.55
sin::1024      f::1  ->  f::30.01320.001310.5
sin::2048      f::1  ->  f::30.02670.002013.49
sin::4096      f::1  ->  f::30.05320.004611.61
sin::1024      f::1  ->  f::100.01320.00149.35
sin::2048      f::1  ->  f::100.02640.002112.63
sin::4096      f::1  ->  f::100.05280.004711.31
sin::1024      f::2  ->  f::10.00120.00111.04
sin::2048      f::2  ->  f::10.00200.00191.06
sin::4096      f::2  ->  f::10.00380.00351.07
sin::1024      f::2  ->  f::20.01810.001512.21
sin::2048      f::2  ->  f::20.03610.002315.41
sin::4096      f::2  ->  f::20.07230.004715.53
sin::1024      f::2  ->  f::30.01810.001412.73
sin::2048      f::2  ->  f::30.03640.002315.76
sin::4096      f::2  ->  f::30.07230.004715.26
sin::1024      f::2  ->  f::100.01810.001512.2
sin::2048      f::2  ->  f::100.03620.002414.85
sin::4096      f::2  ->  f::100.07240.004914.82
sin::1024      f::3  ->  f::10.00120.00111.04
sin::2048      f::3  ->  f::10.00200.00191.05
sin::4096      f::3  ->  f::10.00380.00361.08
sin::1024      f::3  ->  f::20.01810.001512.45
sin::2048      f::3  ->  f::20.03620.002415.39
sin::4096      f::3  ->  f::20.07240.004715.44
sin::1024      f::3  ->  f::30.01810.001412.65
sin::2048      f::3  ->  f::30.03650.002315.71
sin::4096      f::3  ->  f::30.07240.004715.26
sin::1024      f::3  ->  f::100.01810.001512.29
sin::2048      f::3  ->  f::100.03620.002514.77
sin::4096      f::3  ->  f::100.07260.004914.92
sin::1024      f::10  ->  f::10.00130.00121.04
sin::2048      f::10  ->  f::10.00220.00211.08
sin::4096      f::10  ->  f::10.00420.00381.09
sin::1024      f::10  ->  f::20.01810.001511.79
sin::2048      f::10  ->  f::20.03610.002514.26
sin::4096      f::10  ->  f::20.07250.005114.28
sin::1024      f::10  ->  f::30.01810.001511.81
sin::2048      f::10  ->  f::30.03640.002514.37
sin::4096      f::10  ->  f::30.07230.005213.91
sin::1024      f::10  ->  f::100.01810.001611.17
sin::2048      f::10  ->  f::100.03620.002713.24
sin::4096      f::10  ->  f::100.07250.005513.3
AVX2 & FMA3 - Contiguous only

metric: gmean, units: ms

name of testbefore_contig_avx2_fma3after_contig_avx2_fma3after_contig_avx2_fma3 vs before_contig_avx2_fma3
cos::1024      f::1  ->  f::10.00150.00141.05
cos::2048      f::1  ->  f::10.00270.00261.05
cos::4096      f::1  ->  f::10.00530.00511.04
sin::1024      f::1  ->  f::10.00140.00141.02
sin::2048      f::1  ->  f::10.00260.00261.03
sin::4096      f::1  ->  f::10.00510.00501.03
AVX2 & FMA3

metric: gmean, units: ms

name of testbefore_avx2_fma3after_avx2_fma3after_avx2_fma3 vs before_avx2_fma3
cos::1024      f::1  ->  f::10.00150.00141.05
cos::2048      f::1  ->  f::10.00270.00261.05
cos::4096      f::1  ->  f::10.00520.00501.05
cos::1024      f::1  ->  f::20.01390.00197.24
cos::2048      f::1  ->  f::20.02790.00377.46
cos::4096      f::1  ->  f::20.05560.00737.59
cos::1024      f::1  ->  f::30.01390.00197.2
cos::2048      f::1  ->  f::30.02790.00377.51
cos::4096      f::1  ->  f::30.05520.00737.61
cos::1024      f::1  ->  f::100.01390.00197.39
cos::2048      f::1  ->  f::100.02790.00377.57
cos::4096      f::1  ->  f::100.05570.00727.72
cos::1024      f::2  ->  f::10.00180.00180.99
cos::2048      f::2  ->  f::10.00350.00350.99
cos::4096      f::2  ->  f::10.00660.00690.96
cos::1024      f::2  ->  f::20.01880.00238.2
cos::2048      f::2  ->  f::20.03760.00487.83
cos::4096      f::2  ->  f::20.07500.00888.55
cos::1024      f::2  ->  f::30.01880.00238.34
cos::2048      f::2  ->  f::30.03760.00448.52
cos::4096      f::2  ->  f::30.07510.00888.52
cos::1024      f::2  ->  f::100.01880.00238.28
cos::2048      f::2  ->  f::100.03760.00458.33
cos::4096      f::2  ->  f::100.07520.00908.32
cos::1024      f::3  ->  f::10.00180.00181.0
cos::2048      f::3  ->  f::10.00350.00351.0
cos::4096      f::3  ->  f::10.00670.00720.93
cos::1024      f::3  ->  f::20.01880.00228.43
cos::2048      f::3  ->  f::20.03750.00448.51
cos::4096      f::3  ->  f::20.07520.00928.15
cos::1024      f::3  ->  f::30.01880.00238.31
cos::2048      f::3  ->  f::30.03760.00448.54
cos::4096      f::3  ->  f::30.07500.00938.1
cos::1024      f::3  ->  f::100.01880.00247.93
cos::2048      f::3  ->  f::100.03750.00458.36
cos::4096      f::3  ->  f::100.07530.00948.04
cos::1024      f::10  ->  f::10.00190.00200.96
cos::2048      f::10  ->  f::10.00360.00370.96
cos::4096      f::10  ->  f::10.00720.00730.98
cos::1024      f::10  ->  f::20.01880.00257.66
cos::2048      f::10  ->  f::20.03750.00487.79
cos::4096      f::10  ->  f::20.07480.00967.8
cos::1024      f::10  ->  f::30.01880.00257.56
cos::2048      f::10  ->  f::30.03760.00487.78
cos::4096      f::10  ->  f::30.07500.00977.74
cos::1024      f::10  ->  f::100.01880.00257.52
cos::2048      f::10  ->  f::100.03750.00497.65
cos::4096      f::10  ->  f::100.07530.00987.69
sin::1024      f::1  ->  f::10.00150.00141.05
sin::2048      f::1  ->  f::10.00270.00251.05
sin::4096      f::1  ->  f::10.00510.00481.07
sin::1024      f::1  ->  f::20.01390.00187.5
sin::2048      f::1  ->  f::20.02770.00377.51
sin::4096      f::1  ->  f::20.05550.00717.8
sin::1024      f::1  ->  f::30.01380.00187.5
sin::2048      f::1  ->  f::30.02780.00377.6
sin::4096      f::1  ->  f::30.05560.00727.75
sin::1024      f::1  ->  f::100.01390.00187.56
sin::2048      f::1  ->  f::100.02770.00367.67
sin::4096      f::1  ->  f::100.05560.00717.88
sin::1024      f::2  ->  f::10.00180.00181.02
sin::2048      f::2  ->  f::10.00340.00340.99
sin::4096      f::2  ->  f::10.00650.00670.97
sin::1024      f::2  ->  f::20.01900.00228.48
sin::2048      f::2  ->  f::20.03820.00478.1
sin::4096      f::2  ->  f::20.07660.00868.95
sin::1024      f::2  ->  f::30.01900.00228.77
sin::2048      f::2  ->  f::30.03830.00438.84
sin::4096      f::2  ->  f::30.07620.00878.77
sin::1024      f::2  ->  f::100.01910.00228.68
sin::2048      f::2  ->  f::100.03820.00448.6
sin::4096      f::2  ->  f::100.07620.00878.72
sin::1024      f::3  ->  f::10.00180.00181.02
sin::2048      f::3  ->  f::10.00340.00340.99
sin::4096      f::3  ->  f::10.00660.00670.98
sin::1024      f::3  ->  f::20.01910.00228.77
sin::2048      f::3  ->  f::20.03820.00448.77
sin::4096      f::3  ->  f::20.07610.00868.87
sin::1024      f::3  ->  f::30.01910.00228.87
sin::2048      f::3  ->  f::30.03830.00438.87
sin::4096      f::3  ->  f::30.07600.00878.78
sin::1024      f::3  ->  f::100.01910.00228.76
sin::2048      f::3  ->  f::100.03830.00448.75
sin::4096      f::3  ->  f::100.07610.00888.66
sin::1024      f::10  ->  f::10.00190.00181.02
sin::2048      f::10  ->  f::10.00350.00351.0
sin::4096      f::10  ->  f::10.00680.00690.99
sin::1024      f::10  ->  f::20.01910.00228.63
sin::2048      f::10  ->  f::20.03810.00458.56
sin::4096      f::10  ->  f::20.07650.00888.74
sin::1024      f::10  ->  f::30.01920.00228.69
sin::2048      f::10  ->  f::30.03820.00448.6
sin::4096      f::10  ->  f::30.07650.00898.64
sin::1024      f::10  ->  f::100.01910.00228.52
sin::2048      f::10  ->  f::100.03820.00458.46
sin::4096      f::10  ->  f::100.07660.00908.55

ARM8 64-bit

CPU
Architecture:                    aarch64CPU op-mode(s):                  32-bit, 64-bitByte Order:                      Little EndianCPU(s):                          4On-line CPU(s) list:             0-3Thread(s) per core:              1Core(s) per socket:              4Socket(s):                       1NUMA node(s):                    1Vendor ID:                       ARMModel:                           1Model name:                      Neoverse-N1Stepping:                        r3p1BogoMIPS:                        243.75L1d cache:                       256 KiBL1i cache:                       256 KiBL2 cache:                        4 MiBL3 cache:                        32 MiBNUMA node0 CPU(s):               0-3Vulnerability Itlb multihit:     Not affectedVulnerability L1tf:              Not affectedVulnerability Mds:               Not affectedVulnerability Meltdown:          Not affectedVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctlVulnerability Spectre v1:        Mitigation; __user pointer sanitizationVulnerability Spectre v2:        Not affectedVulnerability Srbds:             Not affectedVulnerability Tsx async abort:   Not affectedFlags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
OS
Linux ip-172-31-6-63 5.4.0-1024-aws #24-Ubuntu SMP Sat Sep 5 06:17:48 UTC 2020 aarch64 aarch64 aarch64 GNU/Linuxgcc-7 (Ubuntu/Linaro 7.5.0-6ubuntu2) 7.5.0

Benchmark

ASIMD - Contiguous only

metric: gmean, units: ms

name of testbefore_contigafter_contigafter_contig vs before_contig
cos::1024      f::1  ->  f::10.00720.00371.93
cos::2048      f::1  ->  f::10.01490.00742.0
cos::4096      f::1  ->  f::10.03130.01492.11
sin::1024      f::1  ->  f::10.00720.00371.97
sin::2048      f::1  ->  f::10.01480.00732.03
sin::4096      f::1  ->  f::10.03050.01462.09
ASIMD

metric: gmean, units: ms

name of testbeforeafterafter vs before
cos::1024      f::1  ->  f::10.00570.00371.53
cos::2048      f::1  ->  f::10.01250.00741.68
cos::4096      f::1  ->  f::10.02600.01491.75
cos::1024      f::1  ->  f::20.00570.00421.37
cos::2048      f::1  ->  f::20.01240.00831.49
cos::4096      f::1  ->  f::20.02600.01661.56
cos::1024      f::1  ->  f::30.00570.00421.37
cos::2048      f::1  ->  f::30.01240.00831.49
cos::4096      f::1  ->  f::30.02600.01671.56
cos::1024      f::1  ->  f::100.00570.00421.37
cos::2048      f::1  ->  f::100.01270.00861.48
cos::4096      f::1  ->  f::100.02620.01671.57
cos::1024      f::2  ->  f::10.00600.00401.5
cos::2048      f::2  ->  f::10.01250.00801.56
cos::4096      f::2  ->  f::10.02610.01601.63
cos::1024      f::2  ->  f::20.00600.00441.36
cos::2048      f::2  ->  f::20.01250.00881.42
cos::4096      f::2  ->  f::20.02610.01771.47
cos::1024      f::2  ->  f::30.00610.00441.37
cos::2048      f::2  ->  f::30.01250.00881.41
cos::4096      f::2  ->  f::30.02620.01771.48
cos::1024      f::2  ->  f::100.00600.00441.36
cos::2048      f::2  ->  f::100.01260.00891.42
cos::4096      f::2  ->  f::100.02640.01771.49
cos::1024      f::3  ->  f::10.00570.00421.35
cos::2048      f::3  ->  f::10.01260.00841.51
cos::4096      f::3  ->  f::10.02650.01681.57
cos::1024      f::3  ->  f::20.00570.00471.22
cos::2048      f::3  ->  f::20.01260.00931.36
cos::4096      f::3  ->  f::20.02650.01871.42
cos::1024      f::3  ->  f::30.00570.00471.22
cos::2048      f::3  ->  f::30.01270.00931.36
cos::4096      f::3  ->  f::30.02660.01871.42
cos::1024      f::3  ->  f::100.00570.00471.22
cos::2048      f::3  ->  f::100.01280.00941.37
cos::4096      f::3  ->  f::100.02660.01871.43
cos::1024      f::10  ->  f::10.00600.00481.26
cos::2048      f::10  ->  f::10.01250.00951.31
cos::4096      f::10  ->  f::10.02630.01901.38
cos::1024      f::10  ->  f::20.00610.00511.2
cos::2048      f::10  ->  f::20.01250.01021.23
cos::4096      f::10  ->  f::20.02630.02041.29
cos::1024      f::10  ->  f::30.00610.00511.18
cos::2048      f::10  ->  f::30.01250.01021.22
cos::4096      f::10  ->  f::30.02630.02041.29
cos::1024      f::10  ->  f::100.00610.00521.16
cos::2048      f::10  ->  f::100.01260.01021.23
cos::4096      f::10  ->  f::100.02640.02061.28
sin::1024      f::1  ->  f::10.00730.00372.0
sin::2048      f::1  ->  f::10.01470.00732.01
sin::4096      f::1  ->  f::10.03000.01462.06
sin::1024      f::1  ->  f::20.00740.00411.79
sin::2048      f::1  ->  f::20.01460.00821.78
sin::4096      f::1  ->  f::20.03000.01641.83
sin::1024      f::1  ->  f::30.00730.00411.79
sin::2048      f::1  ->  f::30.01460.00821.78
sin::4096      f::1  ->  f::30.03000.01641.83
sin::1024      f::1  ->  f::100.00730.00411.78
sin::2048      f::1  ->  f::100.01470.00851.74
sin::4096      f::1  ->  f::100.03010.01641.83
sin::1024      f::2  ->  f::10.00720.00391.85
sin::2048      f::2  ->  f::10.01470.00781.89
sin::4096      f::2  ->  f::10.03010.01561.93
sin::1024      f::2  ->  f::20.00720.00441.65
sin::2048      f::2  ->  f::20.01470.00881.68
sin::4096      f::2  ->  f::20.03010.01761.71
sin::1024      f::2  ->  f::30.00730.00441.66
sin::2048      f::2  ->  f::30.01470.00881.68
sin::4096      f::2  ->  f::30.03020.01761.72
sin::1024      f::2  ->  f::100.00730.00441.66
sin::2048      f::2  ->  f::100.01480.00881.68
sin::4096      f::2  ->  f::100.03020.01751.72
sin::1024      f::3  ->  f::10.00730.00421.75
sin::2048      f::3  ->  f::10.01460.00831.76
sin::4096      f::3  ->  f::10.02990.01671.79
sin::1024      f::3  ->  f::20.00730.00461.59
sin::2048      f::3  ->  f::20.01460.00911.6
sin::4096      f::3  ->  f::20.02990.01831.63
sin::1024      f::3  ->  f::30.00730.00461.57
sin::2048      f::3  ->  f::30.01470.00911.6
sin::4096      f::3  ->  f::30.02990.01831.64
sin::1024      f::3  ->  f::100.00730.00461.59
sin::2048      f::3  ->  f::100.01470.00921.61
sin::4096      f::3  ->  f::100.03000.01831.64
sin::1024      f::10  ->  f::10.00730.00471.57
sin::2048      f::10  ->  f::10.01470.00941.57
sin::4096      f::10  ->  f::10.03010.01871.61
sin::1024      f::10  ->  f::20.00730.00501.45
sin::2048      f::10  ->  f::20.01460.01011.45
sin::4096      f::10  ->  f::20.03010.02011.5
sin::1024      f::10  ->  f::30.00730.00501.46
sin::2048      f::10  ->  f::30.01460.01011.45
sin::4096      f::10  ->  f::30.03010.02011.5
sin::1024      f::10  ->  f::100.00740.00511.45
sin::2048      f::10  ->  f::100.01470.01011.45
sin::4096      f::10  ->  f::100.03020.02011.5

Power little-endian

CPU
Architecture:                    ppc64leByte Order:                      Little EndianCPU(s):                          8On-line CPU(s) list:             0-7Thread(s) per core:              1Core(s) per socket:              1Socket(s):                       8NUMA node(s):                    1Model:                           2.2 (pvr 004e 1202)Model name:                      POWER9 (architected), altivec supportedL1d cache:                       256 KiBL1i cache:                       256 KiBNUMA node0 CPU(s):               0-7Vulnerability L1tf:              Not affectedVulnerability Meltdown:          Mitigation; RFI FlushVulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)Vulnerability Spectre v1:        Mitigation; __user pointer sanitizationVulnerability Spectre v2:        Vulnerableprocessor   : 7cpu     : POWER9 (architected), altivec supportedclock       : 2200.000000MHzrevision    : 2.2 (pvr 004e 1202)timebase    : 512000000platform    : pSeriesmodel       : IBM pSeries (emulated by qemu)machine     : CHRP IBM pSeries (emulated by qemu)MMU     : Radix
OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64legcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

Benchmark

VSX2(ISA >= 2.07) - Contiguous only

metric: gmean, units: ms

name of testbefore_contigafter_contigafter_contig vs before_contig
cos::1024      f::1  ->  f::10.01310.00442.94
cos::2048      f::1  ->  f::10.02650.00892.99
cos::4096      f::1  ->  f::10.05350.01763.03
sin::1024      f::1  ->  f::10.01340.00423.16
sin::2048      f::1  ->  f::10.02650.00853.13
sin::4096      f::1  ->  f::10.05420.01693.2
VSX2(ISA >= 2.07)

metric: gmean, units: ms

name of testbeforeafterafter vs before
cos::1024      f::1  ->  f::10.01270.00442.86
cos::2048      f::1  ->  f::10.02640.00882.99
cos::4096      f::1  ->  f::10.05380.01842.92
cos::1024      f::1  ->  f::20.01290.00482.7
cos::2048      f::1  ->  f::20.02680.00952.83
cos::4096      f::1  ->  f::20.05460.01892.88
cos::1024      f::1  ->  f::30.01290.00472.72
cos::2048      f::1  ->  f::30.02680.00952.82
cos::4096      f::1  ->  f::30.05470.01892.89
cos::1024      f::1  ->  f::100.01290.00472.72
cos::2048      f::1  ->  f::100.02690.00952.84
cos::4096      f::1  ->  f::100.05460.01902.87
cos::1024      f::2  ->  f::10.01310.00482.73
cos::2048      f::2  ->  f::10.02660.00952.79
cos::4096      f::2  ->  f::10.05470.01912.87
cos::1024      f::2  ->  f::20.01300.00512.55
cos::2048      f::2  ->  f::20.02660.01022.61
cos::4096      f::2  ->  f::20.05440.02102.58
cos::1024      f::2  ->  f::30.01310.00512.56
cos::2048      f::2  ->  f::30.02810.01032.74
cos::4096      f::2  ->  f::30.05440.02082.62
cos::1024      f::2  ->  f::100.01310.00512.55
cos::2048      f::2  ->  f::100.02660.01022.6
cos::4096      f::2  ->  f::100.05440.02052.65
cos::1024      f::3  ->  f::10.01300.00482.7
cos::2048      f::3  ->  f::10.02710.00962.84
cos::4096      f::3  ->  f::10.05430.01912.83
cos::1024      f::3  ->  f::20.01290.00512.53
cos::2048      f::3  ->  f::20.02710.01022.65
cos::4096      f::3  ->  f::20.05430.02052.65
cos::1024      f::3  ->  f::30.01300.00512.53
cos::2048      f::3  ->  f::30.02720.01022.66
cos::4096      f::3  ->  f::30.05430.02052.65
cos::1024      f::3  ->  f::100.01300.00532.46
cos::2048      f::3  ->  f::100.02800.01022.73
cos::4096      f::3  ->  f::100.05630.02042.75
cos::1024      f::10  ->  f::10.01330.00482.76
cos::2048      f::10  ->  f::10.02650.00962.77
cos::4096      f::10  ->  f::10.05510.01912.89
cos::1024      f::10  ->  f::20.01330.00512.6
cos::2048      f::10  ->  f::20.02660.01022.59
cos::4096      f::10  ->  f::20.05520.02052.7
cos::1024      f::10  ->  f::30.01330.00512.59
cos::2048      f::10  ->  f::30.02660.01022.59
cos::4096      f::10  ->  f::30.05520.02052.7
cos::1024      f::10  ->  f::100.01330.00512.58
cos::2048      f::10  ->  f::100.02650.01022.59
cos::4096      f::10  ->  f::100.05520.02052.7
sin::1024      f::1  ->  f::10.01340.00423.16
sin::2048      f::1  ->  f::10.02710.00853.2
sin::4096      f::1  ->  f::10.05350.01693.17
sin::1024      f::1  ->  f::20.01330.00462.9
sin::2048      f::1  ->  f::20.02680.00912.93
sin::4096      f::1  ->  f::20.05300.01832.9
sin::1024      f::1  ->  f::30.01330.00462.9
sin::2048      f::1  ->  f::30.02680.00912.94
sin::4096      f::1  ->  f::30.05300.01882.82
sin::1024      f::1  ->  f::100.01330.00472.83
sin::2048      f::1  ->  f::100.02680.00942.87
sin::4096      f::1  ->  f::100.05300.01832.9
sin::1024      f::2  ->  f::10.01310.00462.87
sin::2048      f::2  ->  f::10.02640.00922.89
sin::4096      f::2  ->  f::10.05300.01832.9
sin::1024      f::2  ->  f::20.01310.00502.65
sin::2048      f::2  ->  f::20.02640.00992.68
sin::4096      f::2  ->  f::20.05310.01982.68
sin::1024      f::2  ->  f::30.01310.00492.66
sin::2048      f::2  ->  f::30.02640.00992.68
sin::4096      f::2  ->  f::30.05300.01982.68
sin::1024      f::2  ->  f::100.01310.00502.65
sin::2048      f::2  ->  f::100.02650.00992.68
sin::4096      f::2  ->  f::100.05310.01972.69
sin::1024      f::3  ->  f::10.01300.00462.82
sin::2048      f::3  ->  f::10.02630.00922.86
sin::4096      f::3  ->  f::10.05320.01832.9
sin::1024      f::3  ->  f::20.01300.00502.61
sin::2048      f::3  ->  f::20.02630.00992.65
sin::4096      f::3  ->  f::20.05330.01982.69
sin::1024      f::3  ->  f::30.01360.00502.74
sin::2048      f::3  ->  f::30.02630.00992.65
sin::4096      f::3  ->  f::30.05330.01982.69
sin::1024      f::3  ->  f::100.01300.00502.61
sin::2048      f::3  ->  f::100.02630.00992.66
sin::4096      f::3  ->  f::100.05320.01982.69
sin::1024      f::10  ->  f::10.01280.00462.78
sin::2048      f::10  ->  f::10.02640.00922.88
sin::4096      f::10  ->  f::10.05370.01842.91
sin::1024      f::10  ->  f::20.01330.00502.67
sin::2048      f::10  ->  f::20.02640.00992.66
sin::4096      f::10  ->  f::20.05370.01982.72
sin::1024      f::10  ->  f::30.01280.00502.58
sin::2048      f::10  ->  f::30.02640.00992.67
sin::4096      f::10  ->  f::30.05370.01982.71
sin::1024      f::10  ->  f::100.01280.00502.58
sin::2048      f::10  ->  f::100.02640.00992.67
sin::4096      f::10  ->  f::100.05370.01982.71

@seiko2plusseiko2plusforce-pushed theto_npyv_sincos_f32 branch 12 times, most recently from7161e30 toe01dc6eCompareOctober 21, 2020 10:25
@seiko2plusseiko2plus marked this pull request as ready for reviewOctober 21, 2020 10:25
@seiko2plusseiko2plusforce-pushed theto_npyv_sincos_f32 branch 2 times, most recently from518fd92 to2a01e5fCompareNovember 1, 2020 16:39
@charrischarris added 01 - Enhancement component: SIMDIssues in SIMD (fast instruction sets) code or machinery labelsNov 2, 2020
@seiko2plusseiko2plusforce-pushed theto_npyv_sincos_f32 branch 2 times, most recently from360472c tobb08eb2CompareNovember 11, 2020 03:40
@seiko2plus
Copy link
MemberAuthor

the new NPYV intrinsics have moved to separate pull-requests#17790,#17789

@seiko2plusseiko2plusforce-pushed theto_npyv_sincos_f32 branch 7 times, most recently fromb958d43 toa0322eeCompareDecember 26, 2020 10:48
@seiko2plus
Copy link
MemberAuthor

ping@mattip

@@ -0,0 +1,230 @@
/*@targets
** $maxopt $werror baseline
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
** $maxopt$werrorbaseline
** $maxopt baseline

remove treating warnings as errors after the CI pass the tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

CI is passing

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done, I temporarily use this policy during the development to detect any warnings.

@mattip
Copy link
Member

mattip commentedDec 26, 2020
edited
Loading

Nice speedups. Is this for 32-bit float only or also for 64-bit?

Edit: 32 bit only.

   The new code improves the performance of non-contiguous memory access   for the output array without any reduction in performance.   For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
  This test should not be exclusive to AVX. this patch also  extends unary test to cover different sets of output strides.
@seiko2plus
Copy link
MemberAuthor

@mattip, just replaced the raw SIMD code of f32 with NPYV.

@mattipmattip merged commitce82028 intonumpy:masterDec 26, 2020
@mattip
Copy link
Member

Thanks@seiko2plus

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@mattipmattipmattip left review comments

@Qiyu8Qiyu8Qiyu8 left review comments

Assignees
No one assigned
Labels
01 - Enhancementcomponent: SIMDIssues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@seiko2plus@mattip@Qiyu8@charris

[8]ページ先頭

©2009-2025 Movatter.jp