Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork11k
ENH: Move dispatch-able umath fast-loops to the new dispatcher#16396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
3df132a
to486e6a8
CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM. I know this is just a reorganization, but it would be prudent to run benchmarks to make sure nothing has changed.
Uh oh!
There was an error while loading.Please reload this page.
Still waiting for a benchmark on this PR on a machine with AVX512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I am on a machine with avx512f and avx512cd features. I benchmarked bench_ufunc and bench_avx and performance increased in all cases.
before after ratio [58da484a] [d96263c2]- 21.6±0.1μs 20.5±0.1μs 0.95 bench_ufunc.UFunc.time_ufunc_types('bitwise_or')- 13.8±0.3μs 12.8±0.2μs 0.93 bench_ufunc.UFunc.time_ufunc_types('invert')- 21.6±0.4μs 19.9±0.3μs 0.92 bench_ufunc.UFunc.time_ufunc_types('bitwise_xor')- 892±10ns 820±30ns 0.92 bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True))- 912±20ns 830±40ns 0.91 bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True, where=True))- 13.9±0.1μs 12.7±0.2μs 0.91 bench_ufunc.UFunc.time_ufunc_types('bitwise_not')- 635±30ns 566±10ns 0.89 bench_ufunc.Scalar.time_add_scalar- 914±40ns 810±30ns 0.89 bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.)))- 1.60±0.1μs 1.38±0.01μs 0.86 bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0))- 1.64±0.1μs 1.40±0.02μs 0.85 bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None))- 40.4±0.8μs 34.4±1μs 0.85 bench_ufunc.CustomInplace.time_char_or_temp- 849±30ns 717±20ns 0.85 bench_ufunc.Scalar.time_add_scalar_conv- 899±10ns 759±30ns 0.84 bench_ufunc.Scalar.time_add_scalar_conv_complex- 1.64±0.1μs 1.37±0.01μs 0.84 bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.])))- 32.8±0.4μs 25.1±1μs 0.76 bench_ufunc.CustomInplace.time_int_or- 26.3±0.3μs 19.3±0.3μs 0.73 bench_ufunc.CustomInplace.time_char_or- 47.7±0.4μs 31.8±0.2μs 0.67 bench_ufunc.CustomInplace.time_double_add- 93.8±0.6μs 25.0±0.3μs 0.27 bench_ufunc.UFunc.time_ufunc_types('right_shift')- 96.6±0.4μs 22.3±0.3μs 0.23 bench_ufunc.UFunc.time_ufunc_types('left_shift')- 5.45±0.1μs 5.11±0.1μs 0.94 bench_avx.AVX_cmplx_funcs.time_ufunc('conjugate', 1, 'D')- 3.13±0.03μs 2.91±0.06μs 0.93 bench_avx.AVX_UFunc.time_ufunc('trunc', 1, 'd')- 3.17±0.02μs 2.91±0.05μs 0.92 bench_avx.AVX_UFunc.time_ufunc('floor', 1, 'd')- 6.35±0.06μs 5.12±0.1μs 0.81 bench_avx.AVX_UFunc.time_ufunc('frexp', 4, 'f')
seiko2plus commentedJul 11, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Benchmarking results(ufunc inner loops via#15987)Measuring before and after this patchThe target from the following benchmarks is to detect the auto-vectorization performance changes in the old
Measuring the performance gain by enabling SKX in the new dispatcherin other words, comparing SKX(new dispatcher) vs AVX2(new dispatcher) Hardware & OSAWS/EC2 optimized instance (c5d.xlarge), run on Intel SKX CHIP (Skylake servers), Architecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 4On-line CPU(s) list: 0-3Thread(s) per core: 2Core(s) per socket: 2Socket(s): 1NUMA node(s): 1Vendor ID: GenuineIntelCPU family: 6Model: 85Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHzStepping: 4CPU MHz: 3402.876BogoMIPS: 6000.00Hypervisor vendor: KVMVirtualization type: fullL1d cache: 32KL1i cache: 32KL2 cache: 1024KL3 cache: 25344KNUMA node0 CPU(s): 0-3Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke Linux ip-172-31-8-210 5.3.0-1023-aws#25~18.04.1-Ubuntu SMP Fri Jun 5 15:18:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Using built-in specs.COLLECT_GCC=gccCOLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapperOFFLOAD_TARGET_NAMES=nvptx-noneOFFLOAD_TARGET_DEFAULT=1Target: x86_64-linux-gnuConfigured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnuThread model: posixgcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) |
seiko2plus commentedJul 11, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@mattip, I made two mistakes when I reported@rgommers about the binary size during reviewing this patch within#13516:
Now, what is the real cost? # sizes in kbytes and option '-Wl,--strip-debug' is passed to the linker during the build# note: the following sizes based on what GCC 7.5 generates, other versions or compilers should act# differently.4864_multiarray_umath_after.so1436_multiarray_umath_after.so.gz4056_multiarray_umath_after_noskx.so1216_multiarray_umath_after_noskx.so.gz3548_multiarray_umath_before.so1084_multiarray_umath_before.so.gz And why the binary size increased even without AVX512_SKX? Any performance gain? I think we will need to replace fast loops macros with SIMD code, to improve the performance of non-contiguous operations |
add new attribute `dispatch` to umath generator, the same as the current `simd` but for the new dispatcher, requires the name of the dispatch-able source without its extension and signature types.Co-authored-by: Matti Picus <matti.picus@gmail.com>
remove definitions and declarations of dispatche-able umath fast-loops from `loops.c.src` and `loops.h.src`
re-implement the dispatch-able umath fast-loops to fit the requirement of the new dispatcher
re-implement the workaround of disabling optimization for `*_right_shift`.
update umath generator and use attribute `dispatch` instead of `simd`.
I think the binary growth is worth the increase. The benchmarks seem to show this is worth the change in most cases. It would be nice if all the benchmarks were equal or faster, I wonder why sometimes AVX is better even for contiguous vectors for smaller data sizes (bytes). |
I am not sure if these are related to the changes in this PR, but I am seeing regression in some of the benchmarks on a SKX:
|
@r-devulap, Which compiler/version you used? It seems there's no way to achieve that by counting on auto-vectorization, I detected negative impacts with msvc and old versions of GCC, also a weird behavior by gcc-9. not sure yet about Intel Compiler.
I'm going to replace fast macros with universal intrinsics, I see it as the only approach to achieve that. |
gcc-9 (Ubuntu 9.2.1-17ubuntu1~18.04.1) 9.2.1 20191102 |
seiko2plus commentedJul 17, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I closed this pr in favor of replacing auto-vectorization with universal intrinsics due to the impossibility to get a stable and equal performance on widely used compilers. |
Perhaps a more gradual approach of replacing a single class of ufuncs, say logical ops, at a time might be more tractable. |
Uh oh!
There was an error while loading.Please reload this page.
This pullrequest changes
Move dispatch-able umath fast-loops to the new dispatcher
Re-implement the dispatch-able umath fast-loops to fit the requirement of the new dispatcher,
also, add new attribute
dispatch
to umath generator similar to currentsimd
but for the new dispatcher.