NotificationsYou must be signed in to change notification settings
Fork11.9k
Star31k

ENH: add neon intrinsics to fft radix2&4#17231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

DumbMice wants to merge6 commits intonumpy:masterfromDumbMice:fft-neon

Closed

ENH: add neon intrinsics to fft radix2&4#17231

DumbMice wants to merge6 commits intonumpy:masterfromDumbMice:fft-neon

Conversation

Copy link

Contributor

DumbMice commentedSep 3, 2020

pocketfft

NumPy's fft uses pocketfft, and I added neon intrinsics to radix 2 and radix 4 components (cmplx input) for better performance on arm machine. Although scipy's fft is often recommended as the more developed module to be used, still there might be values in improving numpy's own fft.

Benchmarks

Simple benchmards were done on 1darray of size N and 2darray (N,4) filled with 1.+1j. Compared to master branch, time ratios from this commit are given below, and optimization is promising for large arrays.

log2 N	6	10	15	20	25
fft	99%	100%	83%	71%	62%
ifft	99%	100%	94%	78%	63%
fft2	98%	100%	100%	71%	64%
ifft2	99%	100%	99%	71%	64%

Comments

If you think this is worth doing it, I could proceed to othere compoents (radix3,5...) as well.

Copy link

Member

mattip commentedSep 3, 2020

It would be nice if this could be done with the universal intrinsics rather than redefining new macros.

Qiyu8 added 01 - Enhancement component: SIMDIssues in SIMD (fast instruction sets) code or machinery labels

Sep 3, 2020

Copy link

ContributorAuthor

DumbMice commentedSep 4, 2020•
edited
Loading

It would be nice if this could be done with the universal intrinsics rather than redefining new macros.

I could try to use universal intrinsic in the way I learned to do in#16969 and compare their benchmark results. However, there are intrinsics, specificallyvld2q_f64,vst2q_f64, andvmlaq_f64, that have no equivalent universal intrinsics. Should I build them upon existing universal intrinsics only in pocketfft?

Copy link

Member

mattip commentedSep 4, 2020

@seiko2plus do we have a plan for intrinsics that are CPU-specific? There are two alternatives (you probably already thought about this):

locally check if an intrinsic is supported and provide a local fallback in an#else
provide a universal intrinsic, and a slow-path alternative in the simd headers

Copy link

Member

Qiyu8 commentedSep 4, 2020

The best way is provide a normal intrinsics if we met CPU-specific intrinsics , As we can see in#17049 ,_sum_of_products_contig_contig_outstride0_two uses_mm_shuffle_ps, which is X86-specific, if we simulate this intrinsic in ARM/VSX, The performance is not good even compared to C code, but we found out that It's using_mm_shuffle_ps for reduce sum purpose(higher level), so we write a normal intrinsics namedreduce_sum , which can use better specific intrinsics in ARM/VSX without simulate. the performance is going back then,@DumbMice can you do the same thing forvld2q_f64,vst2q_f64, andvmlaq_f64? we are willing to help if you stucked.

Copy link

ContributorAuthor

DumbMice commentedSep 4, 2020

The best way is provide a normal intrinsics if we met CPU-specific intrinsics , As we can see in#17049 ,_sum_of_products_contig_contig_outstride0_two uses_mm_shuffle_ps, which is X86-specific, if we simulate this intrinsic in ARM/VSX, The performance is not good even compared to C code, but we found out that It's using_mm_shuffle_ps for reduce sum purpose(higher level), so we write a normal intrinsics namedreduce_sum , which can use better specific intrinsics in ARM/VSX without simulate. the performance is going back then,@DumbMice can you do the same thing forvld2q_f64,vst2q_f64, andvmlaq_f64? we are willing to help if you stucked.

Sorry, I forgotvmlsq_f64 was used as well.@Qiyu8 After checking#17049 , I think I could easily addnpyv_muladd_f64 andnpyv_mulsub_f64 forvmlsq_f64 andvmlaq_f64, but it might be hard for me to dovld2q_f64,vst2q_f64 wit my limited knowledge in other intrinsics. Any help is welcome.

BTW@mattip, these new macros using neon intrinsics with prefixV are used only to look consistent with original macros used in scalar loops, I could explicitly expand them if you think they are obscure.

Copy link

Member

seiko2plus commentedSep 4, 2020•
edited
Loading

@mattip, sure we have spacial data types npyv_{type}x2 and npyv_{type}x3 to be used with de&interleaving intrinsics and they're pretty compatible with neon intrinsics. we may need later to add data type npyv_{type}x4 to handle four channels.

@DumbMice,

but it might be hard for me to do vld2q_f64, vst2q_f64 wit my limited knowledge in other intrinsics. Any help is welcome.

Here are tips to implement de&interleaving intrinsics for two 64bit channels:

Create a new header named 'interleave.h' inside NPYV dir for each SIMD extension,
since this header expected grown-up to over 1000-2000 lines then include the new header inside the
main 'simd.h' header, same as we with other internal headers.
The new intrinsics names should be namednpyv_load_deinterleave_f64x2 andnpyv_store_interleave_f64x2,
too long name but it will increase the readability.
The final look of neon header should be as following:

#ifndefNPY_SIMD#error "Not a standalone header"#endif#ifndef_NPY_SIMD_NEON_INTERLEAVE_H#define_NPY_SIMD_NEON_INTERLEAVE_H#ifNPY_SIMD_F64NPY_FINLINEnpyv_f64x2npyv_load_deinterleave_f64x2(constdouble*ptr){returnvld2q_f64(ptr); }NPY_FINLINEvoidnpyv_store_interleave_f64x2(double*ptr,npyv_f64x2a){vst2q_f64(ptr,a); }#endif// NPY_SIMD_F64#endif

NPYV have intrinsics for zip operations that can be used with both operations (interleave and deinterleave)
for SSE and VSX. check the following example

#ifndefNPY_SIMD#error "Not a standalone header"#endif#include"memory.h"// load*#include"reorder.h"// npyv_zip*#ifndef_NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H#define_NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_HNPY_FINLINEnpyv_f64x2npyv_load_deinterleave_f64x2(constdouble*ptr){npyv_f64a=npyv_load_f64(ptr);npyv_f64b=npyv_load_f64(ptr+npyv_nlanes_f64);returnnpyv_zip_f64(a,b);}NPY_FINLINEvoidnpyv_store_interleave_f64x2(double*ptr,npyv_f64x2a){npyv_f64x2zip=npyv_zip_f64(a.val[0],a.val[1]);npyv_store_f64(ptr,zip.val[0]);npyv_store_f64(ptr+npyv_nlanes_f64,zip.val[1]);}#endif// _NPY_SIMD_{EXTENSION_NAME}_INTERLEAVE_H

For AVX2 and AVX512 you can still use npyv_zip_f64 only with "interleave" intrinic same as the previous code.
Deinterleave intrinsic implementation for AVX2

NPY_FINLINEnpyv_f64x2npyv_load_deinterleave_f64x2(constdouble*ptr){npyv_f64ab0=npyv_load_f64(ptr);npyv_f64ab1=npyv_load_f64(ptr+npyv_nlanes_f64);npyv_f64x2ab,swap_halves=npyv_combine_f64(ab0,ab1);ab.val[0]=_mm256_unpacklo_pd(swap_halves.val[0],swap_halves.val[1]);ab.val[1]=_mm256_unpackhi_pd(swap_halves.val[0],swap_halves.val[1]);returnab;}

Deinterleave intrinsic implementation for AVX512

NPY_FINLINEnpyv_f64x2npyv_load_deinterleave_f64x2(constdouble*ptr){npyv_f64ab0=npyv_load_f64(ptr);npyv_f64ab1=npyv_load_f64(ptr+npyv_nlanes_f64);npyv_f64x2ab;ab.val[0]=_mm512_permutex2var_pd(ab0,_mm512_setr_epi64(0,2,4,6,8,10,12,14),ab1);ab.val[1]=_mm512_permutex2var_pd(ab0,_mm512_setr_epi64(1,3,5,7,9,11,13,15),ab1);returnab;}

EDIT: // forget adding .val :), thanks@DumbMice!.

seiko2plus reviewed

Sep 4, 2020

View reviewed changes

numpy/fft/_pocketfft.c Outdated

Comment on lines 24 to 40

Copy link

Member

seiko2plusSep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	#ifdefNPY_HAVE_NEON
	#include"lowlevel_strided_loops.h"
	#include"arm_neon.h"

	/* align var to alignment */
	#defineLOOP_BLOCK_ALIGN_VAR(var,type,alignment)\
	npy_intpi,peel=npy_aligned_block_offset(var,sizeof(type),\
	alignment,n);\
	for(i=0;i<peel;i++)

	#defineLOOP_BLOCKED(type,vsize)\
	for(;i<npy_blocked_end(peel,sizeof(type),vsize,n);\
	i+= (vsize /sizeof(type)))

	#defineLOOP_BLOCKED_END\
	for (;i<n;i++)
	#endif

No need to handle the offset of the non-aligned part, for the following reasons

You're dealing with 64bit load, it will not gonna make a difference even with old machines and it will not worst to add new npyv intrinsics to handle alignment memory access for interleaving/deinterleaving.
NEON/ASIMD doesn't have intrinsics to handle the aligned load.
Except for stream write(non-temporal) in the case of x86, modern machines and compilers can effectively handle non-aligned
memory access even with the lowest memory alignments(1/2/4).

Copy link

ContributorAuthor

DumbMice commentedSep 6, 2020•
edited
Loading

@seiko2plus Thank you for your tips! I added de&interleaving intrinsics to universal intrinsics and used them in pocketfft. Only need to changea[0] toa.val[0] to make npyv_f64x2 works, just like those in neon.

There are some subtleties in adding multiply-subtract intrinsics. For neon intrinsics,vmlsq_f64(c,a,b) meansc[i]-a[i]*b[i] where the negative product is accumulated; for other intrinsics, mul-sub(a,b,c) generally meansa[i]*b[i]-c[i]. I assumed the second one were to be adopted and made correponding alterations.

seiko2plus requested changes

Sep 6, 2020

View reviewed changes

Copy link

Member

seiko2plus left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

~~add new intrinsic for negating multiply add~~

Copy link

Member

seiko2plus commentedSep 6, 2020•
edited
Loading

@DumbMice,

Thank you for your tips! I added de&interleaving intrinsics to universal intrinsics and used them in pocketfft. Only need to change a[0] to a.val[0] to make npyv_f64x2 works, just like those in neon.

My bad I wrote that comment in a hurry, thanks for correcting me.

There are some subtleties in adding multiply-subtract intrinsics. For neon intrinsics, vmlsq_f64(c,a,b) means c[i]-a[i]*b[i] where the negative product is accumulated; for other intrinsics, mul-sub generally means a[i]*b[i]-c[i]. I assumed the second one were to be adopted and made correponding alterations.

~~You must add intrinsic for negating the intermediate result as I suggested~~

seiko2plus reviewed

Sep 6, 2020

View reviewed changes

numpy/fft/_pocketfft.c OutdatedShow resolvedHide resolved

Copy link

Member

seiko2plus commentedSep 6, 2020•
edited
Loading

@DumbMice, please wait let me review my suggestion
EDIT: I removed my suggestion in favor of implementing fma3 intrinsics in a separate pr

Copy link

Member

seiko2plus commentedSep 6, 2020

@DumbMice, I'm going to release a pull request implementing fma3 intrinsics to remove any confusion just give me several hours.

Copy link

Member

seiko2plus commentedSep 6, 2020•
edited
Loading

@DumbMice,#17258 implemented all fused intrinsics.
After#17258 gets merged, you can use intrinsicnpyv_nmuladd_f64~~or synonymnpyv_submul_f64~~ without the need to add extra negate.

DumbMice added4 commits

September 8, 2020 20:52

ENH: add neon intrinsics to fft radix2&4

b84ed1b

ENH: add universal intrinsics to fft radix2&4

da95959

ENH: add universal intrinsics to fft radix2&4

6a0aecf

ENH: add universal intrinsics to fft radix2&4

295ce15

Copy link

ContributorAuthor

DumbMice commentedSep 8, 2020•
edited
Loading

@seiko2plus Thanks so much for your help. All macros expanded as raw universal intrinsics as required.

Benchmark Comparison

Similar performances are achieved. I might update with benchmarks on x86 systems later.

Copy link

ContributorAuthor

DumbMice commentedSep 9, 2020

Benchmark on x86

Compared to which on arm machine, performances have insignificant changes on x86 machine. Not sure if this is supposed to happen. Builds and benchmark processes on both systems are:

python setup.py develop
python runtests.py -v -j 8 --bench-compare master bench_fft
,where uncommited file bench_fft.py is used for benchmarking on both machines.

System Info

	Arm	x86
Hardware	KunPeng	Lenovo T470P
Processor	ARMv8 2.6GMHZ 8 processors	Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
OS	Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64	18.04.1-Ubuntu
Compiler	gcc (GCC) 7.3.0	gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

Size Change

File	before	after	change
_pocketfft_internal.cpython-37m-x86_64-linux-gnu.so	392K	408K	+16K

Copy link

Member

mattip commentedSep 9, 2020

If I understand the graphs correctly, SIMD on NEON only begins to really pay off for n>1_000_000 (log2n>20) or so?

Performance on x86 before this PR used intrinsics, right?, so I would hope there is no change.

Copy link

ContributorAuthor

DumbMice commentedSep 9, 2020•
edited
Loading

If I understand the graphs correctly, SIMD on NEON only begins to really pay off for n>1_000_000 (log2n>20) or so?

Yes. I suppose this is expected. The whole array will be iterated log(n) times, but at each time the array will be dissected into smaller and smaller subarrays, where SIMD intrinsics take place. I suppose this is the nature of fft's nlog(n) complexity.

Performance on x86 before this PR used intrinsics, right?, so I would hope there is no change.

I believe pocketfft is implemented with pure C. No intrinsics are used before this commit. I suppose an improvement is expected on x86?

Copy link

Member

mattip commentedSep 9, 2020

Perhaps SIMD is competing with the compiler.

Copy link

Member

charris commentedSep 9, 2020

Cache usage was historically the dominate factor in fft timings, not sure how that holds up on modern architectures. NumPy fft is also float64, float32 might benefit more.

Copy link

Member

seiko2plus commentedSep 9, 2020•
edited
Loading

@DumbMice,

,where uncommited file bench_fft.py is used for benchmarking on both machines.

Please could you provide 'bench_fft.py?

I believe pocketfft is implemented with pure C. No intrinsics are used before this commit. I suppose an improvement is expected on x86?

The compiler can easily auto-vectorize these kinda loops, plus you're "not" using fused intrinsics since you're using NPYV under the domain of the default baseline features(sse sse2 sse3 on x86_64).

Please could you re-run your benchmarks on x86 with--cpu-baseline="avx2 fma3" to activateavx2 withfma3?

Copy link

Member

seiko2plus commentedSep 9, 2020•
edited
Loading

@charris,

NumPy fft is also float64, float32 might benefit more.

especially complex float32 since the compiler will struggling in auto-vectorize interleave/deinterleave single-precision.

seiko2plus reviewed

Sep 9, 2020

View reviewed changes

numpy/fft/_pocketfft.cShow resolvedHide resolved

Copy link

ContributorAuthor

DumbMice commentedSep 10, 2020

Add bench-fft.py as required.

Please could you re-run your benchmarks on x86 with--cpu-baseline="avx2 fma3" to activateavx2 withfma3?

I used the following command for benchmark:python runtests.py --cpu-baseline="avx2 fma3" --bench-compare master bench_fft. Still, no significant improvement has been observed.

Maybe your are right? Compilers are smart enough to auto-vectorize operations on f64 arrays using x86 intrinsics?

Copy link

Member

Qiyu8 commentedSep 10, 2020

I think that the correct commands is:

Build with expected features:python setup.py build --cpu-baseline="avx2 fma3"
Make sure thatNPY_SIMDis 256:python runtests.py -v -t numpy/core/tests/test_fft.py -- -s
Install to override the local:pip install .
Run the benchmark:python runtests.py --bench-compare master bench_fft

Copy link

ContributorAuthor

DumbMice commentedSep 10, 2020

@Qiyu8 Thanks for your advice, but benchmark still shows no improvements on x86.

In build log, both EXT and CLIB OPTMIZATION have SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 ENABLED in CPU baseline.

Not sure how to checkNPY_SIMD, but 2nd step has output:NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CNL?

By comparing _pocketfft.o.d in build dir, I can confirm that header files in simd/avx folder are included when testing on fft-neon branch, and they are not included on master branch. So I think this might prove the compiler's efficiency in optimizing x86 codes.

Copy link

Member

seiko2plus commentedSep 10, 2020

@DumbMice,

Not sure how to check NPY_SIMD, but 2nd step has output: NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CNL?

Thank you for your report, this issue should be fixed by#17284.

To testAVX2 withFMA3 as baseline run :

python runtests.py --cpu-baseline="avx2 fma3" -t numpy/core/tests/test_fft.py

To benchmarkAVX2 withFMA3 as baseline throughasv run after(#17284):

python3 runtests.py --cpu-baseline="avx2 fma3" --bench-compare master bench_fft

DumbMice added2 commits

September 16, 2020 21:08

BENCH: add bench_fft.py

19387d4

ENH: add universal intrinsics to fft radix2&4

976289c

Copy link

ContributorAuthor

DumbMice commentedSep 29, 2020•
edited
Loading

To help see the result faster, I've applied changes in#17284 in order to properly benchmark on my x86 machines locally. A reduction of time ~ 10% is shown when log2(N) is between 10 and 14. No improvement is seen elsewhere.

I cannot explain this. If others are willing to experiment on their x86 machines, I look forward to seeing their results.

Note: Benchmarking bench_fft.py could be memory-consumming, and my laptop has 2x8G memory.

Qiyu8 added the triage reviewIssue/PR to be discussed at the next triage meeting label

Nov 20, 2020

Copy link

Member

mattip commentedDec 16, 2020

We discussed this in the latest triage meeting and the feeling was to close this in favor of SciPy's implementation. NumPy as FFT as a "textbook implementation", if people want a faster FFT they should use SciPy. If you feel strongly that this position is wrong, let us know. One interesting thing to try would be to use the NumPy 1.20 Universal SIMD framework in Scipy, maybe it would provide a good reason to break the USIMD code out into a separate library.

mattip closed this

Dec 16, 2020

Copy link

Contributor

rossbar commentedDec 16, 2020

There are some loosely related discussions in#17839 ,#17801

Labels

01 - Enhancement component: SIMD

Issues in SIMD (fast instruction sets) code or machinery

triage review

Issue/PR to be discussed at the next triage meeting

Movatterモバイル変換

Uh oh!

ENH: add neon intrinsics to fft radix2&4#17231

ENH: add neon intrinsics to fft radix2&4#17231

Uh oh!

Conversation

DumbMice commentedSep 3, 2020

pocketfft

Benchmarks

Comments

Uh oh!

mattip commentedSep 3, 2020

Uh oh!

DumbMice commentedSep 4, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

mattip commentedSep 4, 2020

Uh oh!

Qiyu8 commentedSep 4, 2020

Uh oh!

DumbMice commentedSep 4, 2020

Uh oh!

seiko2plus commentedSep 4, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

seiko2plusSep 4, 2020

Choose a reason for hiding this comment

Uh oh!

DumbMice commentedSep 6, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

seiko2plus left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus commentedSep 6, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

seiko2plus commentedSep 6, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

seiko2plus commentedSep 6, 2020

Uh oh!

seiko2plus commentedSep 6, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

DumbMice commentedSep 8, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Benchmark Comparison

Uh oh!

DumbMice commentedSep 9, 2020

Benchmark on x86

System Info

Size Change

Uh oh!

mattip commentedSep 9, 2020

Uh oh!

DumbMice commentedSep 9, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

mattip commentedSep 9, 2020

Uh oh!

charris commentedSep 9, 2020

Uh oh!

seiko2plus commentedSep 9, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

seiko2plus commentedSep 9, 2020• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DumbMice commentedSep 10, 2020

Uh oh!

Qiyu8 commentedSep 10, 2020

Uh oh!

DumbMice commentedSep 10, 2020

Uh oh!

seiko2plus commentedSep 10, 2020

Uh oh!

DumbMice commentedSep 4, 2020•
edited
Loading

seiko2plus commentedSep 4, 2020•
edited
Loading

DumbMice commentedSep 6, 2020•
edited
Loading

seiko2plus left a comment•
edited
Loading

seiko2plus commentedSep 6, 2020•
edited
Loading

seiko2plus commentedSep 6, 2020•
edited
Loading

seiko2plus commentedSep 6, 2020•
edited
Loading

DumbMice commentedSep 8, 2020•
edited
Loading

DumbMice commentedSep 9, 2020•
edited
Loading

seiko2plus commentedSep 9, 2020•
edited
Loading

seiko2plus commentedSep 9, 2020•
edited
Loading

DumbMice commentedSep 29, 2020•
edited
Loading