NotificationsYou must be signed in to change notification settings
Fork11.9k
Star31k

ENH: support float and longdouble in FFT using C++ pocketfft version#25711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

seberg merged 4 commits intonumpy:mainfrommhvk:pocketfft-cpp

Feb 9, 2024

Merged

ENH: support float and longdouble in FFT using C++ pocketfft version#25711

seberg merged 4 commits intonumpy:mainfrommhvk:pocketfft-cpp

Feb 9, 2024

Conversation

Copy link

Contributor

mhvk commentedJan 28, 2024•
edited
Loading

Using the C++ version of pocketfft turned out to be not that much of a hassle. With it, we can support all thedtypes that it supports: float, double, and long double.

@asmeurer,@mreineck - comments are most welcome! (My C++ is non-existent so this really is just a C file with templates thrown in, mostly using trial & error...)

fixes#17801

EDIT: some things to decide:

How should we add the license? Right now just addedLICENSES.md in thepocketfft directory. (The licence wasn't there for the C version, but that seems to have been an oversight.)
Does it make most sense to include pocketfft as a git submodule? This would include the license (and a small demo programme)
Should we use the cache? -> No (@mreineck suggested not to given the extra memory usage; scipy does use 16)
Should we enable threading? (scipy exposes it but uses 1 thread by default, see itsmeson.build) -> leave for later (A quick check on my laptop shows it gives about a rather array-size dependent speed-up of about a factor 2 even when using all 12 cores, so likely memory bound)
Pocketfft will just bail if it cannot allocate memory (for its twiddle factors, and I guess a possible cache).~~If that is something we should catch, please advice how to interceptthrow...~~ dealt with usingtry/catch.
More tests? (push to this PR welcome.)

mhvk added 01 - Enhancement component: numpy.fft labels

Jan 28, 2024

mhvk added this to the2.0.0 release milestone

Jan 28, 2024

mhvk mentioned this pull request

Jan 28, 2024

np.fft.fft always returns np.complex128 regardless of input type#17801

Closed

mhvk force-pushed thepocketfft-cpp branch fromf96fe5b to7ef8282Compare

January 28, 2024 19:49

mhvk commented

Jan 28, 2024

View reviewed changes

numpy/fft/_pocketfft_umath.cppShow resolvedHide resolved

Copy link

Contributor

mreineck commentedJan 28, 2024

Ah, OK, this switches to the C++ version, but uses only its 1D transforms!
Makes a lot of sense for a quick fix of the data type issue, but it misses the chance of massive speedups (through vectorization and/or multithreading) in multi-D transforms.
Ifnumpy 2.0 is close that's probably the best option, though I would still urge to unlock the other features in later releases.
If only 1D FFTs are used, a lot of C++ code could be#if 0'd to reduce compile time. I can do this once the rest of the PR has converged.

Copy link

Member

seberg commentedJan 28, 2024

but it misses the chance of massive speedups (through vectorization and/or multithreading) in multi-D transforms.

That is actually a (small or not?) issue with using the gufuncs, it may be tricky to expand to arbitrary dimensions in a nice way for those (which was mentioned to me, I admit; although I didn't realize it might be an issue any time soon).

And yes, so long as this looks all fine, it seems fine to not do the N-D versions.

Copy link

ContributorAuthor

mhvk commentedJan 28, 2024

Indeed, I went the easy route, based on the gufunc I had just introduced (mostly to be able to allowout). Is the threading/vectorization specific to the multi-dimensional iterator, or something that makes the 1D transform faster? From a quick look, it looks like it is the latter, so that should be relatively straightforward, by just usingc2c, etc., in my calls. Indeed, that would mean no need for creating a buffer any more too...

mhvk force-pushed thepocketfft-cpp branch from7ef8282 to66c2baeCompare

January 28, 2024 20:44

Copy link

Contributor

mreineck commentedJan 28, 2024

Is the threading/vectorization specific to the multi-dimensional iterator, or something that makes the 1D transform faster?

No, threading/vectorization only affects 2D and higher transforms. I have played with applying this to 1D as well, but only inpocketfft's successor package.

seberg reviewed

Jan 28, 2024

View reviewed changes

numpy/fft/_pocketfft_umath.cpp OutdatedShow resolvedHide resolved

Copy link

ContributorAuthor

mhvk commentedJan 28, 2024

Just to be clear, do you mean that vectorization helps for FTs over multiple axes, or FTs over a single axis in higher-D arrays? From the code, it looks like the latter, which might still work with the ufunc, as we get an extra axis passed in.

No, threading/vectorization only affects 2D and higher transforms. I have played with applying this to 1D as well, but only inpocketfft's successor package.

OK, that seems reasonable for follow-up, then. More generally, it may well make sense to use your iterator too, and/or copy what scipy does... I definitely went for what seemed the fastest route...

asmeurer reviewed

Jan 28, 2024

View reviewed changes

doc/release/upcoming_changes/25536.new_feature.rst Outdated

		The various FFT routines in `numpy.fft` have gained an ``out``
		`numpy.fft` support for different precisions and in-place calculations
		----------------------------------------------------------------------
		The various FFT routines in `numpy.fft` now support calculations in float,

Copy link

Member

asmeurerJan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

They have always supported single precision but have upcast the result. We should be clear that this is a breaking change.

Copy link

ContributorAuthor

mhvkJan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Does this sound good?

The various FFT routines in `numpy.fft` now do their calculations natively infloat, double, or long double precision, depending on the input precision,instead of always calculating in double precision. Hence, the calculation willnow be less precise for single and more precise for long double precision.The data type of the output array will now be adjusted accordingly.Furthermore, all FFT routines have gained an ``out`` argument that can be usedfor in-place calculations.

Copy link

Member

sebergJan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.

Copy link

Member

asmeurerJan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What's the numpy policy on usingversionchanged. Should that be added to the fft docstrings?

Copy link

Member

asmeurerJan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.

I don't know how the numpy release notes are amalgamated, but for 2.0 it would be good to highlight all the compatibility breaks somewhere in the release notes.

Copy link

Contributor

mreineck commentedJan 28, 2024

Just to be clear, do you mean that vectorization helps for FTs over multiple axes, or FTs over a single axis in higher-D arrays? From the code, it looks like the latter, which might still work with the ufunc, as we get an extra axis passed in.

Yes, it works for transforming just a single axis in a multi-D array too. The idea is to "bundle" a few of the 1D transforms into a single one of SIMD type.

Copy link

ContributorAuthor

mhvk commentedJan 28, 2024

Themacos_arm64_test failure seems unrelated...

Copy link

ContributorAuthor

mhvk commentedJan 28, 2024

@mreineck - OK, I'll look into usingc2c and friends...

Copy link

ContributorAuthor

mhvk commentedJan 29, 2024

Callingc2c, etc., using the outer dimension actually seems fairly straightforward and gives a nice speed-up. The only unfortunate thing is that pocketfft does not deal with the case that the input is shorter than the output, so in that case one still needs to buffer/copy. But the speed-up is worth the complexity, so I'll push a further refactoring tomorrow.

Copy link

Member

rgommers commentedJan 29, 2024

Nice!

Furthermore, I have not tried setting up a pocketfft cache, or done anything with threading. Scipy uses itsmeson.build to set a cache size of 16 and have compiler-dependent threading; happy to do the same here.

The cache size may help, assuming it's a useful setting in single-threaded mode (@mreineck?). I'd leave the threading along, we can't do much with that. In SciPy it's controlled by aworkers= keyword to opt in to multi-threading. NumPy doesn't have that, and everything is supposed to be single-threaded (underlying BLAS/LAPACK is an exception for [reasons]).

Should the license be added to LICENSES.txt? It wasn't there for the C version, but perhaps that was an oversight.

Probably an oversight indeed, since the SciPy licenses list has it (here). So +1 for adding it.

With it, we can support all thedtypes that it supports: float, double, and long double.

Addingfloat32 support is great and obviously desirable. I'm less sure aboutlongdouble support, since that's not really all that useful and something we may have more work to get rid of again later. We didn't get around to revisiting this topic for 2.0, but overall I'm still convinced thatlongdouble is overall more harmful than helpful for NumPy as a whole, so why add more support when I don't think anyone asked for it here?

Build time wise this PR in its current state looks fine - impact of C++ is okay overall, see quick profiling result below. Binary size is fine too, it's ~240 kb for a dev (debugoptimized) build. A note though that the SciPy pocketfft extension is more than 4x larger, so enabling all functionality that SciPy uses in the NumPy build is probably not the way to go (it'd be ~25% of the size of_multiarray_umath at that point).

Copy link

Contributor

mreineck commentedJan 29, 2024

The cache size may help, assuming it's a useful setting in single-threaded mode (@mreineck?)

It will work in single-threaded mode. However, my feeling is that the plan caching isn't really worth it: it may cause large memory consumption (see, e.g.,scipy/scipy#12297) for very little performance gain (unless you call very short 1D FFTs repeatedly).Scipy maintainers tend towards keeping the cache, however.

mhvk force-pushed thepocketfft-cpp branch 2 times, most recently from30eeaad to102ac99Compare

January 30, 2024 11:38

Copy link

ContributorAuthor

mhvk commentedJan 30, 2024

OK, I now have a version that calls the vectorized pocketfft version if possible. It still has the direct loops for the case thatnin < nout as well as for 1D transforms (where one can avoid copies in the probably fairly common case of very long 1D transforms with contiguous output).

Some remaining questions:

On the implementation: pocketfft vectorization properties are set by what is available at compile-time, not run-time. Is this a problem? Or can this be left for later optimization?
Would it make sense to include pocketfft as a submodule? That would automatically include the license (and not much more except for a small demo version). I think it is included as a submodule by scipy.

mhvk force-pushed thepocketfft-cpp branch from102ac99 toaa97df0Compare

January 30, 2024 11:52

mhvk changed the title~~ENH: support float and longdoubt in FFT using C++ pocketfft version~~ENH: support float and longdouble in FFT using C++ pocketfft version

Jan 30, 2024

Copy link

ContributorAuthor

mhvk commentedJan 30, 2024

Addingfloat32 support is great and obviously desirable. I'm less sure aboutlongdouble support, since that's not really all that useful and something we may have more work to get rid of again later.

@rgommers - I would love to have a properfloat128 instead oflongdouble, but on the other hand, the present behaviour is not thatlongdouble is not supported (in which case it would be fine to continue with that), but rather thatlongdouble is just converted todouble without any warning. Given that, my sense is that supporting it correctly is the least disruptive change.

Copy link

Member

rgommers commentedJan 30, 2024

the present behaviour is not thatlongdouble is not supported (in which case it would be fine to continue with that), but rather thatlongdouble is just converted todouble without any warning. Given that, my sense is that supporting it correctly is the least disruptive change.

I'm not really opposed, but it is a change, with technically only increased binary size as a penalty and no upsides for Windows and macOS arm64 users. Doing nothing and continuing to convert tofloat64 either internally like before or, as an improvement, use the float64 loops internally but cast back tolongdouble before returning the result, are therefore also options (even the former is zero change from now, so hard to label it as more disruptive).

I'm not sure how serious the plan was to include all the pocketfft functionality that SciPy needs into NumPy - but binary size does seem to become relevant then. If's it's <0.1 MB I'm not too worried, but I wouldn't really want to pay 0.25 MB or more for long double loops that no one has asked for.

Copy link

ContributorAuthor

mhvk commentedJan 30, 2024

ls -lh build/numpy/fft/_pocketfft_umath.cpython-311-x86_64-linux-gnu.so# 5.1M present PR# 4.1M without longdouble# 2.9M w/ longdouble but with POCKETFFT_NO_VECTORS# (which uses low-level routines only, but halves speed;# I checked that high-level routines are not compiled for longdouble)

For comparison,multiarray_umath is 73 MB, the simd stuff 15MB, and everything in random adds up to 43 MB.

Copy link

Member

rgommers commentedJan 31, 2024

Thanks for the comparison, that is helpful. Note that those aredebugoptimized numbers; what I had in mind was wheel sizes (I should have specified). Our wheel sizes are in the 13 - 21 MB range, and the new macOS arm64 ones even as small as 5 MB.

I did a quick test building a wheel from this branch; the pocketfft extension is 508 kb uncompressed, 208 kb compressed in it (total wheel size 7.8 MB;_multiarray_umath.so 2.6 MB compressed). So that should still be fine. If dst/dct transforms would be enabled, it may get a bit questionable. But that could still be taken care of mostly by disabling thelongdouble loops for platforms where it's an alias fordouble. Anyway, it doesn't seem too likely to become an issue at this moment. Let's not worry about it further.

Copy link

ContributorAuthor

mhvk commentedJan 31, 2024

OK, great! The other transforms may not be as bad as one would think, since they mostly use the same code. But anyway, not sure those are needed, and definitely not relevant here.

Copy link

Member

seiko2plus commentedJan 31, 2024

The relevant flags are here:

Its only limited to core only, I thought it was a global flag

Copy link

ContributorAuthor

mhvk commentedJan 31, 2024

Are we sure those flags are seen infft -- since they don't appear in my build log... I'm not sure howmeson exactly works, but does what happens insubdir(_core) influence `subdir(fft)? See

numpy/numpy/meson.build

Lines 398 to 401 in2634f80

	subdir('_core')
	subdir('fft')
	subdir('linalg')
	subdir('random')

Copy link

ContributorAuthor

mhvk commentedJan 31, 2024

OK, I just empirically confirmed that that flag does not get passed on. And with that, that I can catch C++ exceptions. I'll push that up in a bit.

Copy link

Member

seiko2plus commentedJan 31, 2024

And with that, that I can catch C++ exceptions

If its about memory allocations, I think we should use our custom allocations over malloc or new.

Copy link

ContributorAuthor

mhvk commentedJan 31, 2024

@seiko2plus - the problem is that the allocations (and possible other errors) happen insidepocketfft and ideally we do not start rewriting that...

My current solution is just to put everything in atry/catch using macros and convert C++ exceptions to python errors. I don't really know how to test memory errors, but I checked that if I purposely pass in an invalid axis number, the corresponding exception properly leads to aRuntimeError. See last commit,ab16ef4

Copy link

Contributor

mreineck commentedJan 31, 2024

If its about memory allocations, I think we should use our custom allocations over malloc or new.

I'd really recommend not doing that. Most likely these custom operations are C-style and therefore will silently return NULL on error, which goes against fundamental assumptions C++ codes are making. You'd have to patch every place in pocketfft where memory is allocated and add error handling code.

Copy link

ContributorAuthor

mhvk commentedJan 31, 2024

p.s. Note that I now on purpose usepocketfft allocators, since those throw an exception, hence I could remove my custom code.

Copy link

Contributor

mreineck commentedJan 31, 2024

My current solution is just to put everything in a try/catch using macros and convert C++ exceptions to python errors.

To me this looks exactly like the right solution. It's located directly at the interface, does not require any big patching and can be extended in case more detailed error handling is required.
If memory allocations behave in the standard C++ way, they will be caught here as well.

Copy link

Contributor

mreineck commentedJan 31, 2024

"-fno-exceptions" should certainly not be used when compilingpocketfft and the wrapper code.
To be honest, I'd consider removing-fno-exceptions -fno-rtti from your configuration altogether. This was a good idea 20 years ago, but today the overhead (both memory and runtime) should be pretty negligible.

mhvk force-pushed thepocketfft-cpp branch fromab16ef4 toce7a8dcCompare

January 31, 2024 20:10

seberg reviewed

Feb 3, 2024

View reviewed changes

Copy link

Member

seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not a deep review (although I doubt that is needed at this point!).
Just commenting on two things so they don't get lost (one bug, one suggestion to try to have a more C++ pattern for catching the exceptions).

Submodule+license seem still two things that would be nice to do as well.

numpy/fft/_pocketfft_umath.cpp Outdated

		void *buff =NULL;
		if (step_out !=sizeof(std::complex<T>)) {
		buff =pocketfft::detail::aligned_alloc(sizeof(std::complex<T>),
		nout *sizeof(std::complex<T>));

Copy link

Member

sebergFeb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There is a small problem here thatbuff is avoid * and not apocketfft:detail::arr<std::complex<T>>. Which I think means that if theplan->exec fails below no cleanup will occur.
My C++ is a bit trial and error, but I think:

/* Create a temporary array to hold a buffer if op is not contiguous */bool contiguous_op = (step_out != sizeof(std::complex<T>));pocketfft::detail::arr<std::complex<T>> tmp_arr(contiguous_op ? 0 : nout);

should be an option (an array of length 0 doesn't do a malloc so that is fine).

And below:

std::complex<T> *op_or_buff = (contiguous_op ? (std::complex<T> *)op : tmp_arr.data());

(Clearly, the dealloc below then becomes unnecessary, which is the whole point.)

Copy link

ContributorAuthor

mhvkFeb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK, that makes a lot of sense! I implemented it (though with an opposite bool,buffered).

numpy/fft/_pocketfft_umath.cppShow resolvedHide resolved

MAINT: Replace pocketfft with the C++ version and use it in ufuncs

a22dc48

Note that the C++ version is header only.  We just link to upstreamvia a git submodule.

mhvk force-pushed thepocketfft-cpp branch fromce7a8dc toc1b5e17Compare

February 3, 2024 21:41

Copy link

ContributorAuthor

mhvk commentedFeb 3, 2024

@seberg - thanks! Both suggestions were really nice, and make the code a lot cleaner. I pushed a revised version that uses them and that also replaces the addition of the C++ pocketfft header with a new submodule (which includes the license).

Copy link

ContributorAuthor

mhvk commentedFeb 3, 2024

p.s. In the checklist above, this just leaves the question of whether this is tested thoroughly enough. I could adjust more tests to try differentdtype, but I'm a bit wary of just increasing test run time for little benefit. Currently, the tests that I did change exercise the options to do longer or shorter FTs than the input (i.e., which setn), which has been the main pain point so far.

mhvk force-pushed thepocketfft-cpp branch fromc1b5e17 to055dd1dCompare

February 3, 2024 22:05

Copy link

ContributorAuthor

mhvk commentedFeb 3, 2024

p.p.s. Well the above about tests was not actually true, but now it is, so I ticked my "more tests" item...

mhvk commented

Feb 3, 2024

View reviewed changes

numpy/fft/tests/test_pocketfft.py

		assert_allclose(x_norm,
		np.linalg.norm(tmp),atol=1e-6)

		@pytest.mark.parametrize("dtype", [np.half,np.single,np.double,

Copy link

ContributorAuthor

mhvkFeb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This one is now done much more thoroughly intest_identity_long_short andtest_identity_long_short_reversed

mhvk added3 commits

February 3, 2024 17:18

ENH: Use templating to add support float and longdouble.

22c56b5

Also adjust python side to allow multiple double types.

MAINT: use vectorized pocketfft routines when possible

a344ca3

Also make sure every part uses copy_input and copy_outputconsistently, and avoid compiling multithreading.

MAINT: handle possible c++ errors by wrapping the ufuncs

b8c020f

Also use internal pocketfft arr class to allocate the possibletemporary buffers, so we do not have to worry about deallocatingthem any more.

mhvk force-pushed thepocketfft-cpp branch from055dd1d tob8c020fCompare

February 3, 2024 22:19

Copy link

ContributorAuthor

mhvk commentedFeb 9, 2024

Is there anything more to do for this? (C++ pocketfft innp.fft)

seberg approved these changes

Feb 9, 2024

View reviewed changes

Copy link

Member

seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not managing to concentrate very deeply on it, but those two C++ things were basically what I noticed and the rest looks fine.

Since nobody voiced other concerns, let's get this in. (It isn't vastly changed after all, since it is porting the C code.)

Thanks for this great new code!

seberg merged commit0291a81 intonumpy:main

Feb 9, 2024

mhvk deleted the pocketfft-cpp branch

February 9, 2024 17:14

rgommers mentioned this pull request

Feb 11, 2024

BLD: restore 'setup-args=-Duse-ilp64=true' in cibuildwheel [wheel build]#25803

Merged

mreineck mentioned this pull request

Feb 12, 2024

API: Support float32 in fft?#23522

Closed

mhvk mentioned this pull request

Feb 16, 2024

NEP: add NEP 56 on array API standard support in main namespace#25542

Merged

hexane360 mentioned this pull request

Mar 10, 2025

ENH: add optional argumentdtype to numpy.fft.fftfreq (#9126)#16496

Closed

Labels

01 - Enhancement component: numpy.fft

Movatterモバイル変換

Uh oh!

ENH: support float and longdouble in FFT using C++ pocketfft version#25711

ENH: support float and longdouble in FFT using C++ pocketfft version#25711

Uh oh!

Conversation

mhvk commentedJan 28, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mreineck commentedJan 28, 2024

Uh oh!

seberg commentedJan 28, 2024

Uh oh!

mhvk commentedJan 28, 2024

Uh oh!

mreineck commentedJan 28, 2024

Uh oh!

Uh oh!

mhvk commentedJan 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mreineck commentedJan 28, 2024

Uh oh!

mhvk commentedJan 28, 2024

Uh oh!

mhvk commentedJan 28, 2024

Uh oh!

mhvk commentedJan 29, 2024

Uh oh!

rgommers commentedJan 29, 2024

Uh oh!

mreineck commentedJan 29, 2024

Uh oh!

mhvk commentedJan 30, 2024

Uh oh!

mhvk commentedJan 30, 2024

Uh oh!

rgommers commentedJan 30, 2024

Uh oh!

mhvk commentedJan 30, 2024

Uh oh!

rgommers commentedJan 31, 2024

Uh oh!

mhvk commentedJan 31, 2024

Uh oh!

seiko2plus commentedJan 31, 2024

Uh oh!

mhvk commentedJan 31, 2024

Uh oh!

mhvk commentedJan 31, 2024

Uh oh!

seiko2plus commentedJan 31, 2024

Uh oh!

mhvk commentedJan 31, 2024

Uh oh!

mreineck commentedJan 31, 2024

Uh oh!

mhvk commentedJan 31, 2024

Uh oh!

mreineck commentedJan 31, 2024

Uh oh!

mreineck commentedJan 31, 2024

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commentedJan 28, 2024•
edited
Loading