Distributed Fast Fourier Transform#
Overview#
The distributed Fast Fourier Transform (FFT) modulenvmath. innvmath-python leverages the NVIDIA cuFFTMp library and provides a powerful suite of APIsthat can be directly called from the host to efficiently perform discrete Fouriertransformations on multi-node multi-GPU systems at scale. Both stateless function-formAPIs and stateful class-form APIs are provided to support a spectrum of N-dimensionalFFT operations. These include forward and inverse complex-to-complex (C2C) transformations,as well as complex-to-real (C2R) and real-to-complex (R2C) transforms:
N-dimensional forward C2C FFT transform by
nvmath..distributed. fft. fft() N-dimensional inverse C2C FFT transform by
nvmath..distributed. fft. ifft() N-dimensional forward R2C FFT transform by
nvmath..distributed. fft. rfft() N-dimensional inverse C2R FFT transform by
nvmath..distributed. fft. irfft() All types of N-dimensional FFT by stateful
nvmath..distributed. fft. FFT
Note
The APIfft() and related function-form APIs performN-D FFT operations, similar tonumpy.fft.fftn(). Currently, 2-D and 3-DFFTs are supported.
The distributed FFT APIs are similar to their non-distributed host API counterparts, withsome key differences:
The operands to the API are thelocal partition of the global operands andthe user specifies thedistribution (how the data is partitioned acrossprocesses). There are two types of distribution natively supported by FFT:Slab and customBox.
GPU operands need to be allocated onsymmetric memory. Refer toDistributed API Utilities for examples and details of how tomanage symmetric memory GPU operands. The
nvmath.helper described below can also be used to allocate on symmetric memory.distributed. fft. allocate_operand() All distributed FFT operations (including R2C and C2R) arein-place (the result isstored in the same buffer as the input operand). This has special implications on theproperties of the buffer and memory layout, due to the following: (i) in general, varyinginput and output distribution means that on a given process the input and output can havedifferent shape (and size), particularly when global data does not divide evenly amongprocesses; (ii) for R2C and C2R transformations, the shape and dtype of the input andoutput is different, and cuFFTMp has special requirements concerning buffer padding andstrides. Due to the above, it’s recommended to allocate FFT operands with
nvmath.to ensure that the operand andits underlying buffer have the required characteristics for the given distributed FFToperation. This helper is described below.distributed. fft. allocate_operand()
Operand distribution#
To perform a distributed FFT operation you have to specify how the operand is distributedacross processes. Distributed FFT natively supports theSlab andBox distributions. The distribution provided must be compatible withone of these.
Slab#
Tip
Slab (or compatible distribution) is the most optimizeddistribution to use with distributed FFT.
Currently, distributed FFT supports Slab decomposition on X or Y.Here is an example of a distributed FFT using Slab distribution:
Tip
Reminder to initialize the distributed context first as perInitializing the distributed runtime.
fromnvmath.distributed.distributionimportSlab# Get number of processes from mpi4py communicator.nranks=communicator.Get_size()# The global 3-D FFT size is (64, 256, 128).# Here, the input data is distributed across processes according to the# Slab distribution on the Y axis.shape=64,256//nranks,128# Create NumPy ndarray (on the CPU) on each process, with the local shape.a=np.random.rand(*shape)+1j*np.random.rand(*shape)# Forward FFT.# By default, the reshape option is True, which means that the output of the# distributed FFT will be re-distributed to retain the same distribution as# the input (in this case Slab.Y).b=nvmath.distributed.fft.fft(a,distribution=Slab.Y)
For the purposes of the transform withreshape=False,Slab.XandSlab.Y are considered complementary distributions. Ifreshape=False, thereturned operand will use the complementary distribution. The following example illustratesthis using GPU operands:
fromnvmath.distributed.distributionimportSlab# The global 3-D FFT size is (512, 256, 512).# Here, the input data is distributed across processes according to the# Slab distribution on the X axis.shape=512//nranks,256,512# cuFFTMp uses the NVSHMEM PGAS model for distributed computation, which# requires GPU operands to be on the symmetric heap.a=nvmath.distributed.allocate_symmetric_memory(shape,cp,dtype=cp.complex128)# a is a cupy ndarray and can be operated on using in-place cupy operations.withcp.cuda.Device(device_id):a[:]=cp.random.rand(*shape,dtype=cp.float64)+1j*cp.random.rand(*shape,dtype=cp.float64)# Forward FFT.# Here, the forward FFT operand is distributed according to Slab.X distribution.# With reshape=False, the FFT result will be distributed according to Slab.Y distribution.b=nvmath.distributed.fft.fft(a,distribution=Slab.X,options={"reshape":False})# Now we can perform an inverse FFT with reshape=False and get the# result in Slab.X distribution (recall that `b` has Slab.Y distribution).c=nvmath.distributed.fft.ifft(b,distribution=Slab.Y,options={"reshape":False})# Synchronize the default streamwithcp.cuda.Device(device_id):cp.cuda.get_current_stream().synchronize()# All cuFFTMp operations are inplace (a, b, and c share the same memory buffer), so# we take care to only free the buffer once.nvmath.distributed.free_symmetric_memory(a)
Note
Distributed FFT operations are in-place, which needs to be taken into accountwhen freeing the GPU operands on symmetric memory (as shown in the above example).
Custom box#
Distributed FFT also supports arbitrary data distributions in the form of 2D/3D boxes.Refer toBox for an overview.
To perform a distributed FFT using a customBox distribution, each process specifiesits own input and output box, which determines the distribution of the input and outputoperands, respectively (note that input and output distributions can be the same or not).
With box distribution there is also the notion of complementary distribution:(input_box,output_box) and(output_box,input_box) are complementary.
Here is an example of a distributed FFT across 4 GPUs using a custom pencil distribution:
fromnvmath.distributed.distributionimportBox# Get process rank from mpi4py communicator.rank=communicator.Get_rank()# The global 3-D FFT size is (64, 256, 128).# The input data is distributed across 4 processes using a custom pencil# distribution.X,Y,Z=(64,256,128)shape=X//2,Y//2,Z# pencil decomposition on X and Y axes# NumPy ndarray, on the CPU.a=np.random.rand(*shape)+1j*np.random.rand(*shape)# Forward FFT.ifrank==0:input_box=Box((0,0,0),(32,128,128))elifrank==1:input_box=Box((0,128,0),(32,256,128))elifrank==2:input_box=Box((32,0,0),(64,128,128))else:input_box=Box((32,128,0),(64,256,128))# Use the same pencil distribution for the output.output_box=input_boxb=nvmath.distributed.fft.fft(a,distribution=[input_box,output_box])
Operand allocation helper#
Theallocate_operand() helper can be used to allocate anoperand that meets the requirements (in terms of buffer size, padding and strides) forthe specified FFT operation . For GPU operands, the allocation will be done on thesymmetric heap.
Important
Any memory on the symmetric heap that is owned by the user (including memoryallocated withallocate_operand()) must be deletedexplicitly usingfree_symmetric_memory(). Refer toDistributed API Utilities for more information.
To allocate an operand, each process specifies the local shape of its input, the arraypackage, dtype, distribution and FFT type. For example:
importcupyascp# Get number of processes from mpi4py communicator.nranks=communicator.Get_size()fromnvmath.distributed.fftimportSlab# The global *real* 3-D FFT size is (512, 256, 512).# The input data is distributed across processes according to# the cuFFTMp Slab distribution on the X axis.shape=512//nranks,256,512# Allocate the operand on the symmetric heap with the required properties# for the specified distributed FFT R2C.a=nvmath.distributed.fft.allocate_operand(shape,cp,input_dtype=cp.float32,distribution=Slab.X,fft_type="R2C",)# a is a cupy ndarray and can be operated on using in-place cupy operations.withcp.cuda.Device(device_id):a[:]=cp.random.rand(*shape,dtype=cp.float32)# R2C (forward) FFT.# In this example, the R2C operand is distributed according to Slab.X distribution.# With reshape=False, the R2C result will be distributed according to Slab.Y distribution.b=nvmath.distributed.fft.rfft(a,distribution=Slab.X,options={"reshape":False})# Distributed FFT performs computations in-place. The result is stored in the same# buffer as operand a. Note, however, that operand b has a different dtype and shape# (because the output has complex dtype and Slab.Y distribution).# C2R (inverse) FFT.# The inverse FFT operand is distributed according to Slab.Y. With reshape=False,# the C2R result will be distributed according to Slab.X distribution.c=nvmath.distributed.fft.irfft(b,distribution=Slab.Y,options={"reshape":False})# Synchronize the default streamwithcp.cuda.Device(device_id):cp.cuda.get_current_stream().synchronize()# The shape of c is the same as a (due to Slab.X distribution). Once again, note that# a, b and c are sharing the same symmetric memory buffer (distributed FFT operations# are in-place).nvmath.distributed.free_symmetric_memory(a)
API Reference#
FFT support (nvmath. distributed. fft)#
| Return uninitialized operand of the given shape and type, to use as input for distributed FFT. |
| Perform an N-Dcomplex-to-complex (C2C) distributed FFT on the provided complex operand. |
| Perform an N-Dcomplex-to-complex (C2C) inverse FFT on the provided complex operand. |
| Perform an N-Dreal-to-complex (R2C) distributed FFT on the provided real operand. |
| Perform an N-Dcomplex-to-real (C2R) distributed FFT on the provided complex operand. |
| Create a stateful object that encapsulates the specified distributed FFT computations and required resources. |
| A data class for providing options to the |
| An IntEnum class specifying the direction of the transform. |