- Notifications
You must be signed in to change notification settings - Fork101
Tags: DTolm/VkFFT
Tags
v1.3.2
VkFFT v1.3.2 release-Added double-double support in VkFFT. Requires cpu initialization in full quad precision, so only supports gcc with quadmath dependency for now. Potentially possible to add full FP128 support or some other FP128 library (like mpir) in the future.-Data has to be stored in double-double before VkFFT kernels calls (no fp128<->double-double conversion on the GPU yet).-Full 1e-32 precision, but same range as FP64. See Library for Double-Double and Quad-Double Arithmetic by Y Hida for more information on double-double.-Double-double requires FMA contraction to be disabled (due to ab-cd contraction rounding mismatch). Doesn't work on Vulkan as I haven't found how to do that yet.-Added DST I-IV support.-Fixed warnings (#138)-Added proper check for app to be zero before initializeVkFFT call and zeroing on deletion (#134)-Added an option to provide a staging buffer in the application and VkGPU handle (#129)-Added guards for build type (#128)-Changed default innermost stride for real buffers in out-of-place R2C from size[0]+2 to size[0] (#139)-Allow specifying glslang version (#135)-Improved instruction count and accuracy for radix-7.-Fixed missing deallocation calls for the inverse Bluestein axes. Fixed the buffer layout size in Vulkan in some cases.-Refactored the code generator and container struct layout for better handling complex numbers (-5k loc).-Added more precision tests and benchmarks.
v1.3.1
Version 1.3 update of VkFFT-Major library design change - from single header to multiple header approach, which improves structure and maintainability. Now instead of copying a single file, the user has to copy the vkFFT folder contents.-VkFFT has been rewritten to follow the multiple-level platform structure, described in the VkFFT whitepaper. All algorithms have been split into respective files, which should ease an understanding of the library design by everybody. Multiple code duplication places have been restructured and unified (mainly the read/write part of kernels and pre/post-processing).-All math operations and most variables have been abstracted to a union container approach, that can either contain numbers or variable names. Not a full compiler, but the code generated is close to machine-like. There are no math sprintf calls in the actual code generator now. More details can be found here:https://youtu.be/lHlFPqlOezo-VkFFT supports arbitrary number of dimensions now. By defining VKFFT_MAX_FFT_DIMENSIONS, it is now possible to mimic fftw guru interface. Default 4. Innermost stride is always fixed to be 1, but there can be an arbitrary number of outer strides. to achieve innermost batching, initialize N+1 dim FFT and omit the innermost one using omitDimension[0] = 1.-Enabled fp16 for all backends.-Accuracy verification of the new version can be found here:vincefn/pyvkfft#25-The new code structure will facilitate the implementation of many new features and performance improvements, so stay tuned.
v1.2.31
Multi-upload performance improvements + bugfixes-Improved multi-upload FFT algorithm performance in double precision on HPC GPUs-Fixed double precision sincos computation. Now it is possible to disable LUT - useLUT switched to int64_t, -1 disables LUT, 0 - auto decision, 1 forces it. It is possible to disable LUT for 4-step algorithm rotation only - useLUT_4step-Optimized swapTo3Stage4Step and switched it to direct number value from the power of 2-Bugfixes: fixed FP64 usage in FP32 when number ending was not printed in kernels (important), fixed registerBoost incorrect writing,fixed#93
v1.2.30
Metal support in VkFFT-This update adds Apple Metal backend in VkFFT (VKFFT_BACKEND 5)-Metal backend has similar performance compared to other backends (tested on M1 Pro 8c SoC)-Metal backend passes all VkFFT tests OpenCL passes (tested on M1 Pro 8c SoC)-Current limitations of the Metal backend: no double precision, no saving/loading binaries, forced 256 max threads, C++ bindings only, incomplete error handling.-Bugfixes: Rader uint LUT offset not working in some cases, Mult Rader coalescing with <1024 threads, DCT-III reordering index issues with OpenCL on Intel/Apple GPUs.-Slightly improved coalescing logic for Nvidia GPUs-Added precision plots
v1.2.26
Radix 6, 8, 9, 10, 12, 14, 15, 16, 32 support + Bluestein tuning-Added support for more composite radix kenrels. Improves performance by reducing shared memory communications-Added Bluestein sequence advanced tuning: can now specify the sequence to pad to. Added default tuned values for FP32 and FP64 for Nvidia A100 and AMD MI250 for sequences up to 4096-Improved LUT usage: do not upload the first radix, coalesced upload to shared memory for small stage requests.-Bugfixes: C2R check for big radix, matrix convolution coordinate assignment, specification of device_id in cuFFT/rocFFT scripts
PreviousNext