- Notifications
You must be signed in to change notification settings - Fork565
Release v2.9
Key Features and Enhancements
- [PyTorch][Jax] Introduced recipe agnostic functions and APIs in order to generalize to non-FP8 recipes. See Deprecated Features section for a comprehensive list of affected APIs.
- [C][PyTorch][Jax] Added support for the clamped SwiGLU activation function.
- [C] Added support for precompiled wheels for cuda13 via PyPI.
- [PyTorch] Added support for custom training recipes in the
autocastcontext. Transformer Engine quantizers, quantized tensors classes as well as storage dataclasses are now a part of the public API. - [PyTorch] Added CPU offload support for all attention layouts.
- [PyTorch] Added support for the FP8 block scaling recipe (as used in the DeepSeek v3 Technical Report) on NVIDIA Blackwell architecture (SM100 family).
- [PyTorch] Added support for gradient accumulation fusion when using FSDP.
- [PyTorch] Added support for CPU offloading when using
GroupedLinearwith distributed optimizer. - [PyTorch] Exposed as public API utility functions:
is_fp8_available,is_mxfp8_available,is_fp8_block_scaling_available,is_nvfp4_available,is_bf16_available,get_cudnn_version,get_device_compute_capability, andget_default_recipe. - [PyTorch] Added
max_logitsupport for the MuonClip optimizer. - [PyTorch][Jax] Improved the logic for selecting the attention backend, addressing various unsupported cases and preventing errors.
- [Jax] Added support for the NVFP4 training recipe.
- [Jax] Improved the performance of the current scaling recipes by enabling fused amax calculation in normalization and activation kernels.
- [Jax] Added support for bottom right causal mask for THD attention.
- Improved documentation and tutorials for the NVFP4 recipe.
Fixed Issues
- [Jax] Fixed a crash when using Context Parallelism with ring attention.
- [Jax] Fixed an issue with incorrect sharding when
get_all_mesh_axesis used. - [Jax] Fixed a numerical issue when using bias along with Tensor Parallelism.
- [PyTorch] Fixed an integer overflow issue in the triton permute kernel.
- [PyTorch] Fixed the known issue from
release_v2.8which resulted in worse performance for the FP8 current scaling recipe. - Fixed a build issue when cuDNN is installed into a custom location or python virtual environment.
Known Issues in This Release
- [C][PyTorch] The cuDNN attention backend produces NaNs in the forward pass for cases using a non-causal mask with cuDNN 9.13 and cuDNN 9.14. As a workaround, please set the
NVTE_FUSED_ATTNenvironment variable to 0 when using this configuration. - [C][PyTorch] The backward pass of cuDNN attention is incompatible with CUDA graphs for BSHD inputs where the sequence (S) dimension is not divisible by 128 when used with a non-padding mask. As a workaround, please set the
NVTE_FUSED_ATTNenvironment variable to 0 when using this configuration.
Breaking Changes in This Release
There are no breaking changes in this release.
Deprecated Features
- [PyTorch] The function
fp8_autocastis deprecated in favor ofautocast. The newautocastfunction uses argumentsrecipeandamax_reduction_groupinstead offp8_recipeandfp8_grouprespectively.
[PyTorch] The functionfp8_model_initis deprecated in favor ofquantized_model_init.
[PyTorch] The argumentsfp8_enabled,fp8_calibrating,fp8_recipe,fp8_group, andfp8_weight_cachingin functionmake_graphed_callablesare deprecated in favor ofenabled,calibrating,recipe,amax_reduction_group, andcache_quantized_paramsrespectively. - [Jax] The function
fp8_autocastis deprecated in favor ofautocast.
Miscellaneous:
None
Assets2
Uh oh!
There was an error while loading.Please reload this page.
1 person reacted