pytorch/pytorchPublic

NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

PyTorch 2.8.0 Release

jbschlosser released this 06 Aug 17:06

· 7853 commits to main since this release

v2.8.0

ba56102

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.8.0 Release Notes

Highlights

Unstable

torch::stable::Tensor

High-performance quantized LLM inference on Intel CPUs with native PyTorch

Experimental Wheel Variant Support

Inductor CUTLASS backend support

Inductor Graph Partition for CUDAGraph

Control Flow Operator Library

HuggingFace SafeTensors support in PyTorch Distributed Checkpointing

SYCL support in PyTorch CPP Extension API

A16W4 on XPU Device

Hierarchical compilation with torch.compile

Intel GPU distributed backend (XCCL) support

For more details about these highlighted features, you can look at therelease blogpost.
Below are the full release notes for this release.

Tracked Regressions

Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)

Due to a bug introduced in CUDA 12.9.1, we are unable to complete full Windows wheel builds with this
version, as compilation oftorch.segment_reduce() crashes the build. Thus, we provide a wheel
withouttorch.segment_reduce() included in order to sidestep the issue. If you need support
fortorch.segment_reduce(), please utilize a different version.

Backwards Incompatible Changes

CUDA Support

Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517,#158478,#158744)

Due to binary size limitations, support for sm50 - sm60 architectures with CUDA 12.8 and 12.9 has
been dropped for the 2.8.0 release. If you need support for these architectures, please utilize
CUDA 12.6 instead.

Python Frontend

Calling an op with an input dtype that is unsupported now raises`NotImplementedError` instead of`RuntimeError` (#155470)

Please update exception handling logic to reflect this.

In 2.7.0

try:    torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))except RuntimeError:    ...

In 2.8.0

try:    torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))except NotImplementedError:    ...

Added missing in-place on view check to custom`autograd.Function` (#153094)

In 2.8.0, if a customautograd.Function mutates a view of a leaf requiring grad,
it now properly raises an error. Previously, it would silently leak memory.

   class Func(torch.autograd.Function):        @staticmethod        def forward(ctx, inp):            inp.add_(1)            ctx.mark_dirty(inp)            return inp        @staticmethod        def backward(ctx, gO):            pass    a = torch.tensor([1.0, 2.0], requires_grad=True)    b = a.view_as(a)    Func.apply(b)

Output:

Version 2.7.0

Runs without error, but leaks memory

Version 2.8.0

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation

An error is now properly thrown for the out variant of`tensordot` when called with a`requires_grad=True` tensor (#150270)

Please avoid passing an out tensor withrequires_grad=True as gradients cannot be
computed for this tensor.

In 2.7.0

a = torch.empty((4, 2), requires_grad=True)b = torch.empty((2, 4), requires_grad=True)c = torch.empty((2, 2), requires_grad=True)# does not error, but gradients for c cannot be computedtorch.tensordot(a, b, dims=([1], [0]), out=c)

In 2.8.0

a = torch.empty((4, 2), requires_grad=True)b = torch.empty((2, 4), requires_grad=True)c = torch.empty((2, 2), requires_grad=True)torch.tensordot(a, b, dims=([1], [0]), out=c)# RuntimeError: tensordot(): the 'out' tensor was specified and requires gradients, and# its shape does not match the expected result. Either remove the 'out' argument, ensure# it does not require gradients, or make sure its shape matches the expected output.

torch.compile

Specialization of a tensor shape with`mark_dynamic` applied now correctly errors (#152661)

Prior to 2.8, it was possible for a guard on a symbolic shape to be incorrectly
omitted if the symbolic shape evaluation was previously tested with guards
suppressed (this often happens within the compiler itself). This has been fixed
in 2.8 and usually will just silently "do the right thing" and add the correct
guard. However, if the new guard causes a tensor marked withmark_dynamic to become
specialized, this can result in an error. One workaround is to use
maybe_mark_dynamic instead ofmark_dynamic.

See the discussion in issue#157921 for more
context.

Version 2.7.0

importtorchembed=torch.randn(2,8192)x=torch.zeros(8192)torch._dynamo.mark_dynamic(x,0)@torch.compiledeff(embedding_indices,x):added_tokens_mask=torch.where(x>10000,1,0)ei=torch.narrow(embedding_indices,1,0,x.size(0))returnei.clone()f(embed,x)

Version 2.8.0

importtorchembed=torch.randn(2,8192)x=torch.zeros(8192)torch._dynamo.maybe_mark_dynamic(x,0)@torch.compiledeff(embedding_indices,x):added_tokens_mask=torch.where(x>10000,1,0)ei=torch.narrow(embedding_indices,1,0,x.size(0))returnei.clone()f(embed,x)

Several config variables related to`torch.compile` have been renamed or removed

Dynamo config variableenable_cpp_framelocals_guard_eval has changed to no longer have any effect (#151008).
Inductor config variablerocm.n_max_profiling_configs is deprecated (#152341).
Instead, use ck-tile based configsrocm.ck_max_profiling_configs and
rocm.ck_tile_max_profiling_configs.
Inductor config variableautotune_fallback_to_aten is deprecated (#154331).
Inductor will no longer silently fall back toATen. Please add"ATEN" to
max_autotune_gemm_backends for the old behavior.
Inductor config variablesuse_mixed_mm andmixed_mm_choice are deprecated (#152071). Inductor now supports prologue fusion, so there is no need for
special cases now.
Inductor config settingdescriptive_names = False is deprecated (#151481). Please use one of the other available
options:"torch","original_aten", or"inductor_node".
custom_op_default_layout_constraint has moved from inductor config to functorch config (#148104). Please reference it via
torch._functorch.config.custom_op_default_layout_constraint instead of
torch._inductor.config.custom_op_default_layout_constraint.
AOTI config variableemit_current_arch_binary is deprecated (#155768).
AOTI config variableaot_inductor.embed_cubin has been renamed toaot_inductor.embed_kernel_binary (#154412).
AOTI config variableaot_inductor.compile_wrapper_with_O0 has been renamed tocompile_wrapper_opt_level (#148714).

Added a stricter aliasing/mutation check for`HigherOrderOperator`s (e.g.`cond`), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953,#146658).

For affectedHigherOrderOperators, add.clone() to aliased outputs to address this.

Version 2.7.0

importtorch@torch.compile(backend="eager")deffn(x):returntorch.cond(x.sum()>0,lambdax:x,lambdax:x+1, [x])fn(torch.ones(3))

Version 2.8.0

importtorch@torch.compile(backend="eager")deffn(x):returntorch.cond(x.sum()>0,lambdax:x.clone(),lambdax:x+1, [x])fn(torch.ones(3))

`guard_or_x` and`definitely_x` have been consolidated (#152463)

We removeddefinitely_true /definitely_false and associated APIs, replacing them with
guard_or_true /guard_or_false, which offer similar functionality and can be used to
achieve the same effect. Please migrate to the latter.

Version 2.7.0

fromtorch.fx.experimental.symbolic_shapesimportdefinitely_false,definitely_true...ifdefinitely_true(x):  ...ifdefinitely_false(y):  ...

Version 2.8.0

fromtorch.fx.experimental.symbolic_shapesimportguard_or_false,guard_or_true...ifguard_or_false(x):  ...# alternatively: if guard_or_false(torch.sym_not(y))ifnotguard_or_true(y):  ...

torch.export

`torch.export.export_for_inference` has been removed in favor of`torch.export.export_for_training().run_decompositions()` (#149078)

Version 2.7.0

importtorch...exported_program=torch.export.export_for_inference(mod,args,kwargs)

Version 2.8.0

importtorch...exported_program=torch.export.export_for_training(mod,args,kwargs).run_decompositions(decomp_table=decomp_table)

Switched default to`strict=False` in`torch.export.export` and`export_for_training` (#148790,#150941)

This differs from the previous release default ofstrict=True. To revert to the old default
behavior, please explicitly passstrict=True.

Version 2.7.0

importtorch# default behavior is strict=Truetorch.export.export(...)torch.export.export_for_training(...)

Version 2.8.0

importtorch# strict=True must be explicitly passed to get the old behaviortorch.export.export(...,strict=True)torch.export.export_for_training(...,strict=True)

ONNX

Default opset in`torch.onnx.export` is now 18 (#156023)

Whendynamo=False, the default ONNX opset version has been updated from 17 to 18. Users can setopset_version to explicitly select an opset version.

Version 2.7

# opset_version=17torch.onnx.export(...)

Version 2.8

# To preserve the original behaviortorch.onnx.export(...,opset_version=17)# New: opset_version=18torch.onnx.export(...)

The`JitTraceConvertStrategy` has been removed (#152556)

Support for JIT traced and scripted modules in the ONNX exporter whendynamo=True has been removed. You are encouraged to export an nn.Module directly, or create anExportedProgram usingtorch.export before exporting to ONNX.

`onnxscript>=0.3.1` is required for the`dynamo=True` option (#157017)

You must upgradeonnxscript to version 0.3.1 or higher for it to be compatible with PyTorch 2.8.

Build Frontend

Removed the`torch/types.h` include from`Dispatcher.h` (#149557)

This can cause build errors in C++ code that implicitly relies on this include (e.g. very old versions oftorchvision).

Note thatDispatcher.h does not belong as an include fromtorch/types.h and was only present as a
short-term hack to appeasetorchvision. If you run intotorchvision build errors, please
update to a more recent version oftorchvision to resolve this.

Upgraded`DLPack` to 1.0 (#145000)

As part of the upgrade, some of theDLDeviceType enum values have been renamed. Please switch
to the new names.

Version 2.7.0

from torch.utils.dlpack import DLDeviceTyped1 = DLDeviceType.kDLGPUd2 = DLDeviceType.kDLCPUPinned...

Version 2.8.0

from torch.utils.dlpack import DLDeviceTyped1 = DLDeviceType.kDLCUDA  # formerly kDLGPUd2 = DLDeviceType.kDLCUDAHost  # formerly kDLCPUPinned...

NVTX3 code has been moved from`cmake/public/cuda.cmake` to`cmake/Dependencies.cmake` (#151583)

This is a BC-breaking change for the build system interface. Downstream projects that previously got NVTX3 throughcmake/public/cuda.cmake
(i.e.. callingfind_package(TORCH REQUIRED)) will now need to explicitly configure NVTX3 support in the library itself (i.e. useUSE_SYSTEM_NVTX=1).
The change is to fix the broken behavior where downstream projects couldn't find NVTX3 anyway due to thePROJECT_SOURCE_DIR mismatch.

Version 2.7.0:

A downstream project using-DUSE_SYSTEM_NVTX would be able to find NVTX3 andtorch::nvtx3 via PyTorch'scmake/public/cuda.cmake logic.
A downstream project NOT using-DUSE_SYSTEM_NVTX would encounter build errors with CUDA 12.8 or above.

Version 2.8.0:

A downstream project using-DUSE_SYSTEM_NVTX will not be able to find NVTX3 ortorch::nvtx3 via PyTorch'scmake/public/cuda.cmake. The downstream project now needs to explicitly find NVTX3 and torch::nvtx3 by implementing the same logic in PyTorch'scmake/Dependences.cmake.
A downstream project NOT using-DUSE_SYSTEM_NVTX will proceed building without NVTX unless another part of the build process re-enables NVTX.

Deprecations

MPS support for MacOS Ventura will be removed in 2.9

PyTorch 2.8 is the last release that will support GPU acceleration on MacOS Ventura. In the next
release (2.9), MacOS Sonoma (released in Sept. 2023) or above will be required to use the MPS
backend.

`torch.ao.quantization` is deprecated and will be removed in 2.10 (#153892)

To migrate:

Eager mode quantization (torch.ao.quantization.quantize,torch.ao.quantization.quantize_dynamic)
- Weight-only and dynamic quantization: usetorchao eager modequantize_.
- Static quantization: usetorchao PT2E quantization.
FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx): usetorchao PT2E quantization (torchao.quantization.quantize_pt2e.prepare_pt2e,torchao.quantization.quantize_pt2e.convert_pt2e).

Note that PT2E quantization has been migrated totorchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e). Seepytorch/ao#2259 andhttps://docs.pytorch.org/ao/main/quick_start.html#pytorch-2-export-quantization for more details.

The`dynamo=False` (current default) option for`torch.onnx.export` is deprecated (#152478,#155580)

The default will bedynamo=True starting from PyTorch 2.9. You are encouraged to migrate to use thedynamo=True option intorch.onnx.export. This flag makestorch.export.export the default export path, replacingTorchScript.

To maintain the old behavior, setdynamo=False explicitly. You are encouraged to also experiment with thefallback=True option that will make the exporter fall back to thedynamo=False path if there are errors.

New Features

CUDA

Support capture of event record and wait in CUDAGraphs for timing (#155372)

torch.compile

Dynamo

Added support for hierarchical compilation vianested_compile_region (#156449)
Allow guards to be dropped with custom filter functions viaguard_filter_fn (#150936)
Addeddont_skip_tracing decorator to skip over most Dynamoskipfiles rules (#150586)

Inductor

Added support for mapping a Dynamo graph to multiple different Inductor graphs, which can be optimized separately (#147648,#147038)

torch.export

Introduceddraft-export, an export variant designed to consistently produce a graph and generate a debugging report of issues encountered during tracing (#152637,#153219,#149465,#153627,#154190,#155744,#150876,#150948,#151051,#151065,#150809,#151797)

Ahead-Of-Time Inductor (AOTI)

Added support forTorchBind objects (#150196,#154265)
Added config variableaot_inductor.model_name_for_generated_files for specifying model name (#154129)

MPS

MPSInductor:torch.compile for Apple GPUs (#150121,#149342,#151449,#151754,#149687,#149180,#149221,#153598,#152788,#153787,#152214,#151152,#155891,#154578,#151272,#151288,#153997,#151871,#153362,#156566,#150661,#153582)

ONNX

Added new strategydraft_export (#147529,docs) to provide debugging information upon data-dependent / constraint errors when obtaining anExportedProgram withtorch.onnx.export
Added support for symbolic operators in thedynamo=True export path (#148905,#149678,#150038,docs). Two operatorstorch.onnx.ops.symbolic andtorch.onnx.ops.symbolic_multi_out are defined to allow you to create symbolic ONNX operators directly in your PyTorch models. You can use them in aforward method:

defforward(self,x:torch.Tensor)->torch.Tensor:# Optionally use is_in_onnx_export to control the behavior during onnx exportiftorch.onnx.is_in_onnx_export():# Create a symbolic ONNX operator with the name "CustomOp" in the "custom_domain" domain.# The output tensor will have the specified dtype and shapereturntorch.onnx.ops.symbolic("custom_domain::CustomOp",            (x,),dict(attr_key="attr_value"),dtype=x.dtype,shape=x.shape,version=1,        )else:returnx

Python Frontend

Added Generalized Pareto Distribution (GPD) (#135968)

Quantization

Introducedtorch.float4_e2m1fn_x2 dtype (#148791)

XPU

Support Intel distributed backend (XCCL) (#141856)
Support SYCL kernels through C++ extension (#132945)

Improvements

Build Frontend

Removed outdated warning aboutTORCH_CUDA_ARCH_LIST (#152715,#155314)
Made Eigen an optional build dependency (#155955)
Updated CUTLASS to 3.9.2 (#152779)

Composability

Enhanced custom op support with serializable op profiles and fake registration overrides (#151817,#150807,#150806)

C++ Frontend

Exposedbicubic mode fortorch::nn::functional::grid_sample (#150817)

CUDA

Introducedno_implicit_headers mode forload_inline() on custom CUDA extensions (#149480)
Support large batch sizes in SDPA memory-efficient attention backend (#154029,#154663)
Fixed invalid indexing in SDPA memory-efficient attention backward (#155397)
Support SDPA attention backends on sm121 (DGX Spark) (#152314)
Added FP8 row-wise scaled-mm for sm12x (GeForce Blackwell) (#155991)

cuDNN

Updated cuDNN frontend version to 1.12 (#153888)

Distributed

c10d

EnhancedTCPStore with clone and queuing features (#150966,#151045,#150969,#151485)
Added a collective time estimator for NCCL comms (#149343)
MadegetDefaultBackend more fault tolerant without relying on exceptions (#149152)
Specified the default PyTorch Distributed backend for MPS (#149538)
SupportedmasterListenFd inTCPStoreLibUvBackend (#150215)
Used shared stores in gloo (#150230)
Improved FR dump robustness with all watchdog broadcast wait, reduce dump timeout and shrinked mutex range (#150652,#151329,#155949)
Added the record of each individual collective being coalesced in FR (#151238)
Implemented safer book-keeping of NCCL communicators (#150681)
Clarified behavior ofTORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#150682)
Registered also future allocations in mempool with NCCL (#150684)
Avoided computingglobal_rank whengroup_rank is used (#151373)
Exposed NCCL communicator fromProcessGroupNCCL via an unsafe API (#152496)
Added split sizes info dump for uneven all2all bw calculation (#151438)
Made FR vendor neutral so that other backends can use it and integrated into gloo. (#152585,#152563,#154929,#152614)
Addedneeds_contiguous_strides tag in functional collective (#153399,#153523)
Allowedsplit_group to work with non-nccl backends (#152175)
Simplifiednew_subgroups() by usingnew_subgroups_by_enumeration() (#153843)
Made only current thread allocate to pool inProcessGroupNCCL (#153990)
Enabled usingc10::Half for gloo (#153862)
Released GIL in PG destructor (#154976)
Enhancedget_process_group_ranks() to acceptgroup=None (#154902)
Skipped updating the default device distributed backend if already registered (#155320)
Enabled querying the build and runtime NCCL versions (#156305)
Disabled NCCL NVLS when using deterministic mode (#156381)
Madeinit_process_group support index-only device id (#156214)
Support enabling / disabling NaN detector per-ProcessGroup (#151723)
Added support forreduce_scatter andReduceOp::AVG inProcessGroupGloo (#149781,#149869)
Added FP8 support inProcessGroupNCCL (#152706)
Addedibverbs backend in gloo and enabled gloo CUDA when used with a backend that supportsGPUDirect (#153015,#153425,#153406)

DeviceMesh

Improved device selection logic (#150897)

DistributedDataParallel (DDP)

Added one option to allow skipping all reduce unused parameters (#151503)
Added check on received data to avoid segfault in the DDP reducer (#152143)
Propagateduse_python_reducer to C++ reducer (#152735)
DistributedStateDict (DSD)
Supported non-tensor-datawrite_size in planner write items (#149699)
Skip popping meta device tensors (#153185)

DTensor

MadeStridedShard support uneven sharding (#150490)
Added op support fortorch.cumsum (#151071)
AddedDTensorredistribute fwd/bwd datatype conversion to enableSimpleFSDP mixed precision training (#150740)
Added rich support totorch.distributed.tensor.debug.visualize_sharding (#152027)

FullyShardedDataParallel2 (FSDP2)

AddedPrivateUse1 backend in FSDP collectives and device type to pre forward hook (#147260,#149487)
Addedset_reshard_after_forward (#149103)
Allowed different dtypes for no grad model params (#154103)
Respectedreshard_after_forward=True for root model and kept root unsharded when not specifyingreshard_after_forward (#154704,#155319)
Allowed forcing FSDP2 to always use SUM reductions (#155915)
Made assert onall_reduce_event only if it's not CPU device (#150316)
Enabled NCCL zero-copy (user buffer registration) for FSDP2 (#150564)

Pipeline Parallelism

Added schedule visualizer (#150347)
Allowed unused kwargs in ZB path (#153498)
Addedget_pipeline_order() for Gpipe and 1F1B (#155935)

ShardedTensor

Added support for 0-sizeShardedTensor and recalculated metadata fromall_gather (#152583)

TensorParallel

Added aParallelStyle PrepareModuleInputOutput (#150372)

torchelastic

No shutdown of rendezvous on leaving workers (#152525)

torch.compile

Dynamo

Improved tracing support for python sets, tensor subclasses with__torch_function__, andnamedtuple subclasses (#153150,#149792,#153982)
Eliminated all Compiled Autograd dynamic shapes recompiles for compile time reduction (#151962,#152119,
#151962,#149707,#149709,
#148799,#148801)
Addedreason field totorch.compiler.disable (#150341)
Removedlru_cache warnings for functions in the top-leveltorch namespace (#157718)

Inductor

Added block sparse support for FlexAttention on CPU (#147196)
Introduced new config settings:
- aot_inductor.custom_ops_to_c_shims andaot_inductor.custom_op_libs: allow for specifying custom op C shim (#153968)
- max_fusion_buffer_group_pairwise_attempts: limits fusions to specified node distance (#154688)
- cuda.cutlass_enabled_ops: controls CUTLASS operation selection (#155770)
- triton.cudagraph_capture_sizes: allows specifying certain shapes for which to capture CUDAGraphs; skips CUDAGraphs for other shapes (#156551)
- use_static_cuda_launcher: enables launching compiled triton statically to improve cold start times (#148890)
- assume_unaligned_fallback_output: allows inductor to track unaligned outputs (#150777)
- cuda.cutlass_tma_only: controls whether or not to only use TMA-compatible kernels in CUTLASS (#152815)
- static_launch_user_defined_triton_kernels: enables statically launching user defined triton kernels (#153725)
- precompilation_timeout_seconds: controls the timeout on precompilation (#153788)
- disable_decompose_k: disables newDecomposeK GEMM Kernels (#154421)
- min_num_split: sets the minimum number of splits in a split reduction (#155941)
- max_autotune_flex_search_space: allows specifying the size of the search space for flex attention autotuning (#156307)
Introduced environment variableLOG_AUTOTUNE_RESULTS for autotune log (#156254)
Improved numerical stability of CPU Welford reduction for normalizations (#145061)

torch.export

Improved handling of builtin ops (min,max,math.pow) (#151348)
Added min/max ranges for dim hints (#149590)
Allow registering normal classes topytree.register_dataclass (#147752)
Allow specifying integer inputs as dynamic (#151842)
Inlinejit.scripted functions in export (#155180)
Pretty printing for graph signature (#149710)

Ahead-Of-Time Inductor (AOTI)

Support for device-side TMA (#157241)
Addednum_runners toAOTIModelPackageLoader (#149364)

FX

Updated codegen compare op to== (#150611)
Map names to operand indices when const folding submodules (#150692)
Improved stacktrace when tracing (#151029,#155486)
Support edge dialect ops innormalize_function (#143689)
Fixed path naming in minifier (#153130)
Addedgraph_code_verbose_log artifact for FX passes (#153775)
Improved cache key graph printing performance (#151928)
Added flag tofx.passes.split_module to normalize input names (#157793)

Linear Algebra Frontend

Add tensor overlap check forcross (#154999)

MPS

Added support for a number oftorch.special operations as well asindex_copy,hardshrink,rsub,col2im, andisin (#149174,#149203 #149123,#149368,#149378,#149563,#149687,#149705,#149783,#149407/#149680,#150279,#151754,#153786,#154326,#155304,#156263,#155382,#154010,#149816,#152282,#156090,#150060,#151600,#155002,#154671)
Extended dtype support for:
- index_put with half precision floats (#151869)
- ConvTranspose3D with FP32 and complex (#154696)
- log1p andsigmoid with int64 (#151791)
Compute activation kernels at float precision (#155735)

Nested Tensor (NJT)

Fixed contiguity in NJT string representation (#153529)

torch.nn

Added warning for module full backward hook when no input requires gradient (#155339)
Added Half support forweight_norm on CPU (#148878)

ONNX

Updated ONNX to 1.18 (#152200)
Added support for opsets (18-23) whendynamo=True (#149901,#154596)
Added float4 support (#151069,#156353)
Added support for ONNX operatorsAttention-23 andRotaryEmbedding-23 as native PyTorch ops (#156431,#156367,#154745)
Added support fortorch.scan (#154513)
Added support for 0/1-sized example inputs on dynamic dimensions (#155717)
Addgroup_norm support from opset 21 (#152138)
Addedasdict method toVerificationInfo class (#151024)
Support running bfloat16 models with ONNX Runtime (#149646)
Updated ONNX program doc formatting and improve robustness (#151623)
Updateddynamic_shapes behavior to usetorch.export.dim.DYNAMIC (#153065)
Set the name of the producing node using the value name (#155413)
Improved support for symbolic operatorssym_float,sym_not,sym_min,sym_max (#153200,#152111,#152196)

Optimizer

AddedTensorLR variant for fused Adagrad on CPU (#153078)
Convert tensor lr to 0-dim as needed for the optimizer to normally work (#145674)
Addedlr_lambda type check inMultiplicativeLR (#151973)

Profiler

Added support for on-demand memory snapshot (#150559)
Added PT2 compile context to visualizer (#152862)
Added PT2 to memory snapshot (#152707)
Added flag to toggle global and local callbacks for annotations (#154932)
Pass overload names to Kineto (#149333)
Set duration to -1 for unfinished CPU events (#150131)
Start at index with most events (#154571)

Python Frontend

Introducedtorch.AcceleratorError (#152023)
ImplementedSize.__radd__() (#152554)
Updatedget_default_device() to also respecttorch.device context manager (#148621)

Quantization

Improved x86 PT2E quantization support with new uint8 ops (pointwisemul /add /add_relu andbatch_norm2d), qconv1d-relu fusion, and lowering pass (#151112,#152411,#152811,#150751,#149708)
Support boolean tensor fortorch.fused_moving_avg_obs_fake_quant on CUDA (#153699)

Release Engineering

Updated gcc11 to gcc13 in manylinux images (#152825,#152825,#150635,#158445)
Updated to cmake 3.27.2 (#154783,#150549,#153380)

ROCm

Allow user to override default flags forcpp_extension (#152432)
Enabled support for sparse compressedmm/bmm/addmm (#153262)

Sparse Frontend

Enabled sparse compressed tensor invariant checks forPrivateUse1 extension (#149374)

torch.func

Add batching rules for ops:torch.Tensor.scatter_add_ (#150543),torch.matrix_exp (#155202)

XPU

Support safe softmax, GQA, fp32 causal mask for SDP and increase maximum head dim from 256 to 576 on Intel GPU (#151999,#150992,#152091)
Add memory reporting to Memory Profiler for Intel GPU (#152842)
Support Intel GPU profiler toggle functionality (#155135)
Support distributed memory tracker integration for Intel GPU (#150703)
Improved error handling and reporting in Intel GPU CMake files (#149353)
Supportembed_cubin andmulti_arch_kernel_binary options in AOTI for Intel GPU (#154514,#153924)
Added generic and Intel GPU specific Stream and Event inUserDefineClass (#155787)
Support int4 WOQ GEMM on Intel GPU (#137566)

Bug Fixes

Build Frontend

Support builds withCMake-4.x (#150203)
Fixed fbgemm build withgcc-12+ (#150847)
Force build to conform to C++ standard on Windows by adding/permissive- flag (#149035)

Composability

Fixed support for 1-element tuple returns from custom ops (#155447)
Avoid overflow intorch.norm for scalar input (#144073)

CPU (x86)

Fixed apparent copy-paste bug inlog_softmax reduced-precision fp kernel (#156379)

CUDA

Fixed deterministic indexing with broadcast (#154296)
Fixedtorch.backends.cuda.matmul.allow_fp16_accumulation crash when using cuBLASLt (#153083)
EnableAsyncMM on Blackwell (#153519)
Fixedtorch.cuda.MemPool for multithreaded use-cases (#153356)
Fix to avoid callingsum() on a default-constructed gamma / beta inlayer_norm (#156600)
Avoid hangs by erroring out for negative offsets or K=0 in grouped GEMMs (#153226)
Don't error out inempty_cache under mempool context (#158180)

Distributed

c10d

Fixed extra CUDA context created by barrier (#149144)
Fixed the logic to use group rank instead of global rank when possible (#149488)
Fixed ET trace collection ofall_to_all (#149485)
Disabled start event recording for coalesced col and improved profile title (#150863)
Fixed connection reset in tcp store (#150987,#151052)
Fixed unusedgroup input argument innew_subgroups() (#152765,#153798)
Fixed tcp init when using port 0 (#154156)
Adopted a vector to temporarily keep the reference to future object to avoid blocking inside Flight Recorder (#156653)

Distributed Checkpointing (DCP)

Fixed to use global coordinator rank inbroadcast_object util function (#155912)

DistributedDataParallel (DDP)

FixedDDPOptimizer issue on static tensor index (#155746)

DTensor

Fixedlocal_map with multi-threading (#149070)
Fixednew_local_tensor inredistribute be None case (#152303)
Fixed bug visualizing 1D Tensor using rich (#152871)

Pipeline Parallelism

Optimized memory usage by releasing output memory earlier (#153383)

RPC

Made torch importable if compiled withoutTensorPipe (#154382)

ShardedTensor

Fixed sharded tensorgather when a local tensor on certain ranks has zero elements (#150914)

TensorParallel

Turn async-TP applicability asserts back into silent skips (#158736)

torch.compile

Dynamo

Eliminated silent incorrectness issues in the Compiled Autograd initial trace (#149014,#155521,#155289,#149336)
Fixed various tracing errors involving einops,dict(mapping_proxy), and the FlexAttention HOP (#157754,#157515,#157519)
Fixed unpack hook semantics for memory savings in checkpointing and offloading for Compiled Autograd (#147242,#153300)
Fixed sources for dataclass defaults and thelru_cache method (#158689,#157308)
Fixed spammy errors when an invalidTORCH_LOGS argument is passed (#151678)

Inductor

Support special kwargs in AMD triton configs (#154605)
Fixed minifier when one has multiple Python runtimes (#155918)
Bug fix for int8 GEMM compensation epilogue (#152408)

torch.export

Fixed tracing of the following:aten.is_nonzero (#149637),torch.bincount() (#152497),aten.div (#150874) slicing (#150104), andattn_mask (#158618),aten.to (#153972), scalar tensor construction (#154661)
Fixeddynamic_shapes spec for kwargs (#148772,#149528,#150103)
Fixed input bugs in unflattener (#149206,#153474,#153000)
Fix nonstrict tracing offunctools.partial (#153408), and higher order ops (#149295)
Fixed serialization/deserialization ofNone inputs (#150515),math module (#154643),call_torchbind (#155647), and enums (#154821)
Fixed state dict modification in run_decompositions (#151436)
Fixed subclass access custom op bug (#149698)

Ahead-Of-Time Inductor (AOTI)

Fixed AOTIupdate_constant_buffer issue (#149243)
Fixed a memory leak inmodel_package_loader (#152334)
Don't alloc weights inAOTIModel if they don't exist (#152692)
Fixed state ofConstantFolding (#153152)
Fixed index offset for optional tensor return (#155073)
Fixed float8 type printing for min/max value printing (#154466)

Linear Algebra Frontend

Fix to workaround LAPACK workspace size being returned as a floating point value (#149682)
Fixed the accumulation type fordot andgemv (#152676)
Fixedtorch.lobpcg to compute same largest eigenvalue as scipy andnp.linalg.eig (#152789)
Fixed 32-bit indexing overflows inReducedPrecisionGemV (#150949)

MPS

Fixed various op support issues: unary/binary ops with2**32+ element inputs, binary ops with inputs with different dtypes, ops with complex scalar inputs,cholesky decomp,floor_divide type promotion,index_kernel with large inputs,lerp with complex inputs,logit with half/bfloat16 inputs, SDPA memory leak,torch.special.entr,tri[ul], matrix inversion withN>1024, andwhere with non-contiguouscond (#152479,#155183,#149233,#151176,#151282,#158239,#152371,#149974,#158237,#146754,#158867,#155184,#152204)

torch.nn

Fixedload_state_dict behavior fornn.LazyLinear (#147599)

ONNX

Fixed bfloat16 support inonnx_program callable (#151121)
Produce correct dtypes for bf16/f8 in IR TorchTensor (#151259)
Preserve all legacy exporter params in fallback (#156659)
Fixed 4D tensor conversion for SDPA (#157509)

Optimizer

Fixed bug wherelr_scheduler unexpectedly callsstep() when init argumentlast_epoch > -1 (#149312)
FixedCosineAnnealingWarmRestarts resettingT_cur (#151289)

Profiler

Fixed empty C call queue in python tracer (#150370)
Removed decref from python context in python tracer (#151625)
Enable all configured activities in CUPTI Range Profiler mode (#154749)

Python Frontend

Fixed segfault during numpy string tensor conversion (#155364)
Added checks for empty tensor list (#155383)
Fixed sample validation forMixtureSameFamily distribution (#151317)
Fixed bug where creating a secondWishart orUniform distribution modifies constraints on the first (#154361)
Fix to properly exporttorch::utils::tensor_to_numpy symbol (#154178)
Fixedtorch.[con]cat[enate] to avoid crashing on empty inputs (#155460)
Unifytorch.tensor andtorch.ops.aten.scalar_tensor behavior (#158655)

Release Engineering

Checkout optional submodules when publishing a release tarball (#156615)
Fixed MacOS MP hang in Python-3.12+ (#155698)
Fixed static functions when using module in MSVC (#148675)
Fixed VS2022-caused AVX512 illegal instruction issue (#153480)

ROCm

Fixed build error for opportunistic fastatomics with newer compilers (#152841)

TunableOp

More TF32 support (#149088)
Fixed offline tuning forScaledGEMM (#149677)
Fixed row-wiseScaledGEMM (#152403)
Support submatrices in offline tuning for ROCm (#151138)

Vulkan

Fixedtorch.is_vulkan_available() on Mac (#155595)

XPU

Fixed matmul accuracy whenoffset > 0 (#154495)
Fixedtorch.xpu.is_bf16_supported to correctly report presence of Intel GPU (#152317)
Fixed AOT compilation in SYCL C++ extension (#156364)

Performance

Autograd

Improved autograd streams synchronization (#151079,#157914)

CPU (AArch64)

ComputeELU(0) with the cheaper definition (#155765)

CUDA

Improved performance ofcat andindex_select (#150233,#152380,#151715)

Dataloader Frontend

Reduced memory usage ofSubsetRandomSampler by iterating over list instead of tensor (#149126)

torch.compile

Inductor

Improved performance of GEMMs (#147315,#151530,#149373,#156174,#155444)
Added a config optioncpp.use_small_dequant_buffer to use a small dequant buffer for WOQ int4 GEMM (#156395)
Support graph partitioning on custom ops (#149782)
Optimized the heuristics of parallel reduction on CPU (#149614)

torch.export

Cache unflattened graph module (#150030)

JIT

Improved Dead Code Elimination (DCE) compile times for large graphs (#153645)

Linear Algebra Frontend

Introduced fast path fortorch.dot with float16/bfloat16 (#152799)

MPS

Improved performance ofLayerNorm,mm /bmm,sum /prod reductions, arithmetic ops,
binary kernels, SDPA,linear, andcumsum /cumprod (#152010,#150541,#150566,#147644,#149730,#152781,#152210,#157494)

Python Frontend

Optimized SVE embedding performance (#150176)
Improved performance fortorch.tensordot when contracting to a scalar (#145936)

ROCm

Improved performance ofsoftmax,NLLLoss, in-place sum, max pooling backward / reductions on NHWC
inputs, max pooling, multi-dimensional reductions, and non-vectorized elementwise kernels (#149076,#149779,#149548,#151230,#152267,#154522,#154619,#155806,#153184)
Improved scatter add performance on MI250X (#151724)
Extended vectorized elementwise kernel to more heterogenous tensor types (#149738)
UseHipSparseLT to further accelerate semi-structured (e.g. 2:4) sparsity (#150578)

Sparse Frontend

Skip sparse tensor invariant validation when loading sparse Tensors from external storage (#154610,#154759,#154638)

XPU

Enabled post-op fusion for oneDNN convolution on Intel GPU (#150287)
Reduced host overhead for Intel GPU by eliminating meaningless API calls (#151111)
Improved INT4 WOQ GEMM for Intel GPU by introducing a cache mechanism to reduce the oneDNN integration overhead further (#147693)
Improved scalar tensor case handling inaddmm,baddmm to reduce oneDNN integration overhead on Intel GPU (#153051)

Documentation

Autograd

Added more details on whyctx.save_for_backward is important in note about extending autograd (#153005)
Updated docs oftorch.autograd.graph.saved_tensors_hooks to avoid refcycle (#153049)
Updated gradient behavior note intorch.amin andtorch.amax (#155071)

CUDA

Fixed deprecated amp APIs in docs (#154553)
Documented device memory apis in correct module (#155126)
Documented non-pytorch CUDA memory allocation and how to query it (#150880)

Distributed

c10d

Documented object collectives limitations (#150815)
UpdatedNCCLConfig with QOS variable (#151821)
Documentedget_default_backend_for_device (#158236)

FullyShardedDataParallel2 (FSDP2)

Updatedignored_params docstring and added unit tests (#149074)
Added pointer to torchtitan (#153079)
Added warning for incorrected grad results at world size 1 (#154928)

torch.export

Added mini tutorial for provenance tracking (#152211)
Updated docs forDims andExportGraphSignature (#156262,#156244)

Linear Algebra Frontend

Addressed ambiguity in docs fortorch.linalg.norm()'s ord argument of +2 & -2 (#155148)

torch.nn

Improved documentation for transformer-related layers,nn.RNN,nn.functional loss functions,interpolate saturate cast behavior,ConvTranspose2dstride /output_size arguments, andregister_full_backward_hook (#155123,#153620,#148436,#151304,#150819,#150609,#151785)
Fixed examples fornn.Sequential andnn.LazyModuleMixin (#147304,#150596)
Documented padding size limitations innn.modules.padding andAvgPoolND (#155618, #152680)

ONNX

Convert .rst doc files to markdown (#155228, #155556)
Improved docstring of ONNX symbolic ops (#149668)
Added note for attention op symbolic function (#156441)
Added ONNX Dynamo metadata documentation (#155816)

Optimizer

Added scripts to generate plots ofLRSchedulers (#149189)
Included other accelerators in capturable docstr for optimizers (#149770)
Updated SGD documentation to match implementation and document that dampening is skipped in SGD first step (#149884, #152833)
Fixed doc forCosineAnnealingLR to accurately reflect its recursive learning rate schedule (#152936)
Fixed incorrect citation of authors inAdafactor documentation (#145209)
Addedload_state_dict hint doc about invoke order work withlr_scheduler (#149942)

Python Frontend

Maketorch.Library'skind have no default value to be consistent with the code (#149390)
Added 32-bit complex to the list of dtypes (#144590)
Clarified behavior when integer dtype is used withrequires_grad=True intensor.to() (#150913)
Optimizedcdist param description (#151178)
Updated serialization docs (#153631)
RenderExample: and notExample:: in docs (#153978)
Added docstring indicating undefined behavior for converting inf to int (#154781)
Updatedas_strided() docs (#149146)
Fixedkeepdim param optional description (#151197)
Clarify that x and dx are mutually exclusive intorch.trapezoid docs (#151190)
Documentedout_dtype arg for torch GEMM operations (#151704)
Fixed the basic description oftorch.min(),torch.max(),torch.all(), andtorch.any() (#152658)
Addedtorch.triu_indices,torch.tril_indices dtype description (#150749)
Optimizedtorch.equal description (#149618)

Quantization

Fixed incorrectget_default_qat_qconfig inprepare_qat_fx docs (#155100)

Release Engineering

Migrated to new theme (#149331)

XPU

Improved "Getting Started on Intel GPU" hardware requirements and notes (#151886)

Developers

Distributed

c10d

Added param recording for uniqueID broadcasting and allgather (#149166)
Added logger config and more loggings, e.g.nccl_version and thread name/id, for flight record in PGNCCL (#150356, #150513, #151048, #152648, #155142, #155754)
Surfaced error type when we unlink and create named pipe for DumpPipe (#150648)
Improved the logs on remote shutdown of tcpstore (#153586)
Enhanced Error Logging innew_subgroups() for Non-Divisible World Sizes (#154124)
Added a logger for all nccl collectives with its time duration when completed (#156008)
Updated error message inget_backend() with more details (#141796)

FullyShardedDataParallel (FSDP1)

Print FQNs when debuggingFlatParamHandle (#151336)

FullyShardedDataParallel2 (FSDP2)

Added FSDP2 logging (#155826)

RPC

Correctly pass exceptions raised fromrpc_init to CPython (#154325)

torchelastic

Added the logging of start of torch elastic workers (#150849)
Passed event log handler to record function calls (#155457)
Addedtorch.distributed.run option to provide destination for event logging (#155268)

torch.export

AddTracingContext (#149294)
Monkeypatch fake mode so it errors on invalid custom ops (#149410)
Fixed torch export docs for preserve_module_call_signature (#151140)
Improved error message for deserializing custom triton op (#152029)
Better type annotation for lift_constants_pass (#152072)
Fixed bug indetect_attr_assignment (#151824)

Ahead-Of-Time Inductor (AOTI)

RefactorAOTInductor runtime API for Intel GPU (#153929)
Improve stable library APIs (#152040)
Add a basic shim andstable::Tensor is_contiguous API (#156228)

FX

Gracefully exit minimizer when there is no discrepancy in block mode (#154076)

Optimizer

Improve decorator typing for Optimizer subclasses (#153374)
Optimize typing inlr_scheduler.py (#151219)
Fixed the type hint ofstep() with default value (#153367)

Release Engineering

Added support for CUDA 12.9 in CI/CD (#154980, #156630, #155895, #155799, #155496, #155340, #155819, #156108)
Added support for ROCm 6.4 in CI/CD (#151236, #151345, #151355, #153253, #156112)
Moved CI from ubuntu 20.04 images to ubuntu 22.04 and 24.04 (#154437, #154153, #149142)
Moved CI to CUDA 12.8 (#154004, #152810, #155087, #148963)
Enabled CI on MI300 (#150667, #152133, #148394, #153134)
Enabled CI on H100 (#153900, #154562, #153170, #155861, #155719, #156429)
Enabled CD for Windows Arm64 (#150310, #152109, #149850, #152099)
Enabled testing of binary Docker builds in CI/CD (#151483, #151488, #151489, #151706)
Added smoke test to validate NCCL and cuDNN versions in PyPI packages (#149885, #150194)
Enabled monitoring for performance tests (#153452, #153453, #153454, #153456)
Improved benchmarking and performance testing on MacOS (#151721, #151747, #151748, #153897, #155493, #153897, #155493)
Usesetup-python from for Mac tests (#155698)
Removed CUDA 11.8 and 12.4 support in CI/CD (#155509, #154169, #152362, #155555, #154893)
Removed Anaconda support in CI/CD (#147789, #152338, #152431, #152377, #152433, #147476, #151035, #152860, #152702, #154303, #154309)

Assets3

akihironitta, klemens-floege, nguyenphuminh, araffin, R0n12, Japyh, IlyasMoutawwakil, Kitsunp, benettia, tsukumijima, and 30 more reacted with thumbs up emoji

knotgrass, wolegca, DingZhaohai, and NevermindNilas reacted with laugh emoji

akihironitta, TyronMott, IlyasMoutawwakil, uygarpolat, Rohanjames1997, benettia, mihaimoga, yoshoku, trsvchn, github-actions[bot], and 13 more reacted with hooray emoji

minpeter, akihironitta, thomasjo, IlyasMoutawwakil, mpecha, trsvchn, sanodmendis, DefTruth, Brensom, wolegca, and 5 more reacted with heart emoji

healy-hub, akihironitta, klemens-floege, nguyenphuminh, cm4ker, dmholtz, Hydran00, ScottTodd, trsvchn, github-actions[bot], and 11 more reacted with rocket emoji

gugarosa, DefTruth, wolegca, DingZhaohai, shinGangan, aztice, and sonukaloshiya reacted with eyes emoji

71 people reacted

Movatterモバイル変換

PyTorch 2.8.0 Release

PyTorch 2.8.0 Release Notes

Highlights

Tracked Regressions

Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)

Backwards Incompatible Changes

CUDA Support

Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517,#158478,#158744)

Python Frontend

Calling an op with an input dtype that is unsupported now raisesNotImplementedError instead ofRuntimeError (#155470)

Added missing in-place on view check to customautograd.Function (#153094)

An error is now properly thrown for the out variant oftensordot when called with arequires_grad=True tensor (#150270)

torch.compile

Specialization of a tensor shape withmark_dynamic applied now correctly errors (#152661)

Several config variables related totorch.compile have been renamed or removed

Added a stricter aliasing/mutation check forHigherOrderOperators (e.g.cond), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953,#146658).

guard_or_x anddefinitely_x have been consolidated (#152463)

torch.export

torch.export.export_for_inference has been removed in favor oftorch.export.export_for_training().run_decompositions() (#149078)

Switched default tostrict=False intorch.export.export andexport_for_training (#148790,#150941)

ONNX

Default opset intorch.onnx.export is now 18 (#156023)

TheJitTraceConvertStrategy has been removed (#152556)

onnxscript>=0.3.1 is required for thedynamo=True option (#157017)

Build Frontend

Removed thetorch/types.h include fromDispatcher.h (#149557)

UpgradedDLPack to 1.0 (#145000)

NVTX3 code has been moved fromcmake/public/cuda.cmake tocmake/Dependencies.cmake (#151583)

Deprecations

MPS support for MacOS Ventura will be removed in 2.9

torch.ao.quantization is deprecated and will be removed in 2.10 (#153892)

Thedynamo=False (current default) option fortorch.onnx.export is deprecated (#152478,#155580)

New Features

CUDA

torch.compile

Dynamo

Inductor

torch.export

Ahead-Of-Time Inductor (AOTI)

MPS

ONNX

Python Frontend

Quantization

XPU

Improvements

Build Frontend

Composability

C++ Frontend

CUDA

cuDNN

Distributed

c10d

DeviceMesh

DistributedDataParallel (DDP)

DTensor

FullyShardedDataParallel2 (FSDP2)

Pipeline Parallelism

ShardedTensor

TensorParallel

torchelastic

torch.compile

Dynamo

Inductor

torch.export

Ahead-Of-Time Inductor (AOTI)

FX

Linear Algebra Frontend

MPS

Nested Tensor (NJT)

torch.nn

ONNX

Optimizer

Profiler

Python Frontend

Quantization

Release Engineering

ROCm

Sparse Frontend

torch.func

Calling an op with an input dtype that is unsupported now raises`NotImplementedError` instead of`RuntimeError` (#155470)

Added missing in-place on view check to custom`autograd.Function` (#153094)

An error is now properly thrown for the out variant of`tensordot` when called with a`requires_grad=True` tensor (#150270)

Specialization of a tensor shape with`mark_dynamic` applied now correctly errors (#152661)

Several config variables related to`torch.compile` have been renamed or removed

Added a stricter aliasing/mutation check for`HigherOrderOperator`s (e.g.`cond`), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953,#146658).

`guard_or_x` and`definitely_x` have been consolidated (#152463)

`torch.export.export_for_inference` has been removed in favor of`torch.export.export_for_training().run_decompositions()` (#149078)

Switched default to`strict=False` in`torch.export.export` and`export_for_training` (#148790,#150941)

Default opset in`torch.onnx.export` is now 18 (#156023)

The`JitTraceConvertStrategy` has been removed (#152556)

`onnxscript>=0.3.1` is required for the`dynamo=True` option (#157017)

Removed the`torch/types.h` include from`Dispatcher.h` (#149557)

Upgraded`DLPack` to 1.0 (#145000)

NVTX3 code has been moved from`cmake/public/cuda.cmake` to`cmake/Dependencies.cmake` (#151583)

`torch.ao.quantization` is deprecated and will be removed in 2.10 (#153892)

The`dynamo=False` (current default) option for`torch.onnx.export` is deprecated (#152478,#155580)