pytorch/pytorchPublic

NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

PyTorch 2.9.1 Release, bug fix release

12 Nov 19:27

atalman

v2.9.1

d38164a

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.9.1 Release, bug fix releaseLatest

Latest

This release is meant to fix the following issues (regressions / silent correctness):

Tracked Regressions

Significant Memory Regression in F.conv3d with bfloat16 Inputs in PyTorch 2.9.0 (#166643)
This release provides work around this issue. If you are impacted please install nvidia-cudnn package version 9.15+ from pypi. (#166480) (#167111)

Torch.compile

Fix Inductor bug when compiling Gemma (#165601)
Fix InternalTorchDynamoError in bytecode_transformation (#166036)
Fix silent correctness error_on_graph_break bug where non-empty checkpoint results in unwanted graph break resumption (#166586)
Improve performance by avoiding recompilation with mark_static_address with cudagraphs (#162208)
Improve performance by caching get_free_symbol_uses in torch inductor (#166338)
Fix fix registration design for inductor graph partition for vLLM (#166458) (#165815) (#165514)
Fix warning spamming in torch.compile (#166993)
Fix exception related to uninitialized tracer_output variable (#163169)
Fix crash in torch.bmm and torch.compile with PyTorch release 2.9.0 (#166457)

Other

Fix warning spamming on new APIs to control TF32 behavior (#166956)
Fix distributed crash with non-contiguous gather inputs (#166181)
Fix indexing on large tensor causes invalid configuration argument (#166974)
Fix numeric issue in CUDNN_ATTENTION (#166912) (#166570)
Fix symmetric memory issue with fused_scaled_matmul_reduce_scatter (#165086)
Improve libtorch stable ABI documentation (#163899)
Fix image display on pypi project description section (#166404)

Assets3

38 people reacted

2.9 Release Notes

15 Oct 17:12

seemethere

v2.9.0

0fabc3b

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

2.9 Release Notes

PyTorch 2.9.0 Release Notes

Highlights

Unstable (API-Unstable)

Updates to the stable libtorch ABI for third-party C++/CUDA extensions

Symmetric memory that enables easy programming of multi-GPU kernels

The ability to arbitrarily toggle error or resume on graph breaks in torch.compile

Expanded wheel variant support to include ROCm, XPU and CUDA 13

FlexAttention enablement on Intel GPUs

Flash decoding optimization based on FlexAttention on X86 CPU

ARM Platform improvements and optimizations

Enablement of Linux aarch64 binary wheel builds across all supported CUDA versions

For more details about these highlighted features, you can look at therelease blogpost. Below are the full release notes for this release.

Backwards Incompatible Changes

Min supported Python version is now 3.10 (#162310)

The minimum version of Python required for PyTorch 2.9.0 is 3.10. We also have 3.14 and 3.14t available as preview with this release.

Undefined behavior when an output of a custom operator shares storage with an input

This is a reminder that outputs of PyTorch custom operators (that are registered using thetorch.library orTORCH_LIBRARY APIs) are not allowed to return Tensors that share storage with input tensors. The violation of this condition leads to undefined behavior: sometimes the result will be correct, sometimes it will be garbage.

After#163227, custom operators that violated this condition that previously returned correct results undertorch.compile may now return silently incorrect results undertorch.compile. Because this is changing the behavior of undefined behavior, we do not consider this to be a bug, but we are still documenting it in this section as a "potentially unexpected behavior change".

This is one of the conditions checked for bytorch.library.opcheck and is mentioned inThe Custom Operators Manual

More details

Outputs of PyTorch custom operators are not allowed to return Tensors that share storage with input tensors

For example, the following two custom operators are not valid custom operators:

@torch.library.custom_op("mylib::foo",mutates_args=())deffoo(x:torch.Tensor)->torch.Tensor:# the result of `foo` must not directly be an input to foo.returnx@torch.library.custom_op("mylib::bar",mutates_args=())defbar(x:torch.Tensor)->torch.Tensor:# the result of bar must not be a view of an input of barreturnx.view(-1)

The easiest workaround is to add an extra.clone() to the outputs:

@torch.library.custom_op("mylib::foo",mutates_args=())deffoo(x:torch.Tensor)->torch.Tensor:returnx.clone()@torch.library.custom_op("mylib::bar",mutates_args=())defbar(x:torch.Tensor)->torch.Tensor:returnx.view(-1).clone()

A common way to get into this situation is for a user to want to create a custom operator that sometimes mutates the input in-place and sometimes returns a new Tensor, like in the following example.

@torch.library.custom_op("mylib::baz",mutates_args=["x"])defbaz(x:torch.Tensor)->torch.Tensor:ifinplace:x.sin_()returnxelse:returnx.sin()

This dynamism is not supported and leads to undefined behavior. The workaround is to split the custom operator into two custom operators, one that always mutates the input in-place, and another that always returns a new Tensor.

@torch.library.custom_op("mylib::baz_outplace",mutates_args=())defbaz_outplace(x:torch.Tensor)->torch.Tensor:returnx.sin()@torch.library.custom_op("mylib::baz_inplace",mutates_args=["x"])defbaz_inplace(x:torch.Tensor)->torch.Tensor:x.sin_()defbaz(x):ifinplace:baz_inplace(x)returnxelse:returnbaz_outplace(x)

Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#159733,#159912)

PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above

Upgrade to DLPack 1.0 (#145000)

This upgrade is doing the same BC-breaking changes as the DLPack release. Objects intorch.utils.dlpack have been updated to reflect these changes, such asDLDeviceType.

See the PR for details on the exact changes and how to update your code.

Raise appropriate errors in`torch.cat` (#158249)

torch.cat now raisesValueError,IndexError orTypeError where appropriate instead of the genericRuntimeError. If you code was catching these errors, you can update to catch the new error type.

Default to`dynamo=True` for ONNX exporter (#159646,#162726)

Previouslytorch.onnx.export(...) used the legacy TorchScript exporter if no arguments were provied. The ONNX exporter now uses the newertorch.export.export pipeline by default (dynamo=True). This change improves graph fidelity and future-proofs exports, but may surface graph capture errors that were previously masked or handled differently.

Previously in torch 2.8.0:

# API calls the legacy exporter with dynamo=Falsetorch.onnx.export(...)

Now in torch 2.9.0:

# To preserve the original behaviortorch.onnx.export(...,dynamo=False)# Export onnx model through torch.export.exporttorch.onnx.export(...)

Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.

Switch off runtime asserts by default in Export in favor of a shape guards function (#160111,#161178,#161794)

To enable runtime asserts, useexport(..., prefer_deferred_runtime_asserts_over_guards=True). Also kills theallow_complex_guards_as_runtime_asserts flag, merging it into the former option.

Additionally,exported_program.module() will generate a call to a_guards_fn submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or doexported_program.module(check_guards=False) to avoid the generation.

Set default opset to 20 in ONNX (#158802)

Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.

Previously in torch 2.8.0:

# opset_version=18torch.onnx.export(...)

Now in torch 2.9.0:

# To preserve the original behaviortorch.onnx.export(...,opset_version=18)# New: opset_version=20torch.onnx.export(...)# Use the latest supported opset: opset_version=23torch.onnx.export(...,opset_version=23)

Drop`draft_export` in exporter API (#161454,#162225)

Remove implicit draft tracing from the default exporter path, achieving clearer behaviour and faster failures.
The expensivetorch.export.draft_export diagnostic path is no longer auto-invoked (which could take hours on large models). You can still opt in for deep diagnostics:

Previously in torch 2.8.0:

# If both torch.export.export(..., strict=False) and# torch.export.export(..., strict=True) fail to capture# the model graph, torch.export.draft_export(...) will be triggered,# and uses real tensor to trace/export the model.## Inside export_to_onnx.py:#  ... torch.onnx.export(..., dynamo=True)python export_to_onnx.py

Now in torch 2.9.0:

# To trigger torch.export.draft_export once# torch.export.export strict=False/True both# fail:TORCH_ONNX_ENABLE_DRAFT_EXPORT=True python export_to_onnx.py

Remove`torch.onnx.dynamo_export` and the`onnxrt` torch compile backend (#158130,#158258)

torch.onnx.dynamo_export is removed. Please usetorch.onnx.export instead.
The experimental ONNX Runtime compile backend (torch.compile(backend="onnxrt")) is no longer supported.

Remove`torch.onnx.enable_fake_mode` (#161222)

Thedynamo=True mode usesFakeTensors by default which is memory efficient.

Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#161323)

Deprecated members intorch.onnx.verification are removed. Previously privatetorch.onnx.symbolic_opsets* functions will no longer be accessible. Consider making a copy of the source code if you need to access any private functions for compatibility with the TorchScript based exporter.

Remove`torch.onnx.symbolic_caffe2` (#157102)

Support forcaffe2 in the ONNX exporter has ended and is removed.

Remove`/d2implyavx512upperregs` flag that slows build (#159431)

Re-introduced AVX512 optimizations for Windows VS2022 builds, may cause issues with specific versions of VS2022, see#145702

Add`ScalarType` to shim conversion and`stable::Tensor.scalar_type` (#160557)

Before, user extensions could only in abstract...

Assets3

59 people reacted

PyTorch 2.8.0 Release

06 Aug 17:06

jbschlosser

v2.8.0

ba56102

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.8.0 Release

PyTorch 2.8.0 Release Notes

Highlights

Unstable

torch::stable::Tensor

High-performance quantized LLM inference on Intel CPUs with native PyTorch

Experimental Wheel Variant Support

Inductor CUTLASS backend support

Inductor Graph Partition for CUDAGraph

Control Flow Operator Library

HuggingFace SafeTensors support in PyTorch Distributed Checkpointing

SYCL support in PyTorch CPP Extension API

A16W4 on XPU Device

Hierarchical compilation with torch.compile

Intel GPU distributed backend (XCCL) support

For more details about these highlighted features, you can look at therelease blogpost.
Below are the full release notes for this release.

Tracked Regressions

Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)

Due to a bug introduced in CUDA 12.9.1, we are unable to complete full Windows wheel builds with this
version, as compilation oftorch.segment_reduce() crashes the build. Thus, we provide a wheel
withouttorch.segment_reduce() included in order to sidestep the issue. If you need support
fortorch.segment_reduce(), please utilize a different version.

Backwards Incompatible Changes

CUDA Support

Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517,#158478,#158744)

Due to binary size limitations, support for sm50 - sm60 architectures with CUDA 12.8 and 12.9 has
been dropped for the 2.8.0 release. If you need support for these architectures, please utilize
CUDA 12.6 instead.

Python Frontend

Calling an op with an input dtype that is unsupported now raises`NotImplementedError` instead of`RuntimeError` (#155470)

Please update exception handling logic to reflect this.

In 2.7.0

try:    torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))except RuntimeError:    ...

In 2.8.0

try:    torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))except NotImplementedError:    ...

Added missing in-place on view check to custom`autograd.Function` (#153094)

In 2.8.0, if a customautograd.Function mutates a view of a leaf requiring grad,
it now properly raises an error. Previously, it would silently leak memory.

   class Func(torch.autograd.Function):        @staticmethod        def forward(ctx, inp):            inp.add_(1)            ctx.mark_dirty(inp)            return inp        @staticmethod        def backward(ctx, gO):            pass    a = torch.tensor([1.0, 2.0], requires_grad=True)    b = a.view_as(a)    Func.apply(b)

Output:

Version 2.7.0

Runs without error, but leaks memory

Version 2.8.0

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation

An error is now properly thrown for the out variant of`tensordot` when called with a`requires_grad=True` tensor (#150270)

Please avoid passing an out tensor withrequires_grad=True as gradients cannot be
computed for this tensor.

In 2.7.0

a = torch.empty((4, 2), requires_grad=True)b = torch.empty((2, 4), requires_grad=True)c = torch.empty((2, 2), requires_grad=True)# does not error, but gradients for c cannot be computedtorch.tensordot(a, b, dims=([1], [0]), out=c)

In 2.8.0

a = torch.empty((4, 2), requires_grad=True)b = torch.empty((2, 4), requires_grad=True)c = torch.empty((2, 2), requires_grad=True)torch.tensordot(a, b, dims=([1], [0]), out=c)# RuntimeError: tensordot(): the 'out' tensor was specified and requires gradients, and# its shape does not match the expected result. Either remove the 'out' argument, ensure# it does not require gradients, or make sure its shape matches the expected output.

torch.compile

Specialization of a tensor shape with`mark_dynamic` applied now correctly errors (#152661)

Prior to 2.8, it was possible for a guard on a symbolic shape to be incorrectly
omitted if the symbolic shape evaluation was previously tested with guards
suppressed (this often happens within the compiler itself). This has been fixed
in 2.8 and usually will just silently "do the right thing" and add the correct
guard. However, if the new guard causes a tensor marked withmark_dynamic to become
specialized, this can result in an error. One workaround is to use
maybe_mark_dynamic instead ofmark_dynamic.

See the discussion in issue#157921 for more
context.

Version 2.7.0

importtorchembed=torch.randn(2,8192)x=torch.zeros(8192)torch._dynamo.mark_dynamic(x,0)@torch.compiledeff(embedding_indices,x):added_tokens_mask=torch.where(x>10000,1,0)ei=torch.narrow(embedding_indices,1,0,x.size(0))returnei.clone()f(embed,x)

Version 2.8.0

importtorchembed=torch.randn(2,8192)x=torch.zeros(8192)torch._dynamo.maybe_mark_dynamic(x,0)@torch.compiledeff(embedding_indices,x):added_tokens_mask=torch.where(x>10000,1,0)ei=torch.narrow(embedding_indices,1,0,x.size(0))returnei.clone()f(embed,x)

Several config variables related to`torch.compile` have been renamed or removed

Dynamo config variableenable_cpp_framelocals_guard_eval has changed to no longer have any effect (#151008).
Inductor config variablerocm.n_max_profiling_configs is deprecated (#152341).
Instead, use ck-tile based configsrocm.ck_max_profiling_configs and
rocm.ck_tile_max_profiling_configs.
Inductor config variableautotune_fallback_to_aten is deprecated (#154331).
Inductor will no longer silently fall back toATen. Please add"ATEN" to
max_autotune_gemm_backends for the old behavior.
Inductor config variablesuse_mixed_mm andmixed_mm_choice are deprecated (#152071). Inductor now supports prologue fusion, so there is no need for
special cases now.
Inductor config settingdescriptive_names = False is deprecated (#151481). Please use one of the other available
options:"torch","original_aten", or"inductor_node".
custom_op_default_layout_constraint has moved from inductor config to functorch config (#148104). Please reference it via
torch._functorch.config.custom_op_default_layout_constraint instead of
torch._inductor.config.custom_op_default_layout_constraint.
AOTI config variableemit_current_arch_binary is deprecated (#155768).
AOTI config variableaot_inductor.embed_cubin has been renamed toaot_inductor.embed_kernel_binary (#154412).
AOTI config variableaot_inductor.compile_wrapper_with_O0 has been renamed tocompile_wrapper_opt_level (#148714).

Added a stricter aliasing/mutation check for`HigherOrderOperator`s (e.g.`cond`), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953,#146658).

For affectedHigherOrderOperators, add.clone() to aliased outputs to address this.

Version 2.7.0

importtorch@torch.compile(backend="eager")deffn(x):returntorch.cond(x.sum()>0,lambdax:x,lambdax:x+1, [x])fn(torch.ones(3))

Version 2.8.0

importtorch@torch.compile(backend="eager")deffn(x):returntorch.cond(x.sum()>0,lambdax:x.clone(),lambdax:x+1, [x])fn(torch.ones(3))

`guard_or_x` and`definitely_x` have been consolidated (#152463)

We removeddefinitely_true /definitely_false and associated APIs, replacing them with
guard_or_true /guard_or_false, which offer similar functionality and can be used to
achieve the same effect. Please migrate to the latter.

Version 2.7.0

fromtorch.fx.experimental.symbolic_shapesimportdefinitely_false,definitely_true...ifdefinitely_true(x):  ...ifdefinitely_false(y):  ...

Version 2.8.0

fromtorch.fx.experimental.symbolic_shapesimportguard_or_false,guard_or_true...ifguard_or_false(x):  ...# alternatively: if guard_or_false(torch.sym_not(y))ifnotguard_or_true(y):  ...

torch.export

`torch.export.export_for_inference` has been removed in favor of`torch.export.export_for_training().run_decompositions()` (#149078)

Version 2.7.0

importtorch...exported_program=torch.export.export_for_inference(mod,args,kwargs)

Version 2.8.0

importtorch...exported_program=torch.export.export_for_training(mod,args,kwargs).run_decompositions(decomp_table=decomp_table)

Switched default to`strict=False` in`torch.export.export` and`export_for_training` (#148790,#150941)

This differs from the previous release default ofstrict=True. To revert to the old default
behavior, please explicitly passstrict=True.

Version 2.7.0

importtorch# default behavior is strict=Truetorch.export.export(...)torch.export.export_for_training(...)

Version 2.8.0

importtorch# strict=True must be explicitly passed to get the old behaviortorch.export.export(...,strict=True)torch.export.export_for_training(...,strict=True)

ONNX

Default opset in`torch.onnx.export` is now 18 (#156023)

Whendynamo=False, th...

Assets3

akihironitta, klemens-floege, nguyenphuminh, araffin, R0n12, Japyh, IlyasMoutawwakil, Kitsunp, benettia, tsukumijima, and 30 more reacted with thumbs up emoji

knotgrass, wolegca, DingZhaohai, and NevermindNilas reacted with laugh emoji

akihironitta, TyronMott, IlyasMoutawwakil, uygarpolat, Rohanjames1997, benettia, mihaimoga, yoshoku, trsvchn, github-actions[bot], and 13 more reacted with hooray emoji

minpeter, akihironitta, thomasjo, IlyasMoutawwakil, mpecha, trsvchn, sanodmendis, DefTruth, Brensom, wolegca, and 5 more reacted with heart emoji

healy-hub, akihironitta, klemens-floege, nguyenphuminh, cm4ker, dmholtz, Hydran00, ScottTodd, trsvchn, github-actions[bot], and 11 more reacted with rocket emoji

gugarosa, DefTruth, wolegca, DingZhaohai, shinGangan, aztice, and sonukaloshiya reacted with eyes emoji

71 people reacted

PyTorch 2.7.1 Release, bug fix release

04 Jun 18:13

atalman

v2.7.1

e2d141d

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.7.1 Release, bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

Torch.compile

Fix Excessive cudagraph re-recording for HF LLM models (#152287)
Fix torch.compile on some HuggingFace models (#151154)
Fix crash due to Exception raised inside torch.autocast (#152503)
Improve Error logging in torch.compile (#149831)
Mark mutable custom operators as cacheable in torch.compile (#151194)
Implement workaround for a graph break with older version einops (#153925)
Fix an issue with tensor.view(dtype).copy_(...) (#151598)

Flex Attention

Fix assertion error due to inductor permuting inputs to flex attention (#151959)
Fix performance regression on nanogpt speedrun (#152641)

Distributed

Fix extra CUDA context created by barrier (#149144)
Fix an issue related to Distributed Fused Adam in Rocm/APEX when using nccl_ub feature (#150010)
Add a workaround random hang in non-blocking API mode in NCCL 2.26 (#154055)

MacOS

Fix MacOS compilation error with Clang 17 (#151316)
Fix binary kernels produce incorrect results when one of the tensor arguments is from a wrapped scalar on MPS devices (#152997)

Other

Improve PyTorch Wheel size due to introduction of addition of 128 bit vectorization (#148320) (#152396)
Fix fmsub function definition (#152075)
Fix Floating point exception in torch.mkldnn_max_pool2d (#151848)
Fix abnormal inference output with XPU:1 device (#153067)
Fix Illegal Instruction Caused by grid_sample on Windows (#152613)
Fix ONNX decomposition does not preserve custom CompositeImplicitAutograd ops (#151826)
Fix error with dynamic linking of libgomp library (#150084)
Fix segfault in profiler with Python 3.13 (#153848)

Assets3

40 people reacted

PyTorch 2.7.0 Release

23 Apr 16:16

janeyx99

v2.7.0

1341794

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.7.0 Release

PyTorch 2.7.0 Release Notes

Highlights

Beta	Prototype
Torch.Compile support for Torch Function Modes	NVIDIA Blackwell Architecture Support
Mega Cache	PyTorch Native Context Parallel
	Enhancing Intel GPU Acceleration
	FlexAttention LLMfirst token processing on X86 CPUs
	FlexAttention LLMthroughput mode optimization on X86 CPUs
	Foreach Map
	Flex Attention for Inference
	Prologue Fusion Support in Inductor

For more details about these highlighted features, you can look at therelease blogpost.
Below are the full release notes for this release.

Tracked Regressions

NCCL init hits CUDA failure 'invalid argument' on 12.2 driver

Some users with 12.2 CUDA driver (535 version) report seeing "CUDA driver error: invalid argument" during NCCL or Symmetric Memory initialization. This issue is currently under investigation, see#150852. If you use PyTorch from source, a known workaround is to rebuild PyTorch with CUDA 12.2 toolkit. Otherwise, you can try upgrading the CUDA driver on your system.

Backwards Incompatible Changes

Dropped support for Triton < 2.2.0. Removed Support for CUDA 12.4, Anaconda in CI/CD.

Removed CUDA 12.4 support in CI/CD in favor of 12.8 (#148895,#142856,#144118,#145566,#145844,#148602,#143076,#148717)
Removed Anaconda support in CI/CD (#144870,#145015,#147792)
Dropped support for Triton < 2.2.0 (versions without ASTSource) (#143817)

C++ Extensions`py_limited_api=True` is now built with`-DPy_LIMITED_API` (#145764)

We formally began respecting thepy_limited_api=True kwarg in 2.6 and stopped linkinglibtorch_python.so when the flag was specified, as libtorch_python.so does not guarantee using APIs from from the stable Python limited API. In 2.7, we go further by specifying the-DPy_LIMITED_API flag which will enforce that the extension is buildable with the limited API. As a result of this enforcement,custom extensions that setpy_limited_api=True but do not abide by the limited API may fail to build. For an example, see#152243.

This is strictly better behavior as it is sketchy to claim CPython agnosticism without enforcing with the flag. If you run into this issue, please ensure that the extension you are building does not use any APIs which are outside of the Python limited API, e.g.,pybind.

Change`torch.Tensor.new_tensor()` to be on the given Tensor's device by default (#144958)

This function was always creating the new Tensor on the "cpu" device and will now use the same device as the current Tensor object. This behavior is now consistent with other.new_* methods.

Use Manylinux 2.28 and CXX11_ABI=1 for future released Linux wheel builds.

With Migration to manylinux_2_28 (AlmaLinux 8 based), we can no longer support OS distros with glibc2_26. These include popular Amazon Linux 2 and CentOS 7. (#143423,#146200,#148028,#148135,#148195,#148129)

`torch.onnx.dynamo_export` now uses the ExportedProgram logic path (#137296)

Users using thetorch.onnx.dynamo_export API may see someExportOptions become
unsupported due to an internal switch to usetorch.onnx.export(..., dynamo=True):diagnostic_options,fake_context andonnx_registry are removed/ignored byExportOptions. Onlydynamic_shapes is retained.

Users should move to use thedynamo=True option ontorch.onnx.export as
torch.onnx.dynamo_export is now deprecated. Leverage thedynamic_shapes argument intorch.onnx.export for specifying dynamic shapes on the model.

Version 2.6.0

torch.onnx.dynamo_export(model,*args,**kwargs)

Version 2.7.0

torch.onnx.export(model,args,kwargs=kwargs,dynamo=True)

Finish deprecation of`LRScheduler.print_lr()` along with the`verbose` kwarg to the LRScheduler constructor. (#147301)

Both APIs have been deprecated since 2.2. Please useLRScheduler.get_last_lr() to access the learning rate instead.print_lr andverbose were confusing, not properly documented and were little used, as described in#99270, so we deprecated them in 2.2. Now, we complete the deprecation by removing them completely. To access and print the learning rate of a LRScheduler:

Version 2.6.0

optim= ...lrsched=torch.optim.lr_scheduler.ReduceLROnPlateau(optim,verbose=True)//lrschedwillinternallycallprint_lr()andprintthelearningrate

Version 2.7.0

optim= ...lrsched=torch.optim.lr_scheduler.ReduceLROnPlateau(optim)print(lrsched.get_last_lr())

libtorch_python.so symbols are now invisible by default on all platforms except Apple (#142214)

Previously, the symbols in libtorch_python.so were exposed with default visibility. We have transitioned to being more intentional about what we expose as public symbols for our python API in C++. After#142214, public symbols will be marked explicitly while everything else will be hidden. Some extensions using private symbols will see linker failures with this change.

Please use`torch.export.export` instead of`capture_pre_autograd_graph` to export the model for pytorch 2 export quantization (#139505)

capture_pre_autograd_graph was a temporary API intorch.export. Since now we have a better longer term API:export available, we can deprecate it.

Version 2.6.0

fromtorch._exportimportcapture_pre_autograd_graphfromtorch.ao.quantization.quantize_pt2eimportprepare_pt2efromtorch.ao.quantization.quantizer.xnnpack_quantizerimport (XNNPACKQuantizer,get_symmetric_quantization_config,)quantizer=XNNPACKQuantizer().set_global(get_symmetric_quantization_config())m=capture_pre_autograd_graph(m,*example_inputs)m=prepare_pt2e(m,quantizer)

Version 2.7.0

fromtorch.exportimportexportfromtorch.ao.quantization.quantize_pt2eimportprepare_pt2e# please get xnnpack quantizer from executorch (https://github.com/pytorch/executorch/)fromexecutorch.backends.xnnpack.quantizer.xnnpack_quantizerimport (XNNPACKQuantizer,get_symmetric_quantization_config,)quantizer=XNNPACKQuantizer().set_global(get_symmetric_quantization_config())m=export(m,*example_inputs)m=prepare_pt2e(m,quantizer)

New interface for`torch.fx.passes.graph_transform_observer.GraphTransformObserver` to enable Node Level provenance tracking (#144277)

We now track a mapping between the nodes in the pre-grad and post-grad graph. See the issue for an example frontend to visualize the transformations. To update yourGraphTransformObserver subclasses, instead of overridingon_node_creation andon_node_erase, there are new functionsget_node_creation_hook,get_node_erase_hook,get_node_replace_hook andget_deepcopy_hook. These are registered on theGraphModule member of theGraphTransformObserver upon entry and exit of awith block

Version 2.6.0

classMyPrintObserver(GraphTransformObserver):defon_node_creation(self,node:torch.fx.Node):print(node)

Version 2.7.0

classMyPrintObserver(GraphTransformObserver):defget_node_creation_hook(self):defhook(node:torch.fx.Node):print(node)returnhook

`torch.ao.quantization.pt2e.graph_utils.get_control_flow_submodules` is no longer public (#141612)

We are planning to make all functions undertorch.ao.quantization.pt2e.graph_utils private. This update marksget_control_flow_submodules as a private API. If you have to or want to continue usingget_control_flow_submodules, please make a private call by using_get_control_flow_submodules.

Example:
Version 2.6:

>>>fromtorch.ao.quantization.pt2e.graph_utilsimportget_control_flow_submodules

Version 2.7:

>>>fromtorch.ao.quantization.pt2e.graph_utilsimportget_control_flow_submodulesImportError:cannotimportname'get_control_flow_submodules'from'torch.ao.quantization.pt2e.graph_utils'>>>fromtorch.ao.quantization.pt2e.graph_utilsimport_get_control_flow_submodules# Note: Use _get_control_flow_submodules for private access

Deprecations

`torch.onnx.dynamo_export` is deprecated (#146425,#146639,#146923)

Users should use thedynamo=True option ontorch.onnx.export.

Version 2.6.0

torch.onnx.dynamo_export(model,*args,**kwargs)

Version 2.7.0

torch.onnx.export(model,args,kwargs=kwargs,dynamo=True)

`XNNPACKQuantizer` is deprecated in PyTorch and moved to ExecuTorch, please use it from`executorch.backends.xnnpack.quantizer.xnnpack_quantizer` instead of`torch.ao.quantization.quantizer.xnnpack_quantizer`. (#144940)

XNNPACKQuantizer is a quantizer for xnnpack that was added into pytorch/pytorch for initial development. Ho...

Assets3

akshaytrikha, wanderingeek, logicwong, holycowdude, yuygfgg, glevv, andrey-khropov, healy-hub, Jumaron, Aunali321, and 63 more reacted with thumbs up emoji

651961, heindrickdumdum0217, Achilles718611, healy-hub, wolegca, GoodCoder666, cataluna84, obitodaitu, AlirezaSR, and PyroKing39 reacted with laugh emoji

kmaehashi, healy-hub, Jumaron, Aunali321, akihironitta, Silv3S, IndigoW0lf, Red-Eyed, ghchris2021, hotchpotch, and 21 more reacted with hooray emoji

akihironitta, gui-miotto, Red-Eyed, ghchris2021, 651961, heindrickdumdum0217, Achilles718611, healy-hub, andre-brainn, wolegca, and 8 more reacted with heart emoji

tmshv, ghchris2021, brian316, 651961, ErcinDedeoglu, heindrickdumdum0217, Achilles718611, LiPingYen, healy-hub, Anuvadak, and 7 more reacted with rocket emoji

651961, gugarosa, wolegca, GoodCoder666, cataluna84, obitodaitu, PyroKing39, and Fitanium reacted with eyes emoji

99 people reacted

PyTorch 2.6.0 Release

29 Jan 17:18

HDCharles

v2.6.0

1eba9b3

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.6.0 Release

Highlights
Tracked Regressions
Backwards Incompatible Change
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation
Developers

Highlights

We are excited to announce the release of PyTorch® 2.6 (release notes)! This release features multiple improvements for PT2:torch.compile can now be used with Python 3.13; new performance-related knobtorch.compiler.set_stance; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.

NOTE: Starting with this release we are not going to publish on Conda, please see[Announcement] Deprecating PyTorch’s official Anaconda channel for the details.

For this release the experimental Linux binaries shipped with CUDA 12.6.3 (as well as Linux Aarch64, Linux ROCm 6.2.4, and Linux XPU binaries) are built with CXX11_ABI=1 and areusing the Manylinux 2.28 build platform. If you build PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see[RFC] PyTorch next wheel build platform: manylinux-2.28 for the details and discussion.

Also in this release as an important security improvement measure we have changed the default value forweights_only parameter oftorch.load. This is a backward compatibility-breaking change, please seethis forum post for more details.

This release is composed of 3892 commits from 520 contributors since PyTorch 2.5. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve PyTorch. More information about how to get started with the PyTorch 2-series can be found at ourGetting Started page.

Beta	Prototype
torch.compiler.set_stance	Improved PyTorch user experience on Intel GPUs
torch.library.triton_op	FlexAttention support on X86 CPU for LLMs
torch.compile support for Python 3.13	Dim.AUTO
New packaging APIs for AOTInductor	CUTLASS and CK GEMM/CONV Backends for AOTInductor
AOTInductor: minifier
AOTInductor: ABI-compatible mode code generation
FP16 support for X86 CPUs

*To see a full list of public feature submissions clickhere.

BETA FEATURES

[Beta] torch.compiler.set_stance

This feature enables the user to specify different behaviors (“stances”) thattorch.compile can take between different invocations of compiled functions. One of the stances, for example, is

“eager_on_recompile”, that instructs PyTorch to code eagerly when a recompile is necessary, reusing cached compiled code when possible.

For more information please refer to theset_stance documentation and theDynamic Compilation Control with torch.compiler.set_stance tutorial.

[Beta] torch.library.triton_op

torch.library.triton_op offers a standard way of creating custom operators that are backed by user-defined triton kernels.

When users turn user-defined triton kernels into custom operators,torch.library.triton_op allowstorch.compile to peek into the implementation, enablingtorch.compile to optimize the triton kernel inside it.

For more information please refer to thetriton_op documentation and the Using User-Defined Triton Kernels with torch.compile tutorial.

[Beta] torch.compile support for Python 3.13

torch.compile previously only supported Python up to version 3.12. Users can now optimize models withtorch.compile in Python 3.13.

[Beta] New packaging APIs for AOTInductor

A new package format, “PT2 archive”, has been introduced. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send everything needed to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.

For more details please see the updatedtorch.export AOTInductor Tutorial for Python runtime.

[Beta] AOTInductor: minifier

If a user encounters an error while using AOTInductor APIs, AOTInductor Minifier allows creation of a minimal nn.Module that reproduces the error.

For more information please see theAOTInductor Minifier documentation.

[Beta] AOTInductor: ABI-compatible mode code generation

AOTInductor-generated model code has dependency on Pytorch cpp libraries. As Pytorch evolves quickly, it’s important to make sure previously AOTInductor compiled models can continue to run on newer Pytorch versions, i.e. AOTInductor is backward compatible.

In order to guarantee application binary interface (ABI) backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide backward compatibility guarantees for AOTInductor-compiled models.

[Beta] FP16 support for X86 CPUs (both eager and Inductor modes)

Float16 datatype is commonly used for reduced memory usage and faster computation in AI inference and training. CPUs like the recently launchedIntel® Xeon® 6 with P-Cores support Float16 datatype with native acceleratorAMX. Float16 support on X86 CPUs was introduced in PyTorch 2.5 as a prototype feature, and now it has been further improved for both eager mode and Torch.compile + Inductor mode, making it Beta level feature with both functionality and performance verified with a broad scope of workloads.

PROTOTYPE FEATURES

[Prototype] Improved PyTorch user experience on Intel GPUs

PyTorch user experience on Intel GPUs is further improved with simplified installation steps, Windows release binary distribution and expanded coverage of supported GPU models including the latest Intel® Arc™ B-Series discrete graphics. Application developers and researchers seeking to fine-tune, inference and develop with PyTorch models onIntel® Core™ Ultra AI PCsandIntel® Arc™ discrete graphics will now be able to directly install PyTorch with binary releases for Windows, Linux and Windows Subsystem for Linux 2.

Simplified Intel GPU software stack setup to enable one-click installation of the torch-xpu PIP wheels to run deep learning workloads in an out of the box fashion, eliminating the complexity of installing and activating Intel GPU development software bundles.
Windows binary releases for torch core, torchvision and torchaudio have been made available for Intel GPUs, and the supported GPU models have been expanded from Intel® Core™ Ultra Processors with Intel® Arc™ Graphics,Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics andIntel® Arc™ A-Series Graphics to the latest GPU hardwareIntel® Arc™ B-Series graphics.
Further enhanced coverage of Aten operators on Intel GPUs with SYCL* kernels for smooth eager mode execution, as well as bug fixes and performance optimizations for torch.compile on Intel GPUs.

For more information regarding Intel GPU support, please refer toGetting Started Guide.

[Prototype] FlexAttention support on X86 CPU for LLMs

FlexAttention was initially introduced in PyTorch 2.5 to provide optimized implementations for Attention variants with a flexible API. In PyTorch 2.6, X86 CPU support for FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support...

Assets3

D0n-A, inikishev, akihironitta, davidbuterez, Forbu, Geo99pro, Dahvikiin, CrasCris, RobinKa, ErcinDedeoglu, and 44 more reacted with thumbs up emoji

heindrickdumdum0217, 651961, Enigmatisms, bryanlimy, ShotaDeguchi, binbjz, lin72h, cataluna84, GoodCoder666, xingchensong, and chuckles201 reacted with laugh emoji

Anuvadak, madpeh, jeongseok-meta, jjerphan, azevedoguigo, joshdavham, mihaimoga, akihironitta, mplatzer, Olney1, and 24 more reacted with hooray emoji

mplatzer, Olney1, PeterCalifano, Geo99pro, CrasCris, RobinKa, Naeem1144, Di-Is, heindrickdumdum0217, KaSaNaa, and 10 more reacted with heart emoji

painebenjamin, Maurux01, pustam-egr, joshdavham, akihironitta, Olney1, davidbuterez, atalman, CrasCris, RobinKa, and 13 more reacted with rocket emoji

mrverdant13, binbjz, cataluna84, GoodCoder666, and chuckles201 reacted with eyes emoji

87 people reacted

PyTorch 2.5.1: bug fix release

29 Oct 17:58

kit1980

v2.5.1

a8d6afb

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.5.1: bug fix release

This release is meant to fix the following regressions:

Wheels from PyPI are unusable out of the box on PRM-based Linux distributions:#138324
PyPI arm64 distribution logs cpuinfo error on import:#138333
Crash When Using torch.compile with Math scaled_dot_product_attention in AMP Mode:#133974
[MPS] Internal crash due to the invalid buffer size computation if sliced API is used:#137800
Several issues related to CuDNN Attention:#138522

Besides the regression fixes, the release includes several documentation updates.

See release tracker#132400 for additional information.

Assets3

voxlol, rino2000, Denisskas, a-gn, rhiskey, etiennelndr, carlthome, iceychris, leslie-fang-intel, ErcinDedeoglu, and 36 more reacted with thumbs up emoji

mihaimoga, seemethere, Denisskas, QuantumChemist, rhiskey, iceychris, ErcinDedeoglu, binbjz, 651961, wanderingeek, and 13 more reacted with hooray emoji

voxlol, Denisskas, QuantumChemist, rsadwick, grib0ed0v, iceychris, syu-tan, binbjz, wanderingeek, Anuvadak, and 14 more reacted with heart emoji

voxlol, Denisskas, QuantumChemist, PeaBrane, iceychris, ErcinDedeoglu, Puiching-Memory, binbjz, wanderingeek, ngdlmk, and 9 more reacted with rocket emoji

Puiching-Memory, Paxsenix0, bryanlimy, binbjz, Joul1285, dhkim0225, sachithdickwella, and JoEarl reacted with eyes emoji

74 people reacted

PyTorch 2.5.0 Release, SDPA CuDNN backend, Flex Attention

17 Oct 16:26

jainapurva

v2.5.0

32f585d

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.5.0 Release, SDPA CuDNN backend, Flex Attention

PyTorch 2.5 Release Notes

Highlights
Backwards Incompatible Change
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation
Developers
Security

Highlights

We are excited to announce the release of PyTorch® 2.5! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.
This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at ourGetting Started page.
As well, please check out our new ecosystem projects releases withTorchRec andTorchFix.

Beta	Prototype
CuDNN backend for SDPA	FlexAttention
torch.compile regional compilation without recompilations	Compiled Autograd
TorchDynamo added support for exception handling & MutableMapping types	Flight Recorder
TorchInductor CPU backend optimization	Max-autotune Support on CPU with GEMM Template
	TorchInductor on Windows
	FP16 support on CPU path for both eager mode and TorchInductor CPP backend
	Autoload Device Extension
	Enhanced Intel GPU support

*To see a full list of public feature submissions clickhere.

BETA FEATURES

[Beta] CuDNN backend for SDPA

The cuDNN "Fused Flash Attention" backend was landed fortorch.nn.functional.scaled_dot_product_attention. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.

[Beta]torch.compile regional compilation without recompilations

Regional compilation without recompilations, viatorch._dynamo.config.inline_inbuilt_nn_modules which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.

See thetutorial for more information.

[Beta] TorchInductor CPU backend optimization

This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.

Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.

PROTOTYPE FEATURES

[Prototype] FlexAttention

We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.

For more information and examples, please refer to theofficial blog post andAttention Gym.

[Prototype] Compiled Autograd

Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.

Please refer to thetutorial for more information.

[Prototype] Flight Recorder

Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.

For more information please refer to the followingtutorial.

[Prototype] Max-autotune Support on CPU with GEMM Template

Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.

For more information please refer to thetutorial.

[Prototype] TorchInductor CPU on Windows

Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.

See thetutorial for more details.

[Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend

Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.

[Prototype] Autoload Device Extension

PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.

See thetutorial for more information.

[Prototype] Enhanced Intel GPU support

Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.

Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.  
The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode.
Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.

These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer todocumentation.

Backwards Incompatible changes

Distributed

[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)

We released Dispatchable collectives in 2.0 and we will use Backend Option for Backend initialization and the PG options are not needed any more.
In 2.4 and before, users can do:

# Users can pass in a basic option when creating an instance of ProcessGroupbase_pg_options=ProcessGroup.Options(backend=str(backend))base_pg_options._timeout=timeoutpg:ProcessGroup=ProcessGroup(store,rank,group_size,base_pg_options)# Users then need to create a backend option to create the comm backend (e.g., ProcessGroupNCCL)pg_options=ProcessGroupNCCL.Options()backend=ProcessGroupNCCL(store,rank,group_size,pg_options)

But from 2.5 onwards, users don’t need to pass in an option to create an instance of ProcessGroup and user can still set default backend for the pg since users still try to get default backend in the code:

# No basic option is passed in when creating a instance of ProcessGrouppg:ProcessGroup=ProcessGroup(store,rank,group_size)pg._set_default_backend(...

Assets3

voxlol, duanzhiihao, johnnynunez, bryanlimy, mhyrzt, leonardodepaula, SagatdinovEmil, etiennelndr, mrverdant13, jepjoo, and 66 more reacted with thumbs up emoji

QuantumChemist, jjerphan, Puiching-Memory, bryanlimy, mihaimoga, fcogidi, AYUSHMIT, johnnynunez, mrverdant13, parlance-zz, and 25 more reacted with hooray emoji

atalman, cuicaihao, parlance-zz, iceychris, bryanlimy, GraceKafuu, Denisskas, inikishev, GoodCoder666, sa-y-an, and 10 more reacted with heart emoji

Puiching-Memory, bryanlimy, johnnynunez, mrverdant13, parlance-zz, cuicaihao, gau-nernst, iceychris, wangling1820, VictorSantos674, and 12 more reacted with rocket emoji

Paxsenix0, GraceKafuu, Denisskas, fpaupier, GoodCoder666, binbjz, Geo99pro, wanderingeek, and wbigat reacted with eyes emoji

105 people reacted

PyTorch 2.4.1 Release, bug fix release

04 Sep 19:59

atalman

v2.4.1

ee1b680

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.4.1 Release, bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

Breaking Changes:

The pytorch/pytorch docker image now installs the PyTorch package through pip and has switch its conda installation from miniconda tominiforge (#134274)

Windows:

Fix performance regression on Windows related to MKL static linking (#130619) (#130697)
Fix error during loading on Windows: [WinError 126] The specified module could not be found. (#131662) (#130697)

MPS:

Fix tensor.clamp produces wrong values (#130226)
Fix Incorrect result from batch norm with sliced inputs (#133610)

ROCM:

Fix for launching kernel invalid config error when calling embedding with large index (#130994)
Added a check and a warning when attempting to use hipBLASLt on an unsupported architecture (#128753)
Fix image corruption with Memory Efficient Attention when running HuggingFace Diffusers Stable Diffusion 3 pipeline (#133331)

Distributed:

Fix FutureWarning when using torch.load internally (#130663)
Fix FutureWarning when using torch.cuda.amp.autocast internally (#130660)

Torch.compile:

Fix exception with torch compile when onnxruntime-training and deepspeed packages are installed. (#131194)
Fix silent incorrectness with torch.library.custom_op with mutable inputs and torch.compile (#133452)
Fix SIMD detection on Linux ARM (#129075)
Do not use C++20 features in cpu_inducotr code (#130816)

Packaging:

Fix for exposing statically linked libstdc++ CXX11 ABI symbols (#134494)
Fix error while building pytorch from source due to not missing QNNPACK module (#131864)
Make PyTorch buildable from source on PowerPC (#129736)
Fix XPU extension building (#132847)

Other:

Fix warning when using pickle on a nn.Module that contains tensor attributes (#130246)
Fix NaNs return in MultiheadAttention when need_weights=False (#130014)
Fix nested tensor MHA produces incorrect results (#130196)
Fix error when using torch.utils.flop_counter.FlopCounterMode (#134467)

Tracked Regressions:

The experimental remote caching feature for Inductor's autotuner (enabled via TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE) is known to still be broken in this release and actively worked on in main. Following Error is generated: redis.exceptions.DataError: Invalid input of type: 'dict'. Please use nightlies if you need this feature (reported and Fixed by PR:#134032)

Release tracker#132400 contains all relevant pull requests related to this release as well as links to related issues.

Assets3

77 people reacted

PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore

24 Jul 18:39

vmoens

v2.4.0

d990dad

This commit was created on GitHub.com and signed with GitHub’sverified signature.

GPG key ID:B5690EEEBB952194

Verified

Learn about vigilant mode.

PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore

PyTorch 2.4 Release Notes

Highlights
Tracked Regressions
Backward incompatible changes
Deprecations
New features
Improvements
Bug Fixes
Performance
Documentation
Developers
Security

Highlights

We are excited to announce the release of PyTorch® 2.4!
PyTorch 2.4 adds support for the latest version of Python (3.12) fortorch.compile.
AOTInductor freezing gives developers running AOTInductor more performance based optimizations by allowing the
serialization of MKLDNN weights. As well, a new default TCPStore server backend utilizinglibuv has been introduced
which should significantly reduce initialization times for users running large-scale jobs.
Finally, a new Python Custom Operator API makes it easier than before to integrate custom kernels
into PyTorch, especially fortorch.compile.

This release is composed of 3661 commits and 475 contributors since PyTorch 2.3. We want to sincerely thank our
dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we
improve 2.4. More information about how to get started with the PyTorch 2-series can be found at our
Getting Started page.

Beta	Prototype	Performance Improvements
Python 3.12 support for torch.compile	FSDP2: DTensor-based per-parameter-sharding FSDP	torch.compile optimizations for AWS Graviton (aarch64-linux) processors
AOTInductor Freezing for CPU	torch.distributed.pipelining, simplified pipeline parallelism	BF16 symbolic shape optimization in TorchInductor
New Higher-level Python Custom Operator API	Intel GPU is available through source build	Performance optimizations for GenAI projects utilizing CPU devices
Switching TCPStore’s default server backend to libuv

*To see a full list of public feature submissions clickhere.

Tracked Regressions

Subproc exception with torch.compile and onnxruntime-training

There is a reported issue (#131070) when usingtorch.compile ifonnxruntime-training lib is
installed. The issue will be fixed (#131194) in v2.4.1. It can be solved locally by setting the environment variable
TORCHINDUCTOR_WORKER_START=fork before executing the script.

cu118 wheels will not work with pre-cuda12 drivers

It was also reported (#130684) that the new version of triton uses cuda features that are not compatible with pre-cuda12 drivers.
In this case, theworkaround is to set
TRITON_PTXAS_PATH manually as follows (adapt the code according to the local installation path):

TRITON_PTXAS_PATH=/usr/local/lib/python3.10/site-packages/torch/bin/ptxas  python script.py

Backwards Incompatible Change

Python frontend

Default`TreadPool` size to number of physical cores (#125963)

Changed the default number of threads used for intra-op parallelism from the number of logical cores to the number of
physical cores. This should reduce core oversubscribing when running CPU workload and improve performance.
Previous behavior can be recovered by using torch.set_num_threads to set the number of threads to the desired value.

Fix`torch.quasirandom.SobolEngine.draw` default dtype handling (#126781)

The default dtype value has been changed fromtorch.float32 to the current default dtype as given by
torch.get_default_dtype() to be consistent with other APIs.

Forbid subclassing`torch._C._TensorBase` directly (#125558)

This is an internal subclass that a user used to be able to create an object that is almost a Tensor in Python and was
advertised as such in some tutorials. This is not allowed anymore to improve consistency and all users should
subclass torch.Tensor directly.

Composability

Non-compositional usages of as_strided + mutation under`torch.compile` will raise an error (#122502)

Thetorch.compile flow involves functionalizing any mutations inside the region being compiled. Torch.as_strided is
an existing view op that can be used non-compositionally: meaning when you call x.as_strided(...), as_strided will only
consider the underlying storage size of x, and ignore its current size/stride/storage_offset when creating a new view.
This makes it difficult to safely functionalize mutations on views of as_strided that are created non-compositionally,
so we ban them rather than risking silent correctness issues under torch.compile.

An example of a non-compositional usage of as_strided followed by mutation that we will error on is below. You can avoid
this issue by re-writing your usage of as_strided so that it is compositional (for example: either use a different set
of view ops instead of as_strided, or call as_strided directly on the base tensor instead of an existing view of it).

@torch.compiledeffoo(a):e=a.diagonal()# as_strided is being called on an existing view (e),# making it non-compositional. mutations to f under torch.compile# are not allowed, as we cannot easily functionalize them safelyf=e.as_strided((2,), (1,),0)f.add_(1.0)returna

We now verify schemas of custom ops at registration time (#124520)

Previously, you could register a custom op through the operator registration APIs, but give it a schema that contained
types unknown to the PyTorch Dispatcher. This behavior came from TorchScript, where “unknown” types were implicitly
treated by the TorchScript interpreter as type variables. However, calling such a custom op through regular pytorch
would result in an error later. As of 2.4, we will raise an error at registration time, when you first register the
custom operator. You can get the old behavior by constructing the schema with allow_typevars=true.

TORCH_LIBRARY(my_ns, m) {  // this now raises an error at registration time: bar/baz are unknown types  m.def("my_ns::foo(bar t) -> baz");  // you can get back the old behavior with the below flag  m.def(torch::schema("my_ns::foo(bar t) -> baz", /*allow_typevars*/ true));}

Autograd frontend

Delete torch.autograd.function.traceable APIs (#122817)

The torch.autograd.function.traceable(...) API, which sets the is_traceable class attribute
on a torch.autograd.Function class was deprecated in 2.3 and is now being deleted.
This API does not do anything and was only meant for internal purposes.
The following raised an warning in 2.3, and now errors because the API has been deleted:

@torch.autograd.function.traceableclassFunc(torch.autograd.Function):    ...

Release engineering

Remove caffe2 db and distributed from build system (#125092)

Optim

RemoveSparseAdam weird allowance of raw Tensor input (#127081).

Distributed

DeviceMesh

Update get_group and add get_all_groups (#128097)
In 2.3 and before, users can do:

mesh_2d=init_device_mesh("cuda", (2,2),mesh_dim_names=("dp","tp"))mesh_2d.get_group()# This will return all sub-pgs within the meshassertmesh_2d.get_group()[0]==mesh_2d.get_group(0)assertmesh_2d.get_group()[1]==mesh_2d.get_group(1)

But from 2.4 forward, if users callget_group without passing in the dim, users will get aRuntimeError.
Instead, they should useget_all_groups:

mesh_2d=init_device_mesh("cuda", (2,2),mesh_dim_names=("dp","tp"))mesh_2d.get_group()# This will throw a RuntimeErrorassertmesh_2d.get_all_groups()[0]==mesh_2d.get_group(0)assertmesh_2d.get_all_groups()[1]==mesh_2d.get_group(1)

Pipelining

Retire torch.distributed.pipeline (#127354)
In 2.3 and before, users can do:

importtorch.distributed.pipeline# warning saying that this will be removed and users need to migrate to torch.distributed.pipelining

But from 2.4 forward, if users write the code above, users will get aModuleNotFound error.
Instead, they should usetorch.distributed.pipelining:

importtorch.distributed.pipeline# -> ModuleNotFoundErrorimporttorch.distributed.pipelining

jit

Fix serialization/deepcopy behavior for tensors that are aliasing but not equal (#126126)

Fx

Complete revamp of float/promotion sympy handling (#126905)

ONNX

Remove caffe2 contrib and experiments (#125038)

Deprecations

Python frontend

User warning when usingtorch.load with defaultweights_only=False value (#129239,#129396,#129509).
A warning is now raised if the weights_only value is not specified during a call to torch.load, encouraging users to
adopt the safest practice when loading weights.
Deprecate device-specific autocast API (#126062)
All the autocast APIs are unified under torch.amp and it can be used as a drop-in replacement for torch.{device}.amp APIs
(passing a device argument where applicable)..
Export torch.newaxis=None for Python Array API/Numpy consistency (#125026)

Composability

Deprecate calling FakeTensor.data_ptr in eager-mode. FakeTensors are tensors without a valid data pointer, so in
general their data pointer is not safe to access. This makes it easier fortorch.compile to provide a nice error
message when tracing custom ops into a graph that are not written in a PT2-friendly way (bec...

Assets3

tolgacangoz, Borda, redradist, bryanlimy, D0n-A, xsa-dev, akihironitta, etiennelndr, Mickychen00, binbjz, and 44 more reacted with thumbs up emoji

shang-mt reacted with laugh emoji

tolgacangoz, nviraj, redradist, bryanlimy, nairbv, akihironitta, saeedark, Mickychen00, atalman, xuchenhao001, and 20 more reacted with hooray emoji

tolgacangoz, redradist, bryanlimy, ashim-mahara, hammaad2002, akashaero, khushi-411, akihironitta, Kakaymi10, Mickychen00, and 10 more reacted with heart emoji

tolgacangoz, debnath-d, redradist, bryanlimy, gau-nernst, akihironitta, jamesETsmith, binbjz, orion160, qcind, and 9 more reacted with rocket emoji

tolgacangoz, bryanlimy, Paxsenix0, binbjz, andre-brainn, waheedi, Denisskas, james-banks, AndrewDiMola, and GoodCoder666 reacted with eyes emoji

89 people reacted

Movatterモバイル変換

Releases: pytorch/pytorch

PyTorch 2.9.1 Release, bug fix release

Tracked Regressions

Torch.compile

Other

Uh oh!

2.9 Release Notes

PyTorch 2.9.0 Release Notes

Highlights

Backwards Incompatible Changes

Min supported Python version is now 3.10 (#162310)

Undefined behavior when an output of a custom operator shares storage with an input

More details

Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#159733,#159912)

Upgrade to DLPack 1.0 (#145000)

Raise appropriate errors intorch.cat (#158249)

Default todynamo=True for ONNX exporter (#159646,#162726)

Switch off runtime asserts by default in Export in favor of a shape guards function (#160111,#161178,#161794)

Set default opset to 20 in ONNX (#158802)

Dropdraft_export in exporter API (#161454,#162225)

Removetorch.onnx.dynamo_export and theonnxrt torch compile backend (#158130,#158258)

Removetorch.onnx.enable_fake_mode (#161222)

Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#161323)

Removetorch.onnx.symbolic_caffe2 (#157102)

Remove/d2implyavx512upperregs flag that slows build (#159431)

AddScalarType to shim conversion andstable::Tensor.scalar_type (#160557)

Uh oh!

PyTorch 2.8.0 Release

PyTorch 2.8.0 Release Notes

Highlights

Tracked Regressions

Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)

Backwards Incompatible Changes

CUDA Support

Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517,#158478,#158744)

Python Frontend

Calling an op with an input dtype that is unsupported now raisesNotImplementedError instead ofRuntimeError (#155470)

Added missing in-place on view check to customautograd.Function (#153094)

An error is now properly thrown for the out variant oftensordot when called with arequires_grad=True tensor (#150270)

torch.compile

Specialization of a tensor shape withmark_dynamic applied now correctly errors (#152661)

Several config variables related totorch.compile have been renamed or removed

Added a stricter aliasing/mutation check forHigherOrderOperators (e.g.cond), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953,#146658).

guard_or_x anddefinitely_x have been consolidated (#152463)

torch.export

torch.export.export_for_inference has been removed in favor oftorch.export.export_for_training().run_decompositions() (#149078)

Switched default tostrict=False intorch.export.export andexport_for_training (#148790,#150941)

ONNX

Default opset intorch.onnx.export is now 18 (#156023)

Uh oh!

PyTorch 2.7.1 Release, bug fix release

Torch.compile

Flex Attention

Distributed

MacOS

Other

Uh oh!

PyTorch 2.7.0 Release

PyTorch 2.7.0 Release Notes

Highlights

Tracked Regressions

NCCL init hits CUDA failure 'invalid argument' on 12.2 driver

Backwards Incompatible Changes

Dropped support for Triton < 2.2.0. Removed Support for CUDA 12.4, Anaconda in CI/CD.

C++ Extensionspy_limited_api=True is now built with-DPy_LIMITED_API (#145764)

Changetorch.Tensor.new_tensor() to be on the given Tensor's device by default (#144958)

Use Manylinux 2.28 and CXX11_ABI=1 for future released Linux wheel builds.

torch.onnx.dynamo_export now uses the ExportedProgram logic path (#137296)

Finish deprecation ofLRScheduler.print_lr() along with theverbose kwarg to the LRScheduler constructor. (#147301)

libtorch_python.so symbols are now invisible by default on all platforms except Apple (#142214)

Please usetorch.export.export instead ofcapture_pre_autograd_graph to export the model for pytorch 2 export quantization (#139505)

New interface fortorch.fx.passes.graph_transform_observer.GraphTransformObserver to enable Node Level provenance tracking (#144277)

torch.ao.quantization.pt2e.graph_utils.get_control_flow_submodules is no longer public (#141612)

Deprecations

torch.onnx.dynamo_export is deprecated (#146425,#146639,#146923)

XNNPACKQuantizer is deprecated in PyTorch and moved to ExecuTorch, please use it fromexecutorch.backends.xnnpack.quantizer.xnnpack_quantizer instead oftorch.ao.quantization.quantizer.xnnpack_quantizer. (#144940)

Uh oh!

PyTorch 2.6.0 Release

Highlights

Raise appropriate errors in`torch.cat` (#158249)

Default to`dynamo=True` for ONNX exporter (#159646,#162726)

Drop`draft_export` in exporter API (#161454,#162225)

Remove`torch.onnx.dynamo_export` and the`onnxrt` torch compile backend (#158130,#158258)

Remove`torch.onnx.enable_fake_mode` (#161222)

Remove`torch.onnx.symbolic_caffe2` (#157102)

Remove`/d2implyavx512upperregs` flag that slows build (#159431)

Add`ScalarType` to shim conversion and`stable::Tensor.scalar_type` (#160557)

Calling an op with an input dtype that is unsupported now raises`NotImplementedError` instead of`RuntimeError` (#155470)

Added missing in-place on view check to custom`autograd.Function` (#153094)

An error is now properly thrown for the out variant of`tensordot` when called with a`requires_grad=True` tensor (#150270)

Specialization of a tensor shape with`mark_dynamic` applied now correctly errors (#152661)

Several config variables related to`torch.compile` have been renamed or removed

Added a stricter aliasing/mutation check for`HigherOrderOperator`s (e.g.`cond`), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953,#146658).

`guard_or_x` and`definitely_x` have been consolidated (#152463)

`torch.export.export_for_inference` has been removed in favor of`torch.export.export_for_training().run_decompositions()` (#149078)

Switched default to`strict=False` in`torch.export.export` and`export_for_training` (#148790,#150941)

Default opset in`torch.onnx.export` is now 18 (#156023)

C++ Extensions`py_limited_api=True` is now built with`-DPy_LIMITED_API` (#145764)

Change`torch.Tensor.new_tensor()` to be on the given Tensor's device by default (#144958)

`torch.onnx.dynamo_export` now uses the ExportedProgram logic path (#137296)

Finish deprecation of`LRScheduler.print_lr()` along with the`verbose` kwarg to the LRScheduler constructor. (#147301)

Please use`torch.export.export` instead of`capture_pre_autograd_graph` to export the model for pytorch 2 export quantization (#139505)

New interface for`torch.fx.passes.graph_transform_observer.GraphTransformObserver` to enable Node Level provenance tracking (#144277)

`torch.ao.quantization.pt2e.graph_utils.get_control_flow_submodules` is no longer public (#141612)

`torch.onnx.dynamo_export` is deprecated (#146425,#146639,#146923)

`XNNPACKQuantizer` is deprecated in PyTorch and moved to ExecuTorch, please use it from`executorch.backends.xnnpack.quantizer.xnnpack_quantizer` instead of`torch.ao.quantization.quantizer.xnnpack_quantizer`. (#144940)