Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.

To support this in AOTI, this PR:

records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
allocates global scratch, if needed (cuda/device_op_overrides.py)
plumbsdevice_idx_ into the triton caller function, so that global scratch can be allocated on the right device)
updates tests to verify this works for dynamically shaped inputs

This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)

Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).

For Meta reviewers, here is a tlparse from runningpython test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cudahttps://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

cc@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Differential Revision:D77352139

This was referencedJun 29, 2025

[v.2.8.0] Release Tracker#156745

Open

[user triton] AOT inductor support for device-side TMA#155896

Closed

Copy link

pytorch-botbot commentedJun 29, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/157241

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit18eea61 with merge base3a7ff82 ():

NEW FAILURE - The following job has failed:

inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal) (gh)
Process completed with exit code 1.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added ciflow/inductor module: inductor release notes: inductor (aoti) labels

Jun 29, 2025

pytorchbot added the open source label

Jun 29, 2025

jingsh approved these changes

Jul 1, 2025

View reviewed changes

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Jul 1, 2025

Copy link

Contributor

davidberard98 commentedJul 7, 2025

@pytorchbot rebase -b release/2.8

Copy link

Collaborator

pytorchmergebot commentedJul 7, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/release/2.8. Check the current statushere

[user triton] AOT inductor support for device-side TMA (#155896)

18eea61

Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.To support this in AOTI, this PR:* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen* allocates global scratch, if needed (cuda/device_op_overrides.py)* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)* updates tests to verify this works for dynamically shaped inputsThis PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda`https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)Pull Requestresolved:#155896Approved by:https://github.com/desertfire(cherry picked from commitb6c00df)

Copy link

Collaborator

pytorchmergebot commentedJul 7, 2025

Successfully rebasedcherry-pick-155896-by-pytorch_bot_bot_ ontorefs/remotes/origin/release/2.8, please pull locally before adding more changes (for example, viagit checkout cherry-pick-155896-by-pytorch_bot_bot_ && git pull --rebase)