Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[user triton] AOT inductor support for device-side TMA#157241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@pytorchbot
Copy link
Collaborator

Stack fromghstack (oldest at bottom):

Tests:python test/inductor/test_aot_inductor.py -vvv -k device_tma

Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.

To support this in AOTI, this PR:

  • records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
  • allocates global scratch, if needed (cuda/device_op_overrides.py)
  • plumbsdevice_idx_ into the triton caller function, so that global scratch can be allocated on the right device)
  • updates tests to verify this works for dynamically shaped inputs

This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)

Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).

For Meta reviewers, here is a tlparse from runningpython test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cudahttps://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

cc@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov

Differential Revision:D77352139

@pytorch-bot
Copy link

pytorch-botbot commentedJun 29, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/157241

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit18eea61 with merge base3a7ff82 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelJul 1, 2025
@davidberard98
Copy link
Contributor

@pytorchbot rebase -b release/2.8

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job ontorefs/remotes/origin/release/2.8. Check the current statushere

Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.To support this in AOTI, this PR:* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen* allocates global scratch, if needed (cuda/device_op_overrides.py)* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)* updates tests to verify this works for dynamically shaped inputsThis PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda`https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)Pull Requestresolved:#155896Approved by:https://github.com/desertfire(cherry picked from commitb6c00df)
@pytorchmergebot
Copy link
Collaborator

Successfully rebasedcherry-pick-155896-by-pytorch_bot_bot_ ontorefs/remotes/origin/release/2.8, please pull locally before adding more changes (for example, viagit checkout cherry-pick-155896-by-pytorch_bot_bot_ && git pull --rebase)

@pytorchmergebotpytorchmergebotforce-pushed thecherry-pick-155896-by-pytorch_bot_bot_ branch from86aaf93 to18eea61CompareJuly 7, 2025 19:36
@atalmanatalman merged commit6e08036 intorelease/2.8Jul 11, 2025
262 of 267 checks passed
@Camyll
Copy link
Contributor

Validated tests for release/2.8

@github-actionsgithub-actionsbot deleted the cherry-pick-155896-by-pytorch_bot_bot_ branchSeptember 1, 2025 02:19
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@malfetmalfetmalfet approved these changes

@atalmanatalmanatalman approved these changes

+1 more reviewer

@jingshjingshjingsh approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

8 participants

@pytorchbot@davidberard98@pytorchmergebot@Camyll@jingsh@malfet@atalman

[8]ページ先頭

©2009-2025 Movatter.jp