- Notifications
You must be signed in to change notification settings - Fork26.3k
[user triton] AOT inductor support for device-side TMA#157241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedJun 29, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/157241
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit18eea61 with merge base3a7ff82 ( NEW FAILURE - The following job has failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
davidberard98 commentedJul 7, 2025
@pytorchbot rebase -b release/2.8 |
pytorchmergebot commentedJul 7, 2025
@pytorchbot started a rebase job ontorefs/remotes/origin/release/2.8. Check the current statushere |
Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.To support this in AOTI, this PR:* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen* allocates global scratch, if needed (cuda/device_op_overrides.py)* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)* updates tests to verify this works for dynamically shaped inputsThis PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda`https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)Pull Requestresolved:#155896Approved by:https://github.com/desertfire(cherry picked from commitb6c00df)
pytorchmergebot commentedJul 7, 2025
Successfully rebased |
86aaf93 to18eea61Compare6e08036 intorelease/2.8Uh oh!
There was an error while loading.Please reload this page.
Camyll commentedAug 1, 2025
Validated tests for release/2.8 |
Stack fromghstack (oldest at bottom):
Tests:
python test/inductor/test_aot_inductor.py -vvv -k device_tmaDevice-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.
To support this in AOTI, this PR:
device_idx_into the triton caller function, so that global scratch can be allocated on the right device)This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)
Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).
For Meta reviewers, here is a tlparse from running
python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cudahttps://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000cc@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov
Differential Revision:D77352139