- Notifications
You must be signed in to change notification settings - Fork26.3k
Hook StaticCudaLauncher up to torch.compile (cold start)#148890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedMar 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/148890
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit260d3eb with merge base1bf443e ( BROKEN TRUNK - The following job failed but were present on the merge base:👉Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
jamesjwu commentedMar 11, 2025
@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator. |
jamesjwu commentedMar 11, 2025
@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator. |
[ghstack-poisoned]
jansel left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looks like there are some new failure (and nice speedups!) in that bechmark run, we should fix the failures.
Can we add support for larger number of args (fall back to dynamic alloc), and the special stuff like hooks, shared memory, custom ops? I would really like to have only one code path since I think that will be easier to maintain. I suspect stuff like args>50 will have poor coverage in our test suite and is likey to get broken without us noticing.
jamesjwu commentedMar 18, 2025
I'm happy to implement these additions so that our coverage is as complete as possible, but I am very worried about trying to ship this without a config flag to turn it off initially or fall back on (especially internally, where rollouts can take weeks). So I think we should still be able to support the fallback to triton for now. I do agree directionally that we should try to get to the case where all kernels are statically launchable. Re: falling back to dynamic alloc: what do you think about falling back to something like I fixed a bug just now (when we have a global scratch, MAX_ARGS on C++ side actually needs to be MAX_ARGS + 1 lol), will kick off a new benchmark run. |
jamesjwu commentedMar 18, 2025
I realize there's a big list of features that we technically need to support to be "feature parity" with triton's launcher capabilities, so I wrote down them all in this issue:#149440, and will fix them one by one:#149442. The goal is that with @jansel I do have a concern, though, with user defined triton kernels, because they can essentially containanything. So there's a bunch of features that PT2 generated kernels will never encounter (i.e. |
[ghstack-poisoned]
jamesjwu commentedMar 19, 2025
(For posterity, base revision isf461654 and head revision isbe6d18d) Overall looks better, no more accuracy failures or fail to runs. One model, XGLMForCausalLM, seemed to have regressed runtime perf, but I couldn't repro it locally. I checked on various other revisions (i.e. main, other base revisions I've run), and I think this particular base revision might have just gotten a high perf speedup due to variance (2.5x), though I will rerun the head revision one more time to confirm. |
jamesjwu commentedMar 19, 2025
[ghstack-poisoned]
jansel commentedMar 20, 2025
This seems fine to me. For user-defined Triton kernels maybe we just always fallback to the upstream launcher? That way the codepath will at least be well tested. |
jamesjwu commentedMar 20, 2025
This seems good to me, I had thought your point was to try to get it so we don't have to keep both codepaths. But we can make it so only user defined triton kernels use the old launcher system and everything else uses static launch. |
[ghstack-poisoned]
jamesjwu commentedMar 20, 2025
One last round of benchmarks, showing stuff looks good (this is before I flipped the config to default off): Will flip the config to default on once I land#149442 and some other in progress PRs to reach closer feature parity with triton's launcher. |
jamesjwu commentedMar 20, 2025
@pytorchbot merge |
pytorchmergebot commentedMar 20, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:#148890
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:- The kernel is a cuda kernel and inductor can find a cubin file associated with it- The kernel takes less than 50 arguments- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)- The kernel is not user defined (to be supported in a later PR)We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.Some key features of StaticTritonCompileResult:- It is fully serializable- It stores the minimum amount of stuff, so that later it can be cached easily- It does not depend on any triton specific types (though it does have various triton metadata).For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.Fixes#149448Pull Requestresolved:#148890Approved by:https://github.com/jansel
No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:#148890
)This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:- The kernel is a cuda kernel and inductor can find a cubin file associated with it- The kernel takes less than 50 arguments- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)- The kernel is not user defined (to be supported in a later PR)We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.Some key features of StaticTritonCompileResult:- It is fully serializable- It stores the minimum amount of stuff, so that later it can be cached easily- It does not depend on any triton specific types (though it does have various triton metadata).For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.Fixespytorch#149448Pull Requestresolved:pytorch#148890Approved by:https://github.com/jansel
…9669)No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:pytorch#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:pytorch#148890

Uh oh!
There was an error while loading.Please reload this page.
Stack fromghstack (oldest at bottom):
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.
Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:
We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.
Some key features of StaticTritonCompileResult:
For now, both TritonCompileResult and StaticTritonCompileResult still
execcustom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.
Fixes#149448
cc@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov