Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Hook StaticCudaLauncher up to torch.compile (cold start)#148890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
jamesjwu wants to merge29 commits intogh/jamesjwu/119/basefromgh/jamesjwu/119/head

Conversation

@jamesjwu
Copy link
Contributor

@jamesjwujamesjwu commentedMar 10, 2025
edited
Loading

Stack fromghstack (oldest at bottom):

This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.

Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:

  • The kernel is a cuda kernel and inductor can find a cubin file associated with it
  • The kernel takes less than 50 arguments
  • The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
  • The kernel is not user defined (to be supported in a later PR)

We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.

Some key features of StaticTritonCompileResult:

  • It is fully serializable
  • It stores the minimum amount of stuff, so that later it can be cached easily
  • It does not depend on any triton specific types (though it does have various triton metadata).

For now, both TritonCompileResult and StaticTritonCompileResult stillexec custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.

Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.

Fixes#149448

cc@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-botbot commentedMar 10, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/148890

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit260d3eb with merge base1bf443e (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jamesjwu added a commit that referenced this pull requestMar 10, 2025
ghstack-source-id:90d5bafPull Requestresolved:#148890
@jamesjwujamesjwu marked this pull request as draftMarch 10, 2025 16:13
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:ecb797aPull Requestresolved:#148890
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:6662157Pull Requestresolved:#148890
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:4f9fa9dPull Requestresolved:#148890
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:c8b0644Pull Requestresolved:#148890
@jamesjwujamesjwu added the topic: not user facingtopic category labelMar 11, 2025
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:c7aef25Pull Requestresolved:#148890
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:f7bf8e8Pull Requestresolved:#148890
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:22fbcbbPull Requestresolved:#148890
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:bc5b873Pull Requestresolved:#148890
@jamesjwu
Copy link
ContributorAuthor

@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator.

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelMar 11, 2025
@jamesjwujamesjwu changed the titleWIP launch from pytorchRefactor TritonCompileResult with a new type, support Statically Launched TritonCompileResultsMar 11, 2025
@jamesjwu
Copy link
ContributorAuthor

@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator.

jamesjwu added a commit that referenced this pull requestMar 11, 2025
ghstack-source-id:1b65af4Pull Requestresolved:#148890
jamesjwu added a commit that referenced this pull requestMar 18, 2025
…ched TritonCompileResultsghstack-source-id:355d8daPull Requestresolved:#148890
Copy link
Contributor

@janseljansel left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

  1. Looks like there are some new failure (and nice speedups!) in that bechmark run, we should fix the failures.

  2. Can we add support for larger number of args (fall back to dynamic alloc), and the special stuff like hooks, shared memory, custom ops? I would really like to have only one code path since I think that will be easier to maintain. I suspect stuff like args>50 will have poor coverage in our test suite and is likey to get broken without us noticing.

@jamesjwu
Copy link
ContributorAuthor

Can we add support for larger number of args (fall back to dynamic alloc), and the special stuff like hooks, shared memory, custom ops? I would really like to have only one code path since I think that will be easier to maintain. I suspect stuff like args>50 will have poor coverage in our test suite and is likey to get broken without us noticing.

I'm happy to implement these additions so that our coverage is as complete as possible, but I am very worried about trying to ship this without a config flag to turn it off initially or fall back on (especially internally, where rollouts can take weeks). So I think we should still be able to support the fallback to triton for now. I do agree directionally that we should try to get to the case where all kernels are statically launchable.

Re: falling back to dynamic alloc: what do you think about falling back to something likealloca, to just allocate it directly on the stack? I know it might not be super portable, but maybe it's possible to support.

I fixed a bug just now (when we have a global scratch, MAX_ARGS on C++ side actually needs to be MAX_ARGS + 1 lol), will kick off a new benchmark run.

@jamesjwu
Copy link
ContributorAuthor

I realize there's a big list of features that we technically need to support to be "feature parity" with triton's launcher capabilities, so I wrote down them all in this issue:#149440, and will fix them one by one:#149442.

The goal is that withconfig.static_cuda_launch=True, all triton kernels are statically launched (or else we hard error). This should decrease the complexity of maintaining the code base, because then either all of the CompileResults are triton launched, or all of them are statically launched, but there's no in between.

@jansel I do have a concern, though, with user defined triton kernels, because they can essentially containanything. So there's a bunch of features that PT2 generated kernels will never encounter (i.e.num_ctas>1,CUTensorMaps, etc), but would need to be implemented in the static launcher just to support user defined ones. We could fix the finite list of these today, but the bigger worry is that until we can fully upstream it to triton side, there's always a risk of triton supporting a new feature that StaticCudaLauncher can't support, and then having user defined triton kernels break. What do you think our options are here? Will we need to keep a fallback option around for those?

[ghstack-poisoned]
@jamesjwu
Copy link
ContributorAuthor

(For posterity, base revision isf461654 and head revision isbe6d18d)

Overall looks better, no more accuracy failures or fail to runs. One model, XGLMForCausalLM, seemed to have regressed runtime perf, but I couldn't repro it locally. I checked on various other revisions (i.e. main, other base revisions I've run), and I think this particular base revision might have just gotten a high perf speedup due to variance (2.5x), though I will rerun the head revision one more time to confirm.

@jamesjwujamesjwu requested a review fromjanselMarch 19, 2025 15:31
@jamesjwu
Copy link
ContributorAuthor

Ugh there's one more fail_accuracy on timm_efficientnet on torchbench. But I checked on main revisions, and it seems to have been failing accuracy until today?
image

[ghstack-poisoned]
[ghstack-poisoned]
@jansel
Copy link
Contributor

Re: falling back to dynamic alloc: what do you think about falling back to something like alloca, to just allocate it directly on the stack? I know it might not be super portable, but maybe it's possible to support.

This seems fine to me.

For user-defined Triton kernels maybe we just always fallback to the upstream launcher? That way the codepath will at least be well tested.

@jamesjwu
Copy link
ContributorAuthor

For user-defined Triton kernels maybe we just always fallback to the upstream launcher? That way the codepath will at least be well tested.

This seems good to me, I had thought your point was to try to get it so we don't have to keep both codepaths. But we can make it so only user defined triton kernels use the old launcher system and everything else uses static launch.

[ghstack-poisoned]
@jamesjwu
Copy link
ContributorAuthor

@jamesjwu
Copy link
ContributorAuthor

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull requestMar 21, 2025
No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:#148890
svekars pushed a commit that referenced this pull requestMar 21, 2025
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:- The kernel is a cuda kernel and inductor can find a cubin file associated with it- The kernel takes less than 50 arguments- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)- The kernel is not user defined (to be supported in a later PR)We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.Some key features of StaticTritonCompileResult:- It is fully serializable- It stores the minimum amount of stuff, so that later it can be cached easily- It does not depend on any triton specific types (though it does have various triton metadata).For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.Fixes#149448Pull Requestresolved:#148890Approved by:https://github.com/jansel
svekars pushed a commit that referenced this pull requestMar 21, 2025
No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:#148890
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
)This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:- The kernel is a cuda kernel and inductor can find a cubin file associated with it- The kernel takes less than 50 arguments- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)- The kernel is not user defined (to be supported in a later PR)We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.Some key features of StaticTritonCompileResult:- It is fully serializable- It stores the minimum amount of stuff, so that later it can be cached easily- It does not depend on any triton specific types (though it does have various triton metadata).For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.Fixespytorch#149448Pull Requestresolved:pytorch#148890Approved by:https://github.com/jansel
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
…9669)No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:pytorch#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:pytorch#148890
@github-actionsgithub-actionsbot deleted the gh/jamesjwu/119/head branchApril 22, 2025 02:15
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@janseljanseljansel approved these changes

@oulgenoulgenAwaiting requested review from oulgen

@eellisoneellisonAwaiting requested review from eellison

@aorensteaorensteAwaiting requested review from aorenste

Assignees

No one assigned

Labels

ciflow/inductorciflow/trunkTrigger trunk jobs on your pull requestMergedmodule: inductortopic: not user facingtopic category

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@jamesjwu@jansel@pytorchmergebot

[8]ページ先頭

©2009-2025 Movatter.jp