NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

Hook StaticCudaLauncher up to torch.compile (cold start)#148890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

jamesjwu wants to merge29 commits intogh/jamesjwu/119/basefromgh/jamesjwu/119/head

Closed

Hook StaticCudaLauncher up to torch.compile (cold start)#148890

jamesjwu wants to merge29 commits intogh/jamesjwu/119/basefromgh/jamesjwu/119/head

Conversation

Copy link

Contributor

jamesjwu commentedMar 10, 2025•
edited
Loading

Stack fromghstack (oldest at bottom):

This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.

Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:

The kernel is a cuda kernel and inductor can find a cubin file associated with it
The kernel takes less than 50 arguments
The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
The kernel is not user defined (to be supported in a later PR)

We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.

Some key features of StaticTritonCompileResult:

It is fully serializable
It stores the minimum amount of stuff, so that later it can be cached easily
It does not depend on any triton specific types (though it does have various triton metadata).

For now, both TritonCompileResult and StaticTritonCompileResult stillexec custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.

Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.

Fixes#149448

cc@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Update

1a1a7f6

[ghstack-poisoned]

jamesjwu mentioned this pull request

Mar 10, 2025

[RFC] First version of statically compiled launcher for triton compiled CUDA kernels#148561

Closed

Copy link

pytorch-botbot commentedMar 10, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/148890

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit260d3eb with merge base1bf443e ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added ciflow/inductor module: inductor labels

Mar 10, 2025

jamesjwu added a commit that referenced this pull request

Mar 10, 2025

WIP launch from pytorch

f093846

ghstack-source-id:90d5bafPull Requestresolved:#148890

jamesjwu marked this pull request as draft

March 10, 2025 16:13

Update

a8cb9c1

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

95a0d17

ghstack-source-id:ecb797aPull Requestresolved:#148890

Update

ec76acc

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

7794f2d

ghstack-source-id:6662157Pull Requestresolved:#148890

Update

8a41e7c

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

6e50b04

ghstack-source-id:4f9fa9dPull Requestresolved:#148890

Update

ff1d2d7

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

f244bf8

ghstack-source-id:c8b0644Pull Requestresolved:#148890

jamesjwu added the topic: not user facingtopic category label

Mar 11, 2025

Update

0a3c630

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

969ec40

ghstack-source-id:c7aef25Pull Requestresolved:#148890

Update

fcb10d1

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

d086838

ghstack-source-id:f7bf8e8Pull Requestresolved:#148890

Code review

b89b253

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

77c6c1d

ghstack-source-id:22fbcbbPull Requestresolved:#148890

Rebase

b4accbd

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

a728c88

ghstack-source-id:bc5b873Pull Requestresolved:#148890

Copy link

ContributorAuthor

jamesjwu commentedMar 11, 2025

@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator.

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Mar 11, 2025

jamesjwu changed the title~~WIP launch from pytorch~~Refactor TritonCompileResult with a new type, support Statically Launched TritonCompileResults

Mar 11, 2025

Copy link

ContributorAuthor

jamesjwu commentedMar 11, 2025

@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator.

Fix some bugs due to rebase

cb0332f

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request

Mar 11, 2025

WIP launch from pytorch

564ff98

ghstack-source-id:1b65af4Pull Requestresolved:#148890

jamesjwu added a commit that referenced this pull request

Mar 18, 2025

Refactor TritonCompileResult with a new type, support Statically Laun…

217471c

…ched TritonCompileResultsghstack-source-id:355d8daPull Requestresolved:#148890

jansel requested changes

Mar 18, 2025

View reviewed changes

Copy link

Contributor

jansel left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks like there are some new failure (and nice speedups!) in that bechmark run, we should fix the failures.
Can we add support for larger number of args (fall back to dynamic alloc), and the special stuff like hooks, shared memory, custom ops? I would really like to have only one code path since I think that will be easier to maintain. I suspect stuff like args>50 will have poor coverage in our test suite and is likey to get broken without us noticing.

Copy link

ContributorAuthor

jamesjwu commentedMar 18, 2025

Can we add support for larger number of args (fall back to dynamic alloc), and the special stuff like hooks, shared memory, custom ops? I would really like to have only one code path since I think that will be easier to maintain. I suspect stuff like args>50 will have poor coverage in our test suite and is likey to get broken without us noticing.

I'm happy to implement these additions so that our coverage is as complete as possible, but I am very worried about trying to ship this without a config flag to turn it off initially or fall back on (especially internally, where rollouts can take weeks). So I think we should still be able to support the fallback to triton for now. I do agree directionally that we should try to get to the case where all kernels are statically launchable.

Re: falling back to dynamic alloc: what do you think about falling back to something likealloca, to just allocate it directly on the stack? I know it might not be super portable, but maybe it's possible to support.

I fixed a bug just now (when we have a global scratch, MAX_ARGS on C++ side actually needs to be MAX_ARGS + 1 lol), will kick off a new benchmark run.

jamesjwu mentioned this pull request

Mar 18, 2025

[StaticCudaLauncher] Support any number of kernel arguments#149442

Closed

Copy link

ContributorAuthor

jamesjwu commentedMar 18, 2025

I realize there's a big list of features that we technically need to support to be "feature parity" with triton's launcher capabilities, so I wrote down them all in this issue:#149440, and will fix them one by one:#149442.

The goal is that withconfig.static_cuda_launch=True, all triton kernels are statically launched (or else we hard error). This should decrease the complexity of maintaining the code base, because then either all of the CompileResults are triton launched, or all of them are statically launched, but there's no in between.

@jansel I do have a concern, though, with user defined triton kernels, because they can essentially containanything. So there's a bunch of features that PT2 generated kernels will never encounter (i.e.num_ctas>1,CUTensorMaps, etc), but would need to be implemented in the static launcher just to support user defined ones. We could fix the finite list of these today, but the bigger worry is that until we can fully upstream it to triton side, there's always a risk of triton supporting a new feature that StaticCudaLauncher can't support, and then having user defined triton kernels break. What do you think our options are here? Will we need to keep a fallback option around for those?

Rebase onto main

be6d18d

[ghstack-poisoned]

Copy link

ContributorAuthor

jamesjwu commentedMar 19, 2025

(For posterity, base revision isf461654 and head revision isbe6d18d)

Overall looks better, no more accuracy failures or fail to runs. One model, XGLMForCausalLM, seemed to have regressed runtime perf, but I couldn't repro it locally. I checked on various other revisions (i.e. main, other base revisions I've run), and I think this particular base revision might have just gotten a high perf speedup due to variance (2.5x), though I will rerun the head revision one more time to confirm.

jamesjwu requested a review fromjansel

March 19, 2025 15:31

Copy link

ContributorAuthor

jamesjwu commentedMar 19, 2025

Ugh there's one more fail_accuracy on timm_efficientnet on torchbench. But I checked on main revisions, and it seems to have been failing accuracy until today?

jamesjwu added2 commits

March 19, 2025 08:49

Rebase onto nightly A100

63eb5fd

[ghstack-poisoned]

Rebase

c1883a3

[ghstack-poisoned]

Copy link

Contributor

jansel commentedMar 20, 2025

Re: falling back to dynamic alloc: what do you think about falling back to something like alloca, to just allocate it directly on the stack? I know it might not be super portable, but maybe it's possible to support.

This seems fine to me.

For user-defined Triton kernels maybe we just always fallback to the upstream launcher? That way the codepath will at least be well tested.

Copy link

ContributorAuthor

jamesjwu commentedMar 20, 2025

For user-defined Triton kernels maybe we just always fallback to the upstream launcher? That way the codepath will at least be well tested.

This seems good to me, I had thought your point was to try to get it so we don't have to keep both codepaths. But we can make it so only user defined triton kernels use the old launcher system and everything else uses static launch.

Default flag to off

260d3eb

[ghstack-poisoned]

Copy link

ContributorAuthor

jamesjwu commentedMar 20, 2025

One last round of benchmarks, showing stuff looks good (this is before I flipped the config to default off):

https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2013%20Mar%202025%2015%3A39%3A38%20GMT&stopTime=Thu%2C%2020%20Mar%202025%2015%3A39%3A38%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/119/orig&lCommit=4673097e21a35158bb28402e35af65f620846f9f&rBranch=main&rCommit=e8a35eb7da1f7cbd62827943c1e00849a8592298

Will flip the config to default on once I land#149442 and some other in progress PRs to reach closer feature parity with triton's launcher.

jamesjwu mentioned this pull request

Mar 20, 2025

[test] Turn on StaticCudaLauncher#149629

Closed

jansel approved these changes

Mar 20, 2025

View reviewed changes

Copy link

ContributorAuthor

jamesjwu commentedMar 20, 2025

@pytorchbot merge

pytorchmergebot added the merging label

Mar 20, 2025

Copy link

Collaborator

pytorchmergebot commentedMar 20, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

Mar 20, 2025

pytorchmergebot closed this in7bb9c36

Mar 20, 2025

pytorchmergebot removed the merging label

Mar 20, 2025

pytorchmergebot pushed a commit that referenced this pull request

Mar 21, 2025

[easy] Do not logspam if static cuda launcher is disabled (#149669)

c2ada9d

No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:#148890

svekars pushed a commit that referenced this pull request

Mar 21, 2025

Hook StaticCudaLauncher up to torch.compile (cold start) (#148890)

0ff5abf

This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:- The kernel is a cuda kernel and inductor can find a cubin file associated with it- The kernel takes less than 50 arguments- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)- The kernel is not user defined (to be supported in a later PR)We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.Some key features of StaticTritonCompileResult:- It is fully serializable- It stores the minimum amount of stuff, so that later it can be cached easily- It does not depend on any triton specific types (though it does have various triton metadata).For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.Fixes#149448Pull Requestresolved:#148890Approved by:https://github.com/jansel

svekars pushed a commit that referenced this pull request

Mar 21, 2025

[easy] Do not logspam if static cuda launcher is disabled (#149669)

f88e951

No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:#148890

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

Hook StaticCudaLauncher up to torch.compile (cold start) (pytorch#148890

ded2606

)This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:- The kernel is a cuda kernel and inductor can find a cubin file associated with it- The kernel takes less than 50 arguments- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)- The kernel is not user defined (to be supported in a later PR)We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.Some key features of StaticTritonCompileResult:- It is fully serializable- It stores the minimum amount of stuff, so that later it can be cached easily- It does not depend on any triton specific types (though it does have various triton metadata).For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.Fixespytorch#149448Pull Requestresolved:pytorch#148890Approved by:https://github.com/jansel

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

[easy] Do not logspam if static cuda launcher is disabled (pytorch#14…

984b6d2

…9669)No need to log.info every time someone runs with StaticCudaLauncher disabled.Test plan: Run any benchmark and see that we don't spam the bypass message in logs.Pull Requestresolved:pytorch#149669Approved by:https://github.com/oulgen,https://github.com/janselghstack dependencies:pytorch#148890