Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[ROCm] Introduce AMD specific inductor gemm tuning#147315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
jataylo wants to merge7 commits intopytorch:mainfromjataylo:amd-gemm-retune-pr

Conversation

@jataylo
Copy link
Collaborator

@jataylojataylo commentedFeb 17, 2025
edited
Loading

Replaces#143286

Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.

Dynamo huggingface inference benchmarks:
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor

GEOMEAN speedup (before): | 1.35x
GEOMEAN speedup (after): | 1.42x

nameEager - abs latencyold - abs_latencyold - speedupnew - abs_latencynew - speedup
AlbertForMaskedLM26.2226.5298.86%24.58106.67%
AlbertForQuestionAnswering25.9626.4098.33%24.10107.73%
AllenaiLongformerBase21.0310.65197.50%10.49200.58%
BartForCausalLM7.779.7679.63%8.7988.46%
BartForConditionalGeneration14.4412.86112.26%11.96120.70%
BertForMaskedLM8.108.8291.89%8.5794.53%
BertForQuestionAnswering6.827.3293.20%7.1096.18%
BlenderbotForCausalLM10.9711.3996.34%10.10108.65%
BlenderbotSmallForCausalLM5.915.44108.72%4.82122.67%
BlenderbotSmallForConditionalGeneration12.649.65130.94%9.11138.83%
CamemBert8.359.1591.24%8.8694.27%
DebertaForMaskedLM10.926.09179.44%5.90185.05%
DebertaForQuestionAnswering14.297.70185.59%7.26196.75%
DebertaV2ForMaskedLM15.4710.22151.32%9.34165.55%
DebertaV2ForQuestionAnswering14.986.11245.28%6.28238.40%
DistilBertForMaskedLM8.378.7096.30%8.22101.92%
DistilBertForQuestionAnswering10.2110.5496.88%10.3998.36%
DistillGPT28.776.78129.40%6.31138.88%
ElectraForCausalLM10.324.70219.45%4.60224.29%
ElectraForQuestionAnswering11.485.62204.20%5.44210.95%
GPT2ForSequenceClassification6.215.72108.50%5.58111.26%
GoogleFnet26.5120.81127.37%19.91133.11%
LayoutLMForMaskedLM12.097.99151.28%7.66157.80%
LayoutLMForSequenceClassification10.626.49163.67%6.25169.95%
M2M100ForConditionalGeneration14.9810.20146.79%9.89151.42%
MBartForCausalLM7.679.7878.44%8.8786.55%
MBartForConditionalGeneration13.4512.69105.99%12.03111.82%
MT5ForConditionalGeneration19.965.32375.37%5.08393.01%
MegatronBertForCausalLM13.227.86168.07%7.18184.01%
MegatronBertForQuestionAnswering15.6211.81132.21%11.02141.68%
MobileBertForMaskedLM26.6310.82245.99%11.95222.73%
MobileBertForQuestionAnswering23.537.55311.51%9.53247.03%
OPTForCausalLM7.337.6495.93%7.5696.90%
PLBartForCausalLM8.737.63114.40%7.37118.58%
PLBartForConditionalGeneration10.468.50122.98%8.16128.13%
PegasusForCausalLM7.187.3797.42%6.64108.22%
PegasusForConditionalGeneration16.4716.6698.87%14.18116.13%
RobertaForCausalLM10.309.95103.52%9.52108.25%
RobertaForQuestionAnswering6.377.1389.28%6.7993.87%
T5ForConditionalGeneration12.406.72184.51%6.48191.16%
T5Small12.026.66180.55%6.32190.33%
TrOCRForCausalLM14.1213.31106.11%12.45113.41%
XGLMForCausalLM16.486.23264.52%6.35259.51%
XLNetLMHeadModel74.8762.23120.32%57.95129.19%
YituTechConvBert20.2110.50192.48%9.97202.72%

We are also seeing improvement ~9% on internal addmm benchmark

This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.

No CI to test the max-autotune perf currently but this will be enabled via#148672 after which we can investigate more tuning updates and config pruning

cc@jeffdaily@sunway513@jithunnair-amd@pruthvistony@ROCmSupport@dllehr-amd@hongxiayang@naromero77amd@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@yf225@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov

@jataylojataylo added the ciflow/rocmTrigger "default" config CI on ROCm labelFeb 17, 2025
@pytorch-bot
Copy link

pytorch-botbot commentedFeb 17, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/147315

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commitebf5a00 with merge base24aadb4 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jataylo
Copy link
CollaboratorAuthor

@pytorchbot rebase

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Commandgit -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/147315/head returned non-zero exit code 1

warning: skipped previously applied commit 221405d29d5hint: use --reapply-cherry-picks to include skipped commitshint: Disable this message with "git config set advice.skippedCherryPicks false"Rebasing (1/3)Auto-merging torch/_inductor/kernel/mm_common.pyAuto-merging torch/_inductor/select_algorithm.pyCONFLICT (modify/delete): torch/_inductor/template_heuristics.py deleted in HEAD and modified in 2ccab1039fc ([ROCm] Add ROCm specific tuning parameters and gemm retuning).  Version 2ccab1039fc ([ROCm] Add ROCm specific tuning parameters and gemm retuning) of torch/_inductor/template_heuristics.py left in tree.error: could not apply 2ccab1039fc... [ROCm] Add ROCm specific tuning parameters and gemm retuninghint: Resolve all conflicts manually, mark them as resolved withhint: "git add/rm <conflicted_files>", then run "git rebase --continue".hint: You can instead skip this commit: run "git rebase --skip".hint: To abort and get back to the state before "git rebase", run "git rebase --abort".hint: Disable this message with "git config set advice.mergeConflict false"Could not apply 2ccab1039fc... [ROCm] Add ROCm specific tuning parameters and gemm retuning

Raised byhttps://github.com/pytorch/pytorch/actions/runs/13412914470

@jataylojataylo added ciflow/inductor-rocmTrigger "inductor" config CI on ROCm ciflow/inductor-periodic labelsFeb 20, 2025
pytorchmergebot pushed a commit that referenced this pull requestMar 7, 2025
Splitting#147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.A follow up PR will update the configs used by ROCm but this requires#147452 to land firstPull Requestresolved:#148437Approved by:https://github.com/eellison,https://github.com/jansel
@jataylo
Copy link
CollaboratorAuthor

Will need a rebase once we finally reland#147452

jataylo added a commit to jataylo/pytorch that referenced this pull requestMar 27, 2025
…8437)Splittingpytorch#147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.A follow up PR will update the configs used by ROCm but this requirespytorch#147452 to land firstPull Requestresolved:pytorch#148437Approved by:https://github.com/eellison,https://github.com/jansel(cherry picked from commit8059ead)
@jataylo
Copy link
CollaboratorAuthor

Will get this one back up and ready now the prerequisite PR is merged.

@jataylo
Copy link
CollaboratorAuthor

@pytorchbot rebase

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

@pytorchmergebot
Copy link
Collaborator

Successfully rebasedamd-gemm-retune-pr ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viagit checkout amd-gemm-retune-pr && git pull --rebase)

@jataylojataylo requested a review fromeellisonApril 8, 2025 10:01
@jataylojataylo marked this pull request as ready for reviewApril 8, 2025 10:01
@bdhirshbdhirsh added the triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module labelApr 8, 2025
c.num_stages=self.default_num_stages
returnconfigs

def_finalize_mm_configs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmm, I wonder if some of the_finalize_mm_configs logic could be deduped/refactored.

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

At the minute the Base logic and ROCm logic for this method are quite similar besides the additional AMD triton backend kernel args, but I do imagine this may start to more diverge more strongly when we come to adding backend specific optimisations here.

But yeah I do think there is potential refactor here. If it's okay I'll merge this one as is and we can work on that going forward.

eellison reacted with thumbs up emoji
@jataylo
Copy link
CollaboratorAuthor

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelApr 9, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are:inductor-periodic / cuda12.6-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra teamRaised byworkflow job

@jataylo
Copy link
CollaboratorAuthor

This failure is currently on the hud along side other benchmark failures
https://hud.pytorch.org/pytorch/pytorch/commit/604467de208646f0c3b2663e45f2ff6a655a6716

@eellison are we okay if I force push this?

@eellison
Copy link
Contributor

@jataylo, yes, per the dr ci comment you should be good to land:#147315 (comment)

jataylo reacted with heart emoji

@eellison
Copy link
Contributor

@pytorchbot merge -i

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

timocafe pushed a commit to timocafe/pytorch that referenced this pull requestApr 16, 2025
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison
jataylo added a commit to jataylo/pytorch that referenced this pull requestAug 10, 2025
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison(cherry picked from commit2299087)
jataylo added a commit to ROCm/pytorch that referenced this pull requestSep 22, 2025
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison(cherry picked from commit2299087)
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@janseljanseljansel approved these changes

@eellisoneellisoneellison approved these changes

Assignees

No one assigned

Labels

ciflow/inductorciflow/inductor-periodicciflow/inductor-rocmTrigger "inductor" config CI on ROCmciflow/rocmTrigger "default" config CI on ROCmciflow/trunkTrigger trunk jobs on your pull requestMergedmodule: inductormodule: rocmAMD GPU support for Pytorchopen sourcerelease notes: rocmmandatorylabeltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

6 participants

@jataylo@pytorchmergebot@eellison@jansel@bdhirsh@pytorchbot

[8]ページ先頭

©2009-2025 Movatter.jp