NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

[ROCm] Introduce AMD specific inductor gemm tuning#147315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

jataylo wants to merge7 commits intopytorch:mainfromjataylo:amd-gemm-retune-pr

Closed

[ROCm] Introduce AMD specific inductor gemm tuning#147315

jataylo wants to merge7 commits intopytorch:mainfromjataylo:amd-gemm-retune-pr

Conversation

Copy link

Collaborator

jataylo commentedFeb 17, 2025•
edited
Loading

Replaces#143286

Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.

Dynamo huggingface inference benchmarks:
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor

GEOMEAN speedup (before): | 1.35x
GEOMEAN speedup (after): | 1.42x

name	Eager - abs latency	old - abs_latency	old - speedup	new - abs_latency	new - speedup
AlbertForMaskedLM	26.22	26.52	98.86%	24.58	106.67%
AlbertForQuestionAnswering	25.96	26.40	98.33%	24.10	107.73%
AllenaiLongformerBase	21.03	10.65	197.50%	10.49	200.58%
BartForCausalLM	7.77	9.76	79.63%	8.79	88.46%
BartForConditionalGeneration	14.44	12.86	112.26%	11.96	120.70%
BertForMaskedLM	8.10	8.82	91.89%	8.57	94.53%
BertForQuestionAnswering	6.82	7.32	93.20%	7.10	96.18%
BlenderbotForCausalLM	10.97	11.39	96.34%	10.10	108.65%
BlenderbotSmallForCausalLM	5.91	5.44	108.72%	4.82	122.67%
BlenderbotSmallForConditionalGeneration	12.64	9.65	130.94%	9.11	138.83%
CamemBert	8.35	9.15	91.24%	8.86	94.27%
DebertaForMaskedLM	10.92	6.09	179.44%	5.90	185.05%
DebertaForQuestionAnswering	14.29	7.70	185.59%	7.26	196.75%
DebertaV2ForMaskedLM	15.47	10.22	151.32%	9.34	165.55%
DebertaV2ForQuestionAnswering	14.98	6.11	245.28%	6.28	238.40%
DistilBertForMaskedLM	8.37	8.70	96.30%	8.22	101.92%
DistilBertForQuestionAnswering	10.21	10.54	96.88%	10.39	98.36%
DistillGPT2	8.77	6.78	129.40%	6.31	138.88%
ElectraForCausalLM	10.32	4.70	219.45%	4.60	224.29%
ElectraForQuestionAnswering	11.48	5.62	204.20%	5.44	210.95%
GPT2ForSequenceClassification	6.21	5.72	108.50%	5.58	111.26%
GoogleFnet	26.51	20.81	127.37%	19.91	133.11%
LayoutLMForMaskedLM	12.09	7.99	151.28%	7.66	157.80%
LayoutLMForSequenceClassification	10.62	6.49	163.67%	6.25	169.95%
M2M100ForConditionalGeneration	14.98	10.20	146.79%	9.89	151.42%
MBartForCausalLM	7.67	9.78	78.44%	8.87	86.55%
MBartForConditionalGeneration	13.45	12.69	105.99%	12.03	111.82%
MT5ForConditionalGeneration	19.96	5.32	375.37%	5.08	393.01%
MegatronBertForCausalLM	13.22	7.86	168.07%	7.18	184.01%
MegatronBertForQuestionAnswering	15.62	11.81	132.21%	11.02	141.68%
MobileBertForMaskedLM	26.63	10.82	245.99%	11.95	222.73%
MobileBertForQuestionAnswering	23.53	7.55	311.51%	9.53	247.03%
OPTForCausalLM	7.33	7.64	95.93%	7.56	96.90%
PLBartForCausalLM	8.73	7.63	114.40%	7.37	118.58%
PLBartForConditionalGeneration	10.46	8.50	122.98%	8.16	128.13%
PegasusForCausalLM	7.18	7.37	97.42%	6.64	108.22%
PegasusForConditionalGeneration	16.47	16.66	98.87%	14.18	116.13%
RobertaForCausalLM	10.30	9.95	103.52%	9.52	108.25%
RobertaForQuestionAnswering	6.37	7.13	89.28%	6.79	93.87%
T5ForConditionalGeneration	12.40	6.72	184.51%	6.48	191.16%
T5Small	12.02	6.66	180.55%	6.32	190.33%
TrOCRForCausalLM	14.12	13.31	106.11%	12.45	113.41%
XGLMForCausalLM	16.48	6.23	264.52%	6.35	259.51%
XLNetLMHeadModel	74.87	62.23	120.32%	57.95	129.19%
YituTechConvBert	20.21	10.50	192.48%	9.97	202.72%

We are also seeing improvement ~9% on internal addmm benchmark

This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.

No CI to test the max-autotune perf currently but this will be enabled via#148672 after which we can investigate more tuning updates and config pruning

cc@jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

jataylo added the ciflow/rocmTrigger "default" config CI on ROCm label

Feb 17, 2025

Copy link

pytorch-botbot commentedFeb 17, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/147315

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commitebf5a00 with merge base24aadb4 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor-periodic / cuda12.6-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
tts_angular
inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2) (gh) (similar failure)
torch_multimodal_clip
inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2) (gh) (similar failure)
torch_multimodal_clip
inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2) (gh) (similar failure)
torch_multimodal_clip

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added ciflow/inductor module: inductor module: rocmAMD GPU support for Pytorch labels

Feb 17, 2025

pytorchbot added the open source label

Feb 17, 2025

jataylo added the release notes: rocmmandatorylabel label

Feb 17, 2025

Copy link

CollaboratorAuthor

jataylo commentedFeb 19, 2025

@pytorchbot rebase

Copy link

Collaborator

pytorchmergebot commentedFeb 19, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

Copy link

Collaborator

pytorchmergebot commentedFeb 19, 2025

Rebase failed due to Commandgit -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/147315/head returned non-zero exit code 1

warning: skipped previously applied commit 221405d29d5hint: use --reapply-cherry-picks to include skipped commitshint: Disable this message with "git config set advice.skippedCherryPicks false"Rebasing (1/3)Auto-merging torch/_inductor/kernel/mm_common.pyAuto-merging torch/_inductor/select_algorithm.pyCONFLICT (modify/delete): torch/_inductor/template_heuristics.py deleted in HEAD and modified in 2ccab1039fc ([ROCm] Add ROCm specific tuning parameters and gemm retuning).  Version 2ccab1039fc ([ROCm] Add ROCm specific tuning parameters and gemm retuning) of torch/_inductor/template_heuristics.py left in tree.error: could not apply 2ccab1039fc... [ROCm] Add ROCm specific tuning parameters and gemm retuninghint: Resolve all conflicts manually, mark them as resolved withhint: "git add/rm <conflicted_files>", then run "git rebase --continue".hint: You can instead skip this commit: run "git rebase --skip".hint: To abort and get back to the state before "git rebase", run "git rebase --abort".hint: Disable this message with "git config set advice.mergeConflict false"Could not apply 2ccab1039fc... [ROCm] Add ROCm specific tuning parameters and gemm retuning

Raised byhttps://github.com/pytorch/pytorch/actions/runs/13412914470

jataylo force-pushed theamd-gemm-retune-pr branch fromfc8e40b to8173154Compare

February 19, 2025 15:08

jataylo added ciflow/inductor-rocm

Trigger "inductor" config CI on ROCm

ciflow/inductor-periodic labels

Feb 20, 2025

jataylo mentioned this pull request

Mar 4, 2025

[ROCm] Incorporate ROCm triton specific tuning parameters#148437

Closed

pytorchmergebot pushed a commit that referenced this pull request

Mar 7, 2025

[ROCm] Incorporate ROCm triton specific tuning parameters (#148437)

8059ead

Splitting#147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.A follow up PR will update the configs used by ROCm but this requires#147452 to land firstPull Requestresolved:#148437Approved by:https://github.com/eellison,https://github.com/jansel

Copy link

CollaboratorAuthor

jataylo commentedMar 11, 2025

Will need a rebase once we finally reland#147452

jataylo added a commit to jataylo/pytorch that referenced this pull request

Mar 27, 2025

[ROCm] Incorporate ROCm triton specific tuning parameters (pytorch#14…

0df99be

…8437)Splittingpytorch#147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.A follow up PR will update the configs used by ROCm but this requirespytorch#147452 to land firstPull Requestresolved:pytorch#148437Approved by:https://github.com/eellison,https://github.com/jansel(cherry picked from commit8059ead)

Copy link

CollaboratorAuthor

jataylo commentedMar 28, 2025

Will get this one back up and ready now the prerequisite PR is merged.

jataylo force-pushed theamd-gemm-retune-pr branch from8173154 tob618b13Compare

April 2, 2025 15:18

Copy link

CollaboratorAuthor

jataylo commentedApr 7, 2025

@pytorchbot rebase

Copy link

Collaborator

pytorchmergebot commentedApr 7, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

jatayloand others added6 commits

April 7, 2025 12:59

[ROCm] Introduce AMD specific inductor gemm tuning

9e8dcf2

Fixes

b2a1663

Update template_heuristics.py

07874c6

Linting and scale fixes

da112c0

Linting

75d9522

Linting

ae51f64

Copy link

Collaborator

pytorchmergebot commentedApr 7, 2025

Successfully rebasedamd-gemm-retune-pr ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viagit checkout amd-gemm-retune-pr && git pull --rebase)

pytorchmergebot force-pushed theamd-gemm-retune-pr branch from188fd95 toae51f64Compare

April 7, 2025 12:59

jataylo requested a review fromjansel

April 8, 2025 10:01

jataylo requested a review fromeellison

April 8, 2025 10:01

jataylo marked this pull request as ready for review

April 8, 2025 10:01

bdhirsh added the triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module label

Apr 8, 2025

jansel approved these changes

Apr 8, 2025

View reviewed changes

eellison approved these changes

Apr 8, 2025

View reviewed changes

torch/_inductor/template_heuristics.py OutdatedShow resolvedHide resolved

torch/_inductor/template_heuristics.py

		c.num_stages=self.default_num_stages
		returnconfigs

		def_finalize_mm_configs(

Copy link

Contributor

eellisonApr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmm, I wonder if some of the_finalize_mm_configs logic could be deduped/refactored.

Copy link

CollaboratorAuthor

jatayloApr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

At the minute the Base logic and ROCm logic for this method are quite similar besides the additional AMD triton backend kernel args, but I do imagine this may start to more diverge more strongly when we come to adding backend specific optimisations here.

But yeah I do think there is potential refactor here. If it's okay I'll merge this one as is and we can work on that going forward.

Addressing nit and fixing some comments

ebf5a00

Copy link

CollaboratorAuthor

jataylo commentedApr 9, 2025

@pytorchbot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Apr 9, 2025

pytorchmergebot added the merging label

Apr 9, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 9, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedApr 9, 2025

Merge failed

Reason: 1 jobs have failed, first few of them are:inductor-periodic / cuda12.6-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised byworkflow job

pytorchmergebot removed the merging label

Apr 9, 2025

Copy link

CollaboratorAuthor

jataylo commentedApr 9, 2025

This failure is currently on the hud along side other benchmark failures
https://hud.pytorch.org/pytorch/pytorch/commit/604467de208646f0c3b2663e45f2ff6a655a6716

@eellison are we okay if I force push this?

Copy link

Contributor

eellison commentedApr 9, 2025

@jataylo, yes, per the dr ci comment you should be good to land:#147315 (comment)

Copy link

Contributor

eellison commentedApr 9, 2025

@pytorchbot merge -i

pytorchmergebot added the merging label

Apr 9, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 9, 2025

Merge started

Your change will be merged while ignoring the following 4 checks:inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2),inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2),inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2),inductor-periodic / cuda12.6-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in2299087

Apr 9, 2025

pytorchmergebot added Merged and removed merging labels

Apr 9, 2025

timocafe pushed a commit to timocafe/pytorch that referenced this pull request

Apr 16, 2025

[ROCm] Introduce AMD specific inductor gemm tuning (pytorch#147315)

a6e20ba

Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

[ROCm] Introduce AMD specific inductor gemm tuning (pytorch#147315)

aeec542

Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison

jataylo added a commit to jataylo/pytorch that referenced this pull request

Aug 10, 2025

[ROCm] Introduce AMD specific inductor gemm tuning (pytorch#147315)

3edfd47

Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison(cherry picked from commit2299087)

jataylo added a commit to ROCm/pytorch that referenced this pull request

Sep 22, 2025

[ROCm] Introduce AMD specific inductor gemm tuning (pytorch#147315)

61c24e9

Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison(cherry picked from commit2299087)