- Notifications
You must be signed in to change notification settings - Fork26.3k
[ROCm] Introduce AMD specific inductor gemm tuning#147315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedFeb 17, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
jataylo commentedFeb 19, 2025
@pytorchbot rebase |
pytorchmergebot commentedFeb 19, 2025
@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere |
pytorchmergebot commentedFeb 19, 2025
Rebase failed due to Command Raised byhttps://github.com/pytorch/pytorch/actions/runs/13412914470 |
fc8e40b to8173154CompareSplitting#147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.A follow up PR will update the configs used by ROCm but this requires#147452 to land firstPull Requestresolved:#148437Approved by:https://github.com/eellison,https://github.com/jansel
jataylo commentedMar 11, 2025
Will need a rebase once we finally reland#147452 |
…8437)Splittingpytorch#147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.A follow up PR will update the configs used by ROCm but this requirespytorch#147452 to land firstPull Requestresolved:pytorch#148437Approved by:https://github.com/eellison,https://github.com/jansel(cherry picked from commit8059ead)
jataylo commentedMar 28, 2025
Will get this one back up and ready now the prerequisite PR is merged. |
jataylo commentedApr 7, 2025
@pytorchbot rebase |
pytorchmergebot commentedApr 7, 2025
@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere |
pytorchmergebot commentedApr 7, 2025
Successfully rebased |
188fd95 toae51f64CompareUh oh!
There was an error while loading.Please reload this page.
| c.num_stages=self.default_num_stages | ||
| returnconfigs | ||
| def_finalize_mm_configs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hmm, I wonder if some of the_finalize_mm_configs logic could be deduped/refactored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
At the minute the Base logic and ROCm logic for this method are quite similar besides the additional AMD triton backend kernel args, but I do imagine this may start to more diverge more strongly when we come to adding backend specific optimisations here.
But yeah I do think there is potential refactor here. If it's okay I'll merge this one as is and we can work on that going forward.
jataylo commentedApr 9, 2025
@pytorchbot merge |
pytorchmergebot commentedApr 9, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
pytorchmergebot commentedApr 9, 2025
Merge failedReason: 1 jobs have failed, first few of them are:inductor-periodic / cuda12.6-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised byworkflow job |
jataylo commentedApr 9, 2025
This failure is currently on the hud along side other benchmark failures @eellison are we okay if I force push this? |
eellison commentedApr 9, 2025
@jataylo, yes, per the dr ci comment you should be good to land:#147315 (comment) |
eellison commentedApr 9, 2025
@pytorchbot merge -i |
pytorchmergebot commentedApr 9, 2025
Merge startedYour change will be merged while ignoring the following 4 checks:inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2),inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2),inductor-periodic / rocm-py3_10-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.rocm.gpu.mi300.2),inductor-periodic / cuda12.6-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison(cherry picked from commit2299087)
Replacespytorch#143286Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.Dynamo huggingface inference benchmarks:`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`GEOMEAN speedup (before): | 1.35xGEOMEAN speedup (after): | 1.42xname | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup-- | -- | -- | -- | -- | --AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%We are also seeing improvement ~9% on internal addmm benchmarkThis PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.No CI to test the max-autotune perf currently but this will be enabled viapytorch#148672 after which we can investigate more tuning updates and config pruningPull Requestresolved:pytorch#147315Approved by:https://github.com/jansel,https://github.com/eellison(cherry picked from commit2299087)
Uh oh!
There was an error while loading.Please reload this page.
Replaces#143286
Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.
Dynamo huggingface inference benchmarks:
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductorGEOMEAN speedup (before): | 1.35x
GEOMEAN speedup (after): | 1.42x
We are also seeing improvement ~9% on internal addmm benchmark
This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.
No CI to test the max-autotune perf currently but this will be enabled via#148672 after which we can investigate more tuning updates and config pruning
cc@jeffdaily@sunway513@jithunnair-amd@pruthvistony@ROCmSupport@dllehr-amd@hongxiayang@naromero77amd@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@yf225@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov