- Notifications
You must be signed in to change notification settings - Fork26.3k
[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32#156174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedJun 17, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/156174
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit9cfa7e8 with merge base2625c70 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
leslie-fang-intel commentedJun 18, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The failure should be irrelevant. Will rebase this PR and test again. |
leslie-fang-intel commentedJun 18, 2025
@pytorchbot rebase |
pytorchmergebot commentedJun 18, 2025
@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere |
pytorchmergebot commentedJun 18, 2025
Successfully rebased |
…k_n=32ghstack-source-id:62af325Pull-Request:pytorch#156174ghstack-source-id:62af325Pull Requestresolved:pytorch#156294
leslie-fang-intel commentedJun 18, 2025
@pytorchbot merge |
pytorchmergebot commentedJun 18, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
…156407)**Summary**There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR#156174](#156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.**Test Plan**```python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one```Pull Requestresolved:#156407Approved by:https://github.com/CaoE
Uh oh!
There was an error while loading.Please reload this page.
Stack fromghstack (oldest at bottom):
Summary
We found that using
block_n=32brings better performance for A16W4 thanblock_n=64because cache locality is better and parallelism is better if N is small and more cores are used.For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores,
block_n=32is faster by >10% E2E for both first and next token.Test plan
cc@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov