NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32#156174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

Xia-Weiwen wants to merge2 commits intogh/Xia-Weiwen/40/basefromgh/Xia-Weiwen/40/head

Closed

[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32#156174

Xia-Weiwen wants to merge2 commits intogh/Xia-Weiwen/40/basefromgh/Xia-Weiwen/40/head

Conversation

Copy link

Collaborator

Xia-Weiwen commentedJun 17, 2025•
edited by pytorchmergebot
Loading

Stack fromghstack (oldest at bottom):

->[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 #156174

Summary
We found that usingblock_n=32 brings better performance for A16W4 thanblock_n=64 because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores,block_n=32 is faster by >10% E2E for both first and next token.

Test plan

pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx

cc@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link

pytorch-botbot commentedJun 17, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/156174

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit9cfa7e8 with merge base2625c70 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added ciflow/inductor module: inductor labels

Jun 17, 2025

Xia-Weiwen added a commit that referenced this pull request

Jun 17, 2025

[CPU][Inductor] Improve A16W4 GEMM template performance by using bloc…

561ab94

…k_n=32ghstack-source-id:71b7d0fPull-Request:#156174

Xia-Weiwen marked this pull request as draft

June 17, 2025 09:57

Xia-Weiwen requested review fromjgong5 andleslie-fang-intel

June 17, 2025 09:57

Xia-Weiwen added the topic: not user facingtopic category label

Jun 17, 2025

pytorchbot added the open source label

Jun 17, 2025

Update

0056226

[ghstack-poisoned]

leslie-fang-intel marked this pull request as ready for review

June 18, 2025 00:28

Copy link

Collaborator

leslie-fang-intel commentedJun 18, 2025•
edited
Loading

The failure should be irrelevant. Will rebase this PR and test again.

leslie-fang-intel approved these changes

Jun 18, 2025

View reviewed changes

Copy link

Collaborator

leslie-fang-intel commentedJun 18, 2025

@pytorchbot rebase

Copy link

Collaborator

pytorchmergebot commentedJun 18, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

Update

9cfa7e8

[ghstack-poisoned]

Copy link

Collaborator

pytorchmergebot commentedJun 18, 2025

Successfully rebasedgh/Xia-Weiwen/40/orig ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viaghstack checkout https://github.com/pytorch/pytorch/pull/156174)

pytorchmergebot pushed a commit that referenced this pull request

Jun 18, 2025

[CPU][Inductor] Improve A16W4 GEMM template performance by using bloc…

97c524e

…k_n=32ghstack-source-id:62af325Pull-Request:#156174

leslie-fang-intel mentioned this pull request

Jun 18, 2025

[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32#156294

Closed

leslie-fang-intel pushed a commit to leslie-fang-intel/pytorch that referenced this pull request

Jun 18, 2025

[CPU][Inductor] Improve A16W4 GEMM template performance by using bloc…

b1b9ced

…k_n=32ghstack-source-id:62af325Pull-Request:pytorch#156174ghstack-source-id:62af325Pull Requestresolved:pytorch#156294

Copy link

Collaborator

leslie-fang-intel commentedJun 18, 2025

@pytorchbot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Jun 18, 2025

pytorchmergebot added the merging label

Jun 18, 2025

Copy link

Collaborator

pytorchmergebot commentedJun 18, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

Jun 18, 2025

pytorchmergebot closed this in1bb9b18

Jun 18, 2025

pytorchmergebot removed the merging label

Jun 18, 2025

leslie-fang-intel mentioned this pull request

Jun 19, 2025

[Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one#156407

Closed

pytorchmergebot pushed a commit that referenced this pull request

Jun 20, 2025

[Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#…

f7a5ad6

…156407)**Summary**There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR#156174](#156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.**Test Plan**```python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one```Pull Requestresolved:#156407Approved by:https://github.com/CaoE