Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32#156174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed

Conversation

@Xia-Weiwen
Copy link
Collaborator

@Xia-WeiwenXia-Weiwen commentedJun 17, 2025
edited by pytorchmergebot
Loading

Stack fromghstack (oldest at bottom):

Summary
We found that usingblock_n=32 brings better performance for A16W4 thanblock_n=64 because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores,block_n=32 is faster by >10% E2E for both first and next token.

Test plan

pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx

cc@voznesenskym@penguinwu@EikanWang@jgong5@Guobing-Chen@XiaobingSuper@zhuhaozhe@blzheng@wenzhe-nrv@jiayisunx@ipiszy@chenyang78@kadeng@muchulee8@amjames@chauhang@aakhundov

leslie-fang-intel, sanchitintel, and abhishek-iitmadras reacted with thumbs up emoji
@pytorch-bot
Copy link

pytorch-botbot commentedJun 17, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/156174

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit9cfa7e8 with merge base2625c70 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@leslie-fang-intelleslie-fang-intel marked this pull request as ready for reviewJune 18, 2025 00:28
@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commentedJun 18, 2025
edited
Loading

The failure should be irrelevant. Will rebase this PR and test again.

@leslie-fang-intel
Copy link
Collaborator

@pytorchbot rebase

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebasedgh/Xia-Weiwen/40/orig ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viaghstack checkout https://github.com/pytorch/pytorch/pull/156174)

pytorchmergebot pushed a commit that referenced this pull requestJun 18, 2025
leslie-fang-intel pushed a commit to leslie-fang-intel/pytorch that referenced this pull requestJun 18, 2025
…k_n=32ghstack-source-id:62af325Pull-Request:pytorch#156174ghstack-source-id:62af325Pull Requestresolved:pytorch#156294
@leslie-fang-intel
Copy link
Collaborator

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelJun 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull requestJun 20, 2025
…156407)**Summary**There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR#156174](#156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.**Test Plan**```python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one```Pull Requestresolved:#156407Approved by:https://github.com/CaoE
@github-actionsgithub-actionsbot deleted the gh/Xia-Weiwen/40/head branchJuly 20, 2025 02:20
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@leslie-fang-intelleslie-fang-intelleslie-fang-intel approved these changes

@jgong5jgong5Awaiting requested review from jgong5

Assignees

No one assigned

Labels

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

5 participants

@Xia-Weiwen@leslie-fang-intel@pytorchmergebot@pytorchbot

[8]ページ先頭

©2009-2025 Movatter.jp