NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

[Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU#147693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

baodii wants to merge9 commits intopytorch:mainfrombaodii:onednn_pri_cache

Closed

[Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU#147693

baodii wants to merge9 commits intopytorch:mainfrombaodii:onednn_pri_cache

Conversation

Copy link

Contributor

baodii commentedFeb 23, 2025•
edited by pytorch-botbot
Loading

add onednn primitive cache for int4 gemm for xpu

cc@jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @xmfan @EikanWang @fengyuan14 @guangyey @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @mcarilli @ptrblck @leslie-fang-intel @voznesenskym @penguinwu @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @CaoZhongZ @rogerxfeng8

Copy link

pytorch-botbot commentedFeb 23, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/147693

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit562af36 with merge base54f1f29 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added the module: cpuCPU specific problem (e.g., perf, algorithm) label

Feb 23, 2025

Copy link

linux-foundation-easyclabot commentedFeb 23, 2025•
edited
Loading

The committers listed above are authorized under a signed CLA.

✅ login: baodii / name: baodi (9410e67,197af04,e35bfa0,91fa80d,ec80b7a,384df5e,4ceedf6)
✅ login: guangyey / name: Yu, Guangye (562af36)
✅ login: ZhiweiYan-96 / name: Zhiwei (316d232)

pytorchbot added the open source label

Feb 23, 2025

guangyey reviewed

Feb 24, 2025

View reviewed changes

aten/src/ATen/native/mkldnn/xpu/detail/QInt4Matmul.cpp Outdated

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	constc10::optional<Tensor>& bias,
	conststd::optional<Tensor>& bias,

Copy link

Collaborator

EikanWang commentedFeb 24, 2025

@baodii , have you signed the EasyCLA?

baodii changed the title~~Onednn pri cache~~[Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU

Mar 26, 2025

mgouicem reviewed

Apr 3, 2025

View reviewed changes

aten/src/ATen/native/mkldnn/xpu/detail/DnnlExt.h Outdated

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(nit) Do you have plans to usesrc_desc(), ... queries elsewhere?
If not, why not callquery_md() directly in methods that create memory objects (like here)?

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi, we usesrc_desc() asprimitive_desc class to make the code consistent with oneDNN.

aten/src/ATen/native/mkldnn/xpu/detail/DnnlExt.h Outdated

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(readability) maybe use a better name thanm? (e.g. mem_arg_cache? )

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also, if you decide to hoist make_args calls out of the execute function, you might need to make this an unordered_map.

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

mem_arg_cache is better thenm. I'll take it.
We adoptedunordered_map at first. But when we tested on a BMG client with a low-end CPU,unordered_map would affect the performance. So, we chose the simpler array.

aten/src/ATen/native/mkldnn/xpu/detail/DnnlExt.h Outdated

Copy link

mgouicemApr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

what are the slots 0 and 1 reserved for?

Copy link

ContributorAuthor

baodiiApr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since we just provide int4 gemm this time. The slots 0 and 1 are reserved forscale andzp for int4 gemm.

aten/src/ATen/native/mkldnn/xpu/detail/DnnlExt.h Outdated

Copy link

mgouicemApr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For that test to work, there is an assumption that memory::desc is constant across multiple calls to execute. Just two notes on that:

this will no more be true if you start using oneDNN runtime_dimension feature (in that case, you will have to call make_args with proper shape at every execute call).
if you don't use oneDNN runtime_dimension feature, you actually can hoist make_args calls out of this execute call (e.g. in primitive_ext constructor), and only useset_data_handle here.

aten/src/ATen/native/mkldnn/xpu/detail/DnnlExt.h Outdated

Copy link

mgouicemApr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why not take transpose and types as parameter, and hash them as part of the key?
This would allow to have just a single cache instead of separate cache for each combination (and it would be harder to control total cache size).
Also, do you plan to have separate cache for each primitive?

Copy link

Contributor

CaoZhongZApr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, cache only rely on information of shapes but nothing else. Other factors can be processed by switch cases. Which will greatly shrink hash conflict and speed up lookup.

aten/src/ATen/native/mkldnn/xpu/detail/QInt4Matmul.cpp Outdated

Copy link

mgouicemApr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

using unordered_map inprimitive_ext would help hiding this implementation detai.

pytorch-botbot added module: amp (automated mixed precision)

autocast

module: compiled autograd

compiled_autograd

module: dynamo module: inductor module: mkldnn

Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration

NNC oncall: distributed

Add this issue/PR to distributed oncall triage queue

release notes: quantization

release notes category

release notes: releng

release notes category

release notes: distributed (checkpoint) release notes: inductor (aoti) labels

Apr 22, 2025

baodii force-pushed theonednn_pri_cache branch from7dfda51 toab81043Compare

April 24, 2025 06:39

baodii marked this pull request as ready for review

April 24, 2025 06:41

baodii requested review fromEikanWang andgujinghui ascode owners

April 24, 2025 06:41

ZhiweiYan-96 marked this pull request as draft

April 24, 2025 07:11

ZhiweiYan-96 added the ciflow/xpuRun XPU CI tasks label

Apr 24, 2025

Copy link

pytorch-botbot commentedApr 24, 2025

To add the ciflow labelciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-botbot removed the ciflow/xpuRun XPU CI tasks label

Apr 24, 2025

ZhiweiYan-96 reviewed

Apr 24, 2025

View reviewed changes

test/xpu/test_gemm.py

		a=a_bf16.to(dtype=dtype)
		b=b_bf16.to(dtype=dtype)
		b_scales=b_scales.to(dtype=dtype)
		ref=torch.mm(a,b)

Copy link

Collaborator

ZhiweiYan-96Apr 24, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please do not change the test logic of int4 gemm ut.

GPTQ XPU is not supported at AO now. The ut is only intended for RTN recipe.
Usingdq weight to construct ref is not recommended. We formerly found that it would hide bugs in weight prepack.
The UT follows the usage of pytorch designed cases, we want keep a same logic.

Copy link

Collaborator

ZhiweiYan-96 commentedApr 24, 2025•
edited
Loading

The PR overall looks like deprecate current in-tree int4 gemm and use ipex-style int4 gemm(used at hf and other fwk) instead. I thought we want to introduce the prim cache in current int4 gemm specifically intended for torchAO. Besides, mainting two set of int4 gemm may results in further maintenance efforts.

baodii force-pushed theonednn_pri_cache branch from0635cf0 to8db2454Compare

April 27, 2025 06:18

pytorch-botbot removed the ciflow/xpuRun XPU CI tasks label

May 28, 2025

guangyey added ciflow/xpu

Run XPU CI tasks

ciflow/trunkTrigger trunk jobs on your pull request labels

May 28, 2025

Copy link

pytorch-botbot commentedMay 28, 2025

To add the ciflow labelciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-botbot removed the ciflow/trunkTrigger trunk jobs on your pull request label

May 28, 2025

guangyey added the ciflow/trunkTrigger trunk jobs on your pull request label

May 28, 2025

Copy link

Collaborator

guangyey commentedMay 28, 2025

@pytorchbot merge

pytorchmergebot added the merging label

May 28, 2025

Copy link

Collaborator

pytorchmergebot commentedMay 28, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedMay 28, 2025

Merge failed

Reason: 1 jobs have failed, first few of them are:xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 4, 6, linux.idc.xpu)

Details for Dev Infra team

Raised byworkflow job

pytorchmergebot removed the merging label

May 28, 2025

ZhiweiYan-96 added the keep-goingDon't stop on first failure, keep running tests until the end label

May 28, 2025

Copy link

Collaborator

ZhiweiYan-96 commentedMay 28, 2025

hi,@etaf, is this new error in track? thanks

Copy link

Collaborator

etaf commentedMay 29, 2025

hi,@etaf, is this new error in track? thanks

Yes:#154514

ZhiweiYan-96 added keep-goingDon't stop on first failure, keep running tests until the end and removed keep-goingDon't stop on first failure, keep running tests until the end labels

May 29, 2025

Copy link

Contributor

liangan1 commentedMay 30, 2025•
edited
Loading

@pytorchbot label ciflow/xpu

Copy link

pytorch-botbot commentedMay 30, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'ciflow/xpu' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

Copy link

Contributor

liangan1 commentedMay 30, 2025

@pytorchbot label ciflow/xpu

Copy link

Contributor

liangan1 commentedMay 30, 2025

@pytorchbot merge

pytorchmergebot added the merging label

May 30, 2025

Copy link

Collaborator

pytorchmergebot commentedMay 30, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this inbcbd2a2

May 30, 2025

pytorchmergebot added the Merged label

May 30, 2025

github-project-automationbot moved this fromApproved toDone inPyTorch Intel

May 30, 2025

pytorchmergebot removed the merging label

May 30, 2025

iupaikov-amd pushed a commit to ROCm/pytorch that referenced this pull request

Jun 4, 2025

[Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU (p…

eb62fb0

…ytorch#147693)* add onednn primitive cache for int4 gemm for xpuPull Requestresolved:pytorch#147693Approved by:https://github.com/EikanWang,https://github.com/liangan1,https://github.com/guangyey,https://github.com/ZhiweiYan-96Co-authored-by: Yan, Zhiwei <zhiwei.yan@intel.com>Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>