Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU#147693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
baodii wants to merge9 commits intopytorch:mainfrombaodii:onednn_pri_cache

Conversation

@pytorch-bot
Copy link

pytorch-botbot commentedFeb 23, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/147693

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit562af36 with merge base54f1f29 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-botpytorch-botbot added the module: cpuCPU specific problem (e.g., perf, algorithm) labelFeb 23, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easyclabot commentedFeb 23, 2025
edited
Loading

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
constc10::optional<Tensor>& bias,
conststd::optional<Tensor>& bias,

@EikanWang
Copy link
Collaborator

@baodii , have you signed the EasyCLA?

@baodiibaodii changed the titleOnednn pri cache[Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPUMar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(nit) Do you have plans to usesrc_desc(), ... queries elsewhere?
If not, why not callquery_md() directly in methods that create memory objects (like here)?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hi, we usesrc_desc() asprimitive_desc class to make the code consistent with oneDNN.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(readability) maybe use a better name thanm? (e.g. mem_arg_cache? )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also, if you decide to hoist make_args calls out of the execute function, you might need to make this an unordered_map.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

mem_arg_cache is better thenm. I'll take it.
We adoptedunordered_map at first. But when we tested on a BMG client with a low-end CPU,unordered_map would affect the performance. So, we chose the simpler array.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

what are the slots 0 and 1 reserved for?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since we just provide int4 gemm this time. The slots 0 and 1 are reserved forscale andzp for int4 gemm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For that test to work, there is an assumption that memory::desc is constant across multiple calls to execute. Just two notes on that:

  • this will no more be true if you start using oneDNN runtime_dimension feature (in that case, you will have to call make_args with proper shape at every execute call).
  • if you don't use oneDNN runtime_dimension feature, you actually can hoist make_args calls out of this execute call (e.g. in primitive_ext constructor), and only useset_data_handle here.

EikanWang reacted with thumbs up emoji

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why not take transpose and types as parameter, and hash them as part of the key?
This would allow to have just a single cache instead of separate cache for each combination (and it would be harder to control total cache size).
Also, do you plan to have separate cache for each primitive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, cache only rely on information of shapes but nothing else. Other factors can be processed by switch cases. Which will greatly shrink hash conflict and speed up lookup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

using unordered_map inprimitive_ext would help hiding this implementation detai.

@pytorch-botpytorch-botbot added module: amp (automated mixed precision)autocast module: compiled autogradcompiled_autograd module: dynamo module: inductor module: mkldnnRelated to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration NNC oncall: distributedAdd this issue/PR to distributed oncall triage queue release notes: quantizationrelease notes category release notes: relengrelease notes category release notes: distributed (checkpoint) release notes: inductor (aoti) labelsApr 22, 2025
@baodiibaodii marked this pull request as ready for reviewApril 24, 2025 06:41
@ZhiweiYan-96ZhiweiYan-96 marked this pull request as draftApril 24, 2025 07:11
@ZhiweiYan-96ZhiweiYan-96 added the ciflow/xpuRun XPU CI tasks labelApr 24, 2025
@pytorch-bot
Copy link

To add the ciflow labelciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-botpytorch-botbot removed the ciflow/xpuRun XPU CI tasks labelApr 24, 2025
a=a_bf16.to(dtype=dtype)
b=b_bf16.to(dtype=dtype)
b_scales=b_scales.to(dtype=dtype)
ref=torch.mm(a,b)
Copy link
Collaborator

@ZhiweiYan-96ZhiweiYan-96Apr 24, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please do not change the test logic of int4 gemm ut.

  1. GPTQ XPU is not supported at AO now. The ut is only intended for RTN recipe.
  2. Usingdq weight to construct ref is not recommended. We formerly found that it would hide bugs in weight prepack.
  3. The UT follows the usage of pytorch designed cases, we want keep a same logic.

@ZhiweiYan-96
Copy link
Collaborator

ZhiweiYan-96 commentedApr 24, 2025
edited
Loading

The PR overall looks like deprecate current in-tree int4 gemm and use ipex-style int4 gemm(used at hf and other fwk) instead. I thought we want to introduce the prim cache in current int4 gemm specifically intended for torchAO. Besides, mainting two set of int4 gemm may results in further maintenance efforts.

@pytorch-botpytorch-botbot removed the ciflow/xpuRun XPU CI tasks labelMay 28, 2025
@guangyeyguangyey added ciflow/xpuRun XPU CI tasks ciflow/trunkTrigger trunk jobs on your pull request labelsMay 28, 2025
@pytorch-bot
Copy link

To add the ciflow labelciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-botpytorch-botbot removed the ciflow/trunkTrigger trunk jobs on your pull request labelMay 28, 2025
@guangyeyguangyey added the ciflow/trunkTrigger trunk jobs on your pull request labelMay 28, 2025
@guangyey
Copy link
Collaborator

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are:xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 4, 6, linux.idc.xpu)

Details for Dev Infra teamRaised byworkflow job

@ZhiweiYan-96ZhiweiYan-96 added the keep-goingDon't stop on first failure, keep running tests until the end labelMay 28, 2025
@ZhiweiYan-96
Copy link
Collaborator

hi,@etaf, is this new error in track? thanks

@etaf
Copy link
Collaborator

hi,@etaf, is this new error in track? thanks

Yes:#154514

@ZhiweiYan-96ZhiweiYan-96 added keep-goingDon't stop on first failure, keep running tests until the end and removed keep-goingDon't stop on first failure, keep running tests until the end labelsMay 29, 2025
@liangan1
Copy link
Contributor

liangan1 commentedMay 30, 2025
edited
Loading

@pytorchbot label ciflow/xpu

@pytorch-bot
Copy link

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'ciflow/xpu' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

@liangan1
Copy link
Contributor

@pytorchbot label ciflow/xpu

pytorch-bot[bot] reacted with thumbs up emoji

@liangan1
Copy link
Contributor

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-project-automationgithub-project-automationbot moved this fromApproved toDone inPyTorch IntelMay 30, 2025
iupaikov-amd pushed a commit to ROCm/pytorch that referenced this pull requestJun 4, 2025
…ytorch#147693)* add onednn primitive cache for int4 gemm for xpuPull Requestresolved:pytorch#147693Approved by:https://github.com/EikanWang,https://github.com/liangan1,https://github.com/guangyey,https://github.com/ZhiweiYan-96Co-authored-by: Yan, Zhiwei <zhiwei.yan@intel.com>Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@guangyeyguangyeyguangyey approved these changes

@ZhiweiYan-96ZhiweiYan-96ZhiweiYan-96 approved these changes

@EikanWangEikanWangEikanWang approved these changes

@gujinghuigujinghuiAwaiting requested review from gujinghuigujinghui is a code owner

+3 more reviewers

@CaoZhongZCaoZhongZCaoZhongZ left review comments

@mgouicemmgouicemmgouicem left review comments

@liangan1liangan1liangan1 approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

ciflow/trunkTrigger trunk jobs on your pull requestciflow/xpuRun XPU CI taskskeep-goingDon't stop on first failure, keep running tests until the endMergedmodule: cpuCPU specific problem (e.g., perf, algorithm)module: mkldnnRelated to Intel IDEEP or oneDNN (a.k.a. mkldnn) integrationmodule: xpuIntel XPU related issuesopen sourcerelease notes: xpurelease notes categorytopic: not user facingtopic category

Projects

Status: Done

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

10 participants

@baodii@EikanWang@ZhiweiYan-96@pytorchmergebot@liangan1@guangyey@etaf@CaoZhongZ@mgouicem@pytorchbot

[8]ページ先頭

©2009-2025 Movatter.jp