NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

[Intel GPU] Enable safe softmax for XPU SDPA#151999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

LuFinch wants to merge4 commits intopytorch:mainfromLuFinch:lfq/safe_softmax

Closed

[Intel GPU] Enable safe softmax for XPU SDPA#151999

LuFinch wants to merge4 commits intopytorch:mainfromLuFinch:lfq/safe_softmax

Conversation

Copy link

Contributor

LuFinch commentedApr 23, 2025•
edited by pytorch-botbot
Loading

Fixintel/torch-xpu-ops#1432 (comment)

When one row of Q*K attention score is masked with-inf,softmax(score) would outputNaN for whole row which would cause model corruption.

With this new flag, it would output0 for whole row which is aligned with Pytorch CPU/CUDA's behavior.

cc@jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @gujinghui @fengyuan14 @guangyey

Copy link

pytorch-botbot commentedApr 23, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/151999

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Pending

As of commit7ba3ce8 with merge base9f5153b ():

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added the module: cpuCPU specific problem (e.g., perf, algorithm) label

Apr 23, 2025

pytorchbot added the open source label

Apr 23, 2025

guangyey added ciflow/xpu

Run XPU CI tasks

release notes: xpu

release notes category

module: xpuIntel XPU related issues labels

Apr 24, 2025

Copy link

pytorch-botbot commentedApr 24, 2025

To add the ciflow labelciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-botbot removed the ciflow/xpuRun XPU CI tasks label

Apr 24, 2025

guangyey added this toPyTorch Intel

Apr 24, 2025

Copy link

Collaborator

guangyey commentedApr 24, 2025

Could you elaborate on the issues we would encounter if this PR were not applied in PR description? And give a test case if possible.

guangyey moved this toPre-Review Required inPyTorch Intel

Apr 24, 2025

guangyey added the ciflow/xpuRun XPU CI tasks label

Apr 24, 2025

Copy link

pytorch-botbot commentedApr 24, 2025

To add the ciflow labelciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-botbot removed the ciflow/xpuRun XPU CI tasks label

Apr 24, 2025

Copy link

ContributorAuthor

LuFinch commentedApr 24, 2025

@guangyey Updated PR description and added UT.

etaf added the ciflow/xpuRun XPU CI tasks label

Apr 25, 2025

Copy link

pytorch-botbot commentedApr 25, 2025

To add the ciflow labelciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-botbot removed the ciflow/xpuRun XPU CI tasks label

Apr 25, 2025

Copy link

Collaborator

guangyey commentedApr 25, 2025

Thanks for update.

LuFinch force-pushed thelfq/safe_softmax branch from4293129 to1e3d833Compare

June 4, 2025 06:17

LuFinch marked this pull request as ready for review

June 4, 2025 06:19

LuFinch requested review fromEikanWang andgujinghui ascode owners

June 4, 2025 06:19

Copy link

ContributorAuthor

LuFinch commentedJun 4, 2025•
edited
Loading

@guangyey OneDNN has been upgraded to v3.8. This PR is ready to merge. Could you help review and trigger CI?

guangyey approved these changes

Jun 4, 2025

View reviewed changes

guangyey added the ciflow/xpuRun XPU CI tasks label

Jun 4, 2025

guangyey moved this fromPre-Review Required toReview Required inPyTorch Intel

Jun 4, 2025

guangyey requested a review fromdrisspg

June 4, 2025 06:38

pytorch-botbot removed the ciflow/xpuRun XPU CI tasks label

Jun 4, 2025

guangyey added the ciflow/xpuRun XPU CI tasks label

Jun 5, 2025

ZhiweiYan-96 added the ciflow/inductor label

Jun 13, 2025

Copy link

pytorch-botbot commentedJun 13, 2025

To add the ciflow labelciflow/inductor please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-botbot removed the ciflow/inductor label

Jun 13, 2025

ZhiweiYan-96 added ciflow/trunk

Trigger trunk jobs on your pull request

ciflow/inductor labels

Jun 13, 2025

Copy link

pytorch-botbot commentedJun 13, 2025

To add the ciflow labelciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link

pytorch-botbot commentedJun 13, 2025

To add the ciflow labelciflow/inductor please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-botbot removed ciflow/trunk

Trigger trunk jobs on your pull request

ciflow/inductor labels

Jun 13, 2025

Copy link

Collaborator

guangyey commentedJun 13, 2025

@pytorchbot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Jun 13, 2025

pytorchmergebot added the merging label

Jun 13, 2025

Copy link

Collaborator

pytorchmergebot commentedJun 13, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedJun 13, 2025

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-cuda12.8-py3.10-gcc11-sm89 / build

Dig deeper byviewing the failures on hud

Details for Dev Infra team

Raised byworkflow job

Failing merge rule: Core Maintainers

pytorchmergebot removed the merging label

Jun 13, 2025

Copy link

Collaborator

guangyey commentedJun 13, 2025

@pytorchbot merge

pytorchmergebot added the merging label

Jun 13, 2025

Copy link

Collaborator

pytorchmergebot commentedJun 13, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedJun 13, 2025

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper byviewing the failures on hud

Details for Dev Infra team

Raised byworkflow job

Failing merge rule: Core Maintainers

pytorchmergebot removed the merging label

Jun 13, 2025

Copy link

Collaborator

guangyey commentedJun 13, 2025

@pytorchbot merge -f "lint is green, XPU CI pass, ignore unrelated failure and queuing rocm CI"

pytorchmergebot added the merging label

Jun 13, 2025

Copy link

Collaborator

pytorchmergebot commentedJun 13, 2025

Merge started

Your change will be merged immediately since you used the force (-f) flag,bypassing any CI checks (ETA: 1-5 minutes). Please use-f as last resort and instead consider-i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here