Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[FSDP2] allow different dtypes for no grad model params#154103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
xuantengh wants to merge6 commits intopytorch:mainfromxuantengh:fsdp-relax

Conversation

@xuantengh
Copy link
Contributor

@xuantenghxuantengh commentedMay 22, 2025
edited by pytorch-botbot
Loading

@pytorch-bot
Copy link

pytorch-botbot commentedMay 22, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/154103

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs

As of commit1118893 with merge base975bbc6 (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-botpytorch-botbot added oncall: distributedAdd this issue/PR to distributed oncall triage queue release notes: distributed (fsdp)release notes category labelsMay 22, 2025
@awgu
Copy link
Collaborator

I think it's worth to think about what the error/behavior will look like if someone unfreezes their parameters with different dtypes after init time and then tries to run backward.

weifengpy reacted with thumbs up emoji

@Skylion007
Copy link
Collaborator

Skylion007 commentedMay 22, 2025
edited
Loading

Also curious if this is an issue on FSDPv1? I think it might be

@weifengpy
Copy link
Contributor

I think it's worth to think about what the error/behavior will look like if someone unfreezes their parameters with different dtypes after init time and then tries to run backward.

good catch! probably need to check requires_grad again inlazy_init

Copy link
Contributor

@weifengpyweifengpy left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

remove unrelated import and check requires_grad again in lazy_init

@awgu
Copy link
Collaborator

@weifengpy I was also suggesting to understand what kind of error gets raised. Unless you check every time (which I do not necessarily recommend), there is always a chance a user changes therequires_grad compared to when you last checked it. It might be good to know what kind of error naturally gets raised in that case. If it hangs, then that might imply you want to do something differently.

weifengpy reacted with thumbs up emoji

@xuantengh
Copy link
ContributorAuthor

remove unrelated import

Done, I think it was unexpectedly introduced by my editor linter😂. Also fix existing test failures.

@xuantengh
Copy link
ContributorAuthor

xuantengh commentedMay 23, 2025
edited
Loading

I've tested mixed dtype model backward locally, it will raise error due to:
https://github.com/pytorch/pytorch/blob/ec04186d3ca4ec9baca6dbdfa2acaf5eabf0f33d/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py#L375-L381

Edit: we may need to invoke the reduce op multiple times for grads with different dtypes.

weifengpy reacted with thumbs up emoji

Copy link
Contributor

@weifengpyweifengpy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

accepted. agree thattrainable_params would be a better name

@xuantengh
Copy link
ContributorAuthor

So in this PR, do we intend to handle the situation where the frozen params being activated again afterfully_shard and running the backward pass?

@awgu
Copy link
Collaborator

@xuantengh personally I think it's fine if not in this PR

xuantengh reacted with thumbs up emoji

@weifengpy
Copy link
Contributor

So in this PR, do we intend to handle the situation where the frozen params being activated again afterfully_shard and running the backward pass?

no need. As you mentioned, we have a clear runtime error instead of nccl hanging. it's already enough

xuantengh reacted with thumbs up emoji

@xuantengh
Copy link
ContributorAuthor

Hi, any further steps to make this PR merged?

@weifengpy
Copy link
Contributor

@pytorchmergebot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelMay 29, 2025
@weifengpyweifengpy added release notes: distributed (fsdp2)release notes category and removed release notes: distributed (fsdp)release notes category labelsMay 29, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The mergejob was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information seepytorch-bot wiki.

@xuantengh
Copy link
ContributorAuthor

Seems like one test times out.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are:trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Details for Dev Infra teamRaised byworkflow job

@weifengpy
Copy link
Contributor

@pytorchmergebot -i

@pytorch-bot
Copy link

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: -iusage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

@weifengpy
Copy link
Contributor

@pytorchmergebot -i

@pytorch-bot
Copy link

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: -iusage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

@weifengpy
Copy link
Contributor

@pytorchmergebot merge -f "unrelated failure"

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag,bypassing any CI checks (ETA: 1-5 minutes). Please use-f as last resort and instead consider-i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@awguawguawgu left review comments

@weifengpyweifengpyweifengpy approved these changes

@mori360mori360Awaiting requested review from mori360

@Skylion007Skylion007Awaiting requested review from Skylion007

Assignees

No one assigned

Labels

ciflow/trunkTrigger trunk jobs on your pull requestMergedmodule: fsdponcall: distributedAdd this issue/PR to distributed oncall triage queueopen sourcerelease notes: distributed (fsdp2)release notes category

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

[FSDP2] relax uniform dtype assertion for requires_grad=False

7 participants

@xuantengh@awgu@Skylion007@weifengpy@pytorchmergebot@H-Huang@pytorchbot

[8]ページ先頭

©2009-2025 Movatter.jp