- Notifications
You must be signed in to change notification settings - Fork26.3k
[FSDP2] allow different dtypes for no grad model params#154103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedMay 22, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/154103
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 Cancelled JobsAs of commit1118893 with merge base975bbc6 ( CANCELLED JOBS - The following jobs were cancelled. Please retry:This comment was automatically generated by Dr. CI and updates every 15 minutes. |
awgu commentedMay 22, 2025
I think it's worth to think about what the error/behavior will look like if someone unfreezes their parameters with different dtypes after init time and then tries to run backward. |
Skylion007 commentedMay 22, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Also curious if this is an issue on FSDPv1? I think it might be |
Uh oh!
There was an error while loading.Please reload this page.
weifengpy commentedMay 22, 2025
good catch! probably need to check requires_grad again in |
weifengpy left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
remove unrelated import and check requires_grad again in lazy_init
awgu commentedMay 23, 2025
@weifengpy I was also suggesting to understand what kind of error gets raised. Unless you check every time (which I do not necessarily recommend), there is always a chance a user changes the |
xuantengh commentedMay 23, 2025
Done, I think it was unexpectedly introduced by my editor linter😂. Also fix existing test failures. |
xuantengh commentedMay 23, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I've tested mixed dtype model backward locally, it will raise error due to: Edit: we may need to invoke the reduce op multiple times for grads with different dtypes. |
Uh oh!
There was an error while loading.Please reload this page.
weifengpy left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
accepted. agree thattrainable_params would be a better name
xuantengh commentedMay 24, 2025
So in this PR, do we intend to handle the situation where the frozen params being activated again after |
awgu commentedMay 24, 2025
@xuantengh personally I think it's fine if not in this PR |
weifengpy commentedMay 24, 2025
no need. As you mentioned, we have a clear runtime error instead of nccl hanging. it's already enough |
xuantengh commentedMay 28, 2025
Hi, any further steps to make this PR merged? |
weifengpy commentedMay 29, 2025
@pytorchmergebot merge |
pytorchmergebot commentedMay 29, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
pytorchmergebot commentedMay 29, 2025
The mergejob was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
xuantengh commentedMay 30, 2025
Seems like one test times out. |
pytorchmergebot commentedMay 30, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
pytorchmergebot commentedMay 30, 2025
Merge failedReason: 1 jobs have failed, first few of them are:trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Details for Dev Infra teamRaised byworkflow job |
weifengpy commentedMay 30, 2025
❌ 🤖 pytorchbot command failed: Try |
weifengpy commentedMay 30, 2025
❌ 🤖 pytorchbot command failed: Try |
weifengpy commentedMay 30, 2025
@pytorchmergebot merge -f "unrelated failure" |
pytorchmergebot commentedMay 30, 2025
Merge startedYour change will be merged immediately since you used the force (-f) flag,bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
Fixespytorch#154082Pull Requestresolved:pytorch#154103Approved by:https://github.com/weifengpy
Uh oh!
There was an error while loading.Please reload this page.
Fixes#154082
cc@H-Huang@awgu@wanchaol@fegin@fduwjj@wz337@wconstab@d4l3k@zhaojuanmao@mrshenli@rohan-varma@chauhang@mori360