Movatterモバイル変換

pytorchbot added the open source label

Copy link

Collaborator

awgu commentedMay 22, 2025

I think it's worth to think about what the error/behavior will look like if someone unfreezes their parameters with different dtypes after init time and then tries to run backward.

Copy link

Collaborator

Skylion007 commentedMay 22, 2025•
edited
Loading

Also curious if this is an issue on FSDPv1? I think it might be

Skylion007 reviewed

torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py OutdatedShow resolvedHide resolved

H-Huang added the module: fsdp label

pytorch-botbot added the ciflow/inductor label

H-Huang requested review frommori360 andweifengpy

May 22, 2025 17:58

Copy link

Contributor

weifengpy commentedMay 22, 2025

I think it's worth to think about what the error/behavior will look like if someone unfreezes their parameters with different dtypes after init time and then tries to run backward.

good catch! probably need to check requires_grad again inlazy_init

weifengpy requested changes

Copy link

Contributor

weifengpy left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

remove unrelated import and check requires_grad again in lazy_init

Copy link

Collaborator

@weifengpy I was also suggesting to understand what kind of error gets raised. Unless you check every time (which I do not necessarily recommend), there is always a chance a user changes therequires_grad compared to when you last checked it. It might be good to know what kind of error naturally gets raised in that case. If it hangs, then that might imply you want to do something differently.

xuantengh force-pushed thefsdp-relax branch fromadbfdeb toe7b048cCompare

May 23, 2025 02:07

pytorch-botbot removed the ciflow/inductor label

May 23, 2025

Copy link

ContributorAuthor

xuantengh commentedMay 23, 2025

remove unrelated import

Done, I think it was unexpectedly introduced by my editor linter😂. Also fix existing test failures.

Copy link

ContributorAuthor

xuantengh commentedMay 23, 2025•
edited
Loading

I've tested mixed dtype model backward locally, it will raise error due to:
https://github.com/pytorch/pytorch/blob/ec04186d3ca4ec9baca6dbdfa2acaf5eabf0f33d/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py#L375-L381

Edit: we may need to invoke the reduce op multiple times for grads with different dtypes.

xuantengh requested review fromSkylion007 andweifengpy

May 23, 2025 05:50

awgu reviewed

May 23, 2025

torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py OutdatedShow resolvedHide resolved

weifengpy approved these changes

May 24, 2025

Copy link

Contributor

weifengpy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

accepted. agree thattrainable_params would be a better name

xuantengh added5 commits

May 24, 2025 10:24

relax orig_dtype and reduce_dtype consistency check on no-grad params

1a427b6

add test

082d4b4

remove unexpected imports and run lintrunner

7faf9fe

fix test errors

980e79f

add grad after no grad test

58bd035

xuantengh force-pushed thefsdp-relax branch fromec04186 toa9762a9Compare

May 24, 2025 02:25

rename

1118893

xuantengh force-pushed thefsdp-relax branch froma9762a9 to1118893Compare

May 24, 2025 02:35

Copy link

ContributorAuthor

xuantengh commentedMay 24, 2025

So in this PR, do we intend to handle the situation where the frozen params being activated again afterfully_shard and running the backward pass?

Copy link

Collaborator

awgu commentedMay 24, 2025

@xuantengh personally I think it's fine if not in this PR

Copy link

Contributor

weifengpy commentedMay 24, 2025

So in this PR, do we intend to handle the situation where the frozen params being activated again afterfully_shard and running the backward pass?

no need. As you mentioned, we have a clear runtime error instead of nccl hanging. it's already enough

Copy link

ContributorAuthor

xuantengh commentedMay 28, 2025

Hi, any further steps to make this PR merged?

Copy link

Contributor

weifengpy commentedMay 29, 2025

@pytorchmergebot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

May 29, 2025

weifengpy added release notes: distributed (fsdp2)release notes category and removed release notes: distributed (fsdp)release notes category labels

May 29, 2025

pytorchmergebot added the merging label

May 29, 2025

Copy link

Collaborator

pytorchmergebot commentedMay 29, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedMay 29, 2025

The mergejob was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information seepytorch-bot wiki.

Copy link

ContributorAuthor

xuantengh commentedMay 30, 2025

Seems like one test times out.

Copy link

Collaborator

pytorchmergebot commentedMay 30, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedMay 30, 2025

Merge failed

Reason: 1 jobs have failed, first few of them are:trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Details for Dev Infra team

Raised byworkflow job

pytorchmergebot removed the merging label

Copy link

Contributor

weifengpy commentedMay 30, 2025

@pytorchmergebot -i

Copy link

pytorch-botbot commentedMay 30, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: -iusage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

Copy link

Contributor

weifengpy commentedMay 30, 2025

@pytorchmergebot -i

Copy link

pytorch-botbot commentedMay 30, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: -iusage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

Copy link

Contributor

weifengpy commentedMay 30, 2025

@pytorchmergebot merge -f "unrelated failure"

pytorchmergebot added the merging label

Copy link

Collaborator

pytorchmergebot commentedMay 30, 2025

Merge started

Your change will be merged immediately since you used the force (-f) flag,bypassing any CI checks (ETA: 1-5 minutes). Please use-f as last resort and instead consider-i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in30f7079

pytorchmergebot added Merged and removed merging labels