Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[C10D] Document object collectives limitations#150815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
wconstab wants to merge6 commits intogh/wconstab/405/basefromgh/wconstab/405/head

Conversation

@wconstab
Copy link
Contributor

@wconstabwconstab commentedApr 8, 2025
edited
Loading

Stack fromghstack (oldest at bottom):

Adds louder warning labels in the doc page and docstring for object
collectives in hopes of raising awareness of several footgun issues
including accidental creation of cuda contexts by serializing and
sending 'device-local' gpu tensors over the object-* apis.

Preview:
image

addresses#150798

cc@H-Huang@awgu@wanchaol@fegin@fduwjj@wz337@d4l3k

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.[ghstack-poisoned]
@pytorch-bot
Copy link

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-botpytorch-botbot added oncall: distributedAdd this issue/PR to distributed oncall triage queue release notes: distributed (c10d)release notes category labelsApr 8, 2025
wconstab added a commit that referenced this pull requestApr 8, 2025
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.ghstack-source-id:30848a3Pull Requestresolved:#150815
@pytorch-bot
Copy link

pytorch-botbot commentedApr 8, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit87efa14 with merge base91d1826 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
@wconstab
Copy link
ContributorAuthor

@pytorchbot help

@pytorch-bot
Copy link

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
Copy link
Collaborator

@kwen2501kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks much for writing the paragraph! We need more foundational work like this!

Comment on lines 581 to 582
the sending rank(s) had to pickle. This has led to one class of issues where users report nccl watchdog timeouts caused
by one rank spending a really long time unpickling data.
Copy link
Collaborator

@kwen2501kwen2501Apr 9, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just thinking out loud: why would the unpickle time lead to watchdog timeout? I thought the collective has finished at that point? Or are you saying that it may impact the next collective because this gathering rank would join late?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yea, its probably always an unrelated collective that actually times out, caused by desync, cuased by unpickling

Comment on lines 584 to 593
**Inefficient tensor communication** - Tensors should be sent via regular collective APIs, not object collective APIs.
It is possible to send Tensors via object collective APIs, but they will be serialized and deserialized (including a
cpu-sync and device-to-host copy in the case of non-cpu tensors), and in almost every case other than debugging or
troubleshooting code, it would be worth the trouble to refactor the code to use non-object collectives instead.

**Unexpected tensor devices** - If you still want to send tensors via object collectives, there is another aspect
specific to cuda (and possibly other accelerators) tensors. If you pickle a tensor that is currently on `cuda:3`, and
then unpickle it, you will get another tensor on `cuda:3` *regardless of which process you are on, or which cuda device
is the 'default' device for that process*. With regular tensor collective APIs, 'output tensors' will always be on the
same, local device, which is generally what you'd expect.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: maybe merge "Inefficient tensor communication" and "Unexpected tensor devices" into one section and name it "Not for tensors"

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

i merged the latter two instead. i think inefficient tensor comm applies to cpu and cuda tensors (any tensors) due to pickling overhead, and the latter sections apply to cuda specifically.


**Unintended cuda-context creation** - Unpickling a tensor will implicity activate a cuda context if it is the first
time a GPU is used by the process, which can waste significant amounts of GPU memory. This issue can be avoided by
moving tensors to CPU before passing them as inputs to an object collective.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It seems "Unintended cuda-context creation" is really related to "Unexpected tensor devices" above. So I recommend merging sections too.
Could the sentence "This issue can be avoided by moving tensors to CPU before passing them as inputs to an object collective" sound like we encourage the users to do so (while we actually don't)?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

well, i said "worked around" not "should". I think that the best I can do is outline all the risks, and if you still want to pass tensors in we can at least make it less bad.

Copy link
Collaborator

@awguawgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

proof-reading-only pass

(some typos and python/cpu/cuda -> Python/CPU/CUDA)

wconstab reacted with heart emoji
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
@wconstab
Copy link
ContributorAuthor

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelApr 10, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

timocafe pushed a commit to timocafe/pytorch that referenced this pull requestApr 16, 2025
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addressespytorch#150798Pull Requestresolved:pytorch#150815Approved by:https://github.com/kwen2501
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addressespytorch#150798Pull Requestresolved:pytorch#150815Approved by:https://github.com/kwen2501
Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull requestApr 22, 2025
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.ghstack-source-id:0945121Pull Requestresolved:pytorch/pytorch#150815
@github-actionsgithub-actionsbot deleted the gh/wconstab/405/head branchMay 16, 2025 02:19
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@svekarssvekarssvekars left review comments

@awguawguawgu left review comments

@kwen2501kwen2501kwen2501 approved these changes

Assignees

No one assigned

Labels

ciflow/trunkTrigger trunk jobs on your pull requestMergedoncall: distributedAdd this issue/PR to distributed oncall triage queuerelease notes: distributed (c10d)release notes category

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

6 participants

@wconstab@pytorchmergebot@svekars@kwen2501@awgu

[8]ページ先頭

©2009-2025 Movatter.jp