NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

[C10D] Document object collectives limitations#150815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

wconstab wants to merge6 commits intogh/wconstab/405/basefromgh/wconstab/405/head

Closed

[C10D] Document object collectives limitations#150815

wconstab wants to merge6 commits intogh/wconstab/405/basefromgh/wconstab/405/head

Conversation

Copy link

Contributor

wconstab commentedApr 8, 2025•
edited
Loading

Stack fromghstack (oldest at bottom):

Adds louder warning labels in the doc page and docstring for object
collectives in hopes of raising awareness of several footgun issues
including accidental creation of cuda contexts by serializing and
sending 'device-local' gpu tensors over the object-* apis.

Preview:

addresses#150798

cc@H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k

[C10D] Document object collectives limitations

ba0d77d

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.[ghstack-poisoned]

Copy link

pytorch-botbot commentedApr 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added oncall: distributed

Add this issue/PR to distributed oncall triage queue

release notes: distributed (c10d)release notes category labels

Apr 8, 2025

wconstab added a commit that referenced this pull request

Apr 8, 2025

[C10D] Document object collectives limitations

64bac90

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.ghstack-source-id:30848a3Pull Requestresolved:#150815

Copy link

pytorch-botbot commentedApr 8, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit87efa14 with merge base91d1826 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Update on "[C10D] Document object collectives limitations"

6a1ec12

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]

Copy link

ContributorAuthor

wconstab commentedApr 8, 2025

@pytorchbot help

Copy link

pytorch-botbot commentedApr 8, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try@pytorchbot --help for more info.

Update on "[C10D] Document object collectives limitations"

a50941d

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]

wconstab mentioned this pull request

Apr 8, 2025

Document non-pytorch CUDA memory allocation and how to query it#150880

Closed

Update on "[C10D] Document object collectives limitations"

9591625

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]

kwen2501 approved these changes

Apr 9, 2025

View reviewed changes

Copy link

Collaborator

kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks much for writing the paragraph! We need more foundational work like this!

docs/source/distributed.rst OutdatedShow resolvedHide resolved

docs/source/distributed.rst Outdated

Comment on lines 581 to 582

		the sending rank(s) had to pickle. This has led to one class of issues where users report nccl watchdog timeouts caused
		by one rank spending a really long time unpickling data.

Copy link

Collaborator

kwen2501Apr 9, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just thinking out loud: why would the unpickle time lead to watchdog timeout? I thought the collective has finished at that point? Or are you saying that it may impact the next collective because this gathering rank would join late?

Copy link

ContributorAuthor

wconstabApr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yea, its probably always an unrelated collective that actually times out, caused by desync, cuased by unpickling

docs/source/distributed.rst

Comment on lines 584 to 593

		Inefficient tensor communication - Tensors should be sent via regular collective APIs, not object collective APIs.
		It is possible to send Tensors via object collective APIs, but they will be serialized and deserialized (including a
		cpu-sync and device-to-host copy in the case of non-cpu tensors), and in almost every case other than debugging or
		troubleshooting code, it would be worth the trouble to refactor the code to use non-object collectives instead.

		Unexpected tensor devices - If you still want to send tensors via object collectives, there is another aspect
		specific to cuda (and possibly other accelerators) tensors. If you pickle a tensor that is currently on `cuda:3`, and
		then unpickle it, you will get another tensor on `cuda:3` *regardless of which process you are on, or which cuda device
		is the 'default' device for that process*. With regular tensor collective APIs, 'output tensors' will always be on the
		same, local device, which is generally what you'd expect.

Copy link

Collaborator

kwen2501Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: maybe merge "Inefficient tensor communication" and "Unexpected tensor devices" into one section and name it "Not for tensors"

Copy link

ContributorAuthor

wconstabApr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

i merged the latter two instead. i think inefficient tensor comm applies to cpu and cuda tensors (any tensors) due to pickling overhead, and the latter sections apply to cuda specifically.

docs/source/distributed.rst


		Unintended cuda-context creation - Unpickling a tensor will implicity activate a cuda context if it is the first
		time a GPU is used by the process, which can waste significant amounts of GPU memory. This issue can be avoided by
		moving tensors to CPU before passing them as inputs to an object collective.

Copy link

Collaborator

kwen2501Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It seems "Unintended cuda-context creation" is really related to "Unexpected tensor devices" above. So I recommend merging sections too.
Could the sentence "This issue can be avoided by moving tensors to CPU before passing them as inputs to an object collective" sound like we encourage the users to do so (while we actually don't)?

Copy link

ContributorAuthor

wconstabApr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

well, i said "worked around" not "should". I think that the best I can do is outline all the risks, and if you still want to pass tensors in we can at least make it less bad.

awgu reviewed

Apr 9, 2025

View reviewed changes

Copy link

Collaborator

awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

proof-reading-only pass

(some typos and python/cpu/cuda -> Python/CPU/CUDA)

docs/source/distributed.rst OutdatedShow resolvedHide resolved

Update on "[C10D] Document object collectives limitations"

3e5bd4b

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]

svekars reviewed

Apr 10, 2025

View reviewed changes

torch/distributed/distributed_c10d.py OutdatedShow resolvedHide resolved

Update on "[C10D] Document object collectives limitations"

87efa14

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]

Copy link

ContributorAuthor

wconstab commentedApr 10, 2025

@pytorchbot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Apr 10, 2025

pytorchmergebot added the merging label

Apr 10, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 10, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

Apr 10, 2025

pytorchmergebot closed this inc9a35c2

Apr 10, 2025

pytorchmergebot removed the merging label

Apr 10, 2025

wconstab mentioned this pull request

Apr 11, 2025

all_gather_object creates context for each gpu multiple times (leaks memory)#150798

Closed

timocafe pushed a commit to timocafe/pytorch that referenced this pull request

Apr 16, 2025

[C10D] Document object collectives limitations (pytorch#150815)

4c7c606

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addressespytorch#150798Pull Requestresolved:pytorch#150815Approved by:https://github.com/kwen2501

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

[C10D] Document object collectives limitations (pytorch#150815)

799f033

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addressespytorch#150798Pull Requestresolved:pytorch#150815Approved by:https://github.com/kwen2501

Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request

Apr 22, 2025

[C10D] Document object collectives limitations

610f614

Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.ghstack-source-id:0945121Pull Requestresolved:pytorch/pytorch#150815

github-actionsbot deleted the gh/wconstab/405/head branch

May 16, 2025 02:19

Labels

ciflow/trunk

Trigger trunk jobs on your pull request

Merged oncall: distributed

Add this issue/PR to distributed oncall triage queue

release notes: distributed (c10d)

release notes category

Movatterモバイル変換

[C10D] Document object collectives limitations#150815

[C10D] Document object collectives limitations#150815

Uh oh!

Conversation

wconstab commentedApr 8, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

pytorch-botbot commentedApr 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815

Uh oh!

pytorch-botbot commentedApr 8, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815

✅ No Failures

Uh oh!

wconstab commentedApr 8, 2025

Uh oh!

pytorch-botbot commentedApr 8, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kwen2501Apr 9, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstabApr 9, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

wconstabApr 9, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

wconstabApr 9, 2025

Choose a reason for hiding this comment

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wconstab commentedApr 10, 2025

Uh oh!

pytorchmergebot commentedApr 10, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wconstab commentedApr 8, 2025•
edited
Loading

pytorch-botbot commentedApr 8, 2025•
edited
Loading

kwen2501Apr 9, 2025•
edited
Loading