- Notifications
You must be signed in to change notification settings - Fork26.3k
[C10D] Document object collectives limitations#150815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.ghstack-source-id:30848a3Pull Requestresolved:#150815
pytorch-botbot commentedApr 8, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/150815
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit87efa14 with merge base91d1826 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
wconstab commentedApr 8, 2025
@pytorchbot help |
❌ 🤖 pytorchbot command failed: Try |
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
kwen2501 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks much for writing the paragraph! We need more foundational work like this!
Uh oh!
There was an error while loading.Please reload this page.
docs/source/distributed.rst Outdated
| the sending rank(s) had to pickle. This has led to one class of issues where users report nccl watchdog timeouts caused | ||
| by one rank spending a really long time unpickling data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just thinking out loud: why would the unpickle time lead to watchdog timeout? I thought the collective has finished at that point? Or are you saying that it may impact the next collective because this gathering rank would join late?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
yea, its probably always an unrelated collective that actually times out, caused by desync, cuased by unpickling
| **Inefficient tensor communication** - Tensors should be sent via regular collective APIs, not object collective APIs. | ||
| It is possible to send Tensors via object collective APIs, but they will be serialized and deserialized (including a | ||
| cpu-sync and device-to-host copy in the case of non-cpu tensors), and in almost every case other than debugging or | ||
| troubleshooting code, it would be worth the trouble to refactor the code to use non-object collectives instead. | ||
| **Unexpected tensor devices** - If you still want to send tensors via object collectives, there is another aspect | ||
| specific to cuda (and possibly other accelerators) tensors. If you pickle a tensor that is currently on `cuda:3`, and | ||
| then unpickle it, you will get another tensor on `cuda:3` *regardless of which process you are on, or which cuda device | ||
| is the 'default' device for that process*. With regular tensor collective APIs, 'output tensors' will always be on the | ||
| same, local device, which is generally what you'd expect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
nit: maybe merge "Inefficient tensor communication" and "Unexpected tensor devices" into one section and name it "Not for tensors"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
i merged the latter two instead. i think inefficient tensor comm applies to cpu and cuda tensors (any tensors) due to pickling overhead, and the latter sections apply to cuda specifically.
| **Unintended cuda-context creation** - Unpickling a tensor will implicity activate a cuda context if it is the first | ||
| time a GPU is used by the process, which can waste significant amounts of GPU memory. This issue can be avoided by | ||
| moving tensors to CPU before passing them as inputs to an object collective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It seems "Unintended cuda-context creation" is really related to "Unexpected tensor devices" above. So I recommend merging sections too.
Could the sentence "This issue can be avoided by moving tensors to CPU before passing them as inputs to an object collective" sound like we encourage the users to do so (while we actually don't)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
well, i said "worked around" not "should". I think that the best I can do is outline all the risks, and if you still want to pass tensors in we can at least make it less bad.
awgu left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
proof-reading-only pass
(some typos and python/cpu/cuda -> Python/CPU/CUDA)
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
Uh oh!
There was an error while loading.Please reload this page.
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addresses#150798cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k[ghstack-poisoned]
wconstab commentedApr 10, 2025
@pytorchbot merge |
pytorchmergebot commentedApr 10, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addressespytorch#150798Pull Requestresolved:pytorch#150815Approved by:https://github.com/kwen2501
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.Preview:<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />addressespytorch#150798Pull Requestresolved:pytorch#150815Approved by:https://github.com/kwen2501
Adds louder warning labels in the doc page and docstring for objectcollectives in hopes of raising awareness of several footgun issuesincluding accidental creation of cuda contexts by serializing andsending 'device-local' gpu tensors over the object-* apis.ghstack-source-id:0945121Pull Requestresolved:pytorch/pytorch#150815
Uh oh!
There was an error while loading.Please reload this page.
Stack fromghstack (oldest at bottom):
Adds louder warning labels in the doc page and docstring for object
collectives in hopes of raising awareness of several footgun issues
including accidental creation of cuda contexts by serializing and
sending 'device-local' gpu tensors over the object-* apis.
Preview:

addresses#150798
cc@H-Huang@awgu@wanchaol@fegin@fduwjj@wz337@d4l3k