- Notifications
You must be signed in to change notification settings - Fork26.3k
Safer bookkeeping of NCCL communicators#150681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedApr 4, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/150681
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 1 Pending, 1 Unrelated FailureAs of commit8b7c3bc with merge base7ac8186 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| int numRanks, | ||
| int rank, | ||
| std::vector<ncclUniqueId>& commIds, | ||
| at::DeviceIndex deviceIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't think this constructor was deliberately missing the device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
lgtm. Yeah, the other "create" methods takedeviceIndex.
| auto& devIdx = it.second; | ||
| if (te.device_ == devIdx) { | ||
| for (auto& [ncclComm, _] : ncclCommDevIdxMap) { | ||
| if (te.device_ == ncclComm->getDeviceIndex()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Note that with this PR we query the device from theNCCLComm (the key of the hashmap) instead using the value of the hashmap. After this PR, the values of the hashmap arenever usedanywhere. This is on purpose, since in a later PR I plan to repurpose that hashmap in order to store something different in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What do you use thencclCommDevIdxMap for in your later PR? (Sorry I haven't read those ones yet). If that PR is going to come "much later", maybe it would look cleaner to defer the changes onncclCommDevIdxMap to that PR? I don't have a very strong opinion here tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, totally. I broke the entire change down in 4 chunks in order to make it incremental and "guide" the reviewers through it. If you prefer I can squash them all together
kwen2501 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM. Left a minor comment.
| int numRanks, | ||
| int rank, | ||
| std::vector<ncclUniqueId>& commIds, | ||
| at::DeviceIndex deviceIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
lgtm. Yeah, the other "create" methods takedeviceIndex.
| auto& devIdx = it.second; | ||
| if (te.device_ == devIdx) { | ||
| for (auto& [ncclComm, _] : ncclCommDevIdxMap) { | ||
| if (te.device_ == ncclComm->getDeviceIndex()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What do you use thencclCommDevIdxMap for in your later PR? (Sorry I haven't read those ones yet). If that PR is going to come "much later", maybe it would look cleaner to defer the changes onncclCommDevIdxMap to that PR? I don't have a very strong opinion here tho.
lw commentedApr 8, 2025
@pytorchbot merge |
pytorchmergebot commentedApr 8, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
…50682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:#150682Approved by:https://github.com/kwen2501ghstack dependencies:#150681
This consists mainly in two changes:- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)Pull Requestresolved:pytorch#150681Approved by:https://github.com/kwen2501
…torch#150682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:pytorch#150682Approved by:https://github.com/kwen2501ghstack dependencies:pytorch#150681
This consists mainly in two changes:- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)Pull Requestresolved:pytorch#150681Approved by:https://github.com/kwen2501
…torch#150682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:pytorch#150682Approved by:https://github.com/kwen2501ghstack dependencies:pytorch#150681
ghstack-source-id:0a13bf6Pull Requestresolved:pytorch/pytorch#150681
Uh oh!
There was an error while loading.Please reload this page.
Stack fromghstack (oldest at bottom):
This consists mainly in two changes:
NCCLCommobject (there was one constructor which didn't set the device)NCCLComms(which ensures the lock is released in case of exceptions)cc@H-Huang@awgu@wanchaol@fegin@fduwjj@wz337@wconstab@d4l3k