Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Safer bookkeeping of NCCL communicators#150681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
lw wants to merge2 commits intogh/lw/11/basefromgh/lw/11/head
Closed

Conversation

@lw
Copy link
Contributor

@lwlw commentedApr 4, 2025
edited by pytorch-botbot
Loading

Stack fromghstack (oldest at bottom):

This consists mainly in two changes:

  • ensure we can reliably obtain the device from aNCCLComm object (there was one constructor which didn't set the device)
  • use a RAII pattern for acquiring the lock to the global dictionary ofNCCLComms (which ensures the lock is released in case of exceptions)

cc@H-Huang@awgu@wanchaol@fegin@fduwjj@wz337@wconstab@d4l3k

kwen2501 reacted with thumbs up emoji
[ghstack-poisoned]
@pytorch-botpytorch-botbot added oncall: distributedAdd this issue/PR to distributed oncall triage queue release notes: distributed (c10d)release notes category labelsApr 4, 2025
@pytorch-bot
Copy link

pytorch-botbot commentedApr 4, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150681

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit8b7c3bc with merge base7ac8186 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

int numRanks,
int rank,
std::vector<ncclUniqueId>& commIds,
at::DeviceIndex deviceIndex,
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think this constructor was deliberately missing the device?

kwen2501 reacted with thumbs up emoji
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm. Yeah, the other "create" methods takedeviceIndex.

auto& devIdx = it.second;
if (te.device_ == devIdx) {
for (auto& [ncclComm, _] : ncclCommDevIdxMap) {
if (te.device_ == ncclComm->getDeviceIndex()) {
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note that with this PR we query the device from theNCCLComm (the key of the hashmap) instead using the value of the hashmap. After this PR, the values of the hashmap arenever usedanywhere. This is on purpose, since in a later PR I plan to repurpose that hashmap in order to store something different in it.

Copy link
Collaborator

@kwen2501kwen2501Apr 4, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What do you use thencclCommDevIdxMap for in your later PR? (Sorry I haven't read those ones yet). If that PR is going to come "much later", maybe it would look cleaner to defer the changes onncclCommDevIdxMap to that PR? I don't have a very strong opinion here tho.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, totally. I broke the entire change down in 4 chunks in order to make it incremental and "guide" the reviewers through it. If you prefer I can squash them all together

@lwlw marked this pull request as ready for reviewApril 4, 2025 15:58
Copy link
Collaborator

@kwen2501kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM. Left a minor comment.

int numRanks,
int rank,
std::vector<ncclUniqueId>& commIds,
at::DeviceIndex deviceIndex,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm. Yeah, the other "create" methods takedeviceIndex.

auto& devIdx = it.second;
if (te.device_ == devIdx) {
for (auto& [ncclComm, _] : ncclCommDevIdxMap) {
if (te.device_ == ncclComm->getDeviceIndex()) {
Copy link
Collaborator

@kwen2501kwen2501Apr 4, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What do you use thencclCommDevIdxMap for in your later PR? (Sorry I haven't read those ones yet). If that PR is going to come "much later", maybe it would look cleaner to defer the changes onncclCommDevIdxMap to that PR? I don't have a very strong opinion here tho.

[ghstack-poisoned]
@lw
Copy link
ContributorAuthor

lw commentedApr 8, 2025

@pytorchbot merge

pytorch-bot[bot] reacted with thumbs up emoji

@pytorch-botpytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request labelApr 8, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull requestApr 8, 2025
…50682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:#150682Approved by:https://github.com/kwen2501ghstack dependencies:#150681
timocafe pushed a commit to timocafe/pytorch that referenced this pull requestApr 16, 2025
This consists mainly in two changes:- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)Pull Requestresolved:pytorch#150681Approved by:https://github.com/kwen2501
timocafe pushed a commit to timocafe/pytorch that referenced this pull requestApr 16, 2025
…torch#150682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:pytorch#150682Approved by:https://github.com/kwen2501ghstack dependencies:pytorch#150681
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
This consists mainly in two changes:- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)Pull Requestresolved:pytorch#150681Approved by:https://github.com/kwen2501
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
…torch#150682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:pytorch#150682Approved by:https://github.com/kwen2501ghstack dependencies:pytorch#150681
Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull requestApr 22, 2025
@github-actionsgithub-actionsbot deleted the gh/lw/11/head branchMay 15, 2025 02:16
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@kwen2501kwen2501kwen2501 approved these changes

Assignees

No one assigned

Labels

ciflow/trunkTrigger trunk jobs on your pull requestMergedoncall: distributedAdd this issue/PR to distributed oncall triage queuerelease notes: distributed (c10d)release notes category

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@lw@pytorchmergebot@kwen2501

[8]ページ先頭

©2009-2025 Movatter.jp