NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

Safer bookkeeping of NCCL communicators#150681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

lw wants to merge2 commits intogh/lw/11/basefromgh/lw/11/head

Closed

Safer bookkeeping of NCCL communicators#150681

lw wants to merge2 commits intogh/lw/11/basefromgh/lw/11/head

Conversation

Copy link

Contributor

lw commentedApr 4, 2025•
edited by pytorch-botbot
Loading

Stack fromghstack (oldest at bottom):

This consists mainly in two changes:

ensure we can reliably obtain the device from aNCCLComm object (there was one constructor which didn't set the device)
use a RAII pattern for acquiring the lock to the global dictionary ofNCCLComms (which ensures the lock is released in case of exceptions)

cc@H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Update

d7b3edb

[ghstack-poisoned]

pytorch-botbot added oncall: distributed

Add this issue/PR to distributed oncall triage queue

release notes: distributed (c10d)release notes category labels

Apr 4, 2025

Copy link

pytorch-botbot commentedApr 4, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150681

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit8b7c3bc with merge base7ac8186 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m2-15) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This was referencedApr 4, 2025

Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK#150682

Closed

Add mempool to allocator's trace events#150683

Closed

Enable NCCL zero-copy (user buffer registration) for FSDP2#150564

Closed

lw commented

Apr 4, 2025

View reviewed changes

torch/csrc/distributed/c10d/NCCLUtils.hpp

		int numRanks,
		int rank,
		std::vector<ncclUniqueId>& commIds,
		at::DeviceIndex deviceIndex,

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think this constructor was deliberately missing the device?

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm. Yeah, the other "create" methods takedeviceIndex.

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

		auto& devIdx = it.second;
		if (te.device_ == devIdx) {
		for (auto& [ncclComm, _] : ncclCommDevIdxMap) {
		if (te.device_ == ncclComm->getDeviceIndex()) {

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note that with this PR we query the device from theNCCLComm (the key of the hashmap) instead using the value of the hashmap. After this PR, the values of the hashmap arenever usedanywhere. This is on purpose, since in a later PR I plan to repurpose that hashmap in order to store something different in it.

Copy link

Collaborator

kwen2501Apr 4, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What do you use thencclCommDevIdxMap for in your later PR? (Sorry I haven't read those ones yet). If that PR is going to come "much later", maybe it would look cleaner to defer the changes onncclCommDevIdxMap to that PR? I don't have a very strong opinion here tho.

Copy link

ContributorAuthor

lwApr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, totally. I broke the entire change down in 4 chunks in order to make it incremental and "guide" the reviewers through it. If you prefer I can squash them all together

lw marked this pull request as ready for review

April 4, 2025 15:58

kwen2501 approved these changes

Apr 4, 2025

View reviewed changes

Copy link

Collaborator

kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM. Left a minor comment.

torch/csrc/distributed/c10d/NCCLUtils.hpp

		int numRanks,
		int rank,
		std::vector<ncclUniqueId>& commIds,
		at::DeviceIndex deviceIndex,

Copy link

Collaborator

kwen2501Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm. Yeah, the other "create" methods takedeviceIndex.

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

		auto& devIdx = it.second;
		if (te.device_ == devIdx) {
		for (auto& [ncclComm, _] : ncclCommDevIdxMap) {
		if (te.device_ == ncclComm->getDeviceIndex()) {

Copy link

Collaborator

kwen2501Apr 4, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Update

8b7c3bc

[ghstack-poisoned]

Copy link

ContributorAuthor

lw commentedApr 8, 2025

@pytorchbot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Apr 8, 2025

pytorchmergebot added the merging label

Apr 8, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 8, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

Apr 8, 2025

pytorchmergebot closed this in3649e2e

Apr 8, 2025

pytorchmergebot removed the merging label

Apr 8, 2025

pytorchmergebot pushed a commit that referenced this pull request

Apr 8, 2025

Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#1…

1791b41

…50682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:#150682Approved by:https://github.com/kwen2501ghstack dependencies:#150681

timocafe pushed a commit to timocafe/pytorch that referenced this pull request

Apr 16, 2025

Safer bookkeeping of NCCL communicators (pytorch#150681)

c0f0b29

This consists mainly in two changes:- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)Pull Requestresolved:pytorch#150681Approved by:https://github.com/kwen2501

timocafe pushed a commit to timocafe/pytorch that referenced this pull request

Apr 16, 2025

Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (py…

dd51eee

…torch#150682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:pytorch#150682Approved by:https://github.com/kwen2501ghstack dependencies:pytorch#150681

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

Safer bookkeeping of NCCL communicators (pytorch#150681)

cc8cde5

This consists mainly in two changes:- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)Pull Requestresolved:pytorch#150681Approved by:https://github.com/kwen2501

amathewc pushed a commit to amathewc/pytorch that referenced this pull request

Apr 17, 2025

Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (py…

00d79d8

…torch#150682)I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.Pull Requestresolved:pytorch#150682Approved by:https://github.com/kwen2501ghstack dependencies:pytorch#150681

Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request

Apr 22, 2025

Safer bookkeeping of NCCL communicators

f7c1814

ghstack-source-id:0a13bf6Pull Requestresolved:pytorch/pytorch#150681

github-actionsbot deleted the gh/lw/11/head branch

May 15, 2025 02:16

Labels

ciflow/trunk

Trigger trunk jobs on your pull request

Merged oncall: distributed

Add this issue/PR to distributed oncall triage queue

release notes: distributed (c10d)

release notes category

4 participants

Movatterモバイル変換

Safer bookkeeping of NCCL communicators#150681

Safer bookkeeping of NCCL communicators#150681

Uh oh!

Conversation

lw commentedApr 4, 2025• edited by pytorch-botbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

pytorch-botbot commentedApr 4, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150681

⏳ 1 Pending, 1 Unrelated Failure

Uh oh!

lwApr 4, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

lwApr 4, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501Apr 4, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lwApr 4, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501Apr 4, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw commentedApr 8, 2025

Uh oh!

pytorchmergebot commentedApr 8, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lw commentedApr 4, 2025•
edited by pytorch-botbot
Loading

pytorch-botbot commentedApr 4, 2025•
edited
Loading

kwen2501Apr 4, 2025•
edited
Loading

kwen2501Apr 4, 2025•
edited
Loading