NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

c10d/Store: add nonblocking mode to queue_pop#151485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

d4l3k wants to merge1 commit intomainfromd4l3k/queue_block

Closed

c10d/Store: add nonblocking mode to queue_pop#151485

d4l3k wants to merge1 commit intomainfromd4l3k/queue_block

Conversation

Copy link

Member

d4l3k commentedApr 16, 2025•
edited by pytorch-botbot
Loading

This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue.

We also expose atorch.distributed.QueueEmptyError so users can catch the error and handle it accordingly.

Test plan:

pytest test/distributed/test_store.py -k queue -v -s -x

cc@H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab

d4l3k requested review fromfduwjj andtianfengfrank

April 16, 2025 22:30

Copy link

pytorch-botbot commentedApr 16, 2025•
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/151485

📄 PreviewPython docs built from this PR
📄 PreviewC++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit thebot commands wiki or ouroffice hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit31825f4 with merge base7f52875 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.ephemeral.linux.2xlarge) (gh) (#144480)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-botbot added oncall: distributed

Add this issue/PR to distributed oncall triage queue

release notes: distributed (c10d)release notes category labels

Apr 16, 2025

fduwjj approved these changes

Apr 16, 2025

View reviewed changes

Copy link

Contributor

fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

fduwjj reviewed

Apr 16, 2025

View reviewed changes

torch/csrc/Exceptions.cpp

Comment on lines +124 to +126

		PyModule_AddObject(
		module,"_DistQueueEmptyError", THPException_DistQueueEmptyError) ==
		0);

Copy link

Contributor

fduwjjApr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is the purpose of doing this to add the exception into PyTorch python object exception?

Copy link

MemberAuthor

d4l3kApr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes -- we need to do this to surface it to Python. Though, honestly maybe we should move all of the PTD errors to use pybind instead of THP for exception translation. THP is just so painful to work with

tianfengfrank approved these changes

Apr 16, 2025

View reviewed changes

Copy link

MemberAuthor

d4l3k commentedApr 17, 2025

@pytorchbot merge

pytorch-botbot added the ciflow/trunkTrigger trunk jobs on your pull request label

Apr 17, 2025

pytorchmergebot added the merging label

Apr 17, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, lf.ephemeral.linux.4xlarge)

Dig deeper byviewing the failures on hud

Details for Dev Infra team

Raised byworkflow job

Failing merge rule: Core Maintainers

pytorchmergebot removed the merging label

Apr 17, 2025

Copy link

MemberAuthor

d4l3k commentedApr 17, 2025

@pytorchbot merge -i

pytorchmergebot added the merging label

Apr 17, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge started

Your change will be merged while ignoring the following 7 checks:pull / unstable-linux-focal-cuda12.6-py3.10-gcc11-sm89-xfail / build,pull / linux-focal-py3.9-clang10 / test (default, 1, 5, lf.ephemeral.linux.4xlarge),pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, lf.ephemeral.linux.2xlarge),pull / linux-focal-py3.13-clang10 / test (default, 3, 5, lf.ephemeral.linux.4xlarge),pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, lf.ephemeral.linux.2xlarge),pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, lf.ephemeral.linux.2xlarge),pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, lf.ephemeral.linux.4xlarge)

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge failed

Reason: 1 jobs have failed, first few of them are:trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable)

Details for Dev Infra team

Raised byworkflow job

pytorchmergebot removed the merging label

Apr 17, 2025

Copy link

MemberAuthor

d4l3k commentedApr 17, 2025

@pytorchbot merge -i

pytorchmergebot added the merging label

Apr 17, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge started

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge failed

Reason: 12 jobs have failed, first few of them are:pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, lf.ephemeral.linux.2xlarge),pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 1, 3, lf.ephemeral.linux.2xlarge),pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, lf.ephemeral.linux.2xlarge),pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 1, 3, lf.ephemeral.linux.2xlarge),trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Details for Dev Infra team

Raised byworkflow job

pytorchmergebot removed the merging label

Apr 17, 2025

Copy link

MemberAuthor

d4l3k commentedApr 17, 2025

@pytorchbot merge -i

pytorchmergebot added the merging label

Apr 17, 2025

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge started

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Merge failed

Details for Dev Infra team

Raised byworkflow job

pytorchmergebot removed the merging label

Apr 17, 2025

Copy link

MemberAuthor

d4l3k commentedApr 17, 2025

@pytorchbot rebase

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere

c10d/Store: add nonblocking mode to queue_pop

31825f4

Copy link

Collaborator

pytorchmergebot commentedApr 17, 2025

Successfully rebasedd4l3k/queue_block ontorefs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, viagit checkout d4l3k/queue_block && git pull --rebase)