- Notifications
You must be signed in to change notification settings - Fork26.3k
c10d/Store: add nonblocking mode to queue_pop#151485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedApr 16, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/151485
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit31825f4 with merge base7f52875 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
fduwjj left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM
| PyModule_AddObject( | ||
| module,"_DistQueueEmptyError", THPException_DistQueueEmptyError) == | ||
| 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is the purpose of doing this to add the exception into PyTorch python object exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes -- we need to do this to surface it to Python. Though, honestly maybe we should move all of the PTD errors to use pybind instead of THP for exception translation. THP is just so painful to work with
d4l3k commentedApr 17, 2025
@pytorchbot merge |
pytorchmergebot commentedApr 17, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
pytorchmergebot commentedApr 17, 2025
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper byviewing the failures on hud |
d4l3k commentedApr 17, 2025
@pytorchbot merge -i |
pytorchmergebot commentedApr 17, 2025
Merge startedYour change will be merged while ignoring the following 7 checks:pull / unstable-linux-focal-cuda12.6-py3.10-gcc11-sm89-xfail / build,pull / linux-focal-py3.9-clang10 / test (default, 1, 5, lf.ephemeral.linux.4xlarge),pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, lf.ephemeral.linux.2xlarge),pull / linux-focal-py3.13-clang10 / test (default, 3, 5, lf.ephemeral.linux.4xlarge),pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, lf.ephemeral.linux.2xlarge),pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, lf.ephemeral.linux.2xlarge),pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, lf.ephemeral.linux.4xlarge) Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
pytorchmergebot commentedApr 17, 2025
Merge failedReason: 1 jobs have failed, first few of them are:trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) Details for Dev Infra teamRaised byworkflow job |
d4l3k commentedApr 17, 2025
@pytorchbot merge -i |
pytorchmergebot commentedApr 17, 2025
pytorchmergebot commentedApr 17, 2025
d4l3k commentedApr 17, 2025
@pytorchbot merge -i |
pytorchmergebot commentedApr 17, 2025
pytorchmergebot commentedApr 17, 2025
d4l3k commentedApr 17, 2025
@pytorchbot rebase |
pytorchmergebot commentedApr 17, 2025
@pytorchbot started a rebase job ontorefs/remotes/origin/viable/strict. Check the current statushere |
pytorchmergebot commentedApr 17, 2025
Successfully rebased |
9842ea7 to31825f4Compared4l3k commentedApr 17, 2025
@pytorchmergebot merge |
pytorchmergebot commentedApr 17, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
Uh oh!
There was an error while loading.Please reload this page.
This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue.
We also expose a
torch.distributed.QueueEmptyErrorso users can catch the error and handle it accordingly.Test plan:
cc@H-Huang@awgu@wanchaol@fegin@fduwjj@wz337@wconstab