- Notifications
You must be signed in to change notification settings - Fork26.3k
[c10d][tcp_store] Fix connection reset caused by wrong socket close#150987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
pytorch-botbot commentedApr 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/150987
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commitc173ffc with merge basef3cf3ec ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
fduwjj commentedApr 10, 2025
@fduwjj has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator. |
d4l3k left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM, nice find!
facebook-github-bot commentedApr 10, 2025
@pytorchbot merge -i (Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally) |
pytorchmergebot commentedApr 10, 2025
Merge failedReason: Approvers from one of the following sets are needed:
|
fduwjj commentedApr 10, 2025
@pytorchbot merge -i |
pytorchmergebot commentedApr 10, 2025
Merge startedYour change will be merged while ignoring the following 0 checks: Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
We removed the wrong EOF case in#150987, and we added the correct one back in this PR. Since#150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.Pull Requestresolved:#151052Approved by:https://github.com/XilunWu
…ytorch#150987)While fixing the memory leak inpytorch#145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc:https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.Added a unit test to test this case.Pull Requestresolved:pytorch#150987Approved by:https://github.com/d4l3k,https://github.com/XilunWu
We removed the wrong EOF case inpytorch#150987, and we added the correct one back in this PR. Sincepytorch#150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.Pull Requestresolved:pytorch#151052Approved by:https://github.com/XilunWu
…ytorch#150987)While fixing the memory leak inpytorch#145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc:https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.Added a unit test to test this case.Pull Requestresolved:pytorch#150987Approved by:https://github.com/d4l3k,https://github.com/XilunWu
We removed the wrong EOF case inpytorch#150987, and we added the correct one back in this PR. Sincepytorch#150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.Pull Requestresolved:pytorch#151052Approved by:https://github.com/XilunWu
Uh oh!
There was an error while loading.Please reload this page.
Stack fromghstack (oldest at bottom):
While fixing the memory leak in#145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc:https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.
We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.
Added a unit test to test this case.
cc@H-Huang@awgu@wanchaol@fegin@wz337@wconstab@d4l3k