Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[c10d][tcp_store] Fix connection reset caused by wrong socket close#150987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
fduwjj wants to merge1 commit intogh/fduwjj/122/basefromgh/fduwjj/122/head

Conversation

@fduwjj
Copy link
Contributor

@fduwjjfduwjj commentedApr 10, 2025
edited
Loading

Stack fromghstack (oldest at bottom):

While fixing the memory leak in#145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc:https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.

nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).

We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.

Added a unit test to test this case.

cc@H-Huang@awgu@wanchaol@fegin@wz337@wconstab@d4l3k

@pytorch-botpytorch-botbot added oncall: distributedAdd this issue/PR to distributed oncall triage queue release notes: distributed (c10d)release notes category labelsApr 10, 2025
@pytorch-bot
Copy link

pytorch-botbot commentedApr 10, 2025
edited
Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results athud.pytorch.org/pr/150987

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commitc173ffc with merge basef3cf3ec (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj added a commit that referenced this pull requestApr 10, 2025
@fduwjjfduwjj added the ciflow/trunkTrigger trunk jobs on your pull request labelApr 10, 2025
@fduwjj
Copy link
ContributorAuthor

@fduwjj has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator.

Copy link
Member

@d4l3kd4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, nice find!

fduwjj reacted with thumbs up emoji
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra teamRaised byworkflow job

Failing merge rule: Core Maintainers

@fduwjj
Copy link
ContributorAuthor

@pytorchbot merge -i

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in thewiki.

Questions? Feedback? Please reach out to thePyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull requestApr 11, 2025
We removed the wrong EOF case in#150987, and we added the correct one back in this PR. Since#150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.Pull Requestresolved:#151052Approved by:https://github.com/XilunWu
timocafe pushed a commit to timocafe/pytorch that referenced this pull requestApr 16, 2025
…ytorch#150987)While fixing the memory leak inpytorch#145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc:https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.Added a unit test to test this case.Pull Requestresolved:pytorch#150987Approved by:https://github.com/d4l3k,https://github.com/XilunWu
timocafe pushed a commit to timocafe/pytorch that referenced this pull requestApr 16, 2025
We removed the wrong EOF case inpytorch#150987, and we added the correct one back in this PR. Sincepytorch#150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.Pull Requestresolved:pytorch#151052Approved by:https://github.com/XilunWu
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
…ytorch#150987)While fixing the memory leak inpytorch#145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc:https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.Added a unit test to test this case.Pull Requestresolved:pytorch#150987Approved by:https://github.com/d4l3k,https://github.com/XilunWu
amathewc pushed a commit to amathewc/pytorch that referenced this pull requestApr 17, 2025
We removed the wrong EOF case inpytorch#150987, and we added the correct one back in this PR. Sincepytorch#150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.Pull Requestresolved:pytorch#151052Approved by:https://github.com/XilunWu
@github-actionsgithub-actionsbot deleted the gh/fduwjj/122/head branchMay 16, 2025 02:18
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@d4l3kd4l3kd4l3k approved these changes

@XilunWuXilunWuXilunWu approved these changes

@Skylion007Skylion007Awaiting requested review from Skylion007

Assignees

No one assigned

Labels

ciflow/trunkTrigger trunk jobs on your pull requestMergedoncall: distributedAdd this issue/PR to distributed oncall triage queuerelease notes: distributed (c10d)release notes category

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

6 participants

@fduwjj@facebook-github-bot@pytorchmergebot@d4l3k@XilunWu

[8]ページ先頭

©2009-2025 Movatter.jp