Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.1k
gh-130895: fix multiprocessing.Process join/wait/poll races#131440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Open
duaneg wants to merge2 commits intopython:mainChoose a base branch fromduaneg:gh-130895
base:main
Could not load branches
Branch not found:{{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline, and old review comments may become outdated.
+108 −23
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
This bug is caused by race conditions in the poll implementations (which arecalled by join/wait) where if multiple threads try to reap the dead processonly one "wins" and gets the exit code, while the others get an error.In the forkserver implementation the losing thread(s) set the code to an error,possibly overwriting the correct code set by the winning thread. This isrelatively easy to fix: we can just take a lock before waiting for the process,since at that point we know the call should not block.In the fork and spawn implementations the losers of the race return before theexit code is set, meaning the process may still report itself as alive afterjoin returns. Fixing this is trickier as we have to support a mixture ofblocking and non-blocking calls to poll, and we cannot have the latter waitingto take a lock held by the former.The approach taken is to split the blocking and non-blocking call variants. Thenon-blocking variant does its work with the lock held: since it won't blockthis should be safe. The blocking variant releases the lock before making theblocking operating system call. It then retakes the lock and either sets thecode if it wins or waits for a potentially racing thread to do so otherwise.If a non-blocking call is racing with the unlocked part of a blocking call itmay still "lose" the race, and return None instead of the exit code, eventhough the process is dead. However, as the process could be alive at the timethe call is made but die immediately afterwards, this situation should alreadybe handled by correctly written code.To verify the behaviour a test is added which reliably triggers failures forall three implementations. A work-around for this bug in a test added forpythongh-128041 is also reverted.
ghost commentedMar 19, 2025 • edited by ghost
Loading Uh oh!
There was an error while loading.Please reload this page.
edited by ghost
Uh oh!
There was an error while loading.Please reload this page.
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading.Please reload this page.
This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error.
In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block.
In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former.
The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise.
If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code.
To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added forgh-128041 is also reverted.
multiprocessing.Process.is_alive()
can incorrectly return True afterjoin()
#130895