Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork34.3k
gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316
gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316gpshead wants to merge 1 commit intopython:mainfrom
Conversation
ProblemResourceTracker.__del__ (added inpythongh-88887) calls os.waitpid(pid, 0)which blocks indefinitely if a process created via os.fork() stillholds the tracker pipe's write end. The tracker never sees EOF, neverexits, and the parent hangs at interpreter shutdown.Root causeThree requirements conflict:-pythongh-88887 wants the parent to reap the tracker to prevent zombies-pythongh-80849 wants mp.Process(fork) children to reuse the parent's tracker via the inherited pipe fd-pythongh-146313 shows the parent can't block in waitpid() if a child holds the fd -- the tracker won't see EOF until all copies closeFixTwo layers:Timeout safety-net. _stop_locked() gains a wait_timeout parameter.When called from __del__, it polls with WNOHANG using exponentialbackoff for up to 1 second instead of blocking indefinitely.At-fork handler. An os.register_at_fork(after_in_child=...) handlercloses the inherited pipe fd in the child unless a preserve flag isset. popen_fork.Popen._launch() sets the flag before its fork somp.Process(fork) children keep the fd and reuse the parent's tracker(preservingpythongh-80849). Raw os.fork() children close the fd, lettingthe parent reap promptly.Result Scenario Before After Raw os.fork(), parent exits while child alive deadlock ~30ms reap mp.Process(fork), parent joins then exits ~30ms reap ~30ms reap mp.Process(fork), parent exits abnormally deadlock 1s bounded wait No fork (pythongh-88887 scenario) ~30ms reap ~30ms reapThe at-fork handler makes the timeout unreachable in all well-behavedpaths. The timeout remains as a safety net for abnormal shutdowns.
itamaro left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
mostly nits, overall looks good, thanks!
| os.close(self._fd) | ||
| except OSError: | ||
| pass | ||
| self._fd = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
setting this toNoneafter closing exposes us to race conditions?
| # - A raw os.fork() leaves the flag unset. We close the fd so | ||
| # the parent's __del__ can reap the tracker without waiting | ||
| # for us to exit. If we later need a tracker, ensure_running() | ||
| # will launch a fresh one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
| # - A raw os.fork() leaves the flag unset. We close the fd so | |
| # the parent's __del__ can reap the tracker without waiting | |
| # forus to exit. If we later need a tracker, ensure_running() | |
| # will launch a fresh one. | |
| # - A raw os.fork() leaves the flag unset. We close the fdin the child after forkingso | |
| # the parent's __del__ can reap the tracker without waiting | |
| # forthe child to exit. If we later need a tracker, ensure_running() | |
| # will launch a fresh one. |
| # - multiprocessing.Process with the 'fork' start method sets | ||
| # _fork_intent.preserve_fd before forking. The child keeps the | ||
| # fd and reuses the parent's tracker (gh-80849). This is safe | ||
| # because multiprocessing's atexit handler joins all children | ||
| # before the parent's __del__ runs, so by then the fd copies | ||
| # are gone and the parent can reap the tracker promptly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
it might besafe, but it's unclear to me why this isdesirable in themultiprocessing.Process scenario
why do we ever want forked children to share the resource tracker with the parent?
When you're done making the requested changes, leave the comment: |
Uh oh!
There was an error while loading.Please reload this page.
Problem
ResourceTracker.__del__(added ingh-88887) callsos.waitpid(pid, 0)which blocks indefinitely if a process created viaos.fork()still holds the tracker pipe's write end. The tracker never sees EOF, never exits, and the parent hangs at interpreter shutdown.Root cause
Three requirements conflict:
mp.Process(fork)children to reuse the parent's tracker via the inherited pipe fdwaitpid()if a child holds the fd -- the tracker won't see EOF until all copies closeFix
Two layers:
Timeout safety-net.
_stop_locked()gains await_timeoutparameter. When called from__del__, it polls withWNOHANGusing exponential backoff for up to 1 second instead of blocking indefinitely.At-fork handler. An
os.register_at_fork(after_in_child=...)handler closes the inherited pipe fd in the child unless a preserve flag is set.popen_fork.Popen._launch()sets the flag before its fork somp.Process(fork)children keep the fd and reuse the parent's tracker (preservinggh-80849). Rawos.fork()children close the fd, letting the parent reap promptly.Result
os.fork(), parent exits while child alivemp.Process(fork), parent joins then exitsmp.Process(fork), parent exits abnormallyThe at-fork handler makes the timeout unreachable in all well-behaved paths. The timeout remains as a safety net for abnormal shutdowns.