When a worker process dies without communication, its task/job is also lost. By tracking what job that worker took off the job queue as its task, upon detecting the death, the parent process can add an item to the result queue indicating the failure of that task/job.

In case of a future regression, the supplied test uses subprocess to constrain the test with a timeout to ensure an indefinite hang does not interfere with the running of tests.

https://bugs.python.org/issue22393

Issue:multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly #66587

Adds tracking of which process in the pool takes which job from the q…

c8f4896

…ueue; adds test for issue22393/issue38084.

applio requested review fromzooba andpablogsal

September 13, 2019 13:38

the-knights-who-say-ni added the CLA signed label

Sep 13, 2019

bedevere-bot added the awaiting core review label

Sep 13, 2019

Added blurb.

315ec3d

applio added the needs backport to 3.8 label

Sep 13, 2019

Fix for missing checks on resources still being available during tear…

bcbd7d3

…down.

Copy link

Contributor

pierreglaser commentedSep 19, 2019

This looks good to me, simply a few remarks:

out of curiosity, are you aware ofbpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441, that implements a fix for this issue?
what I like aboutbpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly #10441 is that the pool errors out as soon as the worker crashes, whereas in this PR, the pool waits until all tasks are complete before erroring.

Also pinging@tomMoral

Copy link

Member

zooba commentedSep 19, 2019

For mine, I think this fix seems more elegant than#10441, but the tests in that PR seem to have more coverage.

I personally prefer to just have the task fail, and the pool continue. The current behaviour is that the broken worker is immediately replaced and other work continues, but if you wait on the failed task then it will never complete. Now it does complete (with a failure), which means robust code can re-queue it if appropriate. I don't see any reason to tear down the entire pool.

Few comments on the PR incoming.

zooba reviewed

Sep 19, 2019

View reviewed changes

Lib/multiprocessing/pool.pyShow resolvedHide resolved

Lib/multiprocessing/pool.py OutdatedShow resolvedHide resolved

Lib/multiprocessing/pool.py Outdated

		worker.join()
		cleaned = True
		if pid in job_assignments:

Copy link

Member

zoobaSep 19, 2019•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	ifpidinjob_assignments:
	job=job_assignments.pop(pid,None)
	ifjob:
	outqueue.put((job,i, (False,RuntimeError("Worker died"))))

And some additional simplification below, of course.

Lib/multiprocessing/pool.py OutdatedShow resolvedHide resolved

Lib/multiprocessing/pool.pyShow resolvedHide resolved

tomMoral reviewed

Sep 20, 2019

View reviewed changes

Copy link

Contributor

tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Here is a batch of comments.

I have to say that I like this solution as it is the most robust way of handling this, (a kind of scheduler). But it also comes with more complexity and increase communication needs -> more changes for deadlocks.

One of the main argument for the fail on error design is that there is no way there is no way to know in the main process if the worker that died had a lock on one of the communication queue. In this situation, the only way to recover the system and avoid a deadlock is to kill thePool and re-spawn one.

Lib/multiprocessing/pool.py

		job_assignments[value] = job
		else:
		try:
		cache[job]._set(i, (task_info, value))

Copy link

Contributor

tomMoralSep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why don't you remove the job fromjob_assignement here? It would avoid unecessary operation when a worker died gracefully.

Lib/multiprocessing/pool.pyShow resolvedHide resolved

Lib/multiprocessing/pool.py OutdatedShow resolvedHide resolved

applioand others added2 commits

September 22, 2019 16:37

Remove spurious space.

e1a9eb5

Co-Authored-By: Steve Dower <steve.dower@microsoft.com>

Fix result position for killed workers, add Steve suggested change to…

6459284

… use dict.pop().

taleinat reviewed

Nov 17, 2019

View reviewed changes

Copy link

Contributor

taleinat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Additional tests would certainly be a good idea.

Lib/test/_test_multiprocessing.py

		# Issue22393: test fix of indefinite hang caused by worker processes
		# exiting abruptly (such as via os._exit()) without communicating
		# back to the pool at all.
		prog = (

Copy link

Contributor

taleinatNov 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This can be written much more clearly using a multi-line string. See for example a very similar case intest_shared_memory_cleaned_after_process_termination in this file.

Lib/test/_test_multiprocessing.py

		# Only if there is a regression will this ever trigger a
		# subprocess.TimeoutExpired.
		completed_process = subprocess.run(
		[sys.executable, '-E', '-S', '-O', '-c', prog],

Copy link

Contributor

taleinatNov 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The '-O' flag probably shouldn't be used here, but '-S' and '-E' seem fine.

Copy link

Contributor

taleinatNov 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also, consider callingtest.support.script_utils.interpreter_requires_environment(), and only use the '-E' flag if that returnsFalse, as done by the other Python script running utils intest.support.script_utils.

Or just usetest.support.script_utils.run_python_until_end() instead ofsubprocess.run().

Copy link

Contributor

csabella commentedJan 10, 2020

@applio, I'm not sure where this one is at, but I believe there are some comments that still need to be addressed. I don't know if it's waiting on anything else, but it would probably be nice to get this merged.

csabella added awaiting changes needs backport to 3.9only security fixes and removed awaiting core review labels

May 28, 2020

ambv removed the needs backport to 3.8 label

May 4, 2021

Copy link

Contributor

ambv commentedSep 23, 2021

Closing and re-opening to re-trigger CI.

ambv closed this

Sep 23, 2021

ambv reopened this

Sep 23, 2021

ambv added the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

Sep 23, 2021

Copy link

bedevere-bot commentedSep 23, 2021

🤖 New build scheduled with the buildbot fleet by@ambv for commit6459284 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

bedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

Sep 23, 2021

ambv added 🔨 test-with-buildbots

Test PR w/ buildbots; report in status section

needs backport to 3.10only security fixes labels

Sep 23, 2021

Copy link

bedevere-bot commentedSep 23, 2021

🤖 New build scheduled with the buildbot fleet by@ambv for commit6459284 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

bedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

Sep 23, 2021

Copy link

Contributor

ambv commentedMay 17, 2022

This missed the boat for inclusion in Python 3.9 which accepts security fixes only as of today.

ambv removed the needs backport to 3.9only security fixes label

May 17, 2022

danoreillymannequin mentioned this pull request

Apr 10, 2022

multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly#66587

Open

serhiy-storchaka added the needs backport to 3.11only security fixes label

May 20, 2022

ezio-melotti removed the CLA signed label

Jul 13, 2022

serhiy-storchaka added needs backport to 3.12

only security fixes

needs backport to 3.13bugs and security fixes and removed needs backport to 3.10

only security fixes

needs backport to 3.11only security fixes labels

May 9, 2024

Yhg1s removed the needs backport to 3.12only security fixes label

Apr 8, 2025

Copy link

python-cla-botbot commentedApr 18, 2025

The following commit authors need to sign the Contributor License Agreement:

applio@users.noreply.github.com

serhiy-storchaka added the needs backport to 3.14bugs and security fixes label

May 8, 2025

gpshead changed the title~~bpo-22393: Fix deadlock from pool worker death without communication~~gh-66587: Fix deadlock from pool worker death without communication

May 22, 2025

gpshead mentioned this pull request

May 22, 2025

bpo-22393: Fix multiprocessing.Pool hangs if a worker process dies unexpectedly#10441

Closed

gpshead self-assigned this

May 22, 2025

gpshead added stdlib

Python modules in the Lib dir

topic-multiprocessing labels

May 22, 2025

Merge branch 'main' into fix_multiprocessing_worker_died_indefinite_hang