Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-66587: Fix deadlock from pool worker death without communication#16103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
applio wants to merge6 commits intopython:main
base:main
Choose a base branch
Loading
fromapplio:fix_multiprocessing_worker_died_indefinite_hang

Conversation

applio
Copy link
Member

@applioapplio commentedSep 13, 2019
edited by bedevere-appbot
Loading

Adds tracking of which worker process in the pool takes which job from the queue.

When a worker process dies without communication, its task/job is also lost. By tracking what job that worker took off the job queue as its task, upon detecting the death, the parent process can add an item to the result queue indicating the failure of that task/job.

In case of a future regression, the supplied test uses subprocess to constrain the test with a timeout to ensure an indefinite hang does not interfere with the running of tests.

https://bugs.python.org/issue22393

ZackerySpytz, dobrypd, fpoli, and iamkenos reacted with thumbs up emoji
@pierreglaser
Copy link
Contributor

This looks good to me, simply a few remarks:

Also pinging@tomMoral

@zooba
Copy link
Member

For mine, I think this fix seems more elegant than#10441, but the tests in that PR seem to have more coverage.

I personally prefer to just have the task fail, and the pool continue. The current behaviour is that the broken worker is immediately replaced and other work continues, but if you wait on the failed task then it will never complete. Now it does complete (with a failure), which means robust code can re-queue it if appropriate. I don't see any reason to tear down the entire pool.

Few comments on the PR incoming.

worker.join()
cleaned = True
if pid in job_assignments:
Copy link
Member

@zoobazoobaSep 19, 2019
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
ifpidinjob_assignments:
job=job_assignments.pop(pid,None)
ifjob:
outqueue.put((job,i, (False,RuntimeError("Worker died"))))

And some additional simplification below, of course.

Copy link
Contributor

@tomMoraltomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Here is a batch of comments.

I have to say that I like this solution as it is the most robust way of handling this, (a kind of scheduler). But it also comes with more complexity and increase communication needs -> more changes for deadlocks.

One of the main argument for the fail on error design is that there is no way there is no way to know in the main process if the worker that died had a lock on one of the communication queue. In this situation, the only way to recover the system and avoid a deadlock is to kill thePool and re-spawn one.

job_assignments[value] = job
else:
try:
cache[job]._set(i, (task_info, value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why don't you remove the job fromjob_assignement here? It would avoid unecessary operation when a worker died gracefully.

applioand others added2 commitsSeptember 22, 2019 16:37
Co-Authored-By: Steve Dower <steve.dower@microsoft.com>
Copy link
Contributor

@taleinattaleinat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Additional tests would certainly be a good idea.

# Issue22393: test fix of indefinite hang caused by worker processes
# exiting abruptly (such as via os._exit()) without communicating
# back to the pool at all.
prog = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This can be written much more clearly using a multi-line string. See for example a very similar case intest_shared_memory_cleaned_after_process_termination in this file.

# Only if there is a regression will this ever trigger a
# subprocess.TimeoutExpired.
completed_process = subprocess.run(
[sys.executable, '-E', '-S', '-O', '-c', prog],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The '-O' flag probably shouldn't be used here, but '-S' and '-E' seem fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also, consider callingtest.support.script_utils.interpreter_requires_environment(), and only use the '-E' flag if that returnsFalse, as done by the other Python script running utils intest.support.script_utils.

Or just usetest.support.script_utils.run_python_until_end() instead ofsubprocess.run().

@csabella
Copy link
Contributor

@applio, I'm not sure where this one is at, but I believe there are some comments that still need to be addressed. I don't know if it's waiting on anything else, but it would probably be nice to get this merged.

chrissimpkins reacted with thumbs up emoji

@ambv
Copy link
Contributor

Closing and re-opening to re-trigger CI.

@ambvambv closed thisSep 23, 2021
@ambvambv reopened thisSep 23, 2021
@ambvambv added the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section labelSep 23, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by@ambv for commit6459284 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-botbedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section labelSep 23, 2021
@ambvambv added 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section needs backport to 3.10only security fixes labelsSep 23, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by@ambv for commit6459284 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-botbedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section labelSep 23, 2021
@ambv
Copy link
Contributor

This missed the boat for inclusion in Python 3.9 which accepts security fixes only as of today.

@ambvambv removed the needs backport to 3.9only security fixes labelMay 17, 2022
@serhiy-storchakaserhiy-storchaka added the needs backport to 3.11only security fixes labelMay 20, 2022
@serhiy-storchakaserhiy-storchaka added needs backport to 3.12only security fixes needs backport to 3.13bugs and security fixes and removed needs backport to 3.10only security fixes needs backport to 3.11only security fixes labelsMay 9, 2024
@Yhg1sYhg1s removed the needs backport to 3.12only security fixes labelApr 8, 2025
@python-cla-bot
Copy link

The following commit authors need to sign the Contributor License Agreement:

CLA signed

@serhiy-storchakaserhiy-storchaka added the needs backport to 3.14bugs and security fixes labelMay 8, 2025
@gpsheadgpshead changed the titlebpo-22393: Fix deadlock from pool worker death without communicationgh-66587: Fix deadlock from pool worker death without communicationMay 22, 2025
@gpsheadgpshead self-assigned thisMay 22, 2025
@gpsheadgpshead added stdlibPython modules in the Lib dir topic-multiprocessing labelsMay 22, 2025
@gpsheadgpshead self-requested a review as acode ownerMay 22, 2025 18:02
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@taleinattaleinattaleinat left review comments

@zoobazoobazooba left review comments

@tomMoraltomMoraltomMoral left review comments

@pablogsalpablogsalAwaiting requested review from pablogsal

@gpsheadgpsheadAwaiting requested review from gpsheadgpshead is a code owner

Assignees

@gpsheadgpshead

Labels
awaiting changesneeds backport to 3.13bugs and security fixesneeds backport to 3.14bugs and security fixesstdlibPython modules in the Lib dirtopic-multiprocessing
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

13 participants
@applio@pierreglaser@zooba@csabella@ambv@bedevere-bot@taleinat@tomMoral@gpshead@serhiy-storchaka@Yhg1s@the-knights-who-say-ni@ezio-melotti

[8]ページ先頭

©2009-2025 Movatter.jp