- Notifications
You must be signed in to change notification settings - Fork576
UN-3098 [FIX] Fixed graceful shutdown in tool container to ensure in-progress tasks complete#1708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
coderabbitaibot commentedDec 17, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughSignal handlers were removed and SIGINT/SIGTERM are now ignored at startup in the sidecar and SDK entrypoint; container start command no longer uses Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro Cache: Disabled due to Reviews > Disable Cache setting Knowledge base: Disabled due to 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)runner/src/unstract/runner/runner.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for usingCodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tool-sidecar/src/unstract/tool_sidecar/log_processor.py (1)
226-259:Critical: Monitoring loop never checks shutdown flag.The
monitor_logs()method'swhile Trueloop (starting at line 236) never checks the_shutdown_requestedflag. Even after the signal handler sets this flag, the monitoring loop will continue running indefinitely. For graceful shutdown to work, you must check the flag within the loop.Add a flag check in the monitoring loop:
# Monitor the file for new content with open(self.log_path) as f: while True:+ # Check if shutdown was requested+ if _shutdown_requested:+ logger.info("Shutdown requested - exiting monitoring loop gracefully")+ self._update_tool_execution_status(+ status=ToolExecutionStatus.STOPPED,+ error="Shutdown signal received"+ )+ break+ # Remember current position where = f.tell() line = f.readline()Note: You may also want to check the flag at line 249 (during the sleep) to ensure more responsive shutdown during idle periods.
🧹 Nitpick comments (1)
tool-sidecar/src/unstract/tool_sidecar/log_processor.py (1)
261-326:Optional: Consider logging shutdown reason after monitoring completes.While the signal handlers are correctly registered and the monitoring loop is started, consider adding a check after
processor.monitor_logs()completes to log whether the exit was due to graceful shutdown or normal completion. This improves observability.processor.monitor_logs()++ if _shutdown_requested:+ logger.info("Log processor exited due to shutdown signal")+ else:+ logger.info("Log processor completed normally")
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due toReviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
tool-sidecar/src/unstract/tool_sidecar/log_processor.py(1 hunks)unstract/sdk1/src/unstract/sdk1/tool/entrypoint.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unstract/sdk1/src/unstract/sdk1/tool/entrypoint.py (1)
tool-sidecar/src/unstract/tool_sidecar/log_processor.py (1)
_signal_handler(31-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
Uh oh!
There was an error while loading.Please reload this page.
| # Global flag for graceful shutdown | ||
| _shutdown_requested=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash# Search for all usages of _shutdown_requested flag in the codebaseecho"=== Searching for _shutdown_requested usages ==="rg -n'_shutdown_requested' --type pyecho -e"\n=== Searching for conditional checks on shutdown flag ==="rg -n'if.*shutdown|while.*shutdown' --type py -A 2
Repository: Zipstack/unstract
Length of output: 1548
🏁 Script executed:
cat -n unstract/sdk1/src/unstract/sdk1/tool/entrypoint.py
Repository: Zipstack/unstract
Length of output: 2003
🏁 Script executed:
cat -n unstract/sdk1/src/unstract/sdk1/tool/executor.py
Repository: Zipstack/unstract
Length of output: 3547
Critical: Global flag defined and set but never checked.
The_shutdown_requested flag is set toTrue by the signal handler but is never checked anywhere in the codebase. Theexecute_run() method callsself.tool.run() at line 73, which blocks until completion. Without a check of the flag (such as in a loop or before/after the blocking call), the signal handler's attempt to initiate graceful shutdown has no effect—the tool will continue running regardless of the SIGTERM or SIGINT signal.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
runner/src/unstract/runner/runner.py (1)
292-321:The generated shell script lacks signal handling for graceful shutdown.The
_get_container_commandmethod builds a shell script that runs the tool command, but it doesn't include signal forwarding or trap handlers. With the removal ofdumb-init(line 458) andexecinentrypoint.sh, this shell script needs explicit signal management to enable graceful shutdown.To support graceful shutdown after removing
dumb-init, add signal forwarding to the generated shell script:def _get_container_command( self, shared_log_dir: str, shared_log_file: str, settings: dict[str, Any]): """Returns the container command to run the tool.""" settings_json = json.dumps(settings).replace("'", "\\'") # Prepare the tool execution command tool_cmd = ( f"python main.py --command RUN " f"--settings '{settings_json}' --log-level DEBUG" ) if not self.sidecar_enabled: return tool_cmd # Shell script components+ signal_handling = (+ "trap 'kill -TERM $tool_pid 2>/dev/null' TERM INT; "+ "trap 'kill -HUP $tool_pid 2>/dev/null' HUP; "+ ) mkdir_cmd = f"mkdir -p {shared_log_dir}" run_tool_fn = ( "run_tool() { " f"{tool_cmd}; " "exit_code=$?; " f'echo "{LogFieldName.TOOL_TERMINATION_MARKER} with exit code $exit_code" ' f">> {shared_log_file}; " "return $exit_code; " "}" )- execute_cmd = f"run_tool > {shared_log_file} 2>&1"+ execute_cmd = f"run_tool > {shared_log_file} 2>&1 & tool_pid=$!; wait $tool_pid" # Combine all commands- shell_script = f"{mkdir_cmd} && {run_tool_fn}; {execute_cmd}"+ shell_script = f"{mkdir_cmd} && {signal_handling} {run_tool_fn}; {execute_cmd}" return shell_scriptThis ensures signals are forwarded to the Python process for proper graceful shutdown handling.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due toReviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
runner/src/unstract/runner/runner.py(1 hunks)tool-sidecar/entrypoint.sh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
unstract/sdk1/src/unstract/sdk1/tool/entrypoint.py (1)
16-25:Critical: Ignoring SIGTERM/SIGINT breaks container orchestration and prevents graceful shutdown.The PR objectives state this change "Fixed graceful shutdown in tool container to ensure in-progress tasks complete", but setting
signal.SIG_IGNdoes the opposite—it tells the OS to ignore shutdown signals entirely. This creates several critical problems:
Container orchestration failure: Container platforms (Kubernetes, Docker) send SIGTERM to request graceful shutdown. If the process ignores SIGTERM, the platform waits for a grace period (typically 30s) then sends SIGKILL, forcefully terminating the container. This defeats the purpose of graceful shutdown.
Misleading behavior: The comment states "prevents the signal from interrupting the current operation," but the signal hasalready interrupted execution (that's why the handler was invoked). Setting
SIG_IGNonly affectsfuture signals, not the current one.No actual shutdown logic: After the handler returns, control goes back to the interrupted code with no shutdown mechanism. The process continues indefinitely until force-killed.
For proper graceful shutdown in a container environment, you should:
- Let the signal handler set a flag or trigger shutdown logic
- Allow the current operation to complete naturally
- Exit the process cleanly so the container terminates with exit code 0
Apply this diff to implement proper graceful shutdown:
+# Global flag for graceful shutdown+_shutdown_requested = False+ class ToolEntrypoint: """Class that contains methods for the entrypoint for a tool.""" @staticmethod def _signal_handler(signum: int, _frame: types.FrameType | None) -> None: """Handle SIGTERM and SIGINT signals."""+ global _shutdown_requested sig = signal.Signals(signum) signal_name = sig.name logger.warning(f"RECEIVED SIGNAL: {signal_name}")- logger.warning("Allowing tool to complete current operation...")-- # Ignore the signal to allow the process to continue- # This prevents the signal from interrupting the current operation- signal.signal(signum, signal.SIG_IGN)+ logger.warning("Initiating graceful shutdown after current operation...")+ _shutdown_requested = TrueThen check the flag after
executor.execute(parsed_args)completes:parsed_args = ToolArgsParser.parse_args(args) executor = ToolExecutor(tool=tool) executor.execute(parsed_args)++ # Exit cleanly if shutdown was requested+ if _shutdown_requested:+ logger.info("Shutdown completed gracefully")+ sys.exit(0)Additionally, the executor's long-running operations should periodically check
_shutdown_requestedand break/cleanup when set.
♻️ Duplicate comments (1)
tool-sidecar/src/unstract/tool_sidecar/log_processor.py (1)
31-40:Critical: Identical issue—ignoring shutdown signals breaks sidecar termination.This signal handler has the same critical flaw as in
entrypoint.py: settingsignal.SIG_IGNprevents the sidecar from responding to SIGTERM/SIGINT, which breaks container orchestration.Specific concerns for the sidecar:
Pod-level coordination: In Kubernetes, when a pod is terminated, all containers receive SIGTERM simultaneously. The sidecar should honor this signal and shut down gracefully alongside the main tool container.
Existing completion detection: The
monitor_logs()method already has proper completion detection at lines 241-244 (checking for the "completed" file). The signal handler should complement this, not ignore signals.Orphaned sidecar: If the main tool container exits but the sidecar ignores shutdown signals, the sidecar becomes orphaned and must be force-killed.
Apply this diff for proper graceful shutdown:
+# Global flag for graceful shutdown+_shutdown_requested = False++ def _signal_handler(signum: int, _frame: types.FrameType | None): """Handle shutdown signals gracefully."""+ global _shutdown_requested sig = signal.Signals(signum) signal_name = sig.name logger.warning(f"RECEIVED SIGNAL: {signal_name}")- logger.warning("Allowing sidecar to continue monitoring...")-- # Ignore the signal to allow the process to continue- # This prevents the signal from interrupting the monitoring operation- signal.signal(signum, signal.SIG_IGN)+ logger.warning("Initiating graceful shutdown of sidecar...")+ _shutdown_requested = TrueThen modify
monitor_logs()to check the flag:# Remember current position where = f.tell() line = f.readline() if not line:+ # Check for shutdown signal+ if _shutdown_requested:+ logger.info("Shutdown signal received, exiting monitoring loop")+ break+ # No new data, check if tool container is done if os.path.exists( os.path.join(os.path.dirname(self.log_path), "completed") ): break
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due toReviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
tool-sidecar/src/unstract/tool_sidecar/log_processor.py(1 hunks)unstract/sdk1/src/unstract/sdk1/tool/entrypoint.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
unstract/sdk1/src/unstract/sdk1/tool/executor.py (1)
79-85:Remove redundant traceback logging.The full traceback is already captured by
exc_info=Trueon line 80. Lines 82-85 duplicate this information unnecessarily.Apply this diff to remove the redundancy:
- msg = f"Error while running tool '{tool_name}': {type(e).__name__}: {str(e)}"- logger.error(msg, stack_info=True, exc_info=True)-- # Log more details about the exception- import traceback-- logger.error(f"Full traceback:\n{traceback.format_exc()}")+ msg = f"Error while running tool '{tool_name}': {type(e).__name__}: {str(e)}"+ logger.error(msg, stack_info=True, exc_info=True)The enhanced error message with exception type (line 79) is a good improvement and should be kept.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due toReviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
runner/src/unstract/runner/runner.py(1 hunks)unstract/sdk1/src/unstract/sdk1/tool/executor.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (1)
runner/src/unstract/runner/runner.py (1)
458-458:Architectural decision confirmed: Signal ignoring is intentional and works as designed.The review comment's concerns arevalid and confirmed. The codebase deliberately ignores SIGTERM and SIGINT at startup (entrypoint.py lines 27-28, log_processor.py lines 252-253), with explicit documentation stating this prevents "any interruption of tool operations." This is paired with direct shell invocation (
/bin/sh -c) without dumb-init, making the shell PID 1.Confirmed implications:
Kubernetes compatibility: Containers will not respond to graceful SIGTERM. The orchestrator will force-kill (SIGKILL) after the grace period, bypassing graceful shutdown. This aligns with the PR objective ("ensure in-progress tasks complete") but trades responsive termination for guaranteed completion.
PID 1 zombie process handling: Without an init system, shell processes won't reap child zombies. The codebase acknowledges this risk (health.py detects
psutil.STATUS_ZOMBIEprocesses) but only monitors, not prevents, accumulation.PR objective fulfilled: Tasks will complete without interruption, but at the cost of Kubernetes-standard graceful termination behavior.
Recommendation: This architectural decision should be explicitly documented with:
- Why signal ignoring is preferable to graceful shutdown for this use case
- Expected behavior when Kubernetes sends SIGTERM (acknowledge it won't trigger shutdown)
- Guidance on monitoring zombie process accumulation in production deployments
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due toReviews -> Disable Knowledge Base setting
📒 Files selected for processing (3)
runner/src/unstract/runner/runner.py(1 hunks)tool-sidecar/src/unstract/tool_sidecar/log_processor.py(1 hunks)unstract/sdk1/src/unstract/sdk1/tool/entrypoint.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|



What
Why
How
Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
Screenshots
Checklist
I have read and understood theContribution Guidelines.