- Notifications
You must be signed in to change notification settings - Fork927
Closed
Description
SendingSIGTERM
to the coder server is supposed to trigger a graceful shutdown that drains build jobs before exiting. However, it seems like when a build job is running at the timeSIGTERM
is received, the job gets interrupted anyway:
Stop caught, waiting for provisioner jobs to complete and gracefully exiting. Use ctrl+\ to force quitShutting down API server...2024-08-23 15:12:17.146 [info] provisionerd-40d0ef3f-5f61-40ea-838a-45d20073363d-3.runner: workspace provisioner job logged job_id=4b457a13-609f-413b-bf61-fd29bf86bebd template_name=workspace-v1 template_version=zealous_borg5 workspace_build_id=60a52a9c-e60b-4a0a-85f8-7eb3a1775151 workspace_id=d8b32732-8313-47a1-b12e-61a5be6ea289 workspace_name=[redacted] workspace_owner=[redacted] workspace_transition=start level=INFO workspace_build_id=60a52a9c-e60b-4a0a-85f8-7eb3a1775151 ... output= Interrupt received. Please wait for Terraform to exit or data loss may occur. Gracefully shutting down...2024-08-23 15:12:17.146 [info] provisionerd-40d0ef3f-5f61-40ea-838a-45d20073363d-3.runner: workspace provisioner job logged job_id=4b457a13-609f-413b-bf61-fd29bf86bebd template_name=workspace-v1 template_version=zealous_borg5 workspace_build_id=60a52a9c-e60b-4a0a-85f8-7eb3a1775151 workspace_id=d8b32732-8313-47a1-b12e-61a5be6ea289 workspace_name=[redacted] workspace_owner=[redacted] workspace_transition=start level=INFO output="Stopping operation..." workspace_build_id=60a52a9c-e60b-4a0a-85f8-7eb3a17751512024-08-23 15:12:17.146 [info] provisionerd-40d0ef3f-5f61-40ea-838a-45d20073363d-3.runner: workspace provisioner job logged job_id=4b457a13-609f-413b-bf61-fd29bf86bebd template_name=workspace-v1 template_version=zealous_borg5 workspace_build_id=60a52a9c-e60b-4a0a-85f8-7eb3a1775151 workspace_id=d8b32732-8313-47a1-b12e-61a5be6ea289 workspace_name=[redacted] workspace_owner=[redacted] workspace_transition=start level=INFO output="netflix_ec2.dev: Modifications errored after 24s" workspace_build_id=60a52a9c-e60b-4a0a-85f8-7eb3a1775151
This was a result of configuring systemd to send the coder serverSIGTERM
and wait 10 minutes before following up with a kill signal. Howver, the interrupt and "Stopping operation..." log message appears to be immediate. The provider log also showed that its operation was cancelled partway through.
KillSignal=SIGTERMSendSIGKILL=yesTimeoutStopSec=10min
This is a high priority issue for us as it limits our ability to safely deploy updates.