- Notifications
You must be signed in to change notification settings - Fork928
Closed
Description
A customer's replica terminated due to pubsub watchdog, but it hung and never shutdown properly because(*API).Close()
hung permanently waiting for websockets to drain.
We should:
- Have a shutdown timeout. We don't have one because we assume all customers run in Kubernetes which kills pods if they don't shutdown in time. This doesn't work when K8s didn't initiate the shutdown anyways. This customer was using a custom solution with Systemd.
- Kick agent RPC connections as soon as the API closes so they can find a new home quickly.
- Critical shutdowns due to the watchdog should kick all websockets immediately. Almost every websocket on Coder relies on pubsub to work properly.