- Notifications
You must be signed in to change notification settings - Fork926
Closed
Description
The prebuild system uses an active control loop that reconciles prebuild state periodically. The reconciliation loop creates new prebuilt workspaces as needed to meet a desired workspace count. This creation may fail. To allow operators to investigate such failures, we do not delete prebuilt workspaces that failed to build. Prebuilds as a system "never deletes evidence".
Unfortunately, failed prebuilds might still create some resources successfully prior to their failure. These might incur infrastructure costs.
These facts, taken together, mean that prebuild failures over time risk becoming a significant expense both financially and in terms of platform management overhead.
To remedy this, we need four things:
- A configurable cap that stops creating new prebuilt workspaces for a preset if a certain number of failed prebuilt workspaces exist.
- A mechanism to notify operators of prebuild failures and when the aforementioned cap was reached. This could be a prometheus metric or a Coder inbox notification
- A mechanism for operators to manually trigger a prebuild in order to troubleshoot
- An automated mechanism to detect when prebuild reconciliation can resume based on whether failed prebuilds have been cleaned up
Metadata
Metadata
Assignees
Labels
No labels