- Notifications
You must be signed in to change notification settings - Fork1k
docs: add troubleshooting steps for prebuilt workspaces#20231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Changes fromall commits
File filter
Filter by extension
Conversations
Uh oh!
There was an error while loading.Please reload this page.
Jump to
Uh oh!
There was an error while loading.Please reload this page.
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -247,6 +247,71 @@ When prebuilt workspaces are configured for an organization, Coder creates a "pr | ||||||
If a quota is exceeded, the prebuilt workspace will fail provisioning the same way other workspaces do. | ||||||
### Managing prebuild provisioning queues | ||||||
Prebuilt workspaces can overwhelm a Coder deployment, causing significant delays when users and template administrators attempt to create new workspaces or manage their templates. This can happen in two scenarios: | ||||||
1. **Organic overload**: Not enough provisioners to meet the deployment's needs | ||||||
2. **Broken template**: A template that mistakenly requests too many prebuilt workspaces | ||||||
Comment on lines +254 to +255 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I think the issue here is actually a combination of these two factors: there aren’t enough resources to handle the high demand from prebuild-related provisioner jobs. This problem can be further amplified when those jobs take a long time to complete. Additionally, might be worth explanation an additional scenario when a user creates a new template version (a user-initiated job), once this is processed and the prebuild reconciliation loop runs, it adds even more load by scheduling new prebuild-related jobs. This means the queue could now include jobs for both template version 1 and version 2. | ||||||
In the second case, it can be difficult to fix the situation because you cannot upload a corrected template version while the provisioners are overloaded. | ||||||
The troubleshooting steps below will help you resolve this situation: | ||||||
- Pause prebuilt workspace reconciliation to stop the problem from getting worse | ||||||
- Check how many prebuild jobs are clogging your provisioner queue | ||||||
- Cancel excess prebuild jobs to free up provisioners for human users | ||||||
- Fix any problematic templates that are causing the issue | ||||||
- Resume prebuilt reconciliation once everything is back to normal | ||||||
If your Coder deployment is exhibiting the above symptoms, follow these instructions to verify and then rectify the situation: | ||||||
First, run: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. nit: maybe having this numbered would help? Suggested change
| ||||||
```bash | ||||||
coder prebuilds pause | ||||||
``` | ||||||
This prevents further pollution of your provisioner queues by stopping the prebuilt workspaces feature from scheduling new creation jobs. Jobs that have already been enqueued will still be processed. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. nit: maybe worth adding a note that this will pause prebuilds system-wide, not just organization-wide | ||||||
**Important**: Remember to run `coder prebuilds resume` once all impact has been mitigated (see the last step in this section). | ||||||
Next, run: | ||||||
```bash | ||||||
coder provisioner jobs list --status=pending --initiator=prebuilds | ||||||
``` | ||||||
This will show a list of all pending jobs that have been enqueued by the prebuilt workspace system. The length of this list indicates whether prebuilt workspaces have overwhelmed your Coder deployment. | ||||||
Human-initiated jobs have priority over pending prebuild jobs, but running prebuild jobs cannot be preempted. A long list of pending prebuild jobs increases the likelihood that all provisioners are already occupied when a user wants to create a workspace. This increases the likelihood that users will experience delays waiting for the next available provisioner. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Nice 👍 | ||||||
To ensure that the next available provisioner will be given to a human-initiated job, run: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I’m not sure this sentence is entirely accurate. Since human-initiated jobs already have priority over prebuild-related jobs, the next available provisioner will automatically be assigned a human-initiated job if there is one. The purpose of this behavior is to help clear the queue and prevent situations where all provisioner daemons are occupied with prebuild-related jobs, which could delay human-initiated ones. | ||||||
```bash | ||||||
coder provisioner jobs list --status=pending --initiator=prebuilds | jq -r '.[].id' | xargs -n1 -P2 -I{} coder provisioner jobs cancel {} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. AFAIU, this command won’t actually print the list of jobs — it will pipe them directly into jq. I think it would be useful to show the list of jobs first, so users can review them before deciding to cancel. That way, they could choose to cancel only a subset of prebuilds if needed. Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs? | ||||||
``` | ||||||
This will clear the provisioner queue of all jobs that were not initiated by a human being, which increases the probability that a provisioner will be available when the next human operator needs it. It does not cancel running provisioner jobs, so there may still be some delay in processing new provisioner jobs until a provisioner completes its current job. | ||||||
At this stage, most prebuild related impact will have been mitigated. There may still be a bugged template version, but it will no longer pollute provisioner queues with prebuilt workspace jobs. If the latest version of a template is also broken for reasons unrelated to prebuilds, then users are able to create workspaces using a previous template version. Some running jobs may have been initiated by the prebuild system, but these cannot be cancelled without potentially orphaning resources that have already been deployed by Terraform. Depending on your deployment and template provisioning times, it might be best to upload a new template version and wait for it to be processed organically. | ||||||
If you need to expedite the processing of human-related jobs at the cost of some infrastructure housekeeping, you can run: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I think it would be good to include a warning about the infrastructure housekeeping implications here, and clarify that this command should generally be used as a last resort. | ||||||
```bash | ||||||
coder provisioner jobs list --status=running --initiator=prebuilds | jq -r '.[].id' | xargs -n1 -P2 -I{} coder provisioner jobs cancel {} | ||||||
``` | ||||||
This will cancel running prebuild jobs (orphaning any resources that have already been deployed) and immediately make room for human-initiated jobs. | ||||||
Once the provisioner queue has been cleared and all templates have been fixed, resume prebuild reconciliation by running: | ||||||
```bash | ||||||
coder prebuilds resume | ||||||
``` | ||||||
This re-enables the prebuilt workspaces feature and allows the reconciliation loop to resume normal operation. The system will begin creating new prebuilt workspaces according to your template configurations. | ||||||
### Template configuration best practices | ||||||
#### Preventing resource replacement | ||||||
Uh oh!
There was an error while loading.Please reload this page.