- Notifications
You must be signed in to change notification settings - Fork1k
docs: add troubleshooting steps for prebuilt workspaces#20231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice work 🚀 Should we also mention that users can tune theCODER_PREBUILDS_RECONCILIATION_INTERVAL
to manage how frequently the prebuild reconciliation loop runs? That might help reduce the load from frequent reconciliations. Wdyt?
1.**Organic overload**: Not enough provisioners to meet the deployment's needs | ||
2.**Broken template**: A template that mistakenly requests too many prebuilt workspaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think the issue here is actually a combination of these two factors: there aren’t enough resources to handle the high demand from prebuild-related provisioner jobs. This problem can be further amplified when those jobs take a long time to complete.
Additionally, might be worth explanation an additional scenario when a user creates a new template version (a user-initiated job), once this is processed and the prebuild reconciliation loop runs, it adds even more load by scheduling new prebuild-related jobs. This means the queue could now include jobs for both template version 1 and version 2.
If your Coder deployment is exhibiting the above symptoms, follow these instructions to verify and then rectify the situation: | ||
First, run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
nit: maybe having this numbered would help?
First, run: | |
1) Pause prebuilt workspace reconciliation |
coder prebuilds pause | ||
``` | ||
This prevents further pollution of your provisioner queues by stopping the prebuilt workspaces feature from scheduling new creation jobs. Jobs that have already been enqueued will still be processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
nit: maybe worth adding a note that this will pause prebuilds system-wide, not just organization-wide
This will show a list of all pending jobs that have been enqueued by the prebuilt workspace system. The length of this list indicates whether prebuilt workspaces have overwhelmed your Coder deployment. | ||
Human-initiated jobs have priority over pending prebuild jobs, but running prebuild jobs cannot be preempted. A long list of pending prebuild jobs increases the likelihood that all provisioners are already occupied when a user wants to create a workspace. This increases the likelihood that users will experience delays waiting for the next available provisioner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice 👍
Human-initiated jobs have priority over pending prebuild jobs, but running prebuild jobs cannot be preempted. A long list of pending prebuild jobs increases the likelihood that all provisioners are already occupied when a user wants to create a workspace. This increases the likelihood that users will experience delays waiting for the next available provisioner. | ||
To ensure that the next available provisioner will be given to a human-initiated job, run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I’m not sure this sentence is entirely accurate. Since human-initiated jobs already have priority over prebuild-related jobs, the next available provisioner will automatically be assigned a human-initiated job if there is one. The purpose of this behavior is to help clear the queue and prevent situations where all provisioner daemons are occupied with prebuild-related jobs, which could delay human-initiated ones.
To ensure that the next available provisioner will be given to a human-initiated job, run: | ||
```bash | ||
coder provisionerjobs list --status=pending --initiator=prebuilds| jq -r'.[].id'| xargs -n1 -P2 -I{} coder provisionerjobs cancel {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
AFAIU, this command won’t actually print the list of jobs — it will pipe them directly into jq. I think it would be useful to show the list of jobs first, so users can review them before deciding to cancel. That way, they could choose to cancel only a subset of prebuilds if needed.
Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs?
Right now, we don’t support cancelling multiple jobs simultaneously (either through the CLI or the dashboard), so adding that capability would be a nice improvement.
At this stage, most prebuild related impact will have been mitigated. There may still be a bugged template version, but it will no longer pollute provisioner queues with prebuilt workspace jobs. If the latest version of a template is also broken for reasons unrelated to prebuilds, then users are able to create workspaces using a previous template version. Some running jobs may have been initiated by the prebuild system, but these cannot be cancelled without potentially orphaning resources that have already been deployed by Terraform. Depending on your deployment and template provisioning times, it might be best to upload a new template version and wait for it to be processed organically. | ||
If you need to expedite the processing of human-related jobs at the cost of some infrastructure housekeeping, you can run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it would be good to include a warning about the infrastructure housekeeping implications here, and clarify that this command should generally be used as a last resort.
Uh oh!
There was an error while loading.Please reload this page.
This PR adds troubleshooting steps to guide Coder operators when they suspect that prebuilds might have overwhelmed their deployments.
Relates to#19490
This PR does not yet close the issue entirely. We still need to document how to detect and clean orphaned resources that occur when running prebuild jobs are cancelled.