- Notifications
You must be signed in to change notification settings - Fork924
Workspace Prebuilds#16969
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
NoteWe invite your participation on this feature proposal. Please keep commentssubstantive. We'd especially love feedback on ways in which this feature may be useful to you and/or where you feel this RFC falls short. Problem StatementCustomers often use public clouds for workspace provisioning, but these clouds can face resource constraints (especially for GPUs), causing slow builds. Startup scripts that clone large monorepos or install many dependencies also add delays. This poor first-touch experience - sometimes up to 15 minutes - hurts customers’ internal adoption and, subsequently, our sales. We need a way to pre-provision workspaces so provisioning time is reduced toseconds. User Stories
RequirementsInitial Functional Requirements
Initial Non-functional Requirements
Basic Flow
UX & Design# existing templateresource"coder_workspace_preset""us-nix" {name="Nix US"parameters={ (data.coder_parameter.region.name)="us-pittsburgh" (data.coder_parameter.image_type.name)="codercom/oss-dogfood-nix:latest" }# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓prebuilds={ instances=2 cache_invalidation= {# See the Invalidation section for more invalidate_after_secs=86400 } autoscaling= {...# See the Autoscaling section for examples } }}... Integration withWorkspace PresetsWorkspace presets allow operators to specify sets of parameters to simplify workspace builds for users. In order to build a workspace, all required parameters MUST be provided. If we piggy-back on Workspace Presets, we can use them to define which “flavors” of workspaces operators want to prebuild (i.e. small/medium/large - each with their own combination of parameters). Each preset can also have its own number of prebuilt instances; some presets might be more popular than others. This has the nice property that presets can be usedwithout prebuilds (i.e. PersistenceThe above Prebuilds themselves can be stored in the It’s important to note that presets are stored against a Matching LogicWhen a user requests a new workspace and a preset is chosen, the UUID of the chosen preset is used to compare against any available prebuilds which also use that preset UUID. A prebuild will be ONLY considered available if its InvalidationNew workspaces always use the latest template version. Therefore when a new template version is promoted to the active version, all existing prebuilds must be destroyed. The proposed usage above shows that an We could also expose an API to invalidate all prebuilds for a preset if operators need that degree of control; i.e. a new AMI is built. ProvisioningA nice property of our current design is that if no prebuilds are available, a new workspace will be provisioned synchronously. Failing to build prebuilds will not block users, it’ll just fall back to the existing behavior of imperative provisioning of workspace resources (graceful degradation). Reconciliation LoopWe will build a reconciliation loop which will reconcile all templates’ prebuilds. This needs to be triggered under the following scenarios:
The control loop will invoke a reconciliation of template states on an interval, but can also be “nudged” when the above scenarios occur to reduce waiting time. Once the number of desired vs actual prebuilds for the given template is determined, this mechanism will enqueue a number of provisioner jobs to either create new, or destroy outdated/extraneous, prebuilds to satisfy the desired count. NOTE: OwnershipWe will create a “Prebuild Owner” user and have it own all prebuilt workspaces.
We will build a mechanism to “claim” a prebuild. No advisory lock is needed for this action; Build PhasesEach workspace will have3 workspace builds (”phases”). 1st phase: provisioning of the prebuild itself. This will require us to stub out identity datasources (seeConstraints).
2nd phase: workspace build following the workspace creation request. If an available prebuild is matched (seeMatching Logic), the ownership (i.e.
3rd phase: prepare the workspace using the new ownership identity. We will invoke another
Failure ModesShould the1st phase fail, theReconciliation Loop will leave these prebuilds in their failed state. We don’t want to provision potentially many additional resources by retrying, so an operator will either need to manually restart the prebuild (via normal workspace controls) or delete it; the latter case will cause a new prebuild to be provisioned. The2nd phase will occur atomically; if it fails for whatever reason, the prebuild will still be available for claiming later. If the3rd phase fails, the workspace build will need to be manually retried; at this point it is technically no longer a prebuild, and will not be under the purview of theReconciliation Loop. Conditionalized Templates & Startup ScriptsOperators may require a way to conditionalize how a template behaves when it’s provisioning a prebuild vs a regular build. Currently we use a For example, a template admin could choose to only execute a script on the prebuild: data"coder_workspace""me" {}resource"coder_script""script1" {# prebuild_count will only be 1 during prebuild provisioningcount=data.coder_workspace.me.prebuild_countagent_id=coder_agent.dev1.iddisplay_name="Foobar Script 1"script="echo foobar 1"run_on_start=true} Startup scripts canalso bedefined in the Agent ReinitializationThe agent will need to reinitialize once it has been assigned a new identity (and possibly some of its attributes are updated like env or startup scripts). Once build phase 3 completes, the agent will need to be notified that itsmanifest has been updated. The agent API will be notified via pubsub (on a per-workspace channel), and will then push an update to the agent. Once the agent receives its new manifest, it will use it to reinitialize itself. ObservabilityWe should expose Prometheus metrics for (with partitioning in brackets):
AutoscalingGiven that prebuilt instances will be consuming (potentially very expensive) cloud resources, operators will need a mechanism to 0 outside working hours. For the initial phase, we will expose an autoscaling field under data"coder_workspace_preset""us-nix" {...prebuilds={ instances=0# default to 0 instances autoscaling= { timezone="UTC"# only a single timezone may be used# for simplicity# scale to 3 instances during the work week schedule { cron="* 8-18 * * 1-5"# from 8AM-6PM, Mon-Fri, UTC instances=3# scale to 3 instances }# scale to 1 instance on Saturdays for urgent support queries schedule { cron="* 8-14 * * 6"# from 8AM-2PM, Sat, UTC instances=1# scale to 1 instance } } }} The solution above is designed to mirror theautostart scheduling. Thecrontab format will already be familiar to operators, and will be intuitive to understand. The design above allows operators to specify a default number of instances, and then to scale that number dynamically based on one or more schedules. This can either be used to start from 0 and scale up (as the example above demonstrates), or the inverse; whichever the operator prefers. This design would also allow for validation at template import time, where we will detect scheduling conflicts (i.e. if multiple schedules overlap and produce different values). This will require a simple ticker to evaluate when the current time matches the crontab expression of a schedule, and to trigger the appropriate reconciliation in theReconciliation Loop. Constraints
In order to allow prebuilding of workspaces, we have to side-step constraint 2. Consider the following snippet fromthis template: resource"google_compute_instance""dev" {...count=data.coder_workspace.me.start_countname="coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"... If we were to create a prebuilt workspace, what would we provide to the To counteract this, we will:
Workarounds for existing templates: Using resource"google_compute_instance""dev" {count=data.coder_workspace.me.start_countname="coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"...} The above would result in a replacement, but simply adding: resource"google_compute_instance""dev" {lifecycle {ignore_changes=[name] }count=data.coder_workspace.me.start_countname="coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"...} This will instruct terraform to disregard changes to this attribute. OnboardingWe addedWorkspace Build Timings which provided insight into speed issues, it didn’t offer any solutions to this particular problem. We could use the timings graph to prompt users to try prebuilds. Infrastructure Cost ConcernsPrebuilds will drain infrastructure spend, and we have to make that trade-off known to customers. Initially we can just highlight this in the documentation, but later we might want to provide a calculator to determine if prebuilds are worth the cost. |
BetaWas this translation helpful?Give feedback.