coder/coderPublic

NotificationsYou must be signed in to change notification settings
Fork924
Star10.1k

Workspace Prebuilds#16969

dannykopping started this conversation inRFCs

dannykopping

Mar 17, 2025

· 0 comments

Return to top

Discussion options

dannykopping
Mar 17, 2025
Collaborator

Note

We invite your participation on this feature proposal. Please keep commentssubstantive. We'd especially love feedback on ways in which this feature may be useful to you and/or where you feel this RFC falls short.

Problem Statement

Customers often use public clouds for workspace provisioning, but these clouds can face resource constraints (especially for GPUs), causing slow builds. Startup scripts that clone large monorepos or install many dependencies also add delays. This poor first-touch experience - sometimes up to 15 minutes - hurts customers’ internal adoption and, subsequently, our sales.

We need a way to pre-provision workspaces so provisioning time is reduced toseconds.

User Stories

As adeveloper, I want to create workspaces near instantly, in order to start delivering value as soon as possible
As adeveloper, I want workspace creation to be fast, in order to have short-lived / ephemeral workspaces for quick experiments or code-reviews
As anoperator, I want to provision workspaces preemptively so that developers can create workspaces within 60 seconds, to keep them in flow
As anoperator, I am willing to trade off increased infrastructure spend to improve developers’ productivity, but I need to control this spend
As anoperator, I want to view a template’s prebuilt workspaces for troubleshooting purposes
As anoperator, I want my users to have a fast first experience with workspace provisioning, in order to reduce any inertia in their onboarding process
As anoperator, I want metrics or other insights, in order to assess how prebuilds are being used

Requirements

Initial Functional Requirements

MUST accelerate workspace creation for net-new builds
- prebuildsWILL NOT work for rebuilding existing workspaces, because it requires creating workspaces from scratch
MUST provision a workspace synchronously if a prebuild is not available (graceful fallback)
MUST allow operators to configure how many prebuilt instances to create, to control costs
MUST NOT restrict any existing functionality of workspaces
MUST allow for configuring combinations ofcoder_parameter values to produce different prebuilt workspace “flavors” (seeWorkspace Presets #16304)
MUST warn template admins about incompatibilities with prebuilds at template import time
- seeConstraints
MUST keep prebuilds in a running state when not in use, since the compute resource of the workspaces are usually the slowest to provision
MUST support scaling prebuilds to 0 outside of working hours to control costs
MUST expose observability to enable introspection of prebuilds provisioning and usage
MUST require a Premium license

Initial Non-functional Requirements

MUST reduce workspace provisioning time to 60 seconds or less
- NOTE: provisioning time refers to the time taken toproduce a workspace, but not for it to be fully operational (i.e. agent startup scripts have been run)
MUST NOT be slower than current workspace provisioning, if there is no prebuild available
MUST NOT require template admins to refactor their templates significantly
MUSTNOT change workspace behavior or template semantics

Basic Flow

template is configured by template admin to have prebuilds enabled (seeUX & Design)
n prebuilt workspaces are created (”first pass”) usingterraform apply
1. all prebuilds are owned by a special user
2. agent on each prebuilt workspace starts and connects tocoderd
  1. startup scripts execute conditionally
  2. SSH and other non-essential services disabled
user requests a new workspace
prebuild exists to satisfy the request (seeMatching Logic)
prebuild is marked locked
prebuild’s ownership is transferred to the requesting user (”second pass”)
1. prebuild is now indistinguishable from a regular workspace
terraform apply is invoked again with new ownership metadata & parameters chosen in point 3 (”third pass”)
the agent is instructed to reconfigure itself with new metadata, including new ownership (seeAgent Reinitialization)
the workspace is now ready for use!

UX & Design

# existing templateresource"coder_workspace_preset""us-nix" {name="Nix US"parameters={    (data.coder_parameter.region.name)="us-pittsburgh"    (data.coder_parameter.image_type.name)="codercom/oss-dogfood-nix:latest"  }# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓prebuilds={  instances=2  cache_invalidation= {# See the Invalidation section for more  invalidate_after_secs=86400  }  autoscaling= {...# See the Autoscaling section for examples  }  }}...

Integration withWorkspace Presets

Workspace presets allow operators to specify sets of parameters to simplify workspace builds for users. In order to build a workspace, all required parameters MUST be provided.

If we piggy-back on Workspace Presets, we can use them to define which “flavors” of workspaces operators want to prebuild (i.e. small/medium/large - each with their own combination of parameters). Each preset can also have its own number of prebuilt instances; some presets might be more popular than others.

This has the nice property that presets can be usedwithout prebuilds (i.e.instances=0), and enabling prebuilds is as simple as defining the number of instances.

Persistence

The abovecoder_workspace_preset resources will be captured during the template import process and inserted into the database. Each template version will have its own associated preset entries.

Prebuilds themselves can be stored in theworkspaces table; theyare workspaces after all. Prebuilds will be identifiedonly by their ownership. If they are owned by the prebuilds user, then they are by definition a prebuild.

It’s important to note that presets are stored against atemplate_version.

Matching Logic

When a user requests a new workspace and a preset is chosen, the UUID of the chosen preset is used to compare against any available prebuilds which also use that preset UUID.

A prebuild will be ONLY considered available if itslifecycle_state isready, and its preset UUID matches.

Invalidation

New workspaces always use the latest template version. Therefore when a new template version is promoted to the active version, all existing prebuilds must be destroyed.

The proposed usage above shows that aninvalidate_after_secs attribute can be set. The use-case for this is for workspaces which clone a monorepo: incremental updates (i.e. delta between prebuilt state and current state) will work up to a certain point, but after a certain period of time it might be preferable to just build a new prebuild.

We could also expose an API to invalidate all prebuilds for a preset if operators need that degree of control; i.e. a new AMI is built.

Provisioning

A nice property of our current design is that if no prebuilds are available, a new workspace will be provisioned synchronously. Failing to build prebuilds will not block users, it’ll just fall back to the existing behavior of imperative provisioning of workspace resources (graceful degradation).

Reconciliation Loop

We will build a reconciliation loop which will reconcile all templates’ prebuilds.

This needs to be triggered under the following scenarios:

A new active template version is chosen, leading to existing prebuilds being invalidated
A workspace build completes (whichmay have used a prebuild)
A newAutoscaling schedule becomes active (i.e.now is within crontab expression)
AnInvalidation event occurs
coderd startup
Periodically (i.e. every 15s)

The control loop will invoke a reconciliation of template states on an interval, but can also be “nudged” when the above scenarios occur to reduce waiting time.

Once the number of desired vs actual prebuilds for the given template is determined, this mechanism will enqueue a number of provisioner jobs to either create new, or destroy outdated/extraneous, prebuilds to satisfy the desired count.

NOTE:
We need to use an advisory lock (per template) when performing this reconciliation to prevent multiplecoderd replicas from performing this same action simultaneously.Multiplecoderd replicas could attempt to perform this reconciliation simultaneously.

Ownership

We will create a “Prebuild Owner” user and have it own all prebuilt workspaces.

This userMUST be excluded from user listing APIs ****
This user’s workspaces (i.e. prebuilds)MUST be excluded from workspace listing APIs
- We will need specific APIs for prebuilds
This userMUST NOT count towards a license seat

We will build a mechanism to “claim” a prebuild.
Prebuildsare workspaces, except they are owned by the prebuilds user; in fact, this is all that defines a prebuild. Once a prebuild is matched, it will be atomically assigned to the requestor.

No advisory lock is needed for this action;SELECT ... FOR UPDATE SKIP LOCKED will protect a prebuild from being eligible for assignment to multiple users simultaneously.

Build Phases

Each workspace will have3 workspace builds (”phases”).

1st phase: provisioning of the prebuild itself. This will require us to stub out identity datasources (seeConstraints).

This phase is entirely asynchronous and is not involved in the workspace creation process.
TheReconciliation Loop will be reconcile the state, and at this point a new prebuild provisioning attempt will be triggered.

2nd phase: workspace build following the workspace creation request. If an available prebuild is matched (seeMatching Logic), the ownership (i.e.owner_id field in theworkspaces table) will be atomically changed to the initiator of the request.

This & the subsequent stage MUST occur synchronously in the workspace creation process

3rd phase: prepare the workspace using the new ownership identity. We will invoke anotherterraform apply but now the identity datasources will have legitimate values injected, which may cause some resources to be modified (seeFailure Modes). Once the build succeeds, we will need to reinitialise the agent on the prebuilt workspace with a new (updated) manifest. SeeAgent Reinitialization for more details.

If this phase fails, the workspace build will need to be manually retried.
WeMAY need an API and/or UI here to allow a workspace to have anotherstart transition initiated, since we don’t really want to retry (i.e.stop →start) - as this would destroy and recreate all workspace resources, obviating the point of prebuilds
The agentMUST be instructed to reinitialize whenever astart is initiated on an already running workspace.

Failure Modes

Should the1st phase fail, theReconciliation Loop will leave these prebuilds in their failed state. We don’t want to provision potentially many additional resources by retrying, so an operator will either need to manually restart the prebuild (via normal workspace controls) or delete it; the latter case will cause a new prebuild to be provisioned.

The2nd phase will occur atomically; if it fails for whatever reason, the prebuild will still be available for claiming later.

If the3rd phase fails, the workspace build will need to be manually retried; at this point it is technically no longer a prebuild, and will not be under the purview of theReconciliation Loop.

Conditionalized Templates & Startup Scripts

Operators may require a way to conditionalize how a template behaves when it’s provisioning a prebuild vs a regular build.

Currently we use astart_count value on thecoder_workspace datasource to discriminate between astart andstop transition. Similarly, we will expose aprebuild_count attribute on thecoder_workspace resource (remember, a prebuildis a workspace) which will be set to1 when building the prebuild in phase 1.

For example, a template admin could choose to only execute a script on the prebuild:

data"coder_workspace""me" {}resource"coder_script""script1" {# prebuild_count will only be 1 during prebuild provisioningcount=data.coder_workspace.me.prebuild_countagent_id=coder_agent.dev1.iddisplay_name="Foobar Script 1"script="echo foobar 1"run_on_start=true}

Startup scripts canalso bedefined in thecoder_agent resource, and these cannot take advantage of thecount technique above. To ameliorate this limitation, we will need to support a newprebuild_startup_script field. We don’t need to define aprebuild_startup_script_behavior equivalent because SSH will be disabled, which this behavior interacts with.

Agent Reinitialization

The agent will need to reinitialize once it has been assigned a new identity (and possibly some of its attributes are updated like env or startup scripts).

Once build phase 3 completes, the agent will need to be notified that itsmanifest has been updated. The agent API will be notified via pubsub (on a per-workspace channel), and will then push an update to the agent.

Once the agent receives its new manifest, it will use it to reinitialize itself.

Observability

We should expose Prometheus metrics for (with partitioning in brackets):

counter of prebuilds created (preset_name, template_name) → collected
gauge of desired prebuilds (preset_name, template_name) → collected
gauge of actual prebuilds (preset_name, template_name) → collected
counter of failed prebuilds (preset_name, template_name, reason) → collected
counter of claimed prebuilds (preset_name, template_name, user_id) → collected
counter of presets used (preset_name, template_name) → collected
counter of workspace builds which DID NOT match a prebuild, but could have (preset_name, template_name, user_id)
- i.e. there was no prebuild available at the time

Autoscaling

Given that prebuilt instances will be consuming (potentially very expensive) cloud resources, operators will need a mechanism to 0 outside working hours.

For the initial phase, we will expose an autoscaling field undercoder_workspace_preset:

data"coder_workspace_preset""us-nix" {...prebuilds={  instances=0# default to 0 instances    autoscaling= {  timezone="UTC"# only a single timezone may be used# for simplicity# scale to 3 instances during the work week  schedule {    cron="* 8-18 * * 1-5"# from 8AM-6PM, Mon-Fri, UTC    instances=3# scale to 3 instances  }# scale to 1 instance on Saturdays for urgent support queries  schedule {    cron="* 8-14 * * 6"# from 8AM-2PM, Sat, UTC    instances=1# scale to 1 instance  }  }  }}

The solution above is designed to mirror theautostart scheduling.

Thecrontab format will already be familiar to operators, and will be intuitive to understand. The design above allows operators to specify a default number of instances, and then to scale that number dynamically based on one or more schedules. This can either be used to start from 0 and scale up (as the example above demonstrates), or the inverse; whichever the operator prefers.

This design would also allow for validation at template import time, where we will detect scheduling conflicts (i.e. if multiple schedules overlap and produce different values).

This will require a simple ticker to evaluate when the current time matches the crontab expression of a schedule, and to trigger the appropriate reconciliation in theReconciliation Loop.

Constraints

An infinite set of possible template configurations are possible with Terraform
Almost all templates use thecoder_workspace andcoder_workspace_owner (”identity”) data-sources, both of which rely on a workspace being owned by a user
Templates can be customised in non-deterministic ways throughcoder_parameters

In order to allow prebuilding of workspaces, we have to side-step constraint 2. Consider the following snippet fromthis template:

resource"google_compute_instance""dev" {...count=data.coder_workspace.me.start_countname="coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"...

If we were to create a prebuilt workspace, what would we provide to thedata.coder_workspace_owner.me.name anddata.coder_workspace.me.name values? Changing thisname attribute forces a replacement of the resource, and therefore makes the prebuild irrelevant.

To counteract this, we will:

Inject known “stub” values into the above data-sources before a real identity is associated with this workspace
- data.coder_workspace_owner.me.name:coder_prebuild_owner_${UUID}
- data.coder_workspace.me.name:coder_prebuild_${UUID}
  - …etc
- These values have to be human-readable since these resources will retain these values in their names, visible via the cloud console/
Create/reuse a linter which can detect known-bad values forname, and show a warning to the template author
name is not the only attribute which can cause a replacement; each provider and each resource has its own behavior. Consequently, we will need to add provider-specific checks for other resource attributes to further assist template authors
- we likely just need to cover the major compute resources of the major cloud & orchestration (i.e. k8s/nomad) providers
- later on we can catch all possible cases: expand the template import process to detect when a resource will be replaced during the second build phase (i.e. once a workspace has been assigned to a user)
To achieve this, we could either usetfsec’s custom checks, or query the plan file usingJMESPath expressions.

Workarounds for existing templates:

Usingignore_changes:

resource"google_compute_instance""dev" {count=data.coder_workspace.me.start_countname="coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"...}

The above would result in a replacement, but simply adding:

resource"google_compute_instance""dev" {lifecycle {ignore_changes=[name]  }count=data.coder_workspace.me.start_countname="coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"...}

This will instruct terraform to disregard changes to this attribute.

Onboarding

We addedWorkspace Build Timings which provided insight into speed issues, it didn’t offer any solutions to this particular problem.

We could use the timings graph to prompt users to try prebuilds.

Infrastructure Cost Concerns

Prebuilds will drain infrastructure spend, and we have to make that trade-off known to customers. Initially we can just highlight this in the documentation, but later we might want to provide a calculator to determine if prebuilds are worth the cost.

You must be logged in to vote

Movatterモバイル変換

Workspace Prebuilds#16969

Uh oh!

Uh oh!

dannykoppingMar 17, 2025 Collaborator

Note

Problem Statement

User Stories

Requirements

Initial Functional Requirements

Initial Non-functional Requirements

Basic Flow

UX & Design

Integration withWorkspace Presets

Persistence

Matching Logic

Invalidation

Provisioning

Reconciliation Loop

Ownership

Build Phases

Failure Modes

Conditionalized Templates & Startup Scripts

Agent Reinitialization

Observability

Autoscaling

Constraints

Onboarding

Infrastructure Cost Concerns

Replies: 0 comments

Uh oh!

dannykopping
Mar 17, 2025
Collaborator