- Notifications
You must be signed in to change notification settings - Fork1.1k
Description
Problem
If a prebuilt workspace's template usesignore_changes as recommended in the docs, its agent may not reconnect after a workspace template upgrade.
e.g.
resource"docker_container""workspace" {lifecycle {ignore_changes=all }count=data.coder_workspace.me.start_countentrypoint=["sh","-c",coder_agent.main.init_script]env=["CODER_AGENT_TOKEN=${coder_agent.main.token}"]...}
Details
A template upgrade kicks off astart build.start builds setcoder_workspace.start_count to1, which is used in thecount attribute of compute resources (see above example). If the workspace is already started, then any resources which already havecount=1 will attempt to be updated in-place.
Astart build causes thecoder_agent to be recreated, which generates a new auth token. Normally, withoutignore_changes, theenv attribute above would be modified, since the token value changes.env is immutable (i.e. defined asForceNew), therefore Terraform will see any changes to this attribute as drift from the original and force a replacement. This would lead to thedocker_container being recreated and the agent would start afresh and connect to the control plane.
Withignore_changes, however, changes to these attributes are ignoredin order for prebuilds to work, which means the template update for the workspace has no real effect at all, but thecoder_agent's token is still changed and so the agent can no longer connect to the control plane on behalf of the workspace. The previous agent token would still be used, even though the control plane will only accept the new one.
Workaround
Manually restarting the workspace will allow the agent to reconnect successfully.
Proposed Solution
Template updates should not bestart builds, but rather a logical restart (i.e. successivestop andstart builds) in order to guarantee the behaviour customers expect. This should apply for both claimed prebuilt workspaces AND regular workspaces alike, to guarantee that the compute resource is created anew. I fear the currentstart-only mechanism is working by accident, because of Terraform drift taking care of destroying and recreating the resource like astop +start would.