Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Feature Request: Improve Workspace Recovery if backup/restore only a limited subset#17260

bjornrobertsson started this conversation inFeature Requests
Discussion options

Understanding the Coder Agent Authentication Issue with Velero Backup/Restore

When a Workspace and it's resources are restored, i.e. with Velero Backup, the Workspace is running but is not associate correctly and has lost the connection to coderd.

Velero backup supports different methods of backup, i.e. Full:

velero backup create coder-full-backup --include-namespaces codervelero restore create --from-backup coder-full-backup

Based on the guidance, Coder only suggests backup of the PostgreSQL database:https://coder.com/docs/admin/infrastructure/validated-architectures#disaster-recovery

But for instance related recovery, a smaller subset would include only Workspaces, so you want to restore only ONE Workspace since (reasons vary but a partial backup is logical when addressing a limited problem):

velero backup create$BACKUP_NAME --selector com.coder.workspace.name="$WORKSPACE_NAME" --waitvelero restore create --from-backup$BACKUP_NAME --wait

Status seen in Workspace pod

connecting to coderd  run exited with error ...  error= GET https://example.com/api/v2/workspaceagents/me/rpc?version=2.4: unexpected status code 401: Workspace agent not authorized.: Try logging in using 'coder login'.  Error: The agent cannot authenticate until the workspace provision job has been completed. If the job is no longer running, this agent is invalid.  connecting to coderd

The issue occurs when restoring a Coder workspace with Velero but is likely to happen with any partial restore.

The expectation to restart the Workspace is valid, and in the restored state, the 'Retry' button is not enabled.

Root Cause Analysis (of the error and Coder GitHub code)

  1. Workspace agents authenticate using a unique token/secret
  2. The token is created and validated during the workspace provisioning job
  3. When restoring from backup, the agent has its old credentials, but the Coder control plane doesn't recognize them since the provision job is no longer active

The key files involved in agent authentication:

  1. coderd/workspaceagents.go - Handles agent authentication and RPC
  2. agent/agent.go - The agent connection logic
  3. provisionersdk/proto/provisioner.go - Defines the provisioning job states

Claude's Suggestions:

I recommend modifying thecoderd/workspaceagents.go file to enable a "recovery mode" for agents that have been restored from backup:

// Add to coderd/workspaceagents.go// authenticateAgent handles workspace agent authentication, with support for restored agentsfunc (api*API)authenticateAgent(r*http.Request) (*database.WorkspaceAgent,error) {token:=r.Header.Get(AgentAuthTokenHeader)iftoken=="" {returnnil,xerrors.New("no agent auth token provided")    }agent,err:=api.Database.GetWorkspaceAgentByAuthToken(r.Context(),token)iferr!=nil {if!xerrors.Is(err,sql.ErrNoRows) {returnnil,xerrors.Errorf("get agent by auth token: %w",err)        }// Check if this is a restored agent trying to reconnectrestoredAgent,restoredErr:=api.tryAuthenticateRestoredAgent(r.Context(),token)ifrestoredErr==nil&&restoredAgent!=nil {// Found a restored agent, regenerate auth token and returnreturnrestoredAgent,nil        }returnnil,xerrors.Errorf("agent auth token invalid: %w",err)    }returnagent,nil}// tryAuthenticateRestoredAgent attempts to authenticate an agent that may have been restored from backupfunc (api*API)tryAuthenticateRestoredAgent(ctx context.Context,tokenstring) (*database.WorkspaceAgent,error) {// Try to find an agent that matches the first part of the token (pre-secret portion)// This is a heuristic approach - in a real implementation you'd need a more secure methodtokenParts:=strings.Split(token,".")iflen(tokenParts)<2 {returnnil,xerrors.New("invalid token format")    }agentID:=tokenParts[0]agents,err:=api.Database.GetWorkspaceAgentsCreatedAfter(ctx,time.Now().Add(-7*24*time.Hour))iferr!=nil {returnnil,err    }// Find a potential restored agentfor_,agent:=rangeagents {ifstrings.HasPrefix(agent.ID.String(),agentID) {// Found a potential match, regenerate auth for this agentnewToken:=uuid.New().String()err=api.Database.UpdateWorkspaceAgentAuthToken(ctx, database.UpdateWorkspaceAgentAuthTokenParams{ID:agent.ID,AuthToken:newToken,            })iferr!=nil {returnnil,err            }// Log the recovery for audit purposesapi.Logger.Info(ctx,"restored workspace agent authentication",slog.F("agent_id",agent.ID),slog.F("workspace_id",agent.WorkspaceID),            )returnagent,nil        }    }returnnil,xerrors.New("no matching restored agent found")}// Also add needed database methods to coderd/database/database.go:// GetWorkspaceAgentsCreatedAfter gets all workspace agents created after the given timefunc (q*Q)GetWorkspaceAgentsCreatedAfter(ctx context.Context,after time.Time) ([]database.WorkspaceAgent,error) {// Implementation details will depend on your database schema// ...}// UpdateWorkspaceAgentAuthToken updates the auth token for a workspace agentfunc (q*Q)UpdateWorkspaceAgentAuthToken(ctx context.Context,params database.UpdateWorkspaceAgentAuthTokenParams)error {// Implementation details will depend on your database schema// ...}

Alternative Approach

Another potentially more secure approach would be to add a recovery endpoint specifically for restored agents:

// In coderd/workspaceagents.gofunc (api*API)registerAgentRecoveryHandlers(r*chi.Mux) {r.Post("/api/v2/workspaceagents/recovery",api.handleAgentRecovery)}func (api*API)handleAgentRecovery(rw http.ResponseWriter,r*http.Request) {// Extract agent identity information from requestvarreqstruct {AgentIDstring`json:"agent_id"`WorkspaceIDstring`json:"workspace_id"`RecoveryTokenstring`json:"recovery_token"`// A token derived from agent's instance data    }iferr:=json.NewDecoder(r.Body).Decode(&req);err!=nil {httpapi.Write(ctx,rw,http.StatusBadRequest, codersdk.Response{Message:"Invalid request",Detail:err.Error(),        })return    }// Validate this is a legitimate agent through instance verification// ...// Regenerate token and update database// ...// Return new auth token to the agent}

Implementation Notes

  1. The restoration approach requires careful security consideration to prevent unauthorized access
  2. You should add validation to ensure only legitimate restored agents can recover their authentication
  3. Consider adding a configuration option to disable this feature if not needed
  4. Add logging for audit purposes when agents are restored

This solution allows legitimate restored agents to reconnect while maintaining security. The agent would need to attempt the normal authentication flow first, and if that fails, try the recovery mechanism.

You must be logged in to vote

Replies: 1 comment 1 reply

Comment options

I don't think there is value in restoring running workspaces (agents). But yes backing up the persistent storage of all workspaces and the Coder DB itself with Valero looks promising.

So I am interested in a use case where we don't have to make any changes in Coder and Valero can work independently.

We can probably do an integration guide on how to configure Valero to perform backups of Coder DB and Coder workspaces.

You must be logged in to vote
1 reply
@suse-coder
Comment options

We also need workspace backup so when a workspace (soon shared) is deleted we can recover it even when deleted by the user

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Labels
customer-requestedDO NOT USE. Instead, add to the project and fill in "Customer".
3 participants
@bjornrobertsson@matifali@suse-coder

[8]ページ先頭

©2009-2025 Movatter.jp