- Notifications
You must be signed in to change notification settings - Fork1k
Feature Request: Improve Workspace Recovery if backup/restore only a limited subset#17260
-
Understanding the Coder Agent Authentication Issue with Velero Backup/RestoreWhen a Workspace and it's resources are restored, i.e. with Velero Backup, the Workspace is running but is not associate correctly and has lost the connection to coderd. Velero backup supports different methods of backup, i.e. Full:velero backup create coder-full-backup --include-namespaces codervelero restore create --from-backup coder-full-backup Based on the guidance, Coder only suggests backup of the PostgreSQL database:https://coder.com/docs/admin/infrastructure/validated-architectures#disaster-recovery But for instance related recovery, a smaller subset would include only Workspaces, so you want to restore only ONE Workspace since (reasons vary but a partial backup is logical when addressing a limited problem): velero backup create$BACKUP_NAME --selector com.coder.workspace.name="$WORKSPACE_NAME" --waitvelero restore create --from-backup$BACKUP_NAME --wait Status seen in Workspace pod
The issue occurs when restoring a Coder workspace with Velero but is likely to happen with any partial restore. The expectation to restart the Workspace is valid, and in the restored state, the 'Retry' button is not enabled. Root Cause Analysis (of the error and Coder GitHub code)
The key files involved in agent authentication:
Claude's Suggestions:I recommend modifying the // Add to coderd/workspaceagents.go// authenticateAgent handles workspace agent authentication, with support for restored agentsfunc (api*API)authenticateAgent(r*http.Request) (*database.WorkspaceAgent,error) {token:=r.Header.Get(AgentAuthTokenHeader)iftoken=="" {returnnil,xerrors.New("no agent auth token provided") }agent,err:=api.Database.GetWorkspaceAgentByAuthToken(r.Context(),token)iferr!=nil {if!xerrors.Is(err,sql.ErrNoRows) {returnnil,xerrors.Errorf("get agent by auth token: %w",err) }// Check if this is a restored agent trying to reconnectrestoredAgent,restoredErr:=api.tryAuthenticateRestoredAgent(r.Context(),token)ifrestoredErr==nil&&restoredAgent!=nil {// Found a restored agent, regenerate auth token and returnreturnrestoredAgent,nil }returnnil,xerrors.Errorf("agent auth token invalid: %w",err) }returnagent,nil}// tryAuthenticateRestoredAgent attempts to authenticate an agent that may have been restored from backupfunc (api*API)tryAuthenticateRestoredAgent(ctx context.Context,tokenstring) (*database.WorkspaceAgent,error) {// Try to find an agent that matches the first part of the token (pre-secret portion)// This is a heuristic approach - in a real implementation you'd need a more secure methodtokenParts:=strings.Split(token,".")iflen(tokenParts)<2 {returnnil,xerrors.New("invalid token format") }agentID:=tokenParts[0]agents,err:=api.Database.GetWorkspaceAgentsCreatedAfter(ctx,time.Now().Add(-7*24*time.Hour))iferr!=nil {returnnil,err }// Find a potential restored agentfor_,agent:=rangeagents {ifstrings.HasPrefix(agent.ID.String(),agentID) {// Found a potential match, regenerate auth for this agentnewToken:=uuid.New().String()err=api.Database.UpdateWorkspaceAgentAuthToken(ctx, database.UpdateWorkspaceAgentAuthTokenParams{ID:agent.ID,AuthToken:newToken, })iferr!=nil {returnnil,err }// Log the recovery for audit purposesapi.Logger.Info(ctx,"restored workspace agent authentication",slog.F("agent_id",agent.ID),slog.F("workspace_id",agent.WorkspaceID), )returnagent,nil } }returnnil,xerrors.New("no matching restored agent found")}// Also add needed database methods to coderd/database/database.go:// GetWorkspaceAgentsCreatedAfter gets all workspace agents created after the given timefunc (q*Q)GetWorkspaceAgentsCreatedAfter(ctx context.Context,after time.Time) ([]database.WorkspaceAgent,error) {// Implementation details will depend on your database schema// ...}// UpdateWorkspaceAgentAuthToken updates the auth token for a workspace agentfunc (q*Q)UpdateWorkspaceAgentAuthToken(ctx context.Context,params database.UpdateWorkspaceAgentAuthTokenParams)error {// Implementation details will depend on your database schema// ...} Alternative ApproachAnother potentially more secure approach would be to add a recovery endpoint specifically for restored agents: // In coderd/workspaceagents.gofunc (api*API)registerAgentRecoveryHandlers(r*chi.Mux) {r.Post("/api/v2/workspaceagents/recovery",api.handleAgentRecovery)}func (api*API)handleAgentRecovery(rw http.ResponseWriter,r*http.Request) {// Extract agent identity information from requestvarreqstruct {AgentIDstring`json:"agent_id"`WorkspaceIDstring`json:"workspace_id"`RecoveryTokenstring`json:"recovery_token"`// A token derived from agent's instance data }iferr:=json.NewDecoder(r.Body).Decode(&req);err!=nil {httpapi.Write(ctx,rw,http.StatusBadRequest, codersdk.Response{Message:"Invalid request",Detail:err.Error(), })return }// Validate this is a legitimate agent through instance verification// ...// Regenerate token and update database// ...// Return new auth token to the agent} Implementation Notes
This solution allows legitimate restored agents to reconnect while maintaining security. The agent would need to attempt the normal authentication flow first, and if that fails, try the recovery mechanism. |
BetaWas this translation helpful?Give feedback.
All reactions
Replies: 1 comment 1 reply
-
I don't think there is value in restoring running workspaces (agents). But yes backing up the persistent storage of all workspaces and the Coder DB itself with Valero looks promising. So I am interested in a use case where we don't have to make any changes in Coder and Valero can work independently. We can probably do an integration guide on how to configure Valero to perform backups of Coder DB and Coder workspaces. |
BetaWas this translation helpful?Give feedback.
All reactions
-
We also need workspace backup so when a workspace (soon shared) is deleted we can recover it even when deleted by the user |
BetaWas this translation helpful?Give feedback.