- Notifications
You must be signed in to change notification settings - Fork1.1k
Description
Summary
A race condition occurs in Coder's high availability (HA) deployment when PostgreSQL password rotation is managed by HashiCorp Vault.
During password rotation, apps (jupyter-notebook, code-server) become inaccessible with infinite loading, requiring manual pod restart to resolve.
No similar issue has been found.
Environment
- Coder Version: 2.20.2
- Deployment: Kubernetes with HA (2 replicas)
- Database: PostgreSQL 17
- Secret Management: HashiCorp Vault with VaultDynamicSecret
- Network: Air-gapped environment
- Authentication: OIDC with GitLab
Issue Description
Problem
When Vault rotates the PostgreSQL database password and triggers a rollout restart of coder pods, a race condition occurs between the two Coder instances during the replica synchronization process. This results in:
- Workspace apps become inaccessible: jupyter-notebook and code-server show infinite loading
- DERP health check instability: Switches between healthy/unhealthy states
- Replica sync failures: Error messages indicating communication issues between replicas
- Authentication issues: Apps return 502 errors with "Back to site" HTML responses
Root Cause Analysis
The issue appears to be related to thereplicasync process between Coder instances during password rotation. Key evidence:
Failed sibling replica pings:
coderd: failed to ping sibling replica, this could happen if the replica has shutdownerror= do probe: Get "http://192.A.X.Y:$PORT/derp/latency-check": context deadline exceededCoordinator heartbeat failures:
coderd.pgcoord: coordinator failed heartbeat check coordinator_id=$UUIDDERP connectivity issues:
net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route
Reproduction Steps
- Deploy Coder in HA mode (2 replicas) with PostgreSQL
- Configure Vault to manage PostgreSQL password rotation with
VaultDynamicSecret - Set up
rolloutRestartTargetsto restart Coder deployment on password change - Trigger password rotation (manually or wait for scheduled rotation)
- Observe that Apps become inaccessible despite successful pod restart
Technical Details
Race Condition Mechanism
During rolling updates with database password changes, the following condition occurs:
- Vault rotates PostgreSQL password
- Rolling restart begins (one pod at a time)
- First pod restarts with new password, second pod still has old connection context
- Replica synchronization fails due to inconsistent database connection states
- DERP network coordination becomes unstable
- Workspace connectivity breaks
Failed Workarounds
- Rolling Update Strategy: Adding
terminationGracePeriodSeconds: 120and proper rolling update configuration didn't resolve the issue - Deployment Strategy Change: Switching to
type: Recreateinitially worked but caused other instability issues with continuous pod restarts
Expected vs Actual Behavior
Expected: After PostgreSQL password rotation and pod restart, Apps should remain accessible with minimal downtime.
Actual: Apps become completely inaccessible with infinite loading, requiring manual intervention (pod deletion/restart) to restore functionality.
Error Messages and Logs
Coder Pod Logs
coderd: failed to ping sibling replica, this could happen if the replica has shutdowncoderd.pgcoord: coordinator failed heartbeat checkcoderd: requester is not authorized to access the objectWorkspace Agent Logs
net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing routenet.tailnet.net.wgengine: wg: [v2] Received message with unknown typeHTTP Responses
GET /@user/workspace/apps/jupyter-notebook/api/events/subscribeStatus: 502Response: ">Back to site</a>"Impact
- High: Complete loss of workspace app functionality during password rotation
- Business Critical: Affects all users in air-gapped production environment
- Security Impact: Prevents automated password rotation compliance
Suggested Solutions
Immediate Workaround
Use manual pod deletion instead of rolling restart:
kubectl delete pods -l app=coder -n coder-namespace
Proposed Fixes
Implement graceful replica sync during password rotation
- Add coordination mechanism between replicas during database credential changes
- Ensure consistent database connection state across all instances
Enhance DERP relay stability during restarts
- Improve error handling in
enterprise/replicasync/replicasync.go - Add retry mechanisms for failed peer connections
- Improve error handling in
Add password rotation awareness
- Detect database credential changes and coordinate replica restart sequence
- Implement proper cleanup of stale connection pools
Code References
Suspected components based on error messages:
enterprise/replicasync/replicasync.go:381- Peer replica ping logiccmd/pgcoord/main.go- PostgreSQL coordinatorinternal/db/pgcoord/pgcoord.go- Database coordination logic
Additional Context
This issue specifically affects HA deployments with external secret management systems like Vault.
The condition appears to be timing-dependent and related to the coordination between multiple Coder instances during database authentication changes.
The issue doesn't occur with single-instance deployments or when database credentials remain static, indication that this is an HA-specific race condition during credential rotation scenarios.