coder/coderPublic

NotificationsYou must be signed in to change notification settings
Fork1.1k
Star11.6k

Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation #19030

New issue

Closed as not planned

Bug

Closed as not planned

Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation#19030

Bug

Assignees

Labels

customer-reportedDO NOT USE. Instead, add to the project and fill in "Customer".

Description

bjornrobertsson

opened

on Jul 24, 2025

Summary

A race condition occurs in Coder's high availability (HA) deployment when PostgreSQL password rotation is managed by HashiCorp Vault.

During password rotation, apps (jupyter-notebook, code-server) become inaccessible with infinite loading, requiring manual pod restart to resolve.

No similar issue has been found.

Environment

Coder Version: 2.20.2
Deployment: Kubernetes with HA (2 replicas)
Database: PostgreSQL 17
Secret Management: HashiCorp Vault with VaultDynamicSecret
Network: Air-gapped environment
Authentication: OIDC with GitLab

Issue Description

Problem

When Vault rotates the PostgreSQL database password and triggers a rollout restart of coder pods, a race condition occurs between the two Coder instances during the replica synchronization process. This results in:

Workspace apps become inaccessible: jupyter-notebook and code-server show infinite loading
DERP health check instability: Switches between healthy/unhealthy states
Replica sync failures: Error messages indicating communication issues between replicas
Authentication issues: Apps return 502 errors with "Back to site" HTML responses

Root Cause Analysis

The issue appears to be related to thereplicasync process between Coder instances during password rotation. Key evidence:

Failed sibling replica pings:

coderd: failed to ping sibling replica, this could happen if the replica has shutdownerror= do probe: Get "http://192.A.X.Y:$PORT/derp/latency-check": context deadline exceeded

Coordinator heartbeat failures:

coderd.pgcoord: coordinator failed heartbeat check coordinator_id=$UUID

DERP connectivity issues:

net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route

Reproduction Steps

Deploy Coder in HA mode (2 replicas) with PostgreSQL
Configure Vault to manage PostgreSQL password rotation withVaultDynamicSecret
Set uprolloutRestartTargets to restart Coder deployment on password change
Trigger password rotation (manually or wait for scheduled rotation)
Observe that Apps become inaccessible despite successful pod restart

Technical Details

Race Condition Mechanism

During rolling updates with database password changes, the following condition occurs:

Vault rotates PostgreSQL password
Rolling restart begins (one pod at a time)
First pod restarts with new password, second pod still has old connection context
Replica synchronization fails due to inconsistent database connection states
DERP network coordination becomes unstable
Workspace connectivity breaks

Failed Workarounds

Rolling Update Strategy: AddingterminationGracePeriodSeconds: 120 and proper rolling update configuration didn't resolve the issue
Deployment Strategy Change: Switching totype: Recreate initially worked but caused other instability issues with continuous pod restarts

Expected vs Actual Behavior

Expected: After PostgreSQL password rotation and pod restart, Apps should remain accessible with minimal downtime.

Actual: Apps become completely inaccessible with infinite loading, requiring manual intervention (pod deletion/restart) to restore functionality.

Error Messages and Logs

Coder Pod Logs

coderd: failed to ping sibling replica, this could happen if the replica has shutdowncoderd.pgcoord: coordinator failed heartbeat checkcoderd: requester is not authorized to access the object

Workspace Agent Logs

net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing routenet.tailnet.net.wgengine: wg: [v2] Received message with unknown type

HTTP Responses

GET /@user/workspace/apps/jupyter-notebook/api/events/subscribeStatus: 502Response: ">Back to site</a>"

Impact

High: Complete loss of workspace app functionality during password rotation
Business Critical: Affects all users in air-gapped production environment
Security Impact: Prevents automated password rotation compliance

Code References

Suspected components based on error messages:

enterprise/replicasync/replicasync.go:381 - Peer replica ping logic
cmd/pgcoord/main.go - PostgreSQL coordinator
internal/db/pgcoord/pgcoord.go - Database coordination logic

Additional Context

This issue specifically affects HA deployments with external secret management systems like Vault.

The condition appears to be timing-dependent and related to the coordination between multiple Coder instances during database authentication changes.

The issue doesn't occur with single-instance deployments or when database credentials remain static, indication that this is an HA-specific race condition during credential rotation scenarios.

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation #19030

Description

Summary

Environment

Issue Description

Problem

Root Cause Analysis

Reproduction Steps

Technical Details

Race Condition Mechanism

Failed Workarounds

Expected vs Actual Behavior

Error Messages and Logs

Coder Pod Logs

Workspace Agent Logs

HTTP Responses

Impact

Suggested Solutions

Immediate Workaround

Proposed Fixes

Code References

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions