Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation #19030

Closed as not planned
Assignees
cstyan
Labels
customer-reportedDO NOT USE. Instead, add to the project and fill in "Customer".
@bjornrobertsson

Description

@bjornrobertsson

Summary

A race condition occurs in Coder's high availability (HA) deployment when PostgreSQL password rotation is managed by HashiCorp Vault.

During password rotation, apps (jupyter-notebook, code-server) become inaccessible with infinite loading, requiring manual pod restart to resolve.

No similar issue has been found.

Environment

  • Coder Version: 2.20.2
  • Deployment: Kubernetes with HA (2 replicas)
  • Database: PostgreSQL 17
  • Secret Management: HashiCorp Vault with VaultDynamicSecret
  • Network: Air-gapped environment
  • Authentication: OIDC with GitLab

Issue Description

Problem

When Vault rotates the PostgreSQL database password and triggers a rollout restart of coder pods, a race condition occurs between the two Coder instances during the replica synchronization process. This results in:

  1. Workspace apps become inaccessible: jupyter-notebook and code-server show infinite loading
  2. DERP health check instability: Switches between healthy/unhealthy states
  3. Replica sync failures: Error messages indicating communication issues between replicas
  4. Authentication issues: Apps return 502 errors with "Back to site" HTML responses

Root Cause Analysis

The issue appears to be related to thereplicasync process between Coder instances during password rotation. Key evidence:

  1. Failed sibling replica pings:

    coderd: failed to ping sibling replica, this could happen if the replica has shutdownerror= do probe: Get "http://192.A.X.Y:$PORT/derp/latency-check": context deadline exceeded
  2. Coordinator heartbeat failures:

    coderd.pgcoord: coordinator failed heartbeat check coordinator_id=$UUID
  3. DERP connectivity issues:

    net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route

Reproduction Steps

  1. Deploy Coder in HA mode (2 replicas) with PostgreSQL
  2. Configure Vault to manage PostgreSQL password rotation withVaultDynamicSecret
  3. Set uprolloutRestartTargets to restart Coder deployment on password change
  4. Trigger password rotation (manually or wait for scheduled rotation)
  5. Observe that Apps become inaccessible despite successful pod restart

Technical Details

Race Condition Mechanism

During rolling updates with database password changes, the following condition occurs:

  1. Vault rotates PostgreSQL password
  2. Rolling restart begins (one pod at a time)
  3. First pod restarts with new password, second pod still has old connection context
  4. Replica synchronization fails due to inconsistent database connection states
  5. DERP network coordination becomes unstable
  6. Workspace connectivity breaks

Failed Workarounds

  1. Rolling Update Strategy: AddingterminationGracePeriodSeconds: 120 and proper rolling update configuration didn't resolve the issue
  2. Deployment Strategy Change: Switching totype: Recreate initially worked but caused other instability issues with continuous pod restarts

Expected vs Actual Behavior

Expected: After PostgreSQL password rotation and pod restart, Apps should remain accessible with minimal downtime.

Actual: Apps become completely inaccessible with infinite loading, requiring manual intervention (pod deletion/restart) to restore functionality.

Error Messages and Logs

Coder Pod Logs

coderd: failed to ping sibling replica, this could happen if the replica has shutdowncoderd.pgcoord: coordinator failed heartbeat checkcoderd: requester is not authorized to access the object

Workspace Agent Logs

net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing routenet.tailnet.net.wgengine: wg: [v2] Received message with unknown type

HTTP Responses

GET /@user/workspace/apps/jupyter-notebook/api/events/subscribeStatus: 502Response: ">Back to site</a>"

Impact

  • High: Complete loss of workspace app functionality during password rotation
  • Business Critical: Affects all users in air-gapped production environment
  • Security Impact: Prevents automated password rotation compliance

Suggested Solutions

Immediate Workaround

Use manual pod deletion instead of rolling restart:

kubectl delete pods -l app=coder -n coder-namespace

Proposed Fixes

  1. Implement graceful replica sync during password rotation

    • Add coordination mechanism between replicas during database credential changes
    • Ensure consistent database connection state across all instances
  2. Enhance DERP relay stability during restarts

    • Improve error handling inenterprise/replicasync/replicasync.go
    • Add retry mechanisms for failed peer connections
  3. Add password rotation awareness

    • Detect database credential changes and coordinate replica restart sequence
    • Implement proper cleanup of stale connection pools

Code References

Suspected components based on error messages:

  • enterprise/replicasync/replicasync.go:381 - Peer replica ping logic
  • cmd/pgcoord/main.go - PostgreSQL coordinator
  • internal/db/pgcoord/pgcoord.go - Database coordination logic

Additional Context

This issue specifically affects HA deployments with external secret management systems like Vault.

The condition appears to be timing-dependent and related to the coordination between multiple Coder instances during database authentication changes.

The issue doesn't occur with single-instance deployments or when database credentials remain static, indication that this is an HA-specific race condition during credential rotation scenarios.

Metadata

Metadata

Assignees

Labels

customer-reportedDO NOT USE. Instead, add to the project and fill in "Customer".

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions


    [8]ページ先頭

    ©2009-2025 Movatter.jp