- Notifications
You must be signed in to change notification settings - Fork913
Open
Description
Problem Statement
Currently, when deploying Coder OSS withreplicas=2
(or more) without an Enterprise license, the deployment creates multiplecoderd
instances that compete for database access, leading to:
- Inconsistent behavior: ~50% success rate for connections (Terminal works sometimes, VS Code fails completely)
- Poor user experience: No clear indication that multi-replica setup requires Enterprise license
- Silent failures: No warning or error messages about unsupported configuration
- Wasted resources: Multiple instances running when only one can be active
Current Behavior
- Multiple
coderd
instances start successfully - All instances attempt to connect to PostgreSQL
- Traffic gets distributed across instances without proper coordination
- Results in race conditions and connection failures
- VS Code extensions fail completely, some terminal connections work intermittently
Proposed Solution
Implement adatabase-level locking mechanism for OSS deployments that would:
1. Primary Instance Locking
- First
coderd
instance to connect successfully becomes the "primary" - Creates a lock record in PostgreSQL (e.g.,
instance_locks
table with instance ID, timestamp, heartbeat) - Continuously updates heartbeat to maintain lock ownership
2. Standby Instance Behavior
- Additional instances detect existing lock and enter "cold standby" mode
- Standby instances:
- Monitor primary instance heartbeat
- Return HTTP 503 (Service Unavailable) for all requests with clear error message
- Automatically promote to primary if original instance fails/heartbeat expires
- Log clear status messages about standby state
3. Clear User Feedback
- Startup logs: Clear indication of primary vs standby status
- Health endpoints: Different responses for primary (
200 OK
) vs standby (503 Service Unavailable
) - Admin UI warning: Banner indicating "Multiple replicas detected - Enterprise license required for load balancing"
Implementation Details
-- Example lock table structureCREATETABLEIF NOT EXISTS instance_locks ( lock_nameVARCHAR(255)PRIMARY KEY, instance_id UUIDNOT NULL, acquired_atTIMESTAMPTZNOT NULL, heartbeat_atTIMESTAMPTZNOT NULL, expires_atTIMESTAMPTZNOT NULL);
// Pseudo-code for lock acquisitionfunc (s*Server)acquirePrimaryLock(ctx context.Context) (bool,error) {// Try to acquire or refresh lock// Return true if this instance is primary, false if standby}
4. Configuration Options
Add environment variables:
CODER_OSS_STANDBY_MODE=auto
(default: auto-detect and enter standby)CODER_LOCK_TIMEOUT=30s
(how long before lock expires)CODER_HEARTBEAT_INTERVAL=10s
(how often to update heartbeat)
Benefits
- Graceful degradation: Multi-replica deployments work predictably without license
- High availability: Automatic failover when primary instance fails
- Clear feedback: Users understand what's happening and why
- Resource efficiency: Only one active instance processing requests
- Enterprise upsell: Clear path to licensed version for true load balancing
Alternative Considerations
- License check with graceful shutdown: Detect multi-replica + no license and shut down extra instances
- Load balancer integration: Provide health check endpoints that only return healthy for primary
- Admin warnings: Dashboard notifications about suboptimal configuration
Related Issues/Context
This addresses the common Kubernetes deployment pattern where users naturally setreplicas=2
for high availability, not realizing it requires Enterprise licensing. The current behavior creates a frustrating debugging experience.
Priority: Medium-High (affects common deployment scenarios)
Labels:enhancement
,oss
,database
,high-availability
Metadata
Metadata
Assignees
Labels
No labels