- Notifications
You must be signed in to change notification settings - Fork925
Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
When the Coder PostgreSQL database becomes unavailable, the Coder instance/pod generates an excessive amount of logs - approximately 500,000 lines per minute. The logs repeatedly show connection attempts and failures without any throttling or backoff mechanism.
The Example log pattern that repeats continuously shows up at least 40 lines every 1/100th of a second
Connection attempts continue at full speed with no reduction in frequency, generating approximately 500,000 log lines per minute, which could:
Fill up disk space rapidly
Make log analysis difficult
Potentially impact system performance
Log files can grow to unmanageable sizes
Difficulty diagnosing other issues due to log flooding
Potential disk space exhaustion
Relevant Log Output
2025-03-21 17:16:24.504 [info] coderd.pgcoord: closed incoming coordinate callwhile unhealthy coordinator_id=62a74eaa-8477-451e-a81b-1bf4247b23bc peer_id=e0fe7b54-8f5b-4568-be73-f43586e8b3f32025-03-21 17:16:24.504 [info] coderd.servertailnet: obtained tailnet API v2+ client2025-03-21 17:16:24.504 [info] coderd.servertailnet: tailnet API v2+ connection lost
Expected Behavior
When the DB is 'offline', the Tailnet process should have a mechanism where it can be 'silent' if the DB is unavailable or somewhat muted, to avoid running out of disk-space or memory depending on the storage.
The system should implement an exponential backoff or throttling mechanism to reduce log verbosity when the database is unavailable. Connection attempts should decrease in frequency over time.
Possible Solution
Implement a tapering retry mechanism in the database connection logic:
- Add exponential backoff for connection retries
- Reduce logging verbosity after initial connection failures
- Log only state changes (e.g., "database still unavailable after X attempts")
Related Issues
Similar toIssue #11799
Steps to Reproduce
- Have a running Coder deployment
- tail the log, kubectl logs -f
- Stop your PostgreSQL Service
Environment
- Host OS: Irrelevant
- Coder version: 2.19.1+
Additional Context
No response