- Notifications
You must be signed in to change notification settings - Fork928
Description
Summary
Coder users are experiencing intermittent and prolonged failures during SSH connections, especially in VPN environments where DNS servers are partially or intermittently unresponsive. These issues are directly tied to the behavior of Go’s standard resolver, which lacks in-process caching and has limited timeout and failover behavior.
This proposes we configure a more resilient DNS resolution, in-memory caching, enhanced diagnostics viacoder netcheck
, and tighter integration with Coder's SSH ProxyCommand to mitigate these failures and improve user experience.
Problem Statement
Current Pain Points
- SSH ProxyCommand fails with errors like:
could not get canonical name
Did not find remote IP address (is SSH ProxyCommand disabled?)
coder netcheck
logs:"no address for node"
or timeouts- VPN environments inject multiple DNS servers, some of which are non-responsive
- Go’s
net.Resolver
has no DNS cache, leading to repetitive failed lookups - DNS queries (e.g.,
gethostbyname
) can hang without enforced timeouts - Lack of CLI tooling for diagnosing DNS-related failures
Root Cause Analysis
Unresponsive or Hanging DNS Servers
VPN clients often assign multiple DNS servers, but not all are reachable or fast.
Regular DNS handling of OS can lead to long timeouts.
Poor DNS Servers can cause failure rates or success rates (A/B where A is the number of healthy servers and B the total number of DNS Servers (in the resolv.conf), minimum one, and up to maximum of three IP Addresses)
This failure rate therefore is mathematically 100% for one host and it is bad, 1/2 for one bad host out of two, or 1/3, 2/3 for one or two failing hosts out of three.
Due to round-robin methods, some end-users can ultimately have varying amount of success, but typically describe in the above scenario.Go DNS Resolver Limitations
- No built-in in-process DNS cache
- Each call to
net.LookupHost()
performs a fresh network resolution - Honour the GODEBUG environment variable (netdns=go or netdns=cgo) which could provide improved behaviour, to selectively use Go or System DNS behaviour
- net: DNS lookup timeout only when using go resolver golang/go#28419
- cgo may be disabled by default on macOS builds (Coder is likely not enabling cgo as it's not enabled when cross compiling or compiling on macOS)
- net: Support the /etc/resolver DNS resolution configuration hierarchy on OS X when cgo is disabled golang/go#12524
- EDNS0 is not discussed here (tunnel issues or packet size)
- net: go DNS resolver fails to connect to local DNS server golang/go#67925
Lack of Observability & Failover
- No health scoring or metrics for DNS servers
- No fallback to alternate resolvers on failure
Error Visibility
- Error output from
coder netcheck
does not surface DNS issues clearly - No user-facing tools to diagnose DNS resolution failures in SSH flows
- Error output from
Proposed Solution
1. Enhanced DNS Resolution with Resilience
typeDNSConfigstruct {ServerTimeout time.Duration`json:"server_timeout"`// e.g. 2s per DNS serverTotalTimeout time.Duration`json:"total_timeout"`// e.g. 10s totalRetryAttemptsint`json:"retry_attempts"`// e.g. 2 retriesParallelQuerybool`json:"parallel_query"`// Query all resolvers concurrently}
Features:
- Per-server timeout with total query deadline
- Parallel or sequential failover query modes
- Retry logic for transient errors
- Health scoring of DNS servers for future prioritization
2. In-Process DNS Caching
Inspired byrs/dnscache, which wraps Go’snet.Resolver
:
typeCacheConfigstruct {TTL time.Duration`json:"ttl"`// Default: 5mRefreshInterval time.Duration`json:"refresh_interval"`// Background refreshMaxEntriesint`json:"max_entries"`// Default: 1000PersistentCachebool`json:"persistent_cache"`// OptionalBackgroundRefreshbool`json:"background_refresh"`// Optional}
Features:
- TTL-based expiration
- Background refresh of entries to maintain freshness
- Stale result fallback on DNS outage
- Manual cache clearing (
coder dns-cache --clear
) - Optional persistent cache across CLI sessions
3. Enhanced Diagnostics incoder netcheck
# New DNS-focused netcheck optionscoder netcheck --dns-detailedcoder netcheck --dns-servers-onlycoder netcheck --dns-timeout 3s
Diagnostic Output:
- Server-by-server resolution time and success/failure status
- Identification of dead/stale servers
- Parallel test to detect mixed-responsiveness issues
- DNS configuration tips and fix suggestions
4. ProxyCommand Integration
With Match.coder and coder. in the ProxyCommand, we have the option to apply potential new parameters to encourage failover or retries:
# New parameters:dns: timeout: 2s cache_ttl: 5m parallel_queries:true max_retries: 3ssh: dns_resilience:true connection_timeout: 30s
Features:
- Retry logic in the face of DNS resolution failure
- Parallel resolution at SSH connection start
- Graceful fallback to cached DNS data
- DNS-aware keep-alive and connection pooling
Alternatively we could also use 'coder config-ssh' to add HostName for sensitive hosts, to avoid DNS lookups.
Having fixed IP Addresses in the .ssh/config file could help, but they could also expire with LB's often changing IP Addresses, dynamic hosting etc.
Implementation Plan
Phase 1: Core DNS Enhancements
- Add configurable DNS timeout and retry logic
- Implement in-memory DNS caching
- Support DNS server failover and parallel queries
Phase 2: Enhanced Diagnostics
- Expand
netcheck
to report per-resolver behavior - Add CLI flags for DNS tests and timeout controls
- Implement NXDOMAIN Detection (If query for the ACCESS URL or Wildcard URL return NXDOMAIN, tell the end-user)
Phase 3: SSH Integration
- Add DNS retry and cache fallback to ProxyCommand
- Expose new DNS options in CLI and config file
- Support persistent DNS cache storage
- Check if HostName can be used for DNS avoidance (with %h and %p parameters)
Benefits
Immediate
- Reliable SSH in VPN environments with bad DNS
- Faster reconnections due to caching
- Clear error messages and troubleshooting guidance
Long-Term
- Infrastructure for DNS-over-HTTPS/TLS support
- Enhanced resilience in mixed-network setups
- Less user frustration and fewer support cases
Testing Strategy
- Simulate VPN with one working and one broken DNS server
- Measure SSH connection startup time with and without cache
- Verify fallback behavior when all DNS resolvers fail
- Benchmark success rate and latency with new parallel resolver
- Test CLI output clarity and usefulness under failure conditions
Backward Compatibility
- New features are opt-in via configuration flags
- Existing behavior remains default
- Graceful fallback to Go’s default resolver if advanced config fails
Related Issues and Logs
netcheck: [v1] measuring ICMP latency of xyz:us-west-2b (2): no address for node uswest2bnetcheck: netcheck.runProbe: named node "uswest2a" has no address
ssh: could not get canonical namessh: Did not find remote IP address (is SSH ProxyCommand disabled?)