Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Bug: Enhanced DNS Resilience and Diagnostics for Coder SSH ProxyCommand #18616

Open
Labels
customer-reportedBugs reported by enterprise customers. Only humans may set this.customer-requestedFeatures requested by enterprise customers. Only humans may set this.s4Internal bugs (e.g. test flakes), extreme edge cases, and bug risks
@bjornrobertsson

Description

@bjornrobertsson

Summary

Coder users are experiencing intermittent and prolonged failures during SSH connections, especially in VPN environments where DNS servers are partially or intermittently unresponsive. These issues are directly tied to the behavior of Go’s standard resolver, which lacks in-process caching and has limited timeout and failover behavior.

This proposes we configure a more resilient DNS resolution, in-memory caching, enhanced diagnostics viacoder netcheck, and tighter integration with Coder's SSH ProxyCommand to mitigate these failures and improve user experience.


Problem Statement

Current Pain Points

  • SSH ProxyCommand fails with errors like:
  • could not get canonical name
  • Did not find remote IP address (is SSH ProxyCommand disabled?)
  • coder netcheck logs:"no address for node" or timeouts
  • VPN environments inject multiple DNS servers, some of which are non-responsive
  • Go’snet.Resolver has no DNS cache, leading to repetitive failed lookups
  • DNS queries (e.g.,gethostbyname) can hang without enforced timeouts
  • Lack of CLI tooling for diagnosing DNS-related failures

Root Cause Analysis

  1. Unresponsive or Hanging DNS Servers
    VPN clients often assign multiple DNS servers, but not all are reachable or fast.
    Regular DNS handling of OS can lead to long timeouts.
    Poor DNS Servers can cause failure rates or success rates (A/B where A is the number of healthy servers and B the total number of DNS Servers (in the resolv.conf), minimum one, and up to maximum of three IP Addresses)
    This failure rate therefore is mathematically 100% for one host and it is bad, 1/2 for one bad host out of two, or 1/3, 2/3 for one or two failing hosts out of three.
    Due to round-robin methods, some end-users can ultimately have varying amount of success, but typically describe in the above scenario.

  2. Go DNS Resolver Limitations

  3. Lack of Observability & Failover

    • No health scoring or metrics for DNS servers
    • No fallback to alternate resolvers on failure
  4. Error Visibility

    • Error output fromcoder netcheck does not surface DNS issues clearly
    • No user-facing tools to diagnose DNS resolution failures in SSH flows

Proposed Solution

1. Enhanced DNS Resolution with Resilience

typeDNSConfigstruct {ServerTimeout     time.Duration`json:"server_timeout"`// e.g. 2s per DNS serverTotalTimeout      time.Duration`json:"total_timeout"`// e.g. 10s totalRetryAttemptsint`json:"retry_attempts"`// e.g. 2 retriesParallelQuerybool`json:"parallel_query"`// Query all resolvers concurrently}

Features:

  • Per-server timeout with total query deadline
  • Parallel or sequential failover query modes
  • Retry logic for transient errors
  • Health scoring of DNS servers for future prioritization

2. In-Process DNS Caching

Inspired byrs/dnscache, which wraps Go’snet.Resolver:

typeCacheConfigstruct {TTL                time.Duration`json:"ttl"`// Default: 5mRefreshInterval    time.Duration`json:"refresh_interval"`// Background refreshMaxEntriesint`json:"max_entries"`// Default: 1000PersistentCachebool`json:"persistent_cache"`// OptionalBackgroundRefreshbool`json:"background_refresh"`// Optional}

Features:

  • TTL-based expiration
  • Background refresh of entries to maintain freshness
  • Stale result fallback on DNS outage
  • Manual cache clearing (coder dns-cache --clear)
  • Optional persistent cache across CLI sessions

3. Enhanced Diagnostics incoder netcheck

# New DNS-focused netcheck optionscoder netcheck --dns-detailedcoder netcheck --dns-servers-onlycoder netcheck --dns-timeout 3s

Diagnostic Output:

  • Server-by-server resolution time and success/failure status
  • Identification of dead/stale servers
  • Parallel test to detect mixed-responsiveness issues
  • DNS configuration tips and fix suggestions

4. ProxyCommand Integration

With Match.coder and coder. in the ProxyCommand, we have the option to apply potential new parameters to encourage failover or retries:

# New parameters:dns:  timeout: 2s  cache_ttl: 5m  parallel_queries:true  max_retries: 3ssh:  dns_resilience:true  connection_timeout: 30s

Features:

  • Retry logic in the face of DNS resolution failure
  • Parallel resolution at SSH connection start
  • Graceful fallback to cached DNS data
  • DNS-aware keep-alive and connection pooling

Alternatively we could also use 'coder config-ssh' to add HostName for sensitive hosts, to avoid DNS lookups.
Having fixed IP Addresses in the .ssh/config file could help, but they could also expire with LB's often changing IP Addresses, dynamic hosting etc.


Implementation Plan

Phase 1: Core DNS Enhancements

  • Add configurable DNS timeout and retry logic
  • Implement in-memory DNS caching
  • Support DNS server failover and parallel queries

Phase 2: Enhanced Diagnostics

  • Expandnetcheck to report per-resolver behavior
  • Add CLI flags for DNS tests and timeout controls
  • Implement NXDOMAIN Detection (If query for the ACCESS URL or Wildcard URL return NXDOMAIN, tell the end-user)

Phase 3: SSH Integration

  • Add DNS retry and cache fallback to ProxyCommand
  • Expose new DNS options in CLI and config file
  • Support persistent DNS cache storage
  • Check if HostName can be used for DNS avoidance (with %h and %p parameters)

Benefits

Immediate

  • Reliable SSH in VPN environments with bad DNS
  • Faster reconnections due to caching
  • Clear error messages and troubleshooting guidance

Long-Term

  • Infrastructure for DNS-over-HTTPS/TLS support
  • Enhanced resilience in mixed-network setups
  • Less user frustration and fewer support cases

Testing Strategy

  • Simulate VPN with one working and one broken DNS server
  • Measure SSH connection startup time with and without cache
  • Verify fallback behavior when all DNS resolvers fail
  • Benchmark success rate and latency with new parallel resolver
  • Test CLI output clarity and usefulness under failure conditions

Backward Compatibility

  • New features are opt-in via configuration flags
  • Existing behavior remains default
  • Graceful fallback to Go’s default resolver if advanced config fails

Related Issues and Logs

netcheck: [v1] measuring ICMP latency of xyz:us-west-2b (2): no address for node uswest2bnetcheck: netcheck.runProbe: named node "uswest2a" has no address
ssh: could not get canonical namessh: Did not find remote IP address (is SSH ProxyCommand disabled?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedBugs reported by enterprise customers. Only humans may set this.customer-requestedFeatures requested by enterprise customers. Only humans may set this.s4Internal bugs (e.g. test flakes), extreme edge cases, and bug risks

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp