- Notifications
You must be signed in to change notification settings - Fork0
Open
Description
Problem
DNS management for ai.coder.com and proxy domains currently requires manual Slack requests to #help-me-ops for CloudFlare changes. This creates friction during workshops and incident response, and is a potential single point of failure.
Context
Domains Managed in CloudFlare:
ai.coder.com
+*.ai.coder.com
→ us-east-2 NLB (Control Plane)oregon-proxy.ai.coder.com
+*.oregon-proxy.ai.coder.com
→ us-west-2 NLBemea-proxy.ai.coder.com
+*.emea-proxy.ai.coder.com
→ eu-west-2 NLB
Current Process:
- Request DNS change in #help-me-ops Slack channel
- Wait for ops team response
- Manual CloudFlare console changes
- Verify propagation
Pain Points:
- Manual process doesn't scale during incidents
- No self-service for infrastructure team
- Dependency on ops team availability
- No automated validation of DNS configuration
Requirements
Terraform/IaC Management
- Migrate CloudFlare DNS records to Terraform
- Use CloudFlare Terraform provider
- Store state in shared backend (S3)
- Document DNS change process via Infrastructure as Code
- Enable self-service DNS changes via PR/approval workflow
Automated Validation
- Add DNS validation to pre-workshop checklist
- Implement automated tests:
# Verify all 6 domains resolve correctlydig ai.coder.comdig oregon-proxy.ai.coder.com dig emea-proxy.ai.coder.com# Verify wildcard subdomainsdig test.ai.coder.comdig test.oregon-proxy.ai.coder.comdig test.emea-proxy.ai.coder.com
- Run validation as part of CI/CD pipeline
- Alert on DNS misconfiguration
Documentation
- Document CloudFlare API access requirements
- Document Terraform workflow for DNS changes
- Add DNS troubleshooting to incident runbook (Create pre-workshop validation checklist and runbook #4) ✅ (already added)
- Create rollback procedure for DNS changes
Self-Service Workflow
- Grant infrastructure team CloudFlare API access (scoped to ai.coder.com zone)
- Implement PR-based approval workflow:
- Create Terraform change in PR
- Automated validation/plan
- Team review
- Apply after approval
- Set up notifications for DNS changes (Slack, email)
Monitoring & Alerting
- Monitor DNS resolution for all 6 domains
- Alert on:
- DNS resolution failures
- TTL expiration
- Certificate expiration (related to DNS)
- Unexpected DNS changes
- Add to monitoring dashboard (Implement comprehensive resource monitoring and alerting #6)
Success Criteria
- DNS changes can be made via Terraform without manual Slack requests
- Infrastructure team has self-service access to CloudFlare DNS
- DNS configuration validated automatically before and after workshops
- DNS issues detected and alerted before user impact
- Zero workshop delays due to DNS misconfigurations
Implementation Notes
CloudFlare Terraform Example:
resource"cloudflare_record""ai_coder_com" {zone_id=var.cloudflare_zone_idname="ai"value=aws_lb.coder_nlb_us_east_2.dns_nametype="CNAME"ttl=300proxied=false}resource"cloudflare_record""ai_coder_com_wildcard" {zone_id=var.cloudflare_zone_idname="*.ai"value=aws_lb.coder_nlb_us_east_2.dns_nametype="CNAME"ttl=300proxied=false}resource"cloudflare_record""oregon_proxy" {zone_id=var.cloudflare_zone_idname="oregon-proxy.ai"value=aws_lb.coder_nlb_us_west_2.dns_nametype="CNAME"ttl=300proxied=false}# ... additional records for London proxy and wildcards
CloudFlare API Scoping:
- Use API token (not Global API Key)
- Scope to
ai.coder.com
zone only - Grant DNS edit permissions only
- Rotate token periodically
Security Considerations
- CloudFlare API token stored in secure secret management (AWS Secrets Manager, HashiCorp Vault)
- API token scoped to minimum required permissions
- Audit log for all DNS changes
- Require PR approval for production DNS changes
Future Domains
This infrastructure should support upcoming domains:
coderdemo.io
- SE official demo environmentdevcoder.io
- CS / Engineering collaboration environment
Related
Sept 30 Workshop Postmortem
#2 (Image management and subdomain routing)
#4 (Pre-workshop validation checklist)
#6 (Monitoring and alerting)
Incident Runbook - Subdomain Routing Failures
Pre-Workshop Checklist - CloudFlare DNS Verification
Metadata
Metadata
Assignees
Labels
No labels