|
| 1 | +#Postmortem: Agentic Workshop Incident - September 30, 2024 |
| 2 | + |
| 3 | +**Date:** September 30, 2024 |
| 4 | +**Environment:**https://ai.coder.com |
| 5 | +**Severity:** High |
| 6 | +**Duration:**~10 minutes into workshop until post-workshop fixes |
| 7 | +**Impact:** Multiple user workspaces died/restarted, wiping user progress during live workshop |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +##Executive Summary |
| 12 | + |
| 13 | +During the Agentic Workshop on September 30, the AI demo environment experienced multiple cascading failures when approximately 10+ users simultaneously onboarded and deployed workspaces. While initial deployments succeeded, resource contention and architectural issues caused workspace instability, data loss, and service disruptions across the multi-region infrastructure. The incident revealed gaps in stress testing and highlighted limitations in the current architecture that were not apparent during smaller-scale internal testing. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +##Timeline |
| 18 | + |
| 19 | +**Pre-incident:** Workshop begins, users start onboarding process |
| 20 | +**T+0 min:** Initial workspace deployments roll through successfully |
| 21 | +**T+~10 min:** Workspaces begin competing for resources as workloads start running |
| 22 | +**T+~10 min:** LiteLLM authentication key briefly expires (few seconds) |
| 23 | +**T+~10 min:** Workspaces start dying and restarting, triggering self-healing mechanisms |
| 24 | +**T+~10 min:** User progress wiped due to ephemeral volume issues |
| 25 | +**T+~10 min:** Subdomain routing issues surface between Oregon and London proxy clusters |
| 26 | +**Post-workshop:** Fixes applied to address all identified issues |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +##Architecture Context |
| 31 | + |
| 32 | +###Multi-Region Deployment |
| 33 | + |
| 34 | +**Control Plane (us-east-2 - Ohio)**: |
| 35 | +- Coder Server: 2 replicas @ 4 vCPU / 8 GB each |
| 36 | +- External Provisioners: 6 replicas (default org) @ 500m CPU / 512 MB each |
| 37 | +- LiteLLM Service: 4 replicas @ 2 vCPU / 4 GB each |
| 38 | +- Primary domain:`ai.coder.com` +`*.ai.coder.com` |
| 39 | + |
| 40 | +**Proxy Clusters**: |
| 41 | +- Oregon (us-west-2): 2 replicas @ 500m CPU / 1 GB, domain:`oregon-proxy.ai.coder.com` |
| 42 | +- London (eu-west-2): 2 replicas @ 500m CPU / 1 GB, domain:`emea-proxy.ai.coder.com` |
| 43 | + |
| 44 | +**Image Management**: |
| 45 | +- Source:`ghcr.io/coder/coder-preview` (non-GA preview for beta AI features) |
| 46 | +- Mirrored to private AWS ECR (us-east-2) |
| 47 | +- Critical dependency: ECR must stay in sync with GHCR |
| 48 | + |
| 49 | +**DNS Management**: |
| 50 | +- 6 domains managed in CloudFlare (control plane + 2 proxies, each with wildcard) |
| 51 | +- Manual process via #help-me-ops Slack channel |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +##Root Causes |
| 56 | + |
| 57 | +###1. Resource Contention - Ephemeral Volume Storage |
| 58 | + |
| 59 | +**Cause:** Limited node storage capacity for ephemeral volumes could not handle concurrent workspace workloads. Each workspace template consumes 2-4 vCPU and 4-8 GB memory, with ephemeral storage on node-local volumes. |
| 60 | + |
| 61 | +**Impact:** Workspaces died and restarted when nodes exhausted storage, triggering self-healing that wiped user progress. |
| 62 | + |
| 63 | +**Why it wasn't caught:** |
| 64 | +- No stress testing with realistic concurrent user load (10+ users) |
| 65 | +- Internal testing used lower concurrency |
| 66 | +- Capacity planning didn't account for simultaneous workspace workloads |
| 67 | +- No monitoring/alerting for ephemeral volume storage thresholds |
| 68 | + |
| 69 | +**Technical Details:** |
| 70 | +- Workspace templates allow 2-4 vCPU / 4-8 GB configuration |
| 71 | +-~10 concurrent workspaces @ 4 vCPU / 8 GB = 40+ vCPU / 80+ GB demand |
| 72 | +- Ephemeral volumes for each workspace competed for node storage |
| 73 | +- Karpenter auto-scaled nodes but storage capacity per node remained fixed |
| 74 | + |
| 75 | +###2. Image Management Inconsistencies |
| 76 | + |
| 77 | +**Cause:** The non-GA Coder preview image (`ghcr.io/coder/coder-preview`) mirrored to private ECR fell out of sync between the control plane (us-east-2) and proxy clusters (us-west-2, eu-west-2). |
| 78 | + |
| 79 | +**Impact:** Image version mismatches caused subdomain routing failures across regions. Workspaces couldn't be accessed via proxy URLs (`*.oregon-proxy.ai.coder.com`,`*.emea-proxy.ai.coder.com`). |
| 80 | + |
| 81 | +**Why it wasn't caught:** |
| 82 | +- Manual ECR mirroring process from GHCR is error-prone |
| 83 | +- No automated validation of image digests across all clusters |
| 84 | +- Issue only manifests under multi-region load with simultaneous deployments |
| 85 | +- Pre-workshop checklist lacked image consistency verification |
| 86 | + |
| 87 | +**Technical Details:** |
| 88 | +- Image sync process: |
| 89 | +1. Pull from`ghcr.io/coder/coder-preview:latest` |
| 90 | +2. Tag and push to private ECR |
| 91 | +3. Deploy to all 3 regions (us-east-2, us-west-2, eu-west-2) |
| 92 | +- During workshop, ECR mirror was stale |
| 93 | +- Control plane ran newer image than proxies |
| 94 | +- Subdomain routing logic failed due to version mismatch |
| 95 | + |
| 96 | +###3. LiteLLM Key Expiration |
| 97 | + |
| 98 | +**Cause:** LiteLLM authentication key expired briefly during workshop. LiteLLM uses an auxiliary addon that rotates keys every 4-5 hours. |
| 99 | + |
| 100 | +**Impact:** Brief service disruption (few seconds) for AI features (Claude Code CLI, Goose CLI). Key rotation also forces all workspaces to restart to consume new keys. |
| 101 | + |
| 102 | +**Note:** Currently using open-source LiteLLM which has limited key management flexibility. Enterprise version not justified for current needs. |
| 103 | + |
| 104 | +**Why it wasn't caught:** |
| 105 | +- No pre-workshop validation of key expiration times |
| 106 | +- Key rotation schedule not documented or considered in workshop planning |
| 107 | +- No monitoring/alerting for upcoming key expirations |
| 108 | + |
| 109 | +**Technical Details:** |
| 110 | +- LiteLLM: 4 replicas @ 2 vCPU / 4 GB, round-robin between AWS Bedrock and GCP Vertex AI |
| 111 | +- Auxiliary addon runs on 4-5 hour schedule |
| 112 | +- Key rotation requires workspace restart to pick up new credentials |
| 113 | +- If rotation occurs during workshop, causes mass workspace restarts |
| 114 | + |
| 115 | +###4. Provisioner Capacity Bottleneck |
| 116 | + |
| 117 | +**Cause:** Default provisioner capacity (6 replicas @ 500m CPU / 512 MB) insufficient for~10 concurrent users simultaneously creating workspaces. |
| 118 | + |
| 119 | +**Impact:** Workspace create operations queued or timed out, causing delays and poor user experience. |
| 120 | + |
| 121 | +**Why it wasn't caught:** |
| 122 | +- No capacity planning guidelines for concurrent user scaling |
| 123 | +- Provisioners are single-threaded (1 provisioner = 1 Terraform operation) |
| 124 | +- No monitoring of provisioner queue depth |
| 125 | +- Workshop planning didn't include provisioner pre-scaling |
| 126 | + |
| 127 | +**Technical Details:** |
| 128 | +- 10 users × 1 workspace each = 10 concurrent Terraform operations |
| 129 | +- 6 provisioners = max 6 concurrent operations |
| 130 | +- Remaining 4 operations queued, causing delays |
| 131 | +- Recommendation: Scale to 8-10 replicas for 10-15 users |
| 132 | + |
| 133 | +###5. DNS Management Dependency |
| 134 | + |
| 135 | +**Cause:** CloudFlare DNS managed manually via #help-me-ops Slack channel created potential for delays during incident response. |
| 136 | + |
| 137 | +**Impact:** No immediate impact during workshop, but DNS issues would have been slow to resolve. |
| 138 | + |
| 139 | +**Why it's a concern:** |
| 140 | +- 6 domains to manage: control plane + 2 proxies (each with wildcard) |
| 141 | +- No self-service for infrastructure team |
| 142 | +- Dependency on ops team availability |
| 143 | +- No automated validation of DNS configuration |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +##Impact Assessment |
| 148 | + |
| 149 | +**Users Affected:** All workshop participants (~10+ concurrent users) |
| 150 | +**Data Loss:** User workspace progress wiped due to ephemeral volume restarts |
| 151 | +**Service Availability:** Degraded for~10+ minutes during workshop |
| 152 | +**Business Impact:** Poor user experience during live demonstration/workshop event |
| 153 | + |
| 154 | +**Metrics**: |
| 155 | +- Workspace failure rate:~40-50% (estimated, 4-5 workspaces restarted) |
| 156 | +- Average workspace restart time: 2-3 minutes |
| 157 | +- Number of incidents: 3 major (storage, image sync, key expiration) |
| 158 | +- User-visible impact duration:~10 minutes |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +##What Went Well |
| 163 | + |
| 164 | +- Initial deployment phase worked correctly (first~10 minutes) |
| 165 | +- Self-healing mechanisms activated (though resulted in data loss) |
| 166 | +- Karpenter successfully scaled nodes in response to demand |
| 167 | +- LiteLLM key rotation was brief (few seconds) |
| 168 | +- Issues were contained to the workshop environment (no production impact) |
| 169 | +- Team responded post-workshop with comprehensive fixes |
| 170 | +- Base infrastructure foundation is solid (EKS, Karpenter, multi-region setup) |
| 171 | +- Multi-region architecture design is sound |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +##What Went Wrong |
| 176 | + |
| 177 | +- No internal stress testing with realistic concurrent user load prior to workshop |
| 178 | +- Ephemeral volume capacity planning insufficient for simultaneous workloads |
| 179 | +- Image management strategy across multi-region clusters not robust |
| 180 | +- No pre-workshop validation of authentication keys or key rotation schedule |
| 181 | +- Lack of monitoring/alerting for resource contention thresholds |
| 182 | +- Provisioner capacity not scaled proactively |
| 183 | +- No pre-workshop checklist or validation procedures |
| 184 | +- Manual processes (ECR sync, CloudFlare DNS) created points of failure |
| 185 | +- No capacity planning guidelines for concurrent user scaling |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +##Action Items |
| 190 | + |
| 191 | +###Completed (Post-Workshop) |
| 192 | +- ✅ Applied fixes for all identified issues |
| 193 | +- ✅ Created comprehensive incident documentation |
| 194 | +- ✅ Documented architecture and component details |
| 195 | +- ✅ Created pre-workshop validation checklist |
| 196 | +- ✅ Created incident runbook |
| 197 | +- ✅ Established GitHub tracking issues |
| 198 | + |
| 199 | +###High Priority (Before Next Workshop) |
| 200 | + |
| 201 | +**Storage & Capacity** (Issue#1) |
| 202 | +-[ ] Audit current ephemeral volume allocation per node |
| 203 | +-[ ] Calculate storage requirements for target concurrent workspace count |
| 204 | +-[ ] Implement storage capacity monitoring and alerting |
| 205 | +-[ ] Define resource limits per workspace to prevent node exhaustion |
| 206 | +-[ ] Test with realistic concurrent user load |
| 207 | + |
| 208 | +**Image Management** (Issue#2, Issue#7) |
| 209 | +-[ ] Automate ECR image mirroring from`ghcr.io/coder/coder-preview` |
| 210 | +-[ ] Implement pre-deployment validation of image digests across all clusters |
| 211 | +-[ ] Add to pre-workshop checklist |
| 212 | +-[ ] Document rollback procedure for bad images |
| 213 | + |
| 214 | +**LiteLLM Key Management** (Issue#3) |
| 215 | +-[ ] Implement monitoring/alerting for key expiration (7, 3, 1 day warnings) |
| 216 | +-[ ] Document key rotation procedure |
| 217 | +-[ ] Add key expiration check to pre-workshop checklist |
| 218 | +-[ ] Disable/schedule key rotation around workshops |
| 219 | + |
| 220 | +**Pre-Workshop Validation** (Issue#4) |
| 221 | +-[ ] Complete pre-workshop checklist 2 days before each workshop |
| 222 | +-[ ] Validate LiteLLM keys, image consistency, storage capacity |
| 223 | +-[ ] Test subdomain routing across all regions |
| 224 | +-[ ] Scale provisioners based on expected attendance |
| 225 | +-[ ] Confirm monitoring and alerting is operational |
| 226 | + |
| 227 | +**Provisioner Scaling** (Issue#8) |
| 228 | +-[ ] Document scaling recommendations based on concurrent user count |
| 229 | +-[ ] Scale provisioners 1 day before workshops (6 → 8-10 for 10-15 users) |
| 230 | +-[ ] (Long-term) Implement provisioner auto-scaling based on queue depth |
| 231 | + |
| 232 | +**Monitoring & Alerting** (Issue#6) |
| 233 | +-[ ] Ephemeral volume storage capacity per node (alert at 70%, 85%, 95%) |
| 234 | +-[ ] Concurrent workspace count |
| 235 | +-[ ] Workspace restart/failure rate |
| 236 | +-[ ] Image pull times across clusters |
| 237 | +-[ ] LiteLLM key expiration |
| 238 | +-[ ] Subdomain routing success rate |
| 239 | +-[ ] Provisioner queue depth |
| 240 | + |
| 241 | +###Medium Priority (1-3 months) |
| 242 | + |
| 243 | +**CloudFlare DNS Automation** (Issue#9) |
| 244 | +-[ ] Migrate CloudFlare DNS to Terraform |
| 245 | +-[ ] Enable self-service DNS changes via PR workflow |
| 246 | +-[ ] Add DNS validation to CI/CD pipeline |
| 247 | +-[ ] Implement monitoring for DNS resolution |
| 248 | + |
| 249 | +**Monthly Workshop Cadence** (Issue#5) |
| 250 | +-[ ] Establish monthly workshop schedule |
| 251 | +-[ ] Develop workshop content/agenda |
| 252 | +-[ ] Define success metrics |
| 253 | +-[ ] Create feedback collection mechanism |
| 254 | +-[ ] Track month-over-month improvements |
| 255 | + |
| 256 | +###Long-Term (3+ months) |
| 257 | + |
| 258 | +**Stress Testing Automation** |
| 259 | +-[ ] Build internal stress testing tooling |
| 260 | +-[ ] Simulate concurrent user load |
| 261 | +-[ ] Automate capacity validation |
| 262 | +-[ ] Integrate into CI/CD pipeline |
| 263 | + |
| 264 | +**Architectural Improvements** |
| 265 | +-[ ] Evaluate persistent storage options to prevent data loss |
| 266 | +-[ ] Consider workspace state backup/restore mechanisms |
| 267 | +-[ ] Implement provisioner auto-scaling (HPA based on queue depth) |
| 268 | +-[ ] Optimize ephemeral volume allocation strategy |
| 269 | + |
| 270 | +--- |
| 271 | + |
| 272 | +##Lessons Learned |
| 273 | + |
| 274 | +###What We Learned |
| 275 | + |
| 276 | +1.**Production-like testing is essential:** Internal testing without realistic concurrent load is insufficient for demo/workshop environments. The gap between "works in testing" and "works at scale" is significant. |
| 277 | + |
| 278 | +2.**Capacity planning needs real-world data:** Architectural assumptions (storage, provisioners, LiteLLM) must be validated under actual user load patterns. Theoretical capacity ≠ practical capacity. |
| 279 | + |
| 280 | +3.**Manual processes don't scale:** ECR image syncing and CloudFlare DNS management via Slack requests create bottlenecks and points of failure during incidents. |
| 281 | + |
| 282 | +4.**Multi-region consistency is hard:** Keeping images, configurations, and services synchronized across us-east-2, us-west-2, and eu-west-2 requires automation and validation. |
| 283 | + |
| 284 | +5.**Key rotation timing matters:** LiteLLM's 4-5 hour rotation schedule must be coordinated with workshop timing to avoid forced workspace restarts during events. |
| 285 | + |
| 286 | +6.**Provisioner scaling is critical:** Single-threaded Terraform operations mean provisioner count directly determines concurrent workspace operation capacity. |
| 287 | + |
| 288 | +7.**Pre-event validation is non-negotiable:** A structured checklist covering infrastructure, capacity, authentication, and routing prevents preventable issues. |
| 289 | + |
| 290 | +8.**Monthly cadence provides continuous validation:** Regular workshops will surface optimization opportunities and prevent regressions. The base infrastructure is solid; now we need operational refinement. |
| 291 | + |
| 292 | +###What We'll Do Differently |
| 293 | + |
| 294 | +1.**Always run pre-workshop checklist** 2 days before events |
| 295 | +2.**Scale provisioners and LiteLLM proactively** based on expected attendance |
| 296 | +3.**Disable LiteLLM key rotation** during workshop windows |
| 297 | +4.**Validate image consistency** across all regions before workshops |
| 298 | +5.**Monitor ephemeral storage** and alert before capacity issues arise |
| 299 | +6.**Automate manual processes** (ECR sync, DNS management) |
| 300 | +7.**Conduct monthly workshops** to continuously stress test and improve |
| 301 | +8.**Document everything** for faster incident response and knowledge sharing |
| 302 | + |
| 303 | +###Process Improvements |
| 304 | + |
| 305 | +1.**Pre-Workshop Checklist:** Mandatory 2-day pre-event validation covering all infrastructure components |
| 306 | +2.**Incident Runbook:** Step-by-step procedures for common failure scenarios |
| 307 | +3.**Capacity Planning:** Clear guidelines for scaling based on concurrent user count |
| 308 | +4.**Monitoring Dashboard:** Real-time visibility during workshops for proactive issue detection |
| 309 | +5.**Post-Workshop Retrospective:** Structured feedback loop to track improvements month-over-month |
| 310 | + |
| 311 | +--- |
| 312 | + |
| 313 | +##Technical Recommendations |
| 314 | + |
| 315 | +###Immediate (Week 1) |
| 316 | +1. Implement ephemeral storage monitoring with alerting |
| 317 | +2. Create automated ECR sync job (GitHub Actions or AWS Lambda) |
| 318 | +3. Document provisioner scaling procedure in runbook |
| 319 | +4. Add LiteLLM key expiration to monitoring |
| 320 | + |
| 321 | +###Short-term (Month 1) |
| 322 | +1. Migrate CloudFlare DNS to Terraform |
| 323 | +2. Implement image digest validation across clusters |
| 324 | +3. Set up workshop-specific monitoring dashboard |
| 325 | +4. Create provisioner HPA based on CPU/memory |
| 326 | + |
| 327 | +###Long-term (Quarter 1) |
| 328 | +1. Build stress testing automation |
| 329 | +2. Implement provisioner queue depth monitoring and auto-scaling |
| 330 | +3. Evaluate persistent storage options for workspace data |
| 331 | +4. Expand to additional demo environments (coderdemo.io, devcoder.io) |
| 332 | + |
| 333 | +--- |
| 334 | + |
| 335 | +##Success Metrics |
| 336 | + |
| 337 | +Track these metrics month-over-month: |
| 338 | + |
| 339 | +**Platform Stability**: |
| 340 | +- Workspace restart/failure rate: Target <2% |
| 341 | +- Incidents with user-visible impact: Target 0 |
| 342 | +- Storage contention events: Target 0 |
| 343 | +- Subdomain routing errors: Target 0 |
| 344 | +- Average workspace start time: Target <2 minutes |
| 345 | + |
| 346 | +**Workshop Quality**: |
| 347 | +- Participant satisfaction score: Target 4.5+/5 |
| 348 | +- Percentage completing workshop: Target >90% |
| 349 | +- Number of blockers encountered: Target <3 |
| 350 | + |
| 351 | +**Operational Efficiency**: |
| 352 | +- Pre-workshop checklist completion time: Target <30 minutes |
| 353 | +- Time to resolve incidents: Target <5 minutes |
| 354 | +- Manual interventions required: Target <2 per workshop |
| 355 | + |
| 356 | +--- |
| 357 | + |
| 358 | +##Related Resources |
| 359 | + |
| 360 | +###Documentation |
| 361 | +-[Architecture Overview](./workshops/ARCHITECTURE.md) |
| 362 | +-[Monthly Workshop Guide](./workshops/MONTHLY_WORKSHOP_GUIDE.md) |
| 363 | +-[Pre-Workshop Checklist](./workshops/PRE_WORKSHOP_CHECKLIST.md) |
| 364 | +-[Incident Runbook](./workshops/INCIDENT_RUNBOOK.md) |
| 365 | +-[Post-Workshop Retrospective Template](./workshops/POST_WORKSHOP_RETROSPECTIVE.md) |
| 366 | +-[Participant Guide](./workshops/PARTICIPANT_GUIDE.md) |
| 367 | + |
| 368 | +###GitHub Issues |
| 369 | +-[#1 - Optimize ephemeral volume storage capacity](https://github.com/coder/ai.coder.com/issues/1) |
| 370 | +-[#2 - Standardize image management across clusters](https://github.com/coder/ai.coder.com/issues/2) |
| 371 | +-[#3 - Improve LiteLLM key rotation and monitoring](https://github.com/coder/ai.coder.com/issues/3) |
| 372 | +-[#4 - Create pre-workshop validation checklist](https://github.com/coder/ai.coder.com/issues/4) |
| 373 | +-[#5 - Establish monthly workshop cadence](https://github.com/coder/ai.coder.com/issues/5) |
| 374 | +-[#6 - Implement comprehensive monitoring and alerting](https://github.com/coder/ai.coder.com/issues/6) |
| 375 | +-[#7 - Automate ECR image mirroring](https://github.com/coder/ai.coder.com/issues/7) |
| 376 | +-[#8 - Implement provisioner auto-scaling](https://github.com/coder/ai.coder.com/issues/8) |
| 377 | +-[#9 - Automate CloudFlare DNS management](https://github.com/coder/ai.coder.com/issues/9) |
| 378 | + |
| 379 | +--- |
| 380 | + |
| 381 | +##Approvals |
| 382 | + |
| 383 | +**Infrastructure Team Lead**:_________________ |
| 384 | +**Product Team Lead**:_________________ |
| 385 | +**Date**:_________________ |
| 386 | + |
| 387 | +--- |
| 388 | + |
| 389 | +**Prepared by:** Dave Ahr |
| 390 | +**Review Date:** October 2024 |
| 391 | +**Next Review:** After first monthly workshop |