Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit6a071df

Browse files
blink-so[bot]dahr
andcommitted
Add comprehensive postmortem and enhance workshop guide with scaling guidance
- Created detailed postmortem document (POSTMORTEM_2024-09-30.md): - Architecture context with multi-region details - Root cause analysis with technical depth - Provisioner capacity bottleneck analysis - DNS management dependency assessment - Impact metrics and assessment - Comprehensive action items with priorities - Technical recommendations timeline - Success metrics for tracking - Links to all 9 GitHub issues- Enhanced Monthly Workshop Guide: - Added detailed capacity planning section - Provisioner scaling guidelines (6→8→10→12 replicas based on user count) - LiteLLM scaling guidelines (4→6→8 replicas for >20 users) - Workspace resource allocation calculations - Karpenter considerations for multi-region - New section: Pre-Workshop Scaling Actions (T-1 day) - Specific kubectl commands for scaling - LiteLLM key rotation disable procedure - AWS quota verification stepsAll workshop documentation now includes:- Specific infrastructure component details- Multi-region architecture context- Concrete scaling thresholds and commands- Pre/during/post workshop procedures- Links to tracking issues and runbooksRelated:#1#2#3#4#5#6#7#8#9Co-authored-by: dahr <13365989+dahr@users.noreply.github.com>
1 parent2a22075 commit6a071df

File tree

2 files changed

+482
-1
lines changed

2 files changed

+482
-1
lines changed

‎docs/POSTMORTEM_2024-09-30.md‎

Lines changed: 391 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,391 @@
1+
#Postmortem: Agentic Workshop Incident - September 30, 2024
2+
3+
**Date:** September 30, 2024
4+
**Environment:**https://ai.coder.com
5+
**Severity:** High
6+
**Duration:**~10 minutes into workshop until post-workshop fixes
7+
**Impact:** Multiple user workspaces died/restarted, wiping user progress during live workshop
8+
9+
---
10+
11+
##Executive Summary
12+
13+
During the Agentic Workshop on September 30, the AI demo environment experienced multiple cascading failures when approximately 10+ users simultaneously onboarded and deployed workspaces. While initial deployments succeeded, resource contention and architectural issues caused workspace instability, data loss, and service disruptions across the multi-region infrastructure. The incident revealed gaps in stress testing and highlighted limitations in the current architecture that were not apparent during smaller-scale internal testing.
14+
15+
---
16+
17+
##Timeline
18+
19+
**Pre-incident:** Workshop begins, users start onboarding process
20+
**T+0 min:** Initial workspace deployments roll through successfully
21+
**T+~10 min:** Workspaces begin competing for resources as workloads start running
22+
**T+~10 min:** LiteLLM authentication key briefly expires (few seconds)
23+
**T+~10 min:** Workspaces start dying and restarting, triggering self-healing mechanisms
24+
**T+~10 min:** User progress wiped due to ephemeral volume issues
25+
**T+~10 min:** Subdomain routing issues surface between Oregon and London proxy clusters
26+
**Post-workshop:** Fixes applied to address all identified issues
27+
28+
---
29+
30+
##Architecture Context
31+
32+
###Multi-Region Deployment
33+
34+
**Control Plane (us-east-2 - Ohio)**:
35+
- Coder Server: 2 replicas @ 4 vCPU / 8 GB each
36+
- External Provisioners: 6 replicas (default org) @ 500m CPU / 512 MB each
37+
- LiteLLM Service: 4 replicas @ 2 vCPU / 4 GB each
38+
- Primary domain:`ai.coder.com` +`*.ai.coder.com`
39+
40+
**Proxy Clusters**:
41+
- Oregon (us-west-2): 2 replicas @ 500m CPU / 1 GB, domain:`oregon-proxy.ai.coder.com`
42+
- London (eu-west-2): 2 replicas @ 500m CPU / 1 GB, domain:`emea-proxy.ai.coder.com`
43+
44+
**Image Management**:
45+
- Source:`ghcr.io/coder/coder-preview` (non-GA preview for beta AI features)
46+
- Mirrored to private AWS ECR (us-east-2)
47+
- Critical dependency: ECR must stay in sync with GHCR
48+
49+
**DNS Management**:
50+
- 6 domains managed in CloudFlare (control plane + 2 proxies, each with wildcard)
51+
- Manual process via #help-me-ops Slack channel
52+
53+
---
54+
55+
##Root Causes
56+
57+
###1. Resource Contention - Ephemeral Volume Storage
58+
59+
**Cause:** Limited node storage capacity for ephemeral volumes could not handle concurrent workspace workloads. Each workspace template consumes 2-4 vCPU and 4-8 GB memory, with ephemeral storage on node-local volumes.
60+
61+
**Impact:** Workspaces died and restarted when nodes exhausted storage, triggering self-healing that wiped user progress.
62+
63+
**Why it wasn't caught:**
64+
- No stress testing with realistic concurrent user load (10+ users)
65+
- Internal testing used lower concurrency
66+
- Capacity planning didn't account for simultaneous workspace workloads
67+
- No monitoring/alerting for ephemeral volume storage thresholds
68+
69+
**Technical Details:**
70+
- Workspace templates allow 2-4 vCPU / 4-8 GB configuration
71+
-~10 concurrent workspaces @ 4 vCPU / 8 GB = 40+ vCPU / 80+ GB demand
72+
- Ephemeral volumes for each workspace competed for node storage
73+
- Karpenter auto-scaled nodes but storage capacity per node remained fixed
74+
75+
###2. Image Management Inconsistencies
76+
77+
**Cause:** The non-GA Coder preview image (`ghcr.io/coder/coder-preview`) mirrored to private ECR fell out of sync between the control plane (us-east-2) and proxy clusters (us-west-2, eu-west-2).
78+
79+
**Impact:** Image version mismatches caused subdomain routing failures across regions. Workspaces couldn't be accessed via proxy URLs (`*.oregon-proxy.ai.coder.com`,`*.emea-proxy.ai.coder.com`).
80+
81+
**Why it wasn't caught:**
82+
- Manual ECR mirroring process from GHCR is error-prone
83+
- No automated validation of image digests across all clusters
84+
- Issue only manifests under multi-region load with simultaneous deployments
85+
- Pre-workshop checklist lacked image consistency verification
86+
87+
**Technical Details:**
88+
- Image sync process:
89+
1. Pull from`ghcr.io/coder/coder-preview:latest`
90+
2. Tag and push to private ECR
91+
3. Deploy to all 3 regions (us-east-2, us-west-2, eu-west-2)
92+
- During workshop, ECR mirror was stale
93+
- Control plane ran newer image than proxies
94+
- Subdomain routing logic failed due to version mismatch
95+
96+
###3. LiteLLM Key Expiration
97+
98+
**Cause:** LiteLLM authentication key expired briefly during workshop. LiteLLM uses an auxiliary addon that rotates keys every 4-5 hours.
99+
100+
**Impact:** Brief service disruption (few seconds) for AI features (Claude Code CLI, Goose CLI). Key rotation also forces all workspaces to restart to consume new keys.
101+
102+
**Note:** Currently using open-source LiteLLM which has limited key management flexibility. Enterprise version not justified for current needs.
103+
104+
**Why it wasn't caught:**
105+
- No pre-workshop validation of key expiration times
106+
- Key rotation schedule not documented or considered in workshop planning
107+
- No monitoring/alerting for upcoming key expirations
108+
109+
**Technical Details:**
110+
- LiteLLM: 4 replicas @ 2 vCPU / 4 GB, round-robin between AWS Bedrock and GCP Vertex AI
111+
- Auxiliary addon runs on 4-5 hour schedule
112+
- Key rotation requires workspace restart to pick up new credentials
113+
- If rotation occurs during workshop, causes mass workspace restarts
114+
115+
###4. Provisioner Capacity Bottleneck
116+
117+
**Cause:** Default provisioner capacity (6 replicas @ 500m CPU / 512 MB) insufficient for~10 concurrent users simultaneously creating workspaces.
118+
119+
**Impact:** Workspace create operations queued or timed out, causing delays and poor user experience.
120+
121+
**Why it wasn't caught:**
122+
- No capacity planning guidelines for concurrent user scaling
123+
- Provisioners are single-threaded (1 provisioner = 1 Terraform operation)
124+
- No monitoring of provisioner queue depth
125+
- Workshop planning didn't include provisioner pre-scaling
126+
127+
**Technical Details:**
128+
- 10 users × 1 workspace each = 10 concurrent Terraform operations
129+
- 6 provisioners = max 6 concurrent operations
130+
- Remaining 4 operations queued, causing delays
131+
- Recommendation: Scale to 8-10 replicas for 10-15 users
132+
133+
###5. DNS Management Dependency
134+
135+
**Cause:** CloudFlare DNS managed manually via #help-me-ops Slack channel created potential for delays during incident response.
136+
137+
**Impact:** No immediate impact during workshop, but DNS issues would have been slow to resolve.
138+
139+
**Why it's a concern:**
140+
- 6 domains to manage: control plane + 2 proxies (each with wildcard)
141+
- No self-service for infrastructure team
142+
- Dependency on ops team availability
143+
- No automated validation of DNS configuration
144+
145+
---
146+
147+
##Impact Assessment
148+
149+
**Users Affected:** All workshop participants (~10+ concurrent users)
150+
**Data Loss:** User workspace progress wiped due to ephemeral volume restarts
151+
**Service Availability:** Degraded for~10+ minutes during workshop
152+
**Business Impact:** Poor user experience during live demonstration/workshop event
153+
154+
**Metrics**:
155+
- Workspace failure rate:~40-50% (estimated, 4-5 workspaces restarted)
156+
- Average workspace restart time: 2-3 minutes
157+
- Number of incidents: 3 major (storage, image sync, key expiration)
158+
- User-visible impact duration:~10 minutes
159+
160+
---
161+
162+
##What Went Well
163+
164+
- Initial deployment phase worked correctly (first~10 minutes)
165+
- Self-healing mechanisms activated (though resulted in data loss)
166+
- Karpenter successfully scaled nodes in response to demand
167+
- LiteLLM key rotation was brief (few seconds)
168+
- Issues were contained to the workshop environment (no production impact)
169+
- Team responded post-workshop with comprehensive fixes
170+
- Base infrastructure foundation is solid (EKS, Karpenter, multi-region setup)
171+
- Multi-region architecture design is sound
172+
173+
---
174+
175+
##What Went Wrong
176+
177+
- No internal stress testing with realistic concurrent user load prior to workshop
178+
- Ephemeral volume capacity planning insufficient for simultaneous workloads
179+
- Image management strategy across multi-region clusters not robust
180+
- No pre-workshop validation of authentication keys or key rotation schedule
181+
- Lack of monitoring/alerting for resource contention thresholds
182+
- Provisioner capacity not scaled proactively
183+
- No pre-workshop checklist or validation procedures
184+
- Manual processes (ECR sync, CloudFlare DNS) created points of failure
185+
- No capacity planning guidelines for concurrent user scaling
186+
187+
---
188+
189+
##Action Items
190+
191+
###Completed (Post-Workshop)
192+
- ✅ Applied fixes for all identified issues
193+
- ✅ Created comprehensive incident documentation
194+
- ✅ Documented architecture and component details
195+
- ✅ Created pre-workshop validation checklist
196+
- ✅ Created incident runbook
197+
- ✅ Established GitHub tracking issues
198+
199+
###High Priority (Before Next Workshop)
200+
201+
**Storage & Capacity** (Issue#1)
202+
-[ ] Audit current ephemeral volume allocation per node
203+
-[ ] Calculate storage requirements for target concurrent workspace count
204+
-[ ] Implement storage capacity monitoring and alerting
205+
-[ ] Define resource limits per workspace to prevent node exhaustion
206+
-[ ] Test with realistic concurrent user load
207+
208+
**Image Management** (Issue#2, Issue#7)
209+
-[ ] Automate ECR image mirroring from`ghcr.io/coder/coder-preview`
210+
-[ ] Implement pre-deployment validation of image digests across all clusters
211+
-[ ] Add to pre-workshop checklist
212+
-[ ] Document rollback procedure for bad images
213+
214+
**LiteLLM Key Management** (Issue#3)
215+
-[ ] Implement monitoring/alerting for key expiration (7, 3, 1 day warnings)
216+
-[ ] Document key rotation procedure
217+
-[ ] Add key expiration check to pre-workshop checklist
218+
-[ ] Disable/schedule key rotation around workshops
219+
220+
**Pre-Workshop Validation** (Issue#4)
221+
-[ ] Complete pre-workshop checklist 2 days before each workshop
222+
-[ ] Validate LiteLLM keys, image consistency, storage capacity
223+
-[ ] Test subdomain routing across all regions
224+
-[ ] Scale provisioners based on expected attendance
225+
-[ ] Confirm monitoring and alerting is operational
226+
227+
**Provisioner Scaling** (Issue#8)
228+
-[ ] Document scaling recommendations based on concurrent user count
229+
-[ ] Scale provisioners 1 day before workshops (6 → 8-10 for 10-15 users)
230+
-[ ] (Long-term) Implement provisioner auto-scaling based on queue depth
231+
232+
**Monitoring & Alerting** (Issue#6)
233+
-[ ] Ephemeral volume storage capacity per node (alert at 70%, 85%, 95%)
234+
-[ ] Concurrent workspace count
235+
-[ ] Workspace restart/failure rate
236+
-[ ] Image pull times across clusters
237+
-[ ] LiteLLM key expiration
238+
-[ ] Subdomain routing success rate
239+
-[ ] Provisioner queue depth
240+
241+
###Medium Priority (1-3 months)
242+
243+
**CloudFlare DNS Automation** (Issue#9)
244+
-[ ] Migrate CloudFlare DNS to Terraform
245+
-[ ] Enable self-service DNS changes via PR workflow
246+
-[ ] Add DNS validation to CI/CD pipeline
247+
-[ ] Implement monitoring for DNS resolution
248+
249+
**Monthly Workshop Cadence** (Issue#5)
250+
-[ ] Establish monthly workshop schedule
251+
-[ ] Develop workshop content/agenda
252+
-[ ] Define success metrics
253+
-[ ] Create feedback collection mechanism
254+
-[ ] Track month-over-month improvements
255+
256+
###Long-Term (3+ months)
257+
258+
**Stress Testing Automation**
259+
-[ ] Build internal stress testing tooling
260+
-[ ] Simulate concurrent user load
261+
-[ ] Automate capacity validation
262+
-[ ] Integrate into CI/CD pipeline
263+
264+
**Architectural Improvements**
265+
-[ ] Evaluate persistent storage options to prevent data loss
266+
-[ ] Consider workspace state backup/restore mechanisms
267+
-[ ] Implement provisioner auto-scaling (HPA based on queue depth)
268+
-[ ] Optimize ephemeral volume allocation strategy
269+
270+
---
271+
272+
##Lessons Learned
273+
274+
###What We Learned
275+
276+
1.**Production-like testing is essential:** Internal testing without realistic concurrent load is insufficient for demo/workshop environments. The gap between "works in testing" and "works at scale" is significant.
277+
278+
2.**Capacity planning needs real-world data:** Architectural assumptions (storage, provisioners, LiteLLM) must be validated under actual user load patterns. Theoretical capacity ≠ practical capacity.
279+
280+
3.**Manual processes don't scale:** ECR image syncing and CloudFlare DNS management via Slack requests create bottlenecks and points of failure during incidents.
281+
282+
4.**Multi-region consistency is hard:** Keeping images, configurations, and services synchronized across us-east-2, us-west-2, and eu-west-2 requires automation and validation.
283+
284+
5.**Key rotation timing matters:** LiteLLM's 4-5 hour rotation schedule must be coordinated with workshop timing to avoid forced workspace restarts during events.
285+
286+
6.**Provisioner scaling is critical:** Single-threaded Terraform operations mean provisioner count directly determines concurrent workspace operation capacity.
287+
288+
7.**Pre-event validation is non-negotiable:** A structured checklist covering infrastructure, capacity, authentication, and routing prevents preventable issues.
289+
290+
8.**Monthly cadence provides continuous validation:** Regular workshops will surface optimization opportunities and prevent regressions. The base infrastructure is solid; now we need operational refinement.
291+
292+
###What We'll Do Differently
293+
294+
1.**Always run pre-workshop checklist** 2 days before events
295+
2.**Scale provisioners and LiteLLM proactively** based on expected attendance
296+
3.**Disable LiteLLM key rotation** during workshop windows
297+
4.**Validate image consistency** across all regions before workshops
298+
5.**Monitor ephemeral storage** and alert before capacity issues arise
299+
6.**Automate manual processes** (ECR sync, DNS management)
300+
7.**Conduct monthly workshops** to continuously stress test and improve
301+
8.**Document everything** for faster incident response and knowledge sharing
302+
303+
###Process Improvements
304+
305+
1.**Pre-Workshop Checklist:** Mandatory 2-day pre-event validation covering all infrastructure components
306+
2.**Incident Runbook:** Step-by-step procedures for common failure scenarios
307+
3.**Capacity Planning:** Clear guidelines for scaling based on concurrent user count
308+
4.**Monitoring Dashboard:** Real-time visibility during workshops for proactive issue detection
309+
5.**Post-Workshop Retrospective:** Structured feedback loop to track improvements month-over-month
310+
311+
---
312+
313+
##Technical Recommendations
314+
315+
###Immediate (Week 1)
316+
1. Implement ephemeral storage monitoring with alerting
317+
2. Create automated ECR sync job (GitHub Actions or AWS Lambda)
318+
3. Document provisioner scaling procedure in runbook
319+
4. Add LiteLLM key expiration to monitoring
320+
321+
###Short-term (Month 1)
322+
1. Migrate CloudFlare DNS to Terraform
323+
2. Implement image digest validation across clusters
324+
3. Set up workshop-specific monitoring dashboard
325+
4. Create provisioner HPA based on CPU/memory
326+
327+
###Long-term (Quarter 1)
328+
1. Build stress testing automation
329+
2. Implement provisioner queue depth monitoring and auto-scaling
330+
3. Evaluate persistent storage options for workspace data
331+
4. Expand to additional demo environments (coderdemo.io, devcoder.io)
332+
333+
---
334+
335+
##Success Metrics
336+
337+
Track these metrics month-over-month:
338+
339+
**Platform Stability**:
340+
- Workspace restart/failure rate: Target <2%
341+
- Incidents with user-visible impact: Target 0
342+
- Storage contention events: Target 0
343+
- Subdomain routing errors: Target 0
344+
- Average workspace start time: Target <2 minutes
345+
346+
**Workshop Quality**:
347+
- Participant satisfaction score: Target 4.5+/5
348+
- Percentage completing workshop: Target >90%
349+
- Number of blockers encountered: Target <3
350+
351+
**Operational Efficiency**:
352+
- Pre-workshop checklist completion time: Target <30 minutes
353+
- Time to resolve incidents: Target <5 minutes
354+
- Manual interventions required: Target <2 per workshop
355+
356+
---
357+
358+
##Related Resources
359+
360+
###Documentation
361+
-[Architecture Overview](./workshops/ARCHITECTURE.md)
362+
-[Monthly Workshop Guide](./workshops/MONTHLY_WORKSHOP_GUIDE.md)
363+
-[Pre-Workshop Checklist](./workshops/PRE_WORKSHOP_CHECKLIST.md)
364+
-[Incident Runbook](./workshops/INCIDENT_RUNBOOK.md)
365+
-[Post-Workshop Retrospective Template](./workshops/POST_WORKSHOP_RETROSPECTIVE.md)
366+
-[Participant Guide](./workshops/PARTICIPANT_GUIDE.md)
367+
368+
###GitHub Issues
369+
-[#1 - Optimize ephemeral volume storage capacity](https://github.com/coder/ai.coder.com/issues/1)
370+
-[#2 - Standardize image management across clusters](https://github.com/coder/ai.coder.com/issues/2)
371+
-[#3 - Improve LiteLLM key rotation and monitoring](https://github.com/coder/ai.coder.com/issues/3)
372+
-[#4 - Create pre-workshop validation checklist](https://github.com/coder/ai.coder.com/issues/4)
373+
-[#5 - Establish monthly workshop cadence](https://github.com/coder/ai.coder.com/issues/5)
374+
-[#6 - Implement comprehensive monitoring and alerting](https://github.com/coder/ai.coder.com/issues/6)
375+
-[#7 - Automate ECR image mirroring](https://github.com/coder/ai.coder.com/issues/7)
376+
-[#8 - Implement provisioner auto-scaling](https://github.com/coder/ai.coder.com/issues/8)
377+
-[#9 - Automate CloudFlare DNS management](https://github.com/coder/ai.coder.com/issues/9)
378+
379+
---
380+
381+
##Approvals
382+
383+
**Infrastructure Team Lead**:_________________
384+
**Product Team Lead**:_________________
385+
**Date**:_________________
386+
387+
---
388+
389+
**Prepared by:** Dave Ahr
390+
**Review Date:** October 2024
391+
**Next Review:** After first monthly workshop

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp