Commit6a071df

blink-so[bot]

and

dahr

committed

Add comprehensive postmortem and enhance workshop guide with scaling guidance

- Created detailed postmortem document (POSTMORTEM_2024-09-30.md): - Architecture context with multi-region details - Root cause analysis with technical depth - Provisioner capacity bottleneck analysis - DNS management dependency assessment - Impact metrics and assessment - Comprehensive action items with priorities - Technical recommendations timeline - Success metrics for tracking - Links to all 9 GitHub issues- Enhanced Monthly Workshop Guide: - Added detailed capacity planning section - Provisioner scaling guidelines (6→8→10→12 replicas based on user count) - LiteLLM scaling guidelines (4→6→8 replicas for >20 users) - Workspace resource allocation calculations - Karpenter considerations for multi-region - New section: Pre-Workshop Scaling Actions (T-1 day) - Specific kubectl commands for scaling - LiteLLM key rotation disable procedure - AWS quota verification stepsAll workshop documentation now includes:- Specific infrastructure component details- Multi-region architecture context- Concrete scaling thresholds and commands- Pre/during/post workshop procedures- Links to tracking issues and runbooksRelated:#1 #2 #3 #4 #5 #6 #7 #8 #9Co-authored-by: dahr <13365989+dahr@users.noreply.github.com>

1 parent2a22075 commit6a071dfCopy full SHA for 6a071df

File tree

2 files changed

+482

-1

lines changed

docs
- POSTMORTEM_2024-09-30.md
- workshops
  - MONTHLY_WORKSHOP_GUIDE.md

2 files changed

+482

-1

lines changed

`‎docs/POSTMORTEM_2024-09-30.md‎`

Lines changed: 391 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,391 @@`
	`1`	`+#Postmortem: Agentic Workshop Incident - September 30, 2024`
	`2`	`+`
	`3`	`+Date: September 30, 2024`
	`4`	`+Environment:https://ai.coder.com`
	`5`	`+Severity: High`
	`6`	`+Duration:~10 minutes into workshop until post-workshop fixes`
	`7`	`+Impact: Multiple user workspaces died/restarted, wiping user progress during live workshop`
	`8`	`+`
	`9`	`+---`
	`10`	`+`
	`11`	`+##Executive Summary`
	`12`	`+`
	`13`	+During the Agentic Workshop on September 30, the AI demo environment experienced multiple cascading failures when approximately 10+ users simultaneously onboarded and deployed workspaces. While initial deployments succeeded, resource contention and architectural issues caused workspace instability, data loss, and service disruptions across the multi-region infrastructure. The incident revealed gaps in stress testing and highlighted limitations in the current architecture that were not apparent during smaller-scale internal testing.
	`14`	`+`
	`15`	`+---`
	`16`	`+`
	`17`	`+##Timeline`
	`18`	`+`
	`19`	`+Pre-incident: Workshop begins, users start onboarding process`
	`20`	`+T+0 min: Initial workspace deployments roll through successfully`
	`21`	`+T+~10 min: Workspaces begin competing for resources as workloads start running`
	`22`	`+T+~10 min: LiteLLM authentication key briefly expires (few seconds)`
	`23`	`+T+~10 min: Workspaces start dying and restarting, triggering self-healing mechanisms`
	`24`	`+T+~10 min: User progress wiped due to ephemeral volume issues`
	`25`	`+T+~10 min: Subdomain routing issues surface between Oregon and London proxy clusters`
	`26`	`+Post-workshop: Fixes applied to address all identified issues`
	`27`	`+`
	`28`	`+---`
	`29`	`+`
	`30`	`+##Architecture Context`
	`31`	`+`
	`32`	`+###Multi-Region Deployment`
	`33`	`+`
	`34`	`+Control Plane (us-east-2 - Ohio):`
	`35`	`+- Coder Server: 2 replicas @ 4 vCPU / 8 GB each`
	`36`	`+- External Provisioners: 6 replicas (default org) @ 500m CPU / 512 MB each`
	`37`	`+- LiteLLM Service: 4 replicas @ 2 vCPU / 4 GB each`
	`38`	+- Primary domain:`ai.coder.com` +`*.ai.coder.com`
	`39`	`+`
	`40`	`+Proxy Clusters:`
	`41`	+- Oregon (us-west-2): 2 replicas @ 500m CPU / 1 GB, domain:`oregon-proxy.ai.coder.com`
	`42`	+- London (eu-west-2): 2 replicas @ 500m CPU / 1 GB, domain:`emea-proxy.ai.coder.com`
	`43`	`+`
	`44`	`+Image Management:`
	`45`	+- Source:`ghcr.io/coder/coder-preview` (non-GA preview for beta AI features)
	`46`	`+- Mirrored to private AWS ECR (us-east-2)`
	`47`	`+- Critical dependency: ECR must stay in sync with GHCR`
	`48`	`+`
	`49`	`+DNS Management:`
	`50`	`+- 6 domains managed in CloudFlare (control plane + 2 proxies, each with wildcard)`
	`51`	`+- Manual process via #help-me-ops Slack channel`
	`52`	`+`
	`53`	`+---`
	`54`	`+`
	`55`	`+##Root Causes`
	`56`	`+`
	`57`	`+###1. Resource Contention - Ephemeral Volume Storage`
	`58`	`+`
	`59`	`+Cause: Limited node storage capacity for ephemeral volumes could not handle concurrent workspace workloads. Each workspace template consumes 2-4 vCPU and 4-8 GB memory, with ephemeral storage on node-local volumes.`
	`60`	`+`
	`61`	`+Impact: Workspaces died and restarted when nodes exhausted storage, triggering self-healing that wiped user progress.`
	`62`	`+`
	`63`	`+Why it wasn't caught:`
	`64`	`+- No stress testing with realistic concurrent user load (10+ users)`
	`65`	`+- Internal testing used lower concurrency`
	`66`	`+- Capacity planning didn't account for simultaneous workspace workloads`
	`67`	`+- No monitoring/alerting for ephemeral volume storage thresholds`
	`68`	`+`
	`69`	`+Technical Details:`
	`70`	`+- Workspace templates allow 2-4 vCPU / 4-8 GB configuration`
	`71`	`+-~10 concurrent workspaces @ 4 vCPU / 8 GB = 40+ vCPU / 80+ GB demand`
	`72`	`+- Ephemeral volumes for each workspace competed for node storage`
	`73`	`+- Karpenter auto-scaled nodes but storage capacity per node remained fixed`
	`74`	`+`
	`75`	`+###2. Image Management Inconsistencies`
	`76`	`+`
	`77`	+Cause: The non-GA Coder preview image (`ghcr.io/coder/coder-preview`) mirrored to private ECR fell out of sync between the control plane (us-east-2) and proxy clusters (us-west-2, eu-west-2).
	`78`	`+`
	`79`	+Impact: Image version mismatches caused subdomain routing failures across regions. Workspaces couldn't be accessed via proxy URLs (`.oregon-proxy.ai.coder.com`,`.emea-proxy.ai.coder.com`).
	`80`	`+`
	`81`	`+Why it wasn't caught:`
	`82`	`+- Manual ECR mirroring process from GHCR is error-prone`
	`83`	`+- No automated validation of image digests across all clusters`
	`84`	`+- Issue only manifests under multi-region load with simultaneous deployments`
	`85`	`+- Pre-workshop checklist lacked image consistency verification`
	`86`	`+`
	`87`	`+Technical Details:`
	`88`	`+- Image sync process:`
	`89`	+1. Pull from`ghcr.io/coder/coder-preview:latest`
	`90`	`+2. Tag and push to private ECR`
	`91`	`+3. Deploy to all 3 regions (us-east-2, us-west-2, eu-west-2)`
	`92`	`+- During workshop, ECR mirror was stale`
	`93`	`+- Control plane ran newer image than proxies`
	`94`	`+- Subdomain routing logic failed due to version mismatch`
	`95`	`+`
	`96`	`+###3. LiteLLM Key Expiration`
	`97`	`+`
	`98`	`+Cause: LiteLLM authentication key expired briefly during workshop. LiteLLM uses an auxiliary addon that rotates keys every 4-5 hours.`
	`99`	`+`
	`100`	`+Impact: Brief service disruption (few seconds) for AI features (Claude Code CLI, Goose CLI). Key rotation also forces all workspaces to restart to consume new keys.`
	`101`	`+`
	`102`	`+Note: Currently using open-source LiteLLM which has limited key management flexibility. Enterprise version not justified for current needs.`
	`103`	`+`
	`104`	`+Why it wasn't caught:`
	`105`	`+- No pre-workshop validation of key expiration times`
	`106`	`+- Key rotation schedule not documented or considered in workshop planning`
	`107`	`+- No monitoring/alerting for upcoming key expirations`
	`108`	`+`
	`109`	`+Technical Details:`
	`110`	`+- LiteLLM: 4 replicas @ 2 vCPU / 4 GB, round-robin between AWS Bedrock and GCP Vertex AI`
	`111`	`+- Auxiliary addon runs on 4-5 hour schedule`
	`112`	`+- Key rotation requires workspace restart to pick up new credentials`
	`113`	`+- If rotation occurs during workshop, causes mass workspace restarts`
	`114`	`+`
	`115`	`+###4. Provisioner Capacity Bottleneck`
	`116`	`+`
	`117`	`+Cause: Default provisioner capacity (6 replicas @ 500m CPU / 512 MB) insufficient for~10 concurrent users simultaneously creating workspaces.`
	`118`	`+`
	`119`	`+Impact: Workspace create operations queued or timed out, causing delays and poor user experience.`
	`120`	`+`
	`121`	`+Why it wasn't caught:`
	`122`	`+- No capacity planning guidelines for concurrent user scaling`
	`123`	`+- Provisioners are single-threaded (1 provisioner = 1 Terraform operation)`
	`124`	`+- No monitoring of provisioner queue depth`
	`125`	`+- Workshop planning didn't include provisioner pre-scaling`
	`126`	`+`
	`127`	`+Technical Details:`
	`128`	`+- 10 users × 1 workspace each = 10 concurrent Terraform operations`
	`129`	`+- 6 provisioners = max 6 concurrent operations`
	`130`	`+- Remaining 4 operations queued, causing delays`
	`131`	`+- Recommendation: Scale to 8-10 replicas for 10-15 users`
	`132`	`+`
	`133`	`+###5. DNS Management Dependency`
	`134`	`+`
	`135`	`+Cause: CloudFlare DNS managed manually via #help-me-ops Slack channel created potential for delays during incident response.`
	`136`	`+`
	`137`	`+Impact: No immediate impact during workshop, but DNS issues would have been slow to resolve.`
	`138`	`+`
	`139`	`+Why it's a concern:`
	`140`	`+- 6 domains to manage: control plane + 2 proxies (each with wildcard)`
	`141`	`+- No self-service for infrastructure team`
	`142`	`+- Dependency on ops team availability`
	`143`	`+- No automated validation of DNS configuration`
	`144`	`+`
	`145`	`+---`
	`146`	`+`
	`147`	`+##Impact Assessment`
	`148`	`+`
	`149`	`+Users Affected: All workshop participants (~10+ concurrent users)`
	`150`	`+Data Loss: User workspace progress wiped due to ephemeral volume restarts`
	`151`	`+Service Availability: Degraded for~10+ minutes during workshop`
	`152`	`+Business Impact: Poor user experience during live demonstration/workshop event`
	`153`	`+`
	`154`	`+Metrics:`
	`155`	`+- Workspace failure rate:~40-50% (estimated, 4-5 workspaces restarted)`
	`156`	`+- Average workspace restart time: 2-3 minutes`
	`157`	`+- Number of incidents: 3 major (storage, image sync, key expiration)`
	`158`	`+- User-visible impact duration:~10 minutes`
	`159`	`+`
	`160`	`+---`
	`161`	`+`
	`162`	`+##What Went Well`
	`163`	`+`
	`164`	`+- Initial deployment phase worked correctly (first~10 minutes)`
	`165`	`+- Self-healing mechanisms activated (though resulted in data loss)`
	`166`	`+- Karpenter successfully scaled nodes in response to demand`
	`167`	`+- LiteLLM key rotation was brief (few seconds)`
	`168`	`+- Issues were contained to the workshop environment (no production impact)`
	`169`	`+- Team responded post-workshop with comprehensive fixes`
	`170`	`+- Base infrastructure foundation is solid (EKS, Karpenter, multi-region setup)`
	`171`	`+- Multi-region architecture design is sound`
	`172`	`+`
	`173`	`+---`
	`174`	`+`
	`175`	`+##What Went Wrong`
	`176`	`+`
	`177`	`+- No internal stress testing with realistic concurrent user load prior to workshop`
	`178`	`+- Ephemeral volume capacity planning insufficient for simultaneous workloads`
	`179`	`+- Image management strategy across multi-region clusters not robust`
	`180`	`+- No pre-workshop validation of authentication keys or key rotation schedule`
	`181`	`+- Lack of monitoring/alerting for resource contention thresholds`
	`182`	`+- Provisioner capacity not scaled proactively`
	`183`	`+- No pre-workshop checklist or validation procedures`
	`184`	`+- Manual processes (ECR sync, CloudFlare DNS) created points of failure`
	`185`	`+- No capacity planning guidelines for concurrent user scaling`
	`186`	`+`
	`187`	`+---`
	`188`	`+`
	`189`	`+##Action Items`
	`190`	`+`
	`191`	`+###Completed (Post-Workshop)`
	`192`	`+- ✅ Applied fixes for all identified issues`
	`193`	`+- ✅ Created comprehensive incident documentation`
	`194`	`+- ✅ Documented architecture and component details`
	`195`	`+- ✅ Created pre-workshop validation checklist`
	`196`	`+- ✅ Created incident runbook`
	`197`	`+- ✅ Established GitHub tracking issues`
	`198`	`+`
	`199`	`+###High Priority (Before Next Workshop)`
	`200`	`+`
	`201`	`+Storage & Capacity (Issue#1)`
	`202`	`+-[ ] Audit current ephemeral volume allocation per node`
	`203`	`+-[ ] Calculate storage requirements for target concurrent workspace count`
	`204`	`+-[ ] Implement storage capacity monitoring and alerting`
	`205`	`+-[ ] Define resource limits per workspace to prevent node exhaustion`
	`206`	`+-[ ] Test with realistic concurrent user load`
	`207`	`+`
	`208`	`+Image Management (Issue#2, Issue#7)`
	`209`	+-[ ] Automate ECR image mirroring from`ghcr.io/coder/coder-preview`
	`210`	`+-[ ] Implement pre-deployment validation of image digests across all clusters`
	`211`	`+-[ ] Add to pre-workshop checklist`
	`212`	`+-[ ] Document rollback procedure for bad images`
	`213`	`+`
	`214`	`+LiteLLM Key Management (Issue#3)`
	`215`	`+-[ ] Implement monitoring/alerting for key expiration (7, 3, 1 day warnings)`
	`216`	`+-[ ] Document key rotation procedure`
	`217`	`+-[ ] Add key expiration check to pre-workshop checklist`
	`218`	`+-[ ] Disable/schedule key rotation around workshops`
	`219`	`+`
	`220`	`+Pre-Workshop Validation (Issue#4)`
	`221`	`+-[ ] Complete pre-workshop checklist 2 days before each workshop`
	`222`	`+-[ ] Validate LiteLLM keys, image consistency, storage capacity`
	`223`	`+-[ ] Test subdomain routing across all regions`
	`224`	`+-[ ] Scale provisioners based on expected attendance`
	`225`	`+-[ ] Confirm monitoring and alerting is operational`
	`226`	`+`
	`227`	`+Provisioner Scaling (Issue#8)`
	`228`	`+-[ ] Document scaling recommendations based on concurrent user count`
	`229`	`+-[ ] Scale provisioners 1 day before workshops (6 → 8-10 for 10-15 users)`
	`230`	`+-[ ] (Long-term) Implement provisioner auto-scaling based on queue depth`
	`231`	`+`
	`232`	`+Monitoring & Alerting (Issue#6)`
	`233`	`+-[ ] Ephemeral volume storage capacity per node (alert at 70%, 85%, 95%)`
	`234`	`+-[ ] Concurrent workspace count`
	`235`	`+-[ ] Workspace restart/failure rate`
	`236`	`+-[ ] Image pull times across clusters`
	`237`	`+-[ ] LiteLLM key expiration`
	`238`	`+-[ ] Subdomain routing success rate`
	`239`	`+-[ ] Provisioner queue depth`
	`240`	`+`
	`241`	`+###Medium Priority (1-3 months)`
	`242`	`+`
	`243`	`+CloudFlare DNS Automation (Issue#9)`
	`244`	`+-[ ] Migrate CloudFlare DNS to Terraform`
	`245`	`+-[ ] Enable self-service DNS changes via PR workflow`
	`246`	`+-[ ] Add DNS validation to CI/CD pipeline`
	`247`	`+-[ ] Implement monitoring for DNS resolution`
	`248`	`+`
	`249`	`+Monthly Workshop Cadence (Issue#5)`
	`250`	`+-[ ] Establish monthly workshop schedule`
	`251`	`+-[ ] Develop workshop content/agenda`
	`252`	`+-[ ] Define success metrics`
	`253`	`+-[ ] Create feedback collection mechanism`
	`254`	`+-[ ] Track month-over-month improvements`
	`255`	`+`
	`256`	`+###Long-Term (3+ months)`
	`257`	`+`
	`258`	`+Stress Testing Automation`
	`259`	`+-[ ] Build internal stress testing tooling`
	`260`	`+-[ ] Simulate concurrent user load`
	`261`	`+-[ ] Automate capacity validation`
	`262`	`+-[ ] Integrate into CI/CD pipeline`
	`263`	`+`
	`264`	`+Architectural Improvements`
	`265`	`+-[ ] Evaluate persistent storage options to prevent data loss`
	`266`	`+-[ ] Consider workspace state backup/restore mechanisms`
	`267`	`+-[ ] Implement provisioner auto-scaling (HPA based on queue depth)`
	`268`	`+-[ ] Optimize ephemeral volume allocation strategy`
	`269`	`+`
	`270`	`+---`
	`271`	`+`
	`272`	`+##Lessons Learned`
	`273`	`+`
	`274`	`+###What We Learned`
	`275`	`+`
	`276`	`+1.Production-like testing is essential: Internal testing without realistic concurrent load is insufficient for demo/workshop environments. The gap between "works in testing" and "works at scale" is significant.`
	`277`	`+`
	`278`	`+2.Capacity planning needs real-world data: Architectural assumptions (storage, provisioners, LiteLLM) must be validated under actual user load patterns. Theoretical capacity ≠ practical capacity.`
	`279`	`+`
	`280`	`+3.Manual processes don't scale: ECR image syncing and CloudFlare DNS management via Slack requests create bottlenecks and points of failure during incidents.`
	`281`	`+`
	`282`	`+4.Multi-region consistency is hard: Keeping images, configurations, and services synchronized across us-east-2, us-west-2, and eu-west-2 requires automation and validation.`
	`283`	`+`
	`284`	`+5.Key rotation timing matters: LiteLLM's 4-5 hour rotation schedule must be coordinated with workshop timing to avoid forced workspace restarts during events.`
	`285`	`+`
	`286`	`+6.Provisioner scaling is critical: Single-threaded Terraform operations mean provisioner count directly determines concurrent workspace operation capacity.`
	`287`	`+`
	`288`	`+7.Pre-event validation is non-negotiable: A structured checklist covering infrastructure, capacity, authentication, and routing prevents preventable issues.`
	`289`	`+`
	`290`	`+8.Monthly cadence provides continuous validation: Regular workshops will surface optimization opportunities and prevent regressions. The base infrastructure is solid; now we need operational refinement.`
	`291`	`+`
	`292`	`+###What We'll Do Differently`
	`293`	`+`
	`294`	`+1.Always run pre-workshop checklist 2 days before events`
	`295`	`+2.Scale provisioners and LiteLLM proactively based on expected attendance`
	`296`	`+3.Disable LiteLLM key rotation during workshop windows`
	`297`	`+4.Validate image consistency across all regions before workshops`
	`298`	`+5.Monitor ephemeral storage and alert before capacity issues arise`
	`299`	`+6.Automate manual processes (ECR sync, DNS management)`
	`300`	`+7.Conduct monthly workshops to continuously stress test and improve`
	`301`	`+8.Document everything for faster incident response and knowledge sharing`
	`302`	`+`
	`303`	`+###Process Improvements`
	`304`	`+`
	`305`	`+1.Pre-Workshop Checklist: Mandatory 2-day pre-event validation covering all infrastructure components`
	`306`	`+2.Incident Runbook: Step-by-step procedures for common failure scenarios`
	`307`	`+3.Capacity Planning: Clear guidelines for scaling based on concurrent user count`
	`308`	`+4.Monitoring Dashboard: Real-time visibility during workshops for proactive issue detection`
	`309`	`+5.Post-Workshop Retrospective: Structured feedback loop to track improvements month-over-month`
	`310`	`+`
	`311`	`+---`
	`312`	`+`
	`313`	`+##Technical Recommendations`
	`314`	`+`
	`315`	`+###Immediate (Week 1)`
	`316`	`+1. Implement ephemeral storage monitoring with alerting`
	`317`	`+2. Create automated ECR sync job (GitHub Actions or AWS Lambda)`
	`318`	`+3. Document provisioner scaling procedure in runbook`
	`319`	`+4. Add LiteLLM key expiration to monitoring`
	`320`	`+`
	`321`	`+###Short-term (Month 1)`
	`322`	`+1. Migrate CloudFlare DNS to Terraform`
	`323`	`+2. Implement image digest validation across clusters`
	`324`	`+3. Set up workshop-specific monitoring dashboard`
	`325`	`+4. Create provisioner HPA based on CPU/memory`
	`326`	`+`
	`327`	`+###Long-term (Quarter 1)`
	`328`	`+1. Build stress testing automation`
	`329`	`+2. Implement provisioner queue depth monitoring and auto-scaling`
	`330`	`+3. Evaluate persistent storage options for workspace data`
	`331`	`+4. Expand to additional demo environments (coderdemo.io, devcoder.io)`
	`332`	`+`
	`333`	`+---`
	`334`	`+`
	`335`	`+##Success Metrics`
	`336`	`+`
	`337`	`+Track these metrics month-over-month:`
	`338`	`+`
	`339`	`+Platform Stability:`
	`340`	`+- Workspace restart/failure rate: Target <2%`
	`341`	`+- Incidents with user-visible impact: Target 0`
	`342`	`+- Storage contention events: Target 0`
	`343`	`+- Subdomain routing errors: Target 0`
	`344`	`+- Average workspace start time: Target <2 minutes`
	`345`	`+`
	`346`	`+Workshop Quality:`
	`347`	`+- Participant satisfaction score: Target 4.5+/5`
	`348`	`+- Percentage completing workshop: Target >90%`
	`349`	`+- Number of blockers encountered: Target <3`
	`350`	`+`
	`351`	`+Operational Efficiency:`
	`352`	`+- Pre-workshop checklist completion time: Target <30 minutes`
	`353`	`+- Time to resolve incidents: Target <5 minutes`
	`354`	`+- Manual interventions required: Target <2 per workshop`
	`355`	`+`
	`356`	`+---`
	`357`	`+`
	`358`	`+##Related Resources`
	`359`	`+`
	`360`	`+###Documentation`
	`361`	`+-[Architecture Overview](./workshops/ARCHITECTURE.md)`
	`362`	`+-[Monthly Workshop Guide](./workshops/MONTHLY_WORKSHOP_GUIDE.md)`
	`363`	`+-[Pre-Workshop Checklist](./workshops/PRE_WORKSHOP_CHECKLIST.md)`
	`364`	`+-[Incident Runbook](./workshops/INCIDENT_RUNBOOK.md)`
	`365`	`+-[Post-Workshop Retrospective Template](./workshops/POST_WORKSHOP_RETROSPECTIVE.md)`
	`366`	`+-[Participant Guide](./workshops/PARTICIPANT_GUIDE.md)`
	`367`	`+`
	`368`	`+###GitHub Issues`
	`369`	`+-[#1 - Optimize ephemeral volume storage capacity](https://github.com/coder/ai.coder.com/issues/1)`
	`370`	`+-[#2 - Standardize image management across clusters](https://github.com/coder/ai.coder.com/issues/2)`
	`371`	`+-[#3 - Improve LiteLLM key rotation and monitoring](https://github.com/coder/ai.coder.com/issues/3)`
	`372`	`+-[#4 - Create pre-workshop validation checklist](https://github.com/coder/ai.coder.com/issues/4)`
	`373`	`+-[#5 - Establish monthly workshop cadence](https://github.com/coder/ai.coder.com/issues/5)`
	`374`	`+-[#6 - Implement comprehensive monitoring and alerting](https://github.com/coder/ai.coder.com/issues/6)`
	`375`	`+-[#7 - Automate ECR image mirroring](https://github.com/coder/ai.coder.com/issues/7)`
	`376`	`+-[#8 - Implement provisioner auto-scaling](https://github.com/coder/ai.coder.com/issues/8)`
	`377`	`+-[#9 - Automate CloudFlare DNS management](https://github.com/coder/ai.coder.com/issues/9)`
	`378`	`+`
	`379`	`+---`
	`380`	`+`
	`381`	`+##Approvals`
	`382`	`+`
	`383`	`+Infrastructure Team Lead:_________________`
	`384`	`+Product Team Lead:_________________`
	`385`	`+Date:_________________`
	`386`	`+`
	`387`	`+---`
	`388`	`+`
	`389`	`+Prepared by: Dave Ahr`
	`390`	`+Review Date: October 2024`
	`391`	`+Next Review: After first monthly workshop`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit6a071df

File tree

2 files changed

2 files changed

`‎docs/POSTMORTEM_2024-09-30.md‎`

0 commit comments