Commita242f3f

blink-so[bot]

and

dahr

committed

Add monthly workshop infrastructure and documentation

Created comprehensive workshop documentation including:- Monthly workshop planning guide- Pre-workshop validation checklist- Post-workshop retrospective template- Incident runbook for common issues- Participant guide for attendeesThese documents support the monthly workshop cadence tocontinuously stress test the platform and drive improvements.Related to#5Co-authored-by: dahr <13365989+dahr@users.noreply.github.com>

1 parent0713aa4 commita242f3fCopy full SHA for a242f3f

File tree

5 files changed

+1115

-0

lines changed

docs/workshops

5 files changed

+1115

-0

lines changed

`‎docs/workshops/INCIDENT_RUNBOOK.md‎`

Lines changed: 332 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,332 @@`
	`1`	`+#Workshop Incident Runbook`
	`2`	`+`
	`3`	`+##Purpose`
	`4`	`+`
	`5`	`+This runbook provides step-by-step procedures for diagnosing and resolving common incidents during monthly workshops.`
	`6`	`+`
	`7`	`+---`
	`8`	`+`
	`9`	`+##Incident Response Process`
	`10`	`+`
	`11`	`+###1. Initial Response`
	`12`	`+`
	`13`	`+1.Acknowledge the incident in team chat`
	`14`	`+2.Assess severity:`
	`15`	`+-P0 (Critical): Complete service outage, data loss, security breach`
	`16`	`+-P1 (High): Significant degradation affecting multiple users`
	`17`	`+-P2 (Medium): Limited impact, workarounds available`
	`18`	`+-P3 (Low): Cosmetic issues, no user impact`
	`19`	`+3.Assign incident commander (P0/P1 only)`
	`20`	`+4.Start incident log (document timeline, actions, decisions)`
	`21`	`+`
	`22`	`+###2. Communication`
	`23`	`+`
	`24`	`+-Internal: Update team in dedicated incident channel`
	`25`	`+-Participants: Provide status updates if impact is user-visible`
	`26`	`+-Escalation: Contact on-call engineer for P0/P1 incidents`
	`27`	`+`
	`28`	`+###3. Resolution & Follow-up`
	`29`	`+`
	`30`	`+- Document root cause`
	`31`	`+- Create GitHub issue for permanent fix`
	`32`	`+- Update this runbook if new incident type discovered`
	`33`	`+- Include incident in post-workshop retrospective`
	`34`	`+`
	`35`	`+---`
	`36`	`+`
	`37`	`+##Common Incidents`
	`38`	`+`
	`39`	`+###1. Workspace Restarts / Self-Healing Loop`
	`40`	`+`
	`41`	`+Symptoms:`
	`42`	`+- Workspaces repeatedly restarting`
	`43`	`+- Users losing progress`
	`44`	`+- Self-healing mechanisms triggering continuously`
	`45`	`+`
	`46`	`+Likely Causes:`
	`47`	`+- Ephemeral volume storage exhaustion`
	`48`	`+- Resource contention (CPU, memory)`
	`49`	`+- Node capacity exceeded`
	`50`	`+`
	`51`	`+Diagnosis:`
	`52`	`+`
	`53`	+```bash
	`54`	`+# Check node storage`
	`55`	`+kubectl top nodes`
	`56`	`+kubectl get nodes -o wide`
	`57`	`+`
	`58`	`+# Check ephemeral volume usage`
	`59`	`+kubectl get pods -A -o json\| jq'.items[] \| select(.spec.volumes != null) \| {name: .metadata.name, namespace: .metadata.namespace, volumes: [.spec.volumes[] \| select(.emptyDir != null)]}'`
	`60`	`+`
	`61`	`+# Check for evicted pods`
	`62`	`+kubectl get pods -A\| grep Evicted`
	`63`	`+`
	`64`	`+# Check workspace pod events`
	`65`	`+kubectl describe pod<workspace-pod-name> -n<namespace>`
	`66`	`+`
	`67`	`+# Check Karpenter node allocation`
	`68`	`+kubectl logs -l app.kubernetes.io/name=karpenter -n karpenter --tail=100`
	`69`	+```
	`70`	`+`
	`71`	`+Resolution:`
	`72`	`+`
	`73`	`+Immediate:`
	`74`	`+1. Identify workspaces consuming excessive storage:`
	`75`	+```bash
	`76`	`+ kubectlexec -it<workspace-pod> -- df -h`
	`77`	+```
	`78`	`+2. If specific workspace is problematic, delete it:`
	`79`	+```bash
	`80`	`+ kubectl delete pod<workspace-pod> -n<namespace>`
	`81`	+```
	`82`	`+3. If cluster-wide issue, scale up nodes or increase storage capacity`
	`83`	`+`
	`84`	`+Temporary Workaround:`
	`85`	`+- Pause new workspace deployments`
	`86`	`+- Ask participants to save work and stop workspaces`
	`87`	`+- Clean up unused workspaces`
	`88`	`+`
	`89`	`+Permanent Fix:`
	`90`	`+- See GitHub Issue#1 for long-term storage optimization`
	`91`	`+`
	`92`	`+---`
	`93`	`+`
	`94`	`+###2. Subdomain Routing Failures`
	`95`	`+`
	`96`	`+Symptoms:`
	`97`	`+- Users cannot access workspaces via subdomain URLs`
	`98`	`+- 404 or DNS errors on workspace URLs`
	`99`	`+- Inconsistent routing across regions`
	`100`	`+`
	`101`	`+Likely Causes:`
	`102`	`+- Image version mismatch between control plane and proxy clusters`
	`103`	`+- Ingress controller misconfiguration`
	`104`	`+- DNS propagation delays`
	`105`	`+`
	`106`	`+Diagnosis:`
	`107`	`+`
	`108`	+```bash
	`109`	`+# Check Coder image versions across clusters`
	`110`	`+kubectl get pods -n coder -o jsonpath='{.items[].spec.containers[].image}' --context=control-plane`
	`111`	`+kubectl get pods -n coder -o jsonpath='{.items[].spec.containers[].image}' --context=oregon`
	`112`	`+kubectl get pods -n coder -o jsonpath='{.items[].spec.containers[].image}' --context=london`
	`113`	`+`
	`114`	`+# Check ingress configuration`
	`115`	`+kubectl get ingress -A`
	`116`	`+kubectl describe ingress<ingress-name> -n<namespace>`
	`117`	`+`
	`118`	`+# Check DNS resolution`
	`119`	`+dig<workspace-subdomain>.ai.coder.com`
	`120`	`+nslookup<workspace-subdomain>.ai.coder.com`
	`121`	`+`
	`122`	`+# Check load balancer status`
	`123`	`+kubectl get svc -n coder`
	`124`	+```
	`125`	`+`
	`126`	`+Resolution:`
	`127`	`+`
	`128`	`+Immediate:`
	`129`	`+1. Verify image versions match across clusters`
	`130`	`+2. If mismatch found, restart Coder pods in affected cluster:`
	`131`	+```bash
	`132`	`+ kubectl rollout restart deployment/coder -n coder`
	`133`	+```
	`134`	`+3. If DNS issue, wait for propagation or flush DNS cache`
	`135`	`+`
	`136`	`+Temporary Workaround:`
	`137`	`+- Direct users to working region`
	`138`	`+- Use direct IP access if subdomain fails`
	`139`	`+`
	`140`	`+Permanent Fix:`
	`141`	`+- See GitHub Issue#2 for image management standardization`
	`142`	`+`
	`143`	`+---`
	`144`	`+`
	`145`	`+###3. LiteLLM Authentication Failures`
	`146`	`+`
	`147`	`+Symptoms:`
	`148`	`+- Users cannot authenticate`
	`149`	`+- "Invalid API key" or similar errors`
	`150`	`+- AI features not working`
	`151`	`+`
	`152`	`+Likely Causes:`
	`153`	`+- Expired LiteLLM key`
	`154`	`+- Rate limiting`
	`155`	`+- Service outage`
	`156`	`+`
	`157`	`+Diagnosis:`
	`158`	`+`
	`159`	+```bash
	`160`	`+# Check LiteLLM pod logs`
	`161`	`+kubectl logs -l app=litellm -n<namespace> --tail=100`
	`162`	`+`
	`163`	`+# Test LiteLLM API key`
	`164`	`+curl -H"Authorization: Bearer <api-key>" https://<litellm-endpoint>/v1/models`
	`165`	`+`
	`166`	`+# Check key expiration (method depends on your key management)`
	`167`	`+# TODO: Add specific command for your environment`
	`168`	+```
	`169`	`+`
	`170`	`+Resolution:`
	`171`	`+`
	`172`	`+Immediate:`
	`173`	`+1. Verify key expiration date`
	`174`	`+2. If expired, rotate key immediately:`
	`175`	+```bash
	`176`	`+# Follow your key rotation procedure`
	`177`	`+# Update secret:`
	`178`	`+ kubectl create secret generic litellm-key \`
	`179`	`+ --from-literal=api-key=<new-key> \`
	`180`	`+ --dry-run=client -o yaml\| kubectl apply -f -`
	`181`	`+`
	`182`	`+# Restart LiteLLM pods`
	`183`	`+ kubectl rollout restart deployment/litellm -n<namespace>`
	`184`	+```
	`185`	`+`
	`186`	`+Temporary Workaround:`
	`187`	`+- If brief expiration, wait for key rotation`
	`188`	`+- Disable AI features temporarily if critical`
	`189`	`+`
	`190`	`+Permanent Fix:`
	`191`	`+- See GitHub Issue#3 for key rotation automation`
	`192`	`+`
	`193`	`+---`
	`194`	`+`
	`195`	`+###4. High Resource Contention`
	`196`	`+`
	`197`	`+Symptoms:`
	`198`	`+- Slow workspace performance`
	`199`	`+- Timeouts during operations`
	`200`	`+- Elevated CPU/memory usage across cluster`
	`201`	`+`
	`202`	`+Likely Causes:`
	`203`	`+- Too many concurrent workspaces`
	`204`	`+- Workload-heavy exercises`
	`205`	`+- Insufficient node capacity`
	`206`	`+`
	`207`	`+Diagnosis:`
	`208`	`+`
	`209`	+```bash
	`210`	`+# Check cluster resource usage`
	`211`	`+kubectl top nodes`
	`212`	`+kubectl top pods -A`
	`213`	`+`
	`214`	`+# Check Karpenter scaling`
	`215`	`+kubectl get nodeclaims -A`
	`216`	`+kubectl logs -l app.kubernetes.io/name=karpenter -n karpenter --tail=50`
	`217`	`+`
	`218`	`+# Check pod resource limits`
	`219`	`+kubectl describe pod<pod-name> -n<namespace>\| grep -A 5"Limits\\|Requests"`
	`220`	+```
	`221`	`+`
	`222`	`+Resolution:`
	`223`	`+`
	`224`	`+Immediate:`
	`225`	`+1. Trigger Karpenter to scale up nodes if not auto-scaling:`
	`226`	+```bash
	`227`	`+# Check Karpenter NodePool status`
	`228`	`+ kubectl get nodepool`
	`229`	+```
	`230`	`+2. If nodes are at capacity, consider increasing instance sizes`
	`231`	`+3. Identify and pause resource-heavy workloads`
	`232`	`+`
	`233`	`+Temporary Workaround:`
	`234`	`+- Reduce concurrent workspace count`
	`235`	`+- Switch to less resource-intensive exercises`
	`236`	`+- Stagger workspace deployments`
	`237`	`+`
	`238`	`+Permanent Fix:`
	`239`	`+- Adjust resource limits per workspace`
	`240`	`+- Implement better capacity planning (see Issue#1)`
	`241`	`+- Add resource monitoring alerts (see Issue#6)`
	`242`	`+`
	`243`	`+---`
	`244`	`+`
	`245`	`+###5. Image Pull Failures`
	`246`	`+`
	`247`	`+Symptoms:`
	`248`	`+- Workspaces stuck in "ContainerCreating" state`
	`249`	`+- ImagePullBackOff errors`
	`250`	`+- Slow workspace startup times`
	`251`	`+`
	`252`	`+Likely Causes:`
	`253`	`+- Registry authentication issues`
	`254`	`+- Network connectivity problems`
	`255`	`+- Rate limiting from container registry`
	`256`	`+- Image doesn't exist or incorrect tag`
	`257`	`+`
	`258`	`+Diagnosis:`
	`259`	`+`
	`260`	+```bash
	`261`	`+# Check pod status`
	`262`	`+kubectl get pods -A\| grep -E'ImagePull\|ErrImagePull'`
	`263`	`+`
	`264`	`+# Check pod events`
	`265`	`+kubectl describe pod<pod-name> -n<namespace>`
	`266`	`+`
	`267`	`+# Check image pull secrets`
	`268`	`+kubectl get secrets -A\| grep docker`
	`269`	`+`
	`270`	`+# Verify image exists`
	`271`	`+docker pull<image-name>:<tag>`
	`272`	`+# or`
	`273`	`+crane manifest<image-name>:<tag>`
	`274`	+```
	`275`	`+`
	`276`	`+Resolution:`
	`277`	`+`
	`278`	`+Immediate:`
	`279`	`+1. Verify registry credentials are valid:`
	`280`	+```bash
	`281`	`+ kubectl get secret<image-pull-secret> -n<namespace> -o jsonpath='{.data.dockerconfigjson}'\| base64 -d`
	`282`	+```
	`283`	`+2. Re-create image pull secret if expired:`
	`284`	+```bash
	`285`	`+ kubectl create secret docker-registry<secret-name> \`
	`286`	`+ --docker-server=<registry> \`
	`287`	`+ --docker-username=<username> \`
	`288`	`+ --docker-password=<password> \`
	`289`	`+ -n<namespace>`
	`290`	+```
	`291`	`+3. Restart affected pods`
	`292`	`+`
	`293`	`+Temporary Workaround:`
	`294`	`+- Use cached images if available`
	`295`	`+- Switch to alternative image registry`
	`296`	`+`
	`297`	`+Permanent Fix:`
	`298`	`+- Implement image pre-caching on nodes`
	`299`	`+- Use image pull secrets with longer expiration`
	`300`	`+- See GitHub Issue#2 for image management improvements`
	`301`	`+`
	`302`	`+---`
	`303`	`+`
	`304`	`+##Emergency Contacts`
	`305`	`+`
	`306`	`+\| Role\| Name\| Contact\|`
	`307`	`+\|------\|------\|--------\|`
	`308`	`+\| Infrastructure Lead\|\|\|`
	`309`	`+\| On-Call Engineer\|\|\|`
	`310`	`+\| Platform Team Lead\|\|\|`
	`311`	`+\| Escalation Contact\|jullian@coder.com\|\|`
	`312`	`+`
	`313`	`+---`
	`314`	`+`
	`315`	`+##Post-Incident Checklist`
	`316`	`+`
	`317`	`+-[ ] Incident resolved and documented`
	`318`	`+-[ ] Root cause identified`
	`319`	`+-[ ] GitHub issue created for permanent fix`
	`320`	`+-[ ] Runbook updated with new learnings`
	`321`	`+-[ ] Team notified of resolution`
	`322`	`+-[ ] Participants notified if impacted`
	`323`	`+-[ ] Incident added to post-workshop retrospective`
	`324`	`+`
	`325`	`+---`
	`326`	`+`
	`327`	`+##Related Resources`
	`328`	`+`
	`329`	`+-[Monthly Workshop Guide](./MONTHLY_WORKSHOP_GUIDE.md)`
	`330`	`+-[Pre-Workshop Checklist](./PRE_WORKSHOP_CHECKLIST.md)`
	`331`	`+-[Post-Workshop Retrospective Template](./POST_WORKSHOP_RETROSPECTIVE.md)`
	`332`	`+- GitHub Issues:[#1](https://github.com/coder/ai.coder.com/issues/1)[#2](https://github.com/coder/ai.coder.com/issues/2)[#3](https://github.com/coder/ai.coder.com/issues/3)[#6](https://github.com/coder/ai.coder.com/issues/6)`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commita242f3f

File tree

5 files changed

5 files changed

`‎docs/workshops/INCIDENT_RUNBOOK.md‎`

0 commit comments