Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commita242f3f

Browse files
blink-so[bot]dahr
andcommitted
Add monthly workshop infrastructure and documentation
Created comprehensive workshop documentation including:- Monthly workshop planning guide- Pre-workshop validation checklist- Post-workshop retrospective template- Incident runbook for common issues- Participant guide for attendeesThese documents support the monthly workshop cadence tocontinuously stress test the platform and drive improvements.Related to#5Co-authored-by: dahr <13365989+dahr@users.noreply.github.com>
1 parent0713aa4 commita242f3f

File tree

5 files changed

+1115
-0
lines changed

5 files changed

+1115
-0
lines changed

‎docs/workshops/INCIDENT_RUNBOOK.md‎

Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
#Workshop Incident Runbook
2+
3+
##Purpose
4+
5+
This runbook provides step-by-step procedures for diagnosing and resolving common incidents during monthly workshops.
6+
7+
---
8+
9+
##Incident Response Process
10+
11+
###1. Initial Response
12+
13+
1.**Acknowledge** the incident in team chat
14+
2.**Assess severity**:
15+
-**P0 (Critical)**: Complete service outage, data loss, security breach
16+
-**P1 (High)**: Significant degradation affecting multiple users
17+
-**P2 (Medium)**: Limited impact, workarounds available
18+
-**P3 (Low)**: Cosmetic issues, no user impact
19+
3.**Assign incident commander** (P0/P1 only)
20+
4.**Start incident log** (document timeline, actions, decisions)
21+
22+
###2. Communication
23+
24+
-**Internal**: Update team in dedicated incident channel
25+
-**Participants**: Provide status updates if impact is user-visible
26+
-**Escalation**: Contact on-call engineer for P0/P1 incidents
27+
28+
###3. Resolution & Follow-up
29+
30+
- Document root cause
31+
- Create GitHub issue for permanent fix
32+
- Update this runbook if new incident type discovered
33+
- Include incident in post-workshop retrospective
34+
35+
---
36+
37+
##Common Incidents
38+
39+
###1. Workspace Restarts / Self-Healing Loop
40+
41+
**Symptoms**:
42+
- Workspaces repeatedly restarting
43+
- Users losing progress
44+
- Self-healing mechanisms triggering continuously
45+
46+
**Likely Causes**:
47+
- Ephemeral volume storage exhaustion
48+
- Resource contention (CPU, memory)
49+
- Node capacity exceeded
50+
51+
**Diagnosis**:
52+
53+
```bash
54+
# Check node storage
55+
kubectl top nodes
56+
kubectl get nodes -o wide
57+
58+
# Check ephemeral volume usage
59+
kubectl get pods -A -o json| jq'.items[] | select(.spec.volumes != null) | {name: .metadata.name, namespace: .metadata.namespace, volumes: [.spec.volumes[] | select(.emptyDir != null)]}'
60+
61+
# Check for evicted pods
62+
kubectl get pods -A| grep Evicted
63+
64+
# Check workspace pod events
65+
kubectl describe pod<workspace-pod-name> -n<namespace>
66+
67+
# Check Karpenter node allocation
68+
kubectl logs -l app.kubernetes.io/name=karpenter -n karpenter --tail=100
69+
```
70+
71+
**Resolution**:
72+
73+
**Immediate**:
74+
1. Identify workspaces consuming excessive storage:
75+
```bash
76+
kubectlexec -it<workspace-pod> -- df -h
77+
```
78+
2. If specific workspace is problematic, delete it:
79+
```bash
80+
kubectl delete pod<workspace-pod> -n<namespace>
81+
```
82+
3. If cluster-wide issue, scale up nodes or increase storage capacity
83+
84+
**Temporary Workaround**:
85+
- Pause new workspace deployments
86+
- Ask participants to save work and stop workspaces
87+
- Clean up unused workspaces
88+
89+
**Permanent Fix**:
90+
- See GitHub Issue#1 for long-term storage optimization
91+
92+
---
93+
94+
###2. Subdomain Routing Failures
95+
96+
**Symptoms**:
97+
- Users cannot access workspaces via subdomain URLs
98+
- 404 or DNS errors on workspace URLs
99+
- Inconsistent routing across regions
100+
101+
**Likely Causes**:
102+
- Image version mismatch between control plane and proxy clusters
103+
- Ingress controller misconfiguration
104+
- DNS propagation delays
105+
106+
**Diagnosis**:
107+
108+
```bash
109+
# Check Coder image versions across clusters
110+
kubectl get pods -n coder -o jsonpath='{.items[*].spec.containers[*].image}' --context=control-plane
111+
kubectl get pods -n coder -o jsonpath='{.items[*].spec.containers[*].image}' --context=oregon
112+
kubectl get pods -n coder -o jsonpath='{.items[*].spec.containers[*].image}' --context=london
113+
114+
# Check ingress configuration
115+
kubectl get ingress -A
116+
kubectl describe ingress<ingress-name> -n<namespace>
117+
118+
# Check DNS resolution
119+
dig<workspace-subdomain>.ai.coder.com
120+
nslookup<workspace-subdomain>.ai.coder.com
121+
122+
# Check load balancer status
123+
kubectl get svc -n coder
124+
```
125+
126+
**Resolution**:
127+
128+
**Immediate**:
129+
1. Verify image versions match across clusters
130+
2. If mismatch found, restart Coder pods in affected cluster:
131+
```bash
132+
kubectl rollout restart deployment/coder -n coder
133+
```
134+
3. If DNS issue, wait for propagation or flush DNS cache
135+
136+
**Temporary Workaround**:
137+
- Direct users to working region
138+
- Use direct IP access if subdomain fails
139+
140+
**Permanent Fix**:
141+
- See GitHub Issue#2 for image management standardization
142+
143+
---
144+
145+
###3. LiteLLM Authentication Failures
146+
147+
**Symptoms**:
148+
- Users cannot authenticate
149+
- "Invalid API key" or similar errors
150+
- AI features not working
151+
152+
**Likely Causes**:
153+
- Expired LiteLLM key
154+
- Rate limiting
155+
- Service outage
156+
157+
**Diagnosis**:
158+
159+
```bash
160+
# Check LiteLLM pod logs
161+
kubectl logs -l app=litellm -n<namespace> --tail=100
162+
163+
# Test LiteLLM API key
164+
curl -H"Authorization: Bearer <api-key>" https://<litellm-endpoint>/v1/models
165+
166+
# Check key expiration (method depends on your key management)
167+
# TODO: Add specific command for your environment
168+
```
169+
170+
**Resolution**:
171+
172+
**Immediate**:
173+
1. Verify key expiration date
174+
2. If expired, rotate key immediately:
175+
```bash
176+
# Follow your key rotation procedure
177+
# Update secret:
178+
kubectl create secret generic litellm-key \
179+
--from-literal=api-key=<new-key> \
180+
--dry-run=client -o yaml| kubectl apply -f -
181+
182+
# Restart LiteLLM pods
183+
kubectl rollout restart deployment/litellm -n<namespace>
184+
```
185+
186+
**Temporary Workaround**:
187+
- If brief expiration, wait for key rotation
188+
- Disable AI features temporarily if critical
189+
190+
**Permanent Fix**:
191+
- See GitHub Issue#3 for key rotation automation
192+
193+
---
194+
195+
###4. High Resource Contention
196+
197+
**Symptoms**:
198+
- Slow workspace performance
199+
- Timeouts during operations
200+
- Elevated CPU/memory usage across cluster
201+
202+
**Likely Causes**:
203+
- Too many concurrent workspaces
204+
- Workload-heavy exercises
205+
- Insufficient node capacity
206+
207+
**Diagnosis**:
208+
209+
```bash
210+
# Check cluster resource usage
211+
kubectl top nodes
212+
kubectl top pods -A
213+
214+
# Check Karpenter scaling
215+
kubectl get nodeclaims -A
216+
kubectl logs -l app.kubernetes.io/name=karpenter -n karpenter --tail=50
217+
218+
# Check pod resource limits
219+
kubectl describe pod<pod-name> -n<namespace>| grep -A 5"Limits\|Requests"
220+
```
221+
222+
**Resolution**:
223+
224+
**Immediate**:
225+
1. Trigger Karpenter to scale up nodes if not auto-scaling:
226+
```bash
227+
# Check Karpenter NodePool status
228+
kubectl get nodepool
229+
```
230+
2. If nodes are at capacity, consider increasing instance sizes
231+
3. Identify and pause resource-heavy workloads
232+
233+
**Temporary Workaround**:
234+
- Reduce concurrent workspace count
235+
- Switch to less resource-intensive exercises
236+
- Stagger workspace deployments
237+
238+
**Permanent Fix**:
239+
- Adjust resource limits per workspace
240+
- Implement better capacity planning (see Issue#1)
241+
- Add resource monitoring alerts (see Issue#6)
242+
243+
---
244+
245+
###5. Image Pull Failures
246+
247+
**Symptoms**:
248+
- Workspaces stuck in "ContainerCreating" state
249+
- ImagePullBackOff errors
250+
- Slow workspace startup times
251+
252+
**Likely Causes**:
253+
- Registry authentication issues
254+
- Network connectivity problems
255+
- Rate limiting from container registry
256+
- Image doesn't exist or incorrect tag
257+
258+
**Diagnosis**:
259+
260+
```bash
261+
# Check pod status
262+
kubectl get pods -A| grep -E'ImagePull|ErrImagePull'
263+
264+
# Check pod events
265+
kubectl describe pod<pod-name> -n<namespace>
266+
267+
# Check image pull secrets
268+
kubectl get secrets -A| grep docker
269+
270+
# Verify image exists
271+
docker pull<image-name>:<tag>
272+
# or
273+
crane manifest<image-name>:<tag>
274+
```
275+
276+
**Resolution**:
277+
278+
**Immediate**:
279+
1. Verify registry credentials are valid:
280+
```bash
281+
kubectl get secret<image-pull-secret> -n<namespace> -o jsonpath='{.data.dockerconfigjson}'| base64 -d
282+
```
283+
2. Re-create image pull secret if expired:
284+
```bash
285+
kubectl create secret docker-registry<secret-name> \
286+
--docker-server=<registry> \
287+
--docker-username=<username> \
288+
--docker-password=<password> \
289+
-n<namespace>
290+
```
291+
3. Restart affected pods
292+
293+
**Temporary Workaround**:
294+
- Use cached images if available
295+
- Switch to alternative image registry
296+
297+
**Permanent Fix**:
298+
- Implement image pre-caching on nodes
299+
- Use image pull secrets with longer expiration
300+
- See GitHub Issue#2 for image management improvements
301+
302+
---
303+
304+
##Emergency Contacts
305+
306+
| Role| Name| Contact|
307+
|------|------|--------|
308+
| Infrastructure Lead|||
309+
| On-Call Engineer|||
310+
| Platform Team Lead|||
311+
| Escalation Contact|jullian@coder.com||
312+
313+
---
314+
315+
##Post-Incident Checklist
316+
317+
-[ ] Incident resolved and documented
318+
-[ ] Root cause identified
319+
-[ ] GitHub issue created for permanent fix
320+
-[ ] Runbook updated with new learnings
321+
-[ ] Team notified of resolution
322+
-[ ] Participants notified if impacted
323+
-[ ] Incident added to post-workshop retrospective
324+
325+
---
326+
327+
##Related Resources
328+
329+
-[Monthly Workshop Guide](./MONTHLY_WORKSHOP_GUIDE.md)
330+
-[Pre-Workshop Checklist](./PRE_WORKSHOP_CHECKLIST.md)
331+
-[Post-Workshop Retrospective Template](./POST_WORKSHOP_RETROSPECTIVE.md)
332+
- GitHub Issues:[#1](https://github.com/coder/ai.coder.com/issues/1)[#2](https://github.com/coder/ai.coder.com/issues/2)[#3](https://github.com/coder/ai.coder.com/issues/3)[#6](https://github.com/coder/ai.coder.com/issues/6)

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp