|
| 1 | +#Workshop Incident Runbook |
| 2 | + |
| 3 | +##Purpose |
| 4 | + |
| 5 | +This runbook provides step-by-step procedures for diagnosing and resolving common incidents during monthly workshops. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +##Incident Response Process |
| 10 | + |
| 11 | +###1. Initial Response |
| 12 | + |
| 13 | +1.**Acknowledge** the incident in team chat |
| 14 | +2.**Assess severity**: |
| 15 | +-**P0 (Critical)**: Complete service outage, data loss, security breach |
| 16 | +-**P1 (High)**: Significant degradation affecting multiple users |
| 17 | +-**P2 (Medium)**: Limited impact, workarounds available |
| 18 | +-**P3 (Low)**: Cosmetic issues, no user impact |
| 19 | +3.**Assign incident commander** (P0/P1 only) |
| 20 | +4.**Start incident log** (document timeline, actions, decisions) |
| 21 | + |
| 22 | +###2. Communication |
| 23 | + |
| 24 | +-**Internal**: Update team in dedicated incident channel |
| 25 | +-**Participants**: Provide status updates if impact is user-visible |
| 26 | +-**Escalation**: Contact on-call engineer for P0/P1 incidents |
| 27 | + |
| 28 | +###3. Resolution & Follow-up |
| 29 | + |
| 30 | +- Document root cause |
| 31 | +- Create GitHub issue for permanent fix |
| 32 | +- Update this runbook if new incident type discovered |
| 33 | +- Include incident in post-workshop retrospective |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +##Common Incidents |
| 38 | + |
| 39 | +###1. Workspace Restarts / Self-Healing Loop |
| 40 | + |
| 41 | +**Symptoms**: |
| 42 | +- Workspaces repeatedly restarting |
| 43 | +- Users losing progress |
| 44 | +- Self-healing mechanisms triggering continuously |
| 45 | + |
| 46 | +**Likely Causes**: |
| 47 | +- Ephemeral volume storage exhaustion |
| 48 | +- Resource contention (CPU, memory) |
| 49 | +- Node capacity exceeded |
| 50 | + |
| 51 | +**Diagnosis**: |
| 52 | + |
| 53 | +```bash |
| 54 | +# Check node storage |
| 55 | +kubectl top nodes |
| 56 | +kubectl get nodes -o wide |
| 57 | + |
| 58 | +# Check ephemeral volume usage |
| 59 | +kubectl get pods -A -o json| jq'.items[] | select(.spec.volumes != null) | {name: .metadata.name, namespace: .metadata.namespace, volumes: [.spec.volumes[] | select(.emptyDir != null)]}' |
| 60 | + |
| 61 | +# Check for evicted pods |
| 62 | +kubectl get pods -A| grep Evicted |
| 63 | + |
| 64 | +# Check workspace pod events |
| 65 | +kubectl describe pod<workspace-pod-name> -n<namespace> |
| 66 | + |
| 67 | +# Check Karpenter node allocation |
| 68 | +kubectl logs -l app.kubernetes.io/name=karpenter -n karpenter --tail=100 |
| 69 | +``` |
| 70 | + |
| 71 | +**Resolution**: |
| 72 | + |
| 73 | +**Immediate**: |
| 74 | +1. Identify workspaces consuming excessive storage: |
| 75 | +```bash |
| 76 | + kubectlexec -it<workspace-pod> -- df -h |
| 77 | +``` |
| 78 | +2. If specific workspace is problematic, delete it: |
| 79 | +```bash |
| 80 | + kubectl delete pod<workspace-pod> -n<namespace> |
| 81 | +``` |
| 82 | +3. If cluster-wide issue, scale up nodes or increase storage capacity |
| 83 | + |
| 84 | +**Temporary Workaround**: |
| 85 | +- Pause new workspace deployments |
| 86 | +- Ask participants to save work and stop workspaces |
| 87 | +- Clean up unused workspaces |
| 88 | + |
| 89 | +**Permanent Fix**: |
| 90 | +- See GitHub Issue#1 for long-term storage optimization |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +###2. Subdomain Routing Failures |
| 95 | + |
| 96 | +**Symptoms**: |
| 97 | +- Users cannot access workspaces via subdomain URLs |
| 98 | +- 404 or DNS errors on workspace URLs |
| 99 | +- Inconsistent routing across regions |
| 100 | + |
| 101 | +**Likely Causes**: |
| 102 | +- Image version mismatch between control plane and proxy clusters |
| 103 | +- Ingress controller misconfiguration |
| 104 | +- DNS propagation delays |
| 105 | + |
| 106 | +**Diagnosis**: |
| 107 | + |
| 108 | +```bash |
| 109 | +# Check Coder image versions across clusters |
| 110 | +kubectl get pods -n coder -o jsonpath='{.items[*].spec.containers[*].image}' --context=control-plane |
| 111 | +kubectl get pods -n coder -o jsonpath='{.items[*].spec.containers[*].image}' --context=oregon |
| 112 | +kubectl get pods -n coder -o jsonpath='{.items[*].spec.containers[*].image}' --context=london |
| 113 | + |
| 114 | +# Check ingress configuration |
| 115 | +kubectl get ingress -A |
| 116 | +kubectl describe ingress<ingress-name> -n<namespace> |
| 117 | + |
| 118 | +# Check DNS resolution |
| 119 | +dig<workspace-subdomain>.ai.coder.com |
| 120 | +nslookup<workspace-subdomain>.ai.coder.com |
| 121 | + |
| 122 | +# Check load balancer status |
| 123 | +kubectl get svc -n coder |
| 124 | +``` |
| 125 | + |
| 126 | +**Resolution**: |
| 127 | + |
| 128 | +**Immediate**: |
| 129 | +1. Verify image versions match across clusters |
| 130 | +2. If mismatch found, restart Coder pods in affected cluster: |
| 131 | +```bash |
| 132 | + kubectl rollout restart deployment/coder -n coder |
| 133 | +``` |
| 134 | +3. If DNS issue, wait for propagation or flush DNS cache |
| 135 | + |
| 136 | +**Temporary Workaround**: |
| 137 | +- Direct users to working region |
| 138 | +- Use direct IP access if subdomain fails |
| 139 | + |
| 140 | +**Permanent Fix**: |
| 141 | +- See GitHub Issue#2 for image management standardization |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +###3. LiteLLM Authentication Failures |
| 146 | + |
| 147 | +**Symptoms**: |
| 148 | +- Users cannot authenticate |
| 149 | +- "Invalid API key" or similar errors |
| 150 | +- AI features not working |
| 151 | + |
| 152 | +**Likely Causes**: |
| 153 | +- Expired LiteLLM key |
| 154 | +- Rate limiting |
| 155 | +- Service outage |
| 156 | + |
| 157 | +**Diagnosis**: |
| 158 | + |
| 159 | +```bash |
| 160 | +# Check LiteLLM pod logs |
| 161 | +kubectl logs -l app=litellm -n<namespace> --tail=100 |
| 162 | + |
| 163 | +# Test LiteLLM API key |
| 164 | +curl -H"Authorization: Bearer <api-key>" https://<litellm-endpoint>/v1/models |
| 165 | + |
| 166 | +# Check key expiration (method depends on your key management) |
| 167 | +# TODO: Add specific command for your environment |
| 168 | +``` |
| 169 | + |
| 170 | +**Resolution**: |
| 171 | + |
| 172 | +**Immediate**: |
| 173 | +1. Verify key expiration date |
| 174 | +2. If expired, rotate key immediately: |
| 175 | +```bash |
| 176 | +# Follow your key rotation procedure |
| 177 | +# Update secret: |
| 178 | + kubectl create secret generic litellm-key \ |
| 179 | + --from-literal=api-key=<new-key> \ |
| 180 | + --dry-run=client -o yaml| kubectl apply -f - |
| 181 | + |
| 182 | +# Restart LiteLLM pods |
| 183 | + kubectl rollout restart deployment/litellm -n<namespace> |
| 184 | +``` |
| 185 | + |
| 186 | +**Temporary Workaround**: |
| 187 | +- If brief expiration, wait for key rotation |
| 188 | +- Disable AI features temporarily if critical |
| 189 | + |
| 190 | +**Permanent Fix**: |
| 191 | +- See GitHub Issue#3 for key rotation automation |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +###4. High Resource Contention |
| 196 | + |
| 197 | +**Symptoms**: |
| 198 | +- Slow workspace performance |
| 199 | +- Timeouts during operations |
| 200 | +- Elevated CPU/memory usage across cluster |
| 201 | + |
| 202 | +**Likely Causes**: |
| 203 | +- Too many concurrent workspaces |
| 204 | +- Workload-heavy exercises |
| 205 | +- Insufficient node capacity |
| 206 | + |
| 207 | +**Diagnosis**: |
| 208 | + |
| 209 | +```bash |
| 210 | +# Check cluster resource usage |
| 211 | +kubectl top nodes |
| 212 | +kubectl top pods -A |
| 213 | + |
| 214 | +# Check Karpenter scaling |
| 215 | +kubectl get nodeclaims -A |
| 216 | +kubectl logs -l app.kubernetes.io/name=karpenter -n karpenter --tail=50 |
| 217 | + |
| 218 | +# Check pod resource limits |
| 219 | +kubectl describe pod<pod-name> -n<namespace>| grep -A 5"Limits\|Requests" |
| 220 | +``` |
| 221 | + |
| 222 | +**Resolution**: |
| 223 | + |
| 224 | +**Immediate**: |
| 225 | +1. Trigger Karpenter to scale up nodes if not auto-scaling: |
| 226 | +```bash |
| 227 | +# Check Karpenter NodePool status |
| 228 | + kubectl get nodepool |
| 229 | +``` |
| 230 | +2. If nodes are at capacity, consider increasing instance sizes |
| 231 | +3. Identify and pause resource-heavy workloads |
| 232 | + |
| 233 | +**Temporary Workaround**: |
| 234 | +- Reduce concurrent workspace count |
| 235 | +- Switch to less resource-intensive exercises |
| 236 | +- Stagger workspace deployments |
| 237 | + |
| 238 | +**Permanent Fix**: |
| 239 | +- Adjust resource limits per workspace |
| 240 | +- Implement better capacity planning (see Issue#1) |
| 241 | +- Add resource monitoring alerts (see Issue#6) |
| 242 | + |
| 243 | +--- |
| 244 | + |
| 245 | +###5. Image Pull Failures |
| 246 | + |
| 247 | +**Symptoms**: |
| 248 | +- Workspaces stuck in "ContainerCreating" state |
| 249 | +- ImagePullBackOff errors |
| 250 | +- Slow workspace startup times |
| 251 | + |
| 252 | +**Likely Causes**: |
| 253 | +- Registry authentication issues |
| 254 | +- Network connectivity problems |
| 255 | +- Rate limiting from container registry |
| 256 | +- Image doesn't exist or incorrect tag |
| 257 | + |
| 258 | +**Diagnosis**: |
| 259 | + |
| 260 | +```bash |
| 261 | +# Check pod status |
| 262 | +kubectl get pods -A| grep -E'ImagePull|ErrImagePull' |
| 263 | + |
| 264 | +# Check pod events |
| 265 | +kubectl describe pod<pod-name> -n<namespace> |
| 266 | + |
| 267 | +# Check image pull secrets |
| 268 | +kubectl get secrets -A| grep docker |
| 269 | + |
| 270 | +# Verify image exists |
| 271 | +docker pull<image-name>:<tag> |
| 272 | +# or |
| 273 | +crane manifest<image-name>:<tag> |
| 274 | +``` |
| 275 | + |
| 276 | +**Resolution**: |
| 277 | + |
| 278 | +**Immediate**: |
| 279 | +1. Verify registry credentials are valid: |
| 280 | +```bash |
| 281 | + kubectl get secret<image-pull-secret> -n<namespace> -o jsonpath='{.data.dockerconfigjson}'| base64 -d |
| 282 | +``` |
| 283 | +2. Re-create image pull secret if expired: |
| 284 | +```bash |
| 285 | + kubectl create secret docker-registry<secret-name> \ |
| 286 | + --docker-server=<registry> \ |
| 287 | + --docker-username=<username> \ |
| 288 | + --docker-password=<password> \ |
| 289 | + -n<namespace> |
| 290 | +``` |
| 291 | +3. Restart affected pods |
| 292 | + |
| 293 | +**Temporary Workaround**: |
| 294 | +- Use cached images if available |
| 295 | +- Switch to alternative image registry |
| 296 | + |
| 297 | +**Permanent Fix**: |
| 298 | +- Implement image pre-caching on nodes |
| 299 | +- Use image pull secrets with longer expiration |
| 300 | +- See GitHub Issue#2 for image management improvements |
| 301 | + |
| 302 | +--- |
| 303 | + |
| 304 | +##Emergency Contacts |
| 305 | + |
| 306 | +| Role| Name| Contact| |
| 307 | +|------|------|--------| |
| 308 | +| Infrastructure Lead||| |
| 309 | +| On-Call Engineer||| |
| 310 | +| Platform Team Lead||| |
| 311 | +| Escalation Contact|jullian@coder.com|| |
| 312 | + |
| 313 | +--- |
| 314 | + |
| 315 | +##Post-Incident Checklist |
| 316 | + |
| 317 | +-[ ] Incident resolved and documented |
| 318 | +-[ ] Root cause identified |
| 319 | +-[ ] GitHub issue created for permanent fix |
| 320 | +-[ ] Runbook updated with new learnings |
| 321 | +-[ ] Team notified of resolution |
| 322 | +-[ ] Participants notified if impacted |
| 323 | +-[ ] Incident added to post-workshop retrospective |
| 324 | + |
| 325 | +--- |
| 326 | + |
| 327 | +##Related Resources |
| 328 | + |
| 329 | +-[Monthly Workshop Guide](./MONTHLY_WORKSHOP_GUIDE.md) |
| 330 | +-[Pre-Workshop Checklist](./PRE_WORKSHOP_CHECKLIST.md) |
| 331 | +-[Post-Workshop Retrospective Template](./POST_WORKSHOP_RETROSPECTIVE.md) |
| 332 | +- GitHub Issues:[#1](https://github.com/coder/ai.coder.com/issues/1)[#2](https://github.com/coder/ai.coder.com/issues/2)[#3](https://github.com/coder/ai.coder.com/issues/3)[#6](https://github.com/coder/ai.coder.com/issues/6) |