Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit2a22075

Browse files
blink-so[bot]dahr
andcommitted
Update workshop documentation with detailed architecture context
- Enhanced Pre-Workshop Checklist with specific commands and procedures - Added multi-region kubectl commands (us-east-2, us-west-2, eu-west-2) - Included LiteLLM key rotation details (4-5 hour schedule) - Added provisioner scaling guidelines (6 default, scale to 10 for >15 users) - Documented CloudFlare DNS verification for all 6 domains - Added ECR image mirror validation procedures - Included Karpenter health checks - Added LiteLLM capacity scaling (4 replicas, scale to 6-8 for >20 users)- Updated Incident Runbook with architecture-specific procedures - Added LiteLLM auxiliary addon key rotation incident (forces workspace restarts) - Included ECR image sync procedures for subdomain routing failures - Added CloudFlare DNS troubleshooting via #help-me-ops - Documented provisioner scaling procedures - Added new incident type: Provisioner Failures - Included multi-region context throughout all incidents - Added specific resource limits and replica counts- Created Architecture Overview document - Mermaid diagram showing multi-region setup - Component details for all services - Capacity planning tables - Workspace template specifications - Known limitations and tracking issues - Future expansion plans (coderdemo.io, devcoder.io)Related:#1#2#3#4#5#6#7#8#9Co-authored-by: dahr <13365989+dahr@users.noreply.github.com>
1 parenta242f3f commit2a22075

File tree

3 files changed

+840
-107
lines changed

3 files changed

+840
-107
lines changed

‎docs/workshops/ARCHITECTURE.md‎

Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
#ai.coder.com Architecture Overview
2+
3+
This document provides a visual and technical overview of the multi-region infrastructure forhttps://ai.coder.com.
4+
5+
---
6+
7+
##High-Level Architecture
8+
9+
```mermaid
10+
graph TB
11+
subgraph CloudFlare["CloudFlare DNS"]
12+
DNS1["ai.coder.com<br/>*.ai.coder.com"]
13+
DNS2["oregon-proxy.ai.coder.com<br/>*.oregon-proxy.ai.coder.com"]
14+
DNS3["emea-proxy.ai.coder.com<br/>*.emea-proxy.ai.coder.com"]
15+
end
16+
17+
subgraph "us-east-2 (Ohio) - Control Plane"
18+
NLB1["Network Load Balancer"]
19+
CoderServer["Coder Server<br/>2 replicas<br/>4 vCPU / 8 GB"]
20+
Provisioners1["External Provisioners<br/>Default: 6 replicas<br/>Exp/Demo: 2 each<br/>500m CPU / 512 MB"]
21+
Karpenter1["Karpenter<br/>Node Auto-Scaling"]
22+
EKS1["EKS Cluster<br/>Worker Nodes"]
23+
end
24+
25+
subgraph "us-west-2 (Oregon) - Proxy"
26+
NLB2["Network Load Balancer"]
27+
CoderProxy1["Coder Proxy<br/>2 replicas<br/>500m CPU / 1 GB"]
28+
Karpenter2["Karpenter<br/>Node Auto-Scaling"]
29+
EKS2["EKS Cluster<br/>Worker Nodes"]
30+
end
31+
32+
subgraph "eu-west-2 (London) - Proxy"
33+
NLB3["Network Load Balancer"]
34+
CoderProxy2["Coder Proxy<br/>2 replicas<br/>500m CPU / 1 GB"]
35+
Karpenter3["Karpenter<br/>Node Auto-Scaling"]
36+
EKS3["EKS Cluster<br/>Worker Nodes"]
37+
end
38+
39+
subgraph "LiteLLM (us-east-2)"
40+
ALB["Application Load Balancer"]
41+
LiteLLM["LiteLLM Service<br/>4 replicas<br/>2 vCPU / 4 GB"]
42+
KeyRotator["Auxiliary Addon<br/>Key Rotation<br/>Every 4-5 hours"]
43+
end
44+
45+
subgraph "AI Providers"
46+
Bedrock["AWS Bedrock<br/>Claude Models"]
47+
Vertex["GCP Vertex AI<br/>Claude Models"]
48+
end
49+
50+
subgraph "Image Registry"
51+
GHCR["ghcr.io<br/>coder/coder-preview"]
52+
ECR["Private AWS ECR<br/>coder-preview mirror"]
53+
end
54+
55+
DNS1 --> NLB1
56+
DNS2 --> NLB2
57+
DNS3 --> NLB3
58+
59+
NLB1 --> CoderServer
60+
NLB2 --> CoderProxy1
61+
NLB3 --> CoderProxy2
62+
63+
CoderServer --> Provisioners1
64+
CoderProxy1 --> CoderServer
65+
CoderProxy2 --> CoderServer
66+
67+
Provisioners1 --> EKS1
68+
CoderServer --> EKS1
69+
CoderProxy1 --> EKS2
70+
CoderProxy2 --> EKS3
71+
72+
Karpenter1 --> EKS1
73+
Karpenter2 --> EKS2
74+
Karpenter3 --> EKS3
75+
76+
CoderServer --> ALB
77+
ALB --> LiteLLM
78+
LiteLLM --> Bedrock
79+
LiteLLM --> Vertex
80+
81+
KeyRotator -.-> LiteLLM
82+
83+
GHCR -.->|"Manual/Automated Sync"| ECR
84+
ECR --> CoderServer
85+
ECR --> CoderProxy1
86+
ECR --> CoderProxy2
87+
```
88+
89+
---
90+
91+
##Component Details
92+
93+
###Control Plane (us-east-2 - Ohio)
94+
95+
**Coder Server**
96+
-**Function**: Main control plane for workspace management
97+
-**Deployment**: Helm release managed via Terraform
98+
-**Replicas**: 2
99+
-**Resources**: 4 vCPU / 8 GB per replica
100+
-**Capacity**: Supports up to 1,000 users
101+
-**Image**:`ghcr.io/coder/coder-preview` (mirrored to private ECR)
102+
-**Ingress**: Network Load Balancer
103+
-**Authentication**: GitHub OAuth (external users), Okta OIDC (internal users)
104+
105+
**External Provisioners**
106+
-**Function**: Execute Terraform operations for workspace lifecycle
107+
-**Deployment**: Helm release managed via Terraform
108+
-**Replicas**:
109+
- Default org: 6 replicas (scale to 8-10 for workshops >15 users)
110+
- Experimental org: 2 replicas
111+
- Demo org: 2 replicas
112+
-**Resources**: 500m CPU / 512 MB per replica
113+
-**Limitation**: 1 provisioner = 1 concurrent Terraform operation
114+
-**IAM**: AWS IAM role for EC2 workspace provisioning
115+
116+
**Karpenter**
117+
-**Function**: Dynamic node auto-scaling for EKS cluster
118+
-**Triggers**: Pod pending state, resource requests
119+
-**AMI**: EKS-optimized Ubuntu/Bottlerocket/AL2023
120+
-**Dependencies**: AWS SQS, EventBridge, IAM roles
121+
122+
---
123+
124+
###Proxy Clusters
125+
126+
**Oregon Proxy (us-west-2)**
127+
-**Function**: Regional workspace access proxy
128+
-**Replicas**: 2
129+
-**Resources**: 500m CPU / 1 GB per replica
130+
-**Domains**:`oregon-proxy.ai.coder.com` +`*.oregon-proxy.ai.coder.com`
131+
-**Ingress**: Network Load Balancer
132+
-**Token**: Managed via Terraform`coderd_workspace_proxy` resource
133+
134+
**London Proxy (eu-west-2)**
135+
-**Function**: Regional workspace access proxy
136+
-**Replicas**: 2
137+
-**Resources**: 500m CPU / 1 GB per replica
138+
-**Domains**:`emea-proxy.ai.coder.com` +`*.emea-proxy.ai.coder.com`
139+
-**Ingress**: Network Load Balancer
140+
-**Token**: Managed via Terraform`coderd_workspace_proxy` resource
141+
142+
---
143+
144+
###LiteLLM Service (us-east-2)
145+
146+
**LiteLLM Deployment**
147+
-**Function**: LLM proxy/router for AI features
148+
-**Deployment**: Kubernetes manifests (not Helm)
149+
-**Replicas**: 4 (scale to 6-8 for workshops >20 users)
150+
-**Resources**: 2 vCPU / 4 GB per replica
151+
-**Ingress**: Application Load Balancer (HTTPS)
152+
-**Providers**: Round-robin between AWS Bedrock and GCP Vertex AI
153+
-**Models**: Claude (Sonnet, Haiku, Opus)
154+
155+
**Auxiliary Key Rotation**
156+
-**Function**: Periodically generates and rotates LiteLLM keys
157+
-**Frequency**: Every 4-5 hours
158+
-**Impact**: Forces all workspaces to restart and consume new key
159+
-**Note**: Disable during workshops to avoid disruptions
160+
161+
**Authentication**
162+
-**AWS Bedrock**: IAM role with limited Bedrock permissions
163+
-**GCP Vertex**: Service account with Vertex AI permissions
164+
165+
---
166+
167+
###Image Management
168+
169+
**Source**:`ghcr.io/coder/coder-preview`
170+
- Non-GA preview image with beta AI features
171+
- Publicly accessible on GitHub Container Registry
172+
173+
**Private ECR Mirror**
174+
- Mirrored copy in AWS ECR (us-east-2)
175+
-**Critical**: Must stay in sync with GHCR source
176+
-**Issue**: Manual sync process prone to drift
177+
-**Solution**: See Issue#7 for automation
178+
179+
**Workspace Images**
180+
- Build from Scratch w/ Claude: Stored in private ECR
181+
- Build from Scratch w/ Goose: Stored in private ECR
182+
- Real World App w/ Claude:`codercom/example-universal:ubuntu` (DockerHub)
183+
184+
---
185+
186+
###DNS Management (CloudFlare)
187+
188+
**Managed Domains**:
189+
1.`ai.coder.com` +`*.ai.coder.com` → us-east-2 NLB
190+
2.`oregon-proxy.ai.coder.com` +`*.oregon-proxy.ai.coder.com` → us-west-2 NLB
191+
3.`emea-proxy.ai.coder.com` +`*.emea-proxy.ai.coder.com` → eu-west-2 NLB
192+
193+
**Current Process**: Manual changes via #help-me-ops Slack channel
194+
**Improvement**: See Issue#9 for Terraform automation
195+
196+
---
197+
198+
##Workspace Templates
199+
200+
###Build from Scratch w/ Claude
201+
-**Image**: Custom image from private ECR
202+
-**Pre-installed**: Claude Code CLI, desktop-commander, playwright
203+
-**Resources**: 2-4 vCPU, 4-8 GB (user-configurable)
204+
-**LLM Provider**: LiteLLM
205+
-**GitHub Auth**: Optional (use personal credentials or coder-contrib account)
206+
-**AI Interface**: Claude coder_app via AgentAPI or Coder Tasks
207+
208+
###Build from Scratch w/ Goose
209+
-**Image**: Custom image from private ECR
210+
-**Pre-installed**: Goose CLI, desktop-commander, playwright
211+
-**Resources**: 2-4 vCPU, 4-8 GB (user-configurable)
212+
-**LLM Provider**: LiteLLM
213+
-**GitHub Auth**: Optional
214+
-**AI Interface**: Goose coder_app via AgentAPI or Coder Tasks
215+
216+
###Real World App w/ Claude
217+
-**Image**:`codercom/example-universal:ubuntu` (DockerHub)
218+
-**Application**: Django app (auto-starts on workspace launch)
219+
-**Pre-installed**: Claude Code CLI, AgentAPI
220+
-**Resources**: 2-4 vCPU, 4-8 GB (user-configurable)
221+
-**LLM Provider**: LiteLLM
222+
-**GitHub Auth**: Optional
223+
-**Use Case**: Live application modification with AI assistance
224+
225+
---
226+
227+
##Supporting Infrastructure
228+
229+
###AWS Load Balancer Controller
230+
-**Function**: Manages AWS NLB/ALB via Kubernetes Service/Ingress objects
231+
-**Deployment**: Helm release managed via Terraform
232+
-**IAM**: Dedicated IAM role with LoadBalancer management permissions
233+
234+
###AWS EBS CSI Driver
235+
-**Function**: Provisions EBS volumes via Kubernetes PersistentVolume objects
236+
-**Deployment**: Helm release managed via Terraform
237+
-**IAM**: Dedicated IAM role with EBS management permissions
238+
239+
###cert-manager
240+
-**Function**: SSL certificate renewal for all load balancers
241+
-**Integration**: Works with AWS Load Balancer Controller
242+
243+
---
244+
245+
##Capacity Planning
246+
247+
###Concurrent User Targets
248+
249+
| Users| Provisioner Replicas| LiteLLM Replicas| Karpenter Nodes|
250+
|-------|---------------------|------------------|----------------|
251+
| <10| 6 (default)| 4 (default)| Auto-scale|
252+
| 10-15| 8| 4| Auto-scale|
253+
| 15-20| 10| 4-6| Auto-scale|
254+
| 20-30| 12-15| 6-8| Auto-scale|
255+
256+
###Workspace Resource Allocation
257+
258+
**Per Workspace** (template-dependent):
259+
-**CPU**: 2-4 vCPU
260+
-**Memory**: 4-8 GB
261+
-**Storage**: Ephemeral volumes (node-local)
262+
263+
**Example**: 15 concurrent workspaces @ 4 vCPU / 8 GB each = 60 vCPU / 120 GB total
264+
265+
---
266+
267+
##Known Limitations & Issues
268+
269+
###Storage
270+
-**Issue**: Ephemeral volume storage capacity limited per node
271+
-**Impact**: Workspaces restart when nodes exhaust storage
272+
-**Tracking**: Issue#1
273+
274+
###Image Synchronization
275+
-**Issue**: ECR mirror can fall out of sync with GHCR
276+
-**Impact**: Image version mismatch causes subdomain routing failures
277+
-**Tracking**: Issue#2, Issue#7
278+
279+
###LiteLLM Key Rotation
280+
-**Issue**: Automatic rotation every 4-5 hours forces workspace restarts
281+
-**Impact**: User progress lost during workshops if rotation occurs
282+
-**Mitigation**: Disable rotation before workshops
283+
-**Tracking**: Issue#3
284+
285+
###DNS Management
286+
-**Issue**: Manual process via Slack requests
287+
-**Impact**: Slow incident response, dependency on ops team
288+
-**Tracking**: Issue#9
289+
290+
###Provisioner Scaling
291+
-**Issue**: Manual scaling required, no auto-scaling
292+
-**Impact**: Timeouts during simultaneous workspace operations
293+
-**Tracking**: Issue#8
294+
295+
---
296+
297+
##Related Documentation
298+
299+
-[Monthly Workshop Guide](./MONTHLY_WORKSHOP_GUIDE.md)
300+
-[Pre-Workshop Checklist](./PRE_WORKSHOP_CHECKLIST.md)
301+
-[Incident Runbook](./INCIDENT_RUNBOOK.md)
302+
-[Post-Workshop Retrospective Template](./POST_WORKSHOP_RETROSPECTIVE.md)
303+
-[Participant Guide](./PARTICIPANT_GUIDE.md)
304+
305+
---
306+
307+
##Future Expansion
308+
309+
Planned additional demo environments:
310+
311+
###coderdemo.io
312+
-**Purpose**: SE official demo environment
313+
-**Level**: Production-grade, best practices, reference architecture
314+
-**Status**: Not yet live
315+
316+
###devcoder.io
317+
-**Purpose**: CS / Engineering collaboration environment
318+
-**Use Case**: Enablement, internal feedback loops, dogfooding
319+
-**Status**: Not yet live
320+
321+
---
322+
323+
**Last Updated**: October 2024
324+
**Maintained By**: Infrastructure Team
325+
**Questions**: #help-me-ops orjullian@coder.com

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp