Commit2a22075

blink-so[bot]

and

dahr

committed

Update workshop documentation with detailed architecture context

- Enhanced Pre-Workshop Checklist with specific commands and procedures - Added multi-region kubectl commands (us-east-2, us-west-2, eu-west-2) - Included LiteLLM key rotation details (4-5 hour schedule) - Added provisioner scaling guidelines (6 default, scale to 10 for >15 users) - Documented CloudFlare DNS verification for all 6 domains - Added ECR image mirror validation procedures - Included Karpenter health checks - Added LiteLLM capacity scaling (4 replicas, scale to 6-8 for >20 users)- Updated Incident Runbook with architecture-specific procedures - Added LiteLLM auxiliary addon key rotation incident (forces workspace restarts) - Included ECR image sync procedures for subdomain routing failures - Added CloudFlare DNS troubleshooting via #help-me-ops - Documented provisioner scaling procedures - Added new incident type: Provisioner Failures - Included multi-region context throughout all incidents - Added specific resource limits and replica counts- Created Architecture Overview document - Mermaid diagram showing multi-region setup - Component details for all services - Capacity planning tables - Workspace template specifications - Known limitations and tracking issues - Future expansion plans (coderdemo.io, devcoder.io)Related:#1 #2 #3 #4 #5 #6 #7 #8 #9Co-authored-by: dahr <13365989+dahr@users.noreply.github.com>

1 parenta242f3f commit2a22075Copy full SHA for 2a22075

File tree

3 files changed

+840

-107

lines changed

docs/workshops

3 files changed

+840

-107

lines changed

`‎docs/workshops/ARCHITECTURE.md‎`

Lines changed: 325 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,325 @@`
	`1`	`+#ai.coder.com Architecture Overview`
	`2`	`+`
	`3`	`+This document provides a visual and technical overview of the multi-region infrastructure forhttps://ai.coder.com.`
	`4`	`+`
	`5`	`+---`
	`6`	`+`
	`7`	`+##High-Level Architecture`
	`8`	`+`
	`9`	+```mermaid
	`10`	`+graph TB`
	`11`	`+ subgraph CloudFlare["CloudFlare DNS"]`
	`12`	`+ DNS1["ai.coder.com<br/>*.ai.coder.com"]`
	`13`	`+ DNS2["oregon-proxy.ai.coder.com<br/>*.oregon-proxy.ai.coder.com"]`
	`14`	`+ DNS3["emea-proxy.ai.coder.com<br/>*.emea-proxy.ai.coder.com"]`
	`15`	`+ end`
	`16`	`+`
	`17`	`+ subgraph "us-east-2 (Ohio) - Control Plane"`
	`18`	`+ NLB1["Network Load Balancer"]`
	`19`	`+ CoderServer["Coder Server<br/>2 replicas<br/>4 vCPU / 8 GB"]`
	`20`	`+ Provisioners1["External Provisioners<br/>Default: 6 replicas<br/>Exp/Demo: 2 each<br/>500m CPU / 512 MB"]`
	`21`	`+ Karpenter1["Karpenter<br/>Node Auto-Scaling"]`
	`22`	`+ EKS1["EKS Cluster<br/>Worker Nodes"]`
	`23`	`+ end`
	`24`	`+`
	`25`	`+ subgraph "us-west-2 (Oregon) - Proxy"`
	`26`	`+ NLB2["Network Load Balancer"]`
	`27`	`+ CoderProxy1["Coder Proxy<br/>2 replicas<br/>500m CPU / 1 GB"]`
	`28`	`+ Karpenter2["Karpenter<br/>Node Auto-Scaling"]`
	`29`	`+ EKS2["EKS Cluster<br/>Worker Nodes"]`
	`30`	`+ end`
	`31`	`+`
	`32`	`+ subgraph "eu-west-2 (London) - Proxy"`
	`33`	`+ NLB3["Network Load Balancer"]`
	`34`	`+ CoderProxy2["Coder Proxy<br/>2 replicas<br/>500m CPU / 1 GB"]`
	`35`	`+ Karpenter3["Karpenter<br/>Node Auto-Scaling"]`
	`36`	`+ EKS3["EKS Cluster<br/>Worker Nodes"]`
	`37`	`+ end`
	`38`	`+`
	`39`	`+ subgraph "LiteLLM (us-east-2)"`
	`40`	`+ ALB["Application Load Balancer"]`
	`41`	`+ LiteLLM["LiteLLM Service<br/>4 replicas<br/>2 vCPU / 4 GB"]`
	`42`	`+ KeyRotator["Auxiliary Addon<br/>Key Rotation<br/>Every 4-5 hours"]`
	`43`	`+ end`
	`44`	`+`
	`45`	`+ subgraph "AI Providers"`
	`46`	`+ Bedrock["AWS Bedrock<br/>Claude Models"]`
	`47`	`+ Vertex["GCP Vertex AI<br/>Claude Models"]`
	`48`	`+ end`
	`49`	`+`
	`50`	`+ subgraph "Image Registry"`
	`51`	`+ GHCR["ghcr.io<br/>coder/coder-preview"]`
	`52`	`+ ECR["Private AWS ECR<br/>coder-preview mirror"]`
	`53`	`+ end`
	`54`	`+`
	`55`	`+ DNS1 --> NLB1`
	`56`	`+ DNS2 --> NLB2`
	`57`	`+ DNS3 --> NLB3`
	`58`	`+`
	`59`	`+ NLB1 --> CoderServer`
	`60`	`+ NLB2 --> CoderProxy1`
	`61`	`+ NLB3 --> CoderProxy2`
	`62`	`+`
	`63`	`+ CoderServer --> Provisioners1`
	`64`	`+ CoderProxy1 --> CoderServer`
	`65`	`+ CoderProxy2 --> CoderServer`
	`66`	`+`
	`67`	`+ Provisioners1 --> EKS1`
	`68`	`+ CoderServer --> EKS1`
	`69`	`+ CoderProxy1 --> EKS2`
	`70`	`+ CoderProxy2 --> EKS3`
	`71`	`+`
	`72`	`+ Karpenter1 --> EKS1`
	`73`	`+ Karpenter2 --> EKS2`
	`74`	`+ Karpenter3 --> EKS3`
	`75`	`+`
	`76`	`+ CoderServer --> ALB`
	`77`	`+ ALB --> LiteLLM`
	`78`	`+ LiteLLM --> Bedrock`
	`79`	`+ LiteLLM --> Vertex`
	`80`	`+`
	`81`	`+ KeyRotator -.-> LiteLLM`
	`82`	`+`
	`83`	`+ GHCR -.->\|"Manual/Automated Sync"\| ECR`
	`84`	`+ ECR --> CoderServer`
	`85`	`+ ECR --> CoderProxy1`
	`86`	`+ ECR --> CoderProxy2`
	`87`	+```
	`88`	`+`
	`89`	`+---`
	`90`	`+`
	`91`	`+##Component Details`
	`92`	`+`
	`93`	`+###Control Plane (us-east-2 - Ohio)`
	`94`	`+`
	`95`	`+Coder Server`
	`96`	`+-Function: Main control plane for workspace management`
	`97`	`+-Deployment: Helm release managed via Terraform`
	`98`	`+-Replicas: 2`
	`99`	`+-Resources: 4 vCPU / 8 GB per replica`
	`100`	`+-Capacity: Supports up to 1,000 users`
	`101`	+-Image:`ghcr.io/coder/coder-preview` (mirrored to private ECR)
	`102`	`+-Ingress: Network Load Balancer`
	`103`	`+-Authentication: GitHub OAuth (external users), Okta OIDC (internal users)`
	`104`	`+`
	`105`	`+External Provisioners`
	`106`	`+-Function: Execute Terraform operations for workspace lifecycle`
	`107`	`+-Deployment: Helm release managed via Terraform`
	`108`	`+-Replicas:`
	`109`	`+- Default org: 6 replicas (scale to 8-10 for workshops >15 users)`
	`110`	`+- Experimental org: 2 replicas`
	`111`	`+- Demo org: 2 replicas`
	`112`	`+-Resources: 500m CPU / 512 MB per replica`
	`113`	`+-Limitation: 1 provisioner = 1 concurrent Terraform operation`
	`114`	`+-IAM: AWS IAM role for EC2 workspace provisioning`
	`115`	`+`
	`116`	`+Karpenter`
	`117`	`+-Function: Dynamic node auto-scaling for EKS cluster`
	`118`	`+-Triggers: Pod pending state, resource requests`
	`119`	`+-AMI: EKS-optimized Ubuntu/Bottlerocket/AL2023`
	`120`	`+-Dependencies: AWS SQS, EventBridge, IAM roles`
	`121`	`+`
	`122`	`+---`
	`123`	`+`
	`124`	`+###Proxy Clusters`
	`125`	`+`
	`126`	`+Oregon Proxy (us-west-2)`
	`127`	`+-Function: Regional workspace access proxy`
	`128`	`+-Replicas: 2`
	`129`	`+-Resources: 500m CPU / 1 GB per replica`
	`130`	+-Domains:`oregon-proxy.ai.coder.com` +`*.oregon-proxy.ai.coder.com`
	`131`	`+-Ingress: Network Load Balancer`
	`132`	+-Token: Managed via Terraform`coderd_workspace_proxy` resource
	`133`	`+`
	`134`	`+London Proxy (eu-west-2)`
	`135`	`+-Function: Regional workspace access proxy`
	`136`	`+-Replicas: 2`
	`137`	`+-Resources: 500m CPU / 1 GB per replica`
	`138`	+-Domains:`emea-proxy.ai.coder.com` +`*.emea-proxy.ai.coder.com`
	`139`	`+-Ingress: Network Load Balancer`
	`140`	+-Token: Managed via Terraform`coderd_workspace_proxy` resource
	`141`	`+`
	`142`	`+---`
	`143`	`+`
	`144`	`+###LiteLLM Service (us-east-2)`
	`145`	`+`
	`146`	`+LiteLLM Deployment`
	`147`	`+-Function: LLM proxy/router for AI features`
	`148`	`+-Deployment: Kubernetes manifests (not Helm)`
	`149`	`+-Replicas: 4 (scale to 6-8 for workshops >20 users)`
	`150`	`+-Resources: 2 vCPU / 4 GB per replica`
	`151`	`+-Ingress: Application Load Balancer (HTTPS)`
	`152`	`+-Providers: Round-robin between AWS Bedrock and GCP Vertex AI`
	`153`	`+-Models: Claude (Sonnet, Haiku, Opus)`
	`154`	`+`
	`155`	`+Auxiliary Key Rotation`
	`156`	`+-Function: Periodically generates and rotates LiteLLM keys`
	`157`	`+-Frequency: Every 4-5 hours`
	`158`	`+-Impact: Forces all workspaces to restart and consume new key`
	`159`	`+-Note: Disable during workshops to avoid disruptions`
	`160`	`+`
	`161`	`+Authentication`
	`162`	`+-AWS Bedrock: IAM role with limited Bedrock permissions`
	`163`	`+-GCP Vertex: Service account with Vertex AI permissions`
	`164`	`+`
	`165`	`+---`
	`166`	`+`
	`167`	`+###Image Management`
	`168`	`+`
	`169`	+Source:`ghcr.io/coder/coder-preview`
	`170`	`+- Non-GA preview image with beta AI features`
	`171`	`+- Publicly accessible on GitHub Container Registry`
	`172`	`+`
	`173`	`+Private ECR Mirror`
	`174`	`+- Mirrored copy in AWS ECR (us-east-2)`
	`175`	`+-Critical: Must stay in sync with GHCR source`
	`176`	`+-Issue: Manual sync process prone to drift`
	`177`	`+-Solution: See Issue#7 for automation`
	`178`	`+`
	`179`	`+Workspace Images`
	`180`	`+- Build from Scratch w/ Claude: Stored in private ECR`
	`181`	`+- Build from Scratch w/ Goose: Stored in private ECR`
	`182`	+- Real World App w/ Claude:`codercom/example-universal:ubuntu` (DockerHub)
	`183`	`+`
	`184`	`+---`
	`185`	`+`
	`186`	`+###DNS Management (CloudFlare)`
	`187`	`+`
	`188`	`+Managed Domains:`
	`189`	+1.`ai.coder.com` +`*.ai.coder.com` → us-east-2 NLB
	`190`	+2.`oregon-proxy.ai.coder.com` +`*.oregon-proxy.ai.coder.com` → us-west-2 NLB
	`191`	+3.`emea-proxy.ai.coder.com` +`*.emea-proxy.ai.coder.com` → eu-west-2 NLB
	`192`	`+`
	`193`	`+Current Process: Manual changes via #help-me-ops Slack channel`
	`194`	`+Improvement: See Issue#9 for Terraform automation`
	`195`	`+`
	`196`	`+---`
	`197`	`+`
	`198`	`+##Workspace Templates`
	`199`	`+`
	`200`	`+###Build from Scratch w/ Claude`
	`201`	`+-Image: Custom image from private ECR`
	`202`	`+-Pre-installed: Claude Code CLI, desktop-commander, playwright`
	`203`	`+-Resources: 2-4 vCPU, 4-8 GB (user-configurable)`
	`204`	`+-LLM Provider: LiteLLM`
	`205`	`+-GitHub Auth: Optional (use personal credentials or coder-contrib account)`
	`206`	`+-AI Interface: Claude coder_app via AgentAPI or Coder Tasks`
	`207`	`+`
	`208`	`+###Build from Scratch w/ Goose`
	`209`	`+-Image: Custom image from private ECR`
	`210`	`+-Pre-installed: Goose CLI, desktop-commander, playwright`
	`211`	`+-Resources: 2-4 vCPU, 4-8 GB (user-configurable)`
	`212`	`+-LLM Provider: LiteLLM`
	`213`	`+-GitHub Auth: Optional`
	`214`	`+-AI Interface: Goose coder_app via AgentAPI or Coder Tasks`
	`215`	`+`
	`216`	`+###Real World App w/ Claude`
	`217`	+-Image:`codercom/example-universal:ubuntu` (DockerHub)
	`218`	`+-Application: Django app (auto-starts on workspace launch)`
	`219`	`+-Pre-installed: Claude Code CLI, AgentAPI`
	`220`	`+-Resources: 2-4 vCPU, 4-8 GB (user-configurable)`
	`221`	`+-LLM Provider: LiteLLM`
	`222`	`+-GitHub Auth: Optional`
	`223`	`+-Use Case: Live application modification with AI assistance`
	`224`	`+`
	`225`	`+---`
	`226`	`+`
	`227`	`+##Supporting Infrastructure`
	`228`	`+`
	`229`	`+###AWS Load Balancer Controller`
	`230`	`+-Function: Manages AWS NLB/ALB via Kubernetes Service/Ingress objects`
	`231`	`+-Deployment: Helm release managed via Terraform`
	`232`	`+-IAM: Dedicated IAM role with LoadBalancer management permissions`
	`233`	`+`
	`234`	`+###AWS EBS CSI Driver`
	`235`	`+-Function: Provisions EBS volumes via Kubernetes PersistentVolume objects`
	`236`	`+-Deployment: Helm release managed via Terraform`
	`237`	`+-IAM: Dedicated IAM role with EBS management permissions`
	`238`	`+`
	`239`	`+###cert-manager`
	`240`	`+-Function: SSL certificate renewal for all load balancers`
	`241`	`+-Integration: Works with AWS Load Balancer Controller`
	`242`	`+`
	`243`	`+---`
	`244`	`+`
	`245`	`+##Capacity Planning`
	`246`	`+`
	`247`	`+###Concurrent User Targets`
	`248`	`+`
	`249`	`+\| Users\| Provisioner Replicas\| LiteLLM Replicas\| Karpenter Nodes\|`
	`250`	`+\|-------\|---------------------\|------------------\|----------------\|`
	`251`	`+\| <10\| 6 (default)\| 4 (default)\| Auto-scale\|`
	`252`	`+\| 10-15\| 8\| 4\| Auto-scale\|`
	`253`	`+\| 15-20\| 10\| 4-6\| Auto-scale\|`
	`254`	`+\| 20-30\| 12-15\| 6-8\| Auto-scale\|`
	`255`	`+`
	`256`	`+###Workspace Resource Allocation`
	`257`	`+`
	`258`	`+Per Workspace (template-dependent):`
	`259`	`+-CPU: 2-4 vCPU`
	`260`	`+-Memory: 4-8 GB`
	`261`	`+-Storage: Ephemeral volumes (node-local)`
	`262`	`+`
	`263`	`+Example: 15 concurrent workspaces @ 4 vCPU / 8 GB each = 60 vCPU / 120 GB total`
	`264`	`+`
	`265`	`+---`
	`266`	`+`
	`267`	`+##Known Limitations & Issues`
	`268`	`+`
	`269`	`+###Storage`
	`270`	`+-Issue: Ephemeral volume storage capacity limited per node`
	`271`	`+-Impact: Workspaces restart when nodes exhaust storage`
	`272`	`+-Tracking: Issue#1`
	`273`	`+`
	`274`	`+###Image Synchronization`
	`275`	`+-Issue: ECR mirror can fall out of sync with GHCR`
	`276`	`+-Impact: Image version mismatch causes subdomain routing failures`
	`277`	`+-Tracking: Issue#2, Issue#7`
	`278`	`+`
	`279`	`+###LiteLLM Key Rotation`
	`280`	`+-Issue: Automatic rotation every 4-5 hours forces workspace restarts`
	`281`	`+-Impact: User progress lost during workshops if rotation occurs`
	`282`	`+-Mitigation: Disable rotation before workshops`
	`283`	`+-Tracking: Issue#3`
	`284`	`+`
	`285`	`+###DNS Management`
	`286`	`+-Issue: Manual process via Slack requests`
	`287`	`+-Impact: Slow incident response, dependency on ops team`
	`288`	`+-Tracking: Issue#9`
	`289`	`+`
	`290`	`+###Provisioner Scaling`
	`291`	`+-Issue: Manual scaling required, no auto-scaling`
	`292`	`+-Impact: Timeouts during simultaneous workspace operations`
	`293`	`+-Tracking: Issue#8`
	`294`	`+`
	`295`	`+---`
	`296`	`+`
	`297`	`+##Related Documentation`
	`298`	`+`
	`299`	`+-[Monthly Workshop Guide](./MONTHLY_WORKSHOP_GUIDE.md)`
	`300`	`+-[Pre-Workshop Checklist](./PRE_WORKSHOP_CHECKLIST.md)`
	`301`	`+-[Incident Runbook](./INCIDENT_RUNBOOK.md)`
	`302`	`+-[Post-Workshop Retrospective Template](./POST_WORKSHOP_RETROSPECTIVE.md)`
	`303`	`+-[Participant Guide](./PARTICIPANT_GUIDE.md)`
	`304`	`+`
	`305`	`+---`
	`306`	`+`
	`307`	`+##Future Expansion`
	`308`	`+`
	`309`	`+Planned additional demo environments:`
	`310`	`+`
	`311`	`+###coderdemo.io`
	`312`	`+-Purpose: SE official demo environment`
	`313`	`+-Level: Production-grade, best practices, reference architecture`
	`314`	`+-Status: Not yet live`
	`315`	`+`
	`316`	`+###devcoder.io`
	`317`	`+-Purpose: CS / Engineering collaboration environment`
	`318`	`+-Use Case: Enablement, internal feedback loops, dogfooding`
	`319`	`+-Status: Not yet live`
	`320`	`+`
	`321`	`+---`
	`322`	`+`
	`323`	`+Last Updated: October 2024`
	`324`	`+Maintained By: Infrastructure Team`
	`325`	`+Questions: #help-me-ops orjullian@coder.com`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit2a22075

File tree

3 files changed

3 files changed

`‎docs/workshops/ARCHITECTURE.md‎`

0 commit comments