- Notifications
You must be signed in to change notification settings - Fork687
Description
Motivation
Currently, the scale-up lambda determines the AMI for new GitHub Action runners using static configuration (e.g., SSM parameter name) selected per runner group at deployment time. This requires deploying and managing separate infrastructure modules for each AMI requirement, making it difficult to target specific runner images for ad hoc workflows or testing pipelines.
One challenge occurs when you have both production and staging pipelines, possibly even in separate organizations, but want to use the same runner group. With the existing single deployment and SSM parameter per runner group, multiple changes (e.g., testing new AMIs in staging) overwrite the shared SSM parameter, leading to confusion and errors. This makes testing and validation before pushing to production difficult and error-prone.
Proposal
Introduce anoption (toggleable per runner group in the Terraform code) to dynamically select the AMI (ID or ARN) for EC2 runners based on labels provided in the triggering workflow job. This enables workflows to specify custom runner requirements directly via labels (e.g.,runs-on: [self-hosted, test-ami-v2]) and ensures that only jobs matching the appropriate labels will launch runners with the corresponding AMIs.
- Default to the current behavior (static AMI from SSM parameter) unless the dynamic label-based mapping is enabled for the runner group.
- When enabled, pass workflow job labels/metadata from the webhook/SQS to the scale-up lambda.
- Provide a configuration mapping (per runner group/module) from runner labels to SSM AMI parameter names (or direct AMI IDs/ARNs).
- Scale-up lambda uses the job labels to select the correct AMI at runtime, falling back to default if no match.
Benefits
- Greatly improves flexibility for CI/CD organizations with both production and staging pipelines, especially when using shared runner group names.
- No more accidental overwriting of SSM parameters between pipeline scenarios.
- Streamlines test pipelines, allowing safe staging runs with different AMIs without impacting production pipelines.
- Reduces infrastructure management overhead and simplifies testing new runner images.
Example Use Case
- Organization A (production) and Organization B (staging) both use runner group 'service-x'.
- Production uses AMI 'prod-ami', staging uses 'test-ami'.
- Currently, staging pipeline deployments override the SSM parameter for 'service-x', overwriting staging AMI, when multiple changes are being testing at the same time, we risk testing the wrong AMI.
- With label-based AMI toggle, both can use distinct AMIs for the same runner group, and staging tests never impact each other.
Implementation Details
- Add Terraform variable and internal wiring to toggle this feature per runner group/module.
- Modify ActionRequestMessage and SQS message to include workflow job labels.
- Add config mapping and retrieval in scale-up lambda.
- Retain backwards compatibility for users not enabling the toggle.
Related Context
This approach would align with label-based queue selection but extend to image provisioning, closing a key gap in dynamic self-hosted runner scenarios and multi-org, multi-pipeline CI setups.