- Notifications
You must be signed in to change notification settings - Fork100
Reuse Ephemeral runners#6315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
vercelbot commentedFeb 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The latest updates on your projects. Learn more aboutVercel for Git ↗︎ 1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The approach looks quite reasonable!
terraform-aws-github-runner/modules/runners-instances/templates/install-config-runner.ps1 OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/cache.test.tsShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/runners.ts OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
metrics.ec2CreateReplaceRootVolumeTaskFailure, | ||
() => { | ||
return ec2 | ||
.createReplaceRootVolumeTask({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Any idea how long this operation takes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
no. I don't send metrics for it specifically on the spa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Workflow testing the changes:https://github.com/pytorch/pytorch-canary/actions/runs/13680037408/job/38251223446?pr=249 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Finished reviewing the first half of this code. Will do the other half in a bit
terraform-aws-github-runner/modules/runners-instances/templates/install-config-runner.ps1 OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
functions:90, | ||
lines:93, | ||
statements:94 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
what does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
jest configuration.
I am just raising the bar of the tests, as I wrote more tests and improved the coverage a bit more, I am just making sure people are maintaining the standard, over ignoring tests.
const redisData = JSON.parse(redisResponse, mapReviver) as RedisStore; | ||
while (+new Date() / 1000 < acquireTimeout) { | ||
try { | ||
if (tryRead !== undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
when would tryRead ever be undefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
when you don't want to use it? I don't think there is a point to replicate the code to write a function that supports a try-read and one that does not.
We have both cases in our application.
async () => { | ||
if (!redisPool) throw Error('Redis should be initialized!'); | ||
console.debug(`Trying to get a redis client ${queryKey}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
nit: are debug statements like this important to leave in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe just at the beginning, if we need to troubleshoot something. I would rather have them, for now.
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/runners.ts OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Could you please share your major concern with not moving forward with the implementation? I assume that as you didn't approve it is because you have more concerns than just some syntaxes improvements you shared in your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
looks good. Could you please share the test plan for this ? Can it be tested in canary ?
jeanschmidt commentedMar 6, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Sure, I did test in canary, here is the workflow showing the runner being reused according to plan: This paste shows how it will impact our internal infra (terraform plan): P1747996633 (edit updated paste, as the one before was flagged) |
Can it be deployed to canary only, so we can run couple of PRs and then explore logs of runners for any possible issues or failures ? |
Yes, what else do you need except from a pytorch-canary CI with the results of the test? Do you want me to reproduce the test and paste the full logs from scaleUp? |
I believe we probably want to run full CI/CD using this lambda in canary using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
lgtm
Can you please update the PR with the rollout/rollback plan for this change? Would be good to see how this is expected to be safely tested in prod with the option for a fast rollback if something goes wrong |
…nschmidt/reuse_runners
fc07220
intomainUh oh!
There was an error while loading.Please reload this page.
…le_config.py to enforce it (#6402)# TLDRAdds `ephemeral` variant to all runner types, and updatesvalidate_scale_config.py to enforce it# What?Enable experiment with ephemeral runners for all workflows. The statusof the experiment can be found in the [test-infraissue](#5132).# Why?Those runners are ephemeral, as eliminating nonephemeral runners is afollow up for the recent security incident. Refreshable infrastructurehave been something we've trying to accomplish for a while, but haven'tbeen successful. The major blocker we face is related to stockouts andunreliability from GH side. Most of it is because nonephemeral runnerscan run other jobs and continue clearing the queue in case of a problem.This is not possible for ephemeral runners.# How?To remediate stockouts, the [reuse/refresh of ephemeralinstances](#6315) have beenintroduced.In order to remediate GH side issues, [queue healingmechanism](#6390) is beingimplemented.Also, in order to guarantee the stability of this behaviour, [additionalunit tests have been included with otherchanges](#6403).# Next stepsAfter merging those changes, we intend to put a small percentage of jobsto use ephemeral runners, so we can evaluate impact on queue times andgather statistics on reuse and tune noflap cluster behaviour. Once wefeel comfortable the experiment will be shifted to 100% and we'llmigrate all workflows to fully ephemeral instances. Eventually, allrunners will be ephemeral and the experiment and runner variants will beremoved, as we update other workflows like slow, trunk and nightlies.# The ephemeral runners are not working properly?Then go to the experiment on the `What?` section of this document andset the experiment named `ephemeral` it to 0.
…le_config.py to enforce it (#6402)# TLDRAdds `ephemeral` variant to all runner types, and updatesvalidate_scale_config.py to enforce it# What?Enable experiment with ephemeral runners for all workflows. The statusof the experiment can be found in the [test-infraissue](#5132).# Why?Those runners are ephemeral, as eliminating nonephemeral runners is afollow up for the recent security incident. Refreshable infrastructurehave been something we've trying to accomplish for a while, but haven'tbeen successful. The major blocker we face is related to stockouts andunreliability from GH side. Most of it is because nonephemeral runnerscan run other jobs and continue clearing the queue in case of a problem.This is not possible for ephemeral runners.# How?To remediate stockouts, the [reuse/refresh of ephemeralinstances](#6315) have beenintroduced.In order to remediate GH side issues, [queue healingmechanism](#6390) is beingimplemented.Also, in order to guarantee the stability of this behaviour, [additionalunit tests have been included with otherchanges](#6403).# Next stepsAfter merging those changes, we intend to put a small percentage of jobsto use ephemeral runners, so we can evaluate impact on queue times andgather statistics on reuse and tune noflap cluster behaviour. Once wefeel comfortable the experiment will be shifted to 100% and we'llmigrate all workflows to fully ephemeral instances. Eventually, allrunners will be ephemeral and the experiment and runner variants will beremoved, as we update other workflows like slow, trunk and nightlies.# The ephemeral runners are not working properly?Then go to the experiment on the `What?` section of this document andset the experiment named `ephemeral` it to 0.
Uh oh!
There was an error while loading.Please reload this page.
About
With the goal to eventually move to all instances being ephemeral, we need to fix the major limitation we have with ephemeral instances: stockouts.
This is a problem as we currently release the instances when they finish the job.
The goal is to make the instances to be reused before return them to AWS by:
EphemeralRunnerFinished=finish_timestamp
so scaleUp is hinted that it can be reused;EphemeralRunnerFinished
and try to use them to run a new job;EBSVolumeReplacementRequestTm
tagging when the instance was marked for reuse;EphemeralRunnerFinished
so others won't find the same instance for reuse;ScaleDown then:
minRunningTime
using either creation time orEphemeralRunnerFinished
orEBSVolumeReplacementRequestTm
depending on instance status;Disaster recovery plan:
https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.6p894klzung1