NotificationsYou must be signed in to change notification settings
Fork100
Star96

Reuse Ephemeral runners#6315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

jeanschmidt merged 21 commits intomainfromjeanschmidt/reuse_runners

Mar 10, 2025

Merged

Reuse Ephemeral runners#6315

jeanschmidt merged 21 commits intomainfromjeanschmidt/reuse_runners

Mar 10, 2025

Conversation

Copy link

Contributor

jeanschmidt commentedFeb 21, 2025•
edited
Loading

About

With the goal to eventually move to all instances being ephemeral, we need to fix the major limitation we have with ephemeral instances: stockouts.

This is a problem as we currently release the instances when they finish the job.

The goal is to make the instances to be reused before return them to AWS by:

Tagging ephemeral instances that finished a job withEphemeralRunnerFinished=finish_timestamp so scaleUp is hinted that it can be reused;
scaleUp finds instances that have theEphemeralRunnerFinished and try to use them to run a new job;
scaleUp acquires lock on the instance name to avoid concurrency on reuse;
scaleUp mark instances re-deployed withEBSVolumeReplacementRequestTm tagging when the instance was marked for reuse;
scaleUp removeEphemeralRunnerFinished so others won't find the same instance for reuse;
scaleUp creates the necessary SSM parameters and return the instance to its fresh state by restoring EBS volume;

ScaleDown then:

Avoids removing ephemeral instances byminRunningTime using either creation time orEphemeralRunnerFinished orEBSVolumeReplacementRequestTm depending on instance status;

Disaster recovery plan:

https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.6p894klzung1

20250221080318

27605af

Copy link

vercelbot commentedFeb 21, 2025•
edited
Loading

The latest updates on your projects. Learn more aboutVercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	Mar 7, 2025 2:04pm

facebook-github-bot added the CLA SignedThis label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label

Feb 21, 2025

Merge branch 'main' into jeanschmidt/reuse_runners

8bdf74c

vercelbotdeployed toPreview

February 26, 2025 16:14

View deployment

jeanschmidt added5 commits

February 26, 2025 17:21

20250226172157

0dff5ce

20250228171059

a7440ca

20250228180314

e543ec4

20250228181204

73d2688

20250228212509

4034502

jeanschmidt changed the title~~[WIP] - Experimentations with runners reuse~~[WIP] - Reuse Ephemeral runners

Feb 28, 2025

ZainRizvi reviewed

Feb 28, 2025

View reviewed changes

Copy link

Contributor

ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The approach looks quite reasonable!

terraform-aws-github-runner/modules/runners-instances/templates/install-config-runner.ps1 OutdatedShow resolvedHide resolved

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/cache.test.tsShow resolvedHide resolved

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/runners.ts OutdatedShow resolvedHide resolved

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/runners.ts

		metrics.ec2CreateReplaceRootVolumeTaskFailure,
		() => {
		return ec2
		.createReplaceRootVolumeTask({

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Any idea how long this operation takes?

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

no. I don't send metrics for it specifically on the spa.

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

arround 500ms, lets assume <1000ms

jeanschmidt added6 commits

March 3, 2025 20:22

20250303202251

e2a0c6b

20250305153903

cc31240

20250305160640

462c68a

20250305160840

9aa05de

20250305162757

f7b0033

20250305171242

4fc719d

Copy link

ContributorAuthor

jeanschmidt commentedMar 5, 2025

Workflow testing the changes:https://github.com/pytorch/pytorch-canary/actions/runs/13680037408/job/38251223446?pr=249

20250305174846

09fde96

jeanschmidt changed the title~~[WIP] - Reuse Ephemeral runners~~Reuse Ephemeral runners

Mar 5, 2025

jeanschmidt requested a review fromZainRizvi

March 5, 2025 16:49

jeanschmidt added3 commits

March 5, 2025 17:55

20250305175541

bd5a8e8

20250305175922

d1e77e9

Merge branch 'main' ofhttps://github.com/pytorch/test-infrainto jea…

112af31

…nschmidt/reuse_runners

vercelbotdeployed toPreview

March 5, 2025 17:21

View deployment

ZainRizvi reviewed

Mar 5, 2025

View reviewed changes

Copy link

Contributor

ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Finished reviewing the first half of this code. Will do the other half in a bit

terraform-aws-github-runner/modules/runners-instances/templates/install-config-runner.ps1 OutdatedShow resolvedHide resolved

terraform-aws-github-runner/modules/runners/lambdas/runners/jest.config.js

Comment on lines +10 to +12

		functions:90,
		lines:93,
		statements:94

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

what does this do?

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

jest configuration.

I am just raising the bar of the tests, as I wrote more tests and improved the coverage a bit more, I am just making sure people are maintaining the standard, over ignoring tests.

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/cache.ts

		const redisData = JSON.parse(redisResponse, mapReviver) as RedisStore;
		while (+new Date() / 1000 < acquireTimeout) {
		try {
		if (tryRead !== undefined) {

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

when would tryRead ever be undefined?

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

when you don't want to use it? I don't think there is a point to replicate the code to write a function that supports a try-read and one that does not.

We have both cases in our application.

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/cache.ts

		async () => {
		if (!redisPool) throw Error('Redis should be initialized!');

		console.debug(`Trying to get a redis client ${queryKey}`);

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: are debug statements like this important to leave in?

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe just at the beginning, if we need to troubleshoot something. I would rather have them, for now.

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/runners.ts OutdatedShow resolvedHide resolved

jeanschmidt added2 commits

March 6, 2025 14:26

Proposal from zain to 'onliner' some powershell code

9f34c4b

replacing tag named ebsVolumeReplacementRequestTm with ebsVolumeRepla…

8d59d65

…cementRequestTimestamp

Copy link

ContributorAuthor

jeanschmidt commentedMar 6, 2025

@ZainRizvi

Could you please share your major concern with not moving forward with the implementation? I assume that as you didn't approve it is because you have more concerns than just some syntaxes improvements you shared in your comments.

atalman requested changes

Mar 6, 2025

View reviewed changes

Copy link

Contributor

atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

looks good. Could you please share the test plan for this ? Can it be tested in canary ?

Copy link

ContributorAuthor

jeanschmidt commentedMar 6, 2025•
edited
Loading

Workflow testing the changes:https://github.com/pytorch/pytorch-canary/actions/runs/13680037408/job/38251223446?pr=249

Sure, I did test in canary, here is the workflow showing the runner being reused according to plan:

#6315 (comment)

This paste shows how it will impact our internal infra (terraform plan): P1747996633

(edit updated paste, as the one before was flagged)

Copy link

Contributor

atalman commentedMar 6, 2025

Can it be deployed to canary only, so we can run couple of PRs and then explore logs of runners for any possible issues or failures ?

Copy link

ContributorAuthor

jeanschmidt commentedMar 6, 2025

Can it be deployed to canary only, so we can run couple of PRs and then explore logs of runners for any possible issues or failures ?

Yes, what else do you need except from a pytorch-canary CI with the results of the test? Do you want me to reproduce the test and paste the full logs from scaleUp?

Copy link

Contributor

atalman commentedMar 6, 2025

I believe we probably want to run full CI/CD using this lambda in canary usingciflow/binaries andciflow/trunk labels. Before deploying to prod.

atalman approved these changes

Mar 6, 2025

View reviewed changes

Copy link

Contributor

atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm

ZainRizvi approved these changes

Mar 6, 2025

View reviewed changes

Copy link

Contributor

ZainRizvi commentedMar 6, 2025

Can you please update the PR with the rollout/rollback plan for this change? Would be good to see how this is expected to be safely tested in prod with the option for a fast rollback if something goes wrong

Merge branch 'main' ofhttps://github.com/pytorch/test-infrainto jea…

74976e6

…nschmidt/reuse_runners

vercelbotdeployed toPreview

March 7, 2025 13:44

View deployment

Ops, small mistake on powershell code

ddc17dd

jeanschmidt merged commitfc07220 intomain

Mar 10, 2025

6 checks passed

jeanschmidt deleted the jeanschmidt/reuse_runners branch

March 10, 2025 12:47

This was referencedMar 11, 2025

Enable experimentation with ephemeral runners on pull.ymlpytorch/pytorch#148897

Closed

Add ephemeral variants for all runner types, and updates validate_scale_config.py to enforce it#6402

Merged

jeanschmidt added a commit that referenced this pull request

Mar 13, 2025

Add ephemeral variants for all runner types, and updates validate_sca…

b6abb4a

…le_config.py to enforce it (#6402)# TLDRAdds `ephemeral` variant to all runner types, and updatesvalidate_scale_config.py to enforce it# What?Enable experiment with ephemeral runners for all workflows. The statusof the experiment can be found in the [test-infraissue](#5132).# Why?Those runners are ephemeral, as eliminating nonephemeral runners is afollow up for the recent security incident. Refreshable infrastructurehave been something we've trying to accomplish for a while, but haven'tbeen successful. The major blocker we face is related to stockouts andunreliability from GH side. Most of it is because nonephemeral runnerscan run other jobs and continue clearing the queue in case of a problem.This is not possible for ephemeral runners.# How?To remediate stockouts, the [reuse/refresh of ephemeralinstances](#6315) have beenintroduced.In order to remediate GH side issues, [queue healingmechanism](#6390) is beingimplemented.Also, in order to guarantee the stability of this behaviour, [additionalunit tests have been included with otherchanges](#6403).# Next stepsAfter merging those changes, we intend to put a small percentage of jobsto use ephemeral runners, so we can evaluate impact on queue times andgather statistics on reuse and tune noflap cluster behaviour. Once wefeel comfortable the experiment will be shifted to 100% and we'llmigrate all workflows to fully ephemeral instances. Eventually, allrunners will be ephemeral and the experiment and runner variants will beremoved, as we update other workflows like slow, trunk and nightlies.# The ephemeral runners are not working properly?Then go to the experiment on the `What?` section of this document andset the experiment named `ephemeral` it to 0.

Camyll pushed a commit that referenced this pull request

Mar 13, 2025

Add ephemeral variants for all runner types, and updates validate_sca…

084c572

…le_config.py to enforce it (#6402)# TLDRAdds `ephemeral` variant to all runner types, and updatesvalidate_scale_config.py to enforce it# What?Enable experiment with ephemeral runners for all workflows. The statusof the experiment can be found in the [test-infraissue](#5132).# Why?Those runners are ephemeral, as eliminating nonephemeral runners is afollow up for the recent security incident. Refreshable infrastructurehave been something we've trying to accomplish for a while, but haven'tbeen successful. The major blocker we face is related to stockouts andunreliability from GH side. Most of it is because nonephemeral runnerscan run other jobs and continue clearing the queue in case of a problem.This is not possible for ephemeral runners.# How?To remediate stockouts, the [reuse/refresh of ephemeralinstances](#6315) have beenintroduced.In order to remediate GH side issues, [queue healingmechanism](#6390) is beingimplemented.Also, in order to guarantee the stability of this behaviour, [additionalunit tests have been included with otherchanges](#6403).# Next stepsAfter merging those changes, we intend to put a small percentage of jobsto use ephemeral runners, so we can evaluate impact on queue times andgather statistics on reuse and tune noflap cluster behaviour. Once wefeel comfortable the experiment will be shifted to 100% and we'llmigrate all workflows to fully ephemeral instances. Eventually, allrunners will be ephemeral and the experiment and runner variants will beremoved, as we update other workflows like slow, trunk and nightlies.# The ephemeral runners are not working properly?Then go to the experiment on the `What?` section of this document andset the experiment named `ephemeral` it to 0.

Labels

CLA Signed

This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

4 participants

Movatterモバイル変換

Reuse Ephemeral runners#6315

Reuse Ephemeral runners#6315

Uh oh!

Conversation

jeanschmidt commentedFeb 21, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

About

Disaster recovery plan:

Uh oh!

vercelbot commentedFeb 21, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ZainRizvi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeanschmidt commentedMar 5, 2025

Uh oh!

ZainRizvi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeanschmidt commentedMar 6, 2025

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

jeanschmidt commentedMar 6, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

atalman commentedMar 6, 2025

Uh oh!

jeanschmidt commentedMar 6, 2025

Uh oh!

atalman commentedMar 6, 2025

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

ZainRizvi commentedMar 6, 2025

Uh oh!

Uh oh!

Uh oh!

jeanschmidt commentedFeb 21, 2025•
edited
Loading

vercelbot commentedFeb 21, 2025•
edited
Loading

jeanschmidt commentedMar 6, 2025•
edited
Loading