Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Reuse Ephemeral runners#6315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
jeanschmidt merged 21 commits intomainfromjeanschmidt/reuse_runners
Mar 10, 2025
Merged

Reuse Ephemeral runners#6315

jeanschmidt merged 21 commits intomainfromjeanschmidt/reuse_runners
Mar 10, 2025

Conversation

jeanschmidt
Copy link
Contributor

@jeanschmidtjeanschmidt commentedFeb 21, 2025
edited
Loading

About

With the goal to eventually move to all instances being ephemeral, we need to fix the major limitation we have with ephemeral instances: stockouts.

This is a problem as we currently release the instances when they finish the job.

The goal is to make the instances to be reused before return them to AWS by:

  • Tagging ephemeral instances that finished a job withEphemeralRunnerFinished=finish_timestamp so scaleUp is hinted that it can be reused;
  • scaleUp finds instances that have theEphemeralRunnerFinished and try to use them to run a new job;
  • scaleUp acquires lock on the instance name to avoid concurrency on reuse;
  • scaleUp mark instances re-deployed withEBSVolumeReplacementRequestTm tagging when the instance was marked for reuse;
  • scaleUp removeEphemeralRunnerFinished so others won't find the same instance for reuse;
  • scaleUp creates the necessary SSM parameters and return the instance to its fresh state by restoring EBS volume;

ScaleDown then:

  • Avoids removing ephemeral instances byminRunningTime using either creation time orEphemeralRunnerFinished orEBSVolumeReplacementRequestTm depending on instance status;

Disaster recovery plan:

https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.6p894klzung1

@vercelVercel
Copy link

vercelbot commentedFeb 21, 2025
edited
Loading

The latest updates on your projects. Learn more aboutVercel for Git ↗︎

1 Skipped Deployment
NameStatusPreviewUpdated (UTC)
torchci⬜️ Ignored (Inspect)Visit PreviewMar 7, 2025 2:04pm

@facebook-github-botfacebook-github-bot added the CLA SignedThis label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labelFeb 21, 2025
@jeanschmidtjeanschmidt changed the title[WIP] - Experimentations with runners reuse[WIP] - Reuse Ephemeral runnersFeb 28, 2025
Copy link
Contributor

@ZainRizviZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The approach looks quite reasonable!

metrics.ec2CreateReplaceRootVolumeTaskFailure,
() => {
return ec2
.createReplaceRootVolumeTask({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Any idea how long this operation takes?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

no. I don't send metrics for it specifically on the spa.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Screenshot 2025-03-06 at 14 30 28

arround 500ms, lets assume <1000ms

ZainRizvi reacted with thumbs up emoji
@jeanschmidt
Copy link
ContributorAuthor

@jeanschmidtjeanschmidt changed the title[WIP] - Reuse Ephemeral runnersReuse Ephemeral runnersMar 5, 2025
Copy link
Contributor

@ZainRizviZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Finished reviewing the first half of this code. Will do the other half in a bit

Comment on lines +10 to +12
functions:90,
lines:93,
statements:94
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

what does this do?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

jest configuration.

I am just raising the bar of the tests, as I wrote more tests and improved the coverage a bit more, I am just making sure people are maintaining the standard, over ignoring tests.

const redisData = JSON.parse(redisResponse, mapReviver) as RedisStore;
while (+new Date() / 1000 < acquireTimeout) {
try {
if (tryRead !== undefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

when would tryRead ever be undefined?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

when you don't want to use it? I don't think there is a point to replicate the code to write a function that supports a try-read and one that does not.

We have both cases in our application.

async () => {
if (!redisPool) throw Error('Redis should be initialized!');

console.debug(`Trying to get a redis client ${queryKey}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: are debug statements like this important to leave in?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe just at the beginning, if we need to troubleshoot something. I would rather have them, for now.

@jeanschmidt
Copy link
ContributorAuthor

@ZainRizvi

Could you please share your major concern with not moving forward with the implementation? I assume that as you didn't approve it is because you have more concerns than just some syntaxes improvements you shared in your comments.

Copy link
Contributor

@atalmanatalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

looks good. Could you please share the test plan for this ? Can it be tested in canary ?

@jeanschmidt
Copy link
ContributorAuthor

jeanschmidt commentedMar 6, 2025
edited
Loading

Workflow testing the changes:https://github.com/pytorch/pytorch-canary/actions/runs/13680037408/job/38251223446?pr=249

Sure, I did test in canary, here is the workflow showing the runner being reused according to plan:

#6315 (comment)

This paste shows how it will impact our internal infra (terraform plan): P1747996633

(edit updated paste, as the one before was flagged)

@atalman
Copy link
Contributor

Can it be deployed to canary only, so we can run couple of PRs and then explore logs of runners for any possible issues or failures ?

@jeanschmidt
Copy link
ContributorAuthor

Can it be deployed to canary only, so we can run couple of PRs and then explore logs of runners for any possible issues or failures ?

Yes, what else do you need except from a pytorch-canary CI with the results of the test? Do you want me to reproduce the test and paste the full logs from scaleUp?

@atalman
Copy link
Contributor

I believe we probably want to run full CI/CD using this lambda in canary usingciflow/binaries andciflow/trunk labels. Before deploying to prod.

Copy link
Contributor

@atalmanatalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm

@ZainRizvi
Copy link
Contributor

Can you please update the PR with the rollout/rollback plan for this change? Would be good to see how this is expected to be safely tested in prod with the option for a fast rollback if something goes wrong

jeanschmidt reacted with thumbs up emoji

@jeanschmidtjeanschmidt merged commitfc07220 intomainMar 10, 2025
6 checks passed
@jeanschmidtjeanschmidt deleted the jeanschmidt/reuse_runners branchMarch 10, 2025 12:47
jeanschmidt added a commit that referenced this pull requestMar 13, 2025
…le_config.py to enforce it (#6402)# TLDRAdds `ephemeral` variant to all runner types, and updatesvalidate_scale_config.py to enforce it# What?Enable experiment with ephemeral runners for all workflows. The statusof the experiment can be found in the [test-infraissue](#5132).# Why?Those runners are ephemeral, as eliminating nonephemeral runners is afollow up for the recent security incident. Refreshable infrastructurehave been something we've trying to accomplish for a while, but haven'tbeen successful. The major blocker we face is related to stockouts andunreliability from GH side. Most of it is because nonephemeral runnerscan run other jobs and continue clearing the queue in case of a problem.This is not possible for ephemeral runners.# How?To remediate stockouts, the [reuse/refresh of ephemeralinstances](#6315) have beenintroduced.In order to remediate GH side issues, [queue healingmechanism](#6390) is beingimplemented.Also, in order to guarantee the stability of this behaviour, [additionalunit tests have been included with otherchanges](#6403).# Next stepsAfter merging those changes, we intend to put a small percentage of jobsto use ephemeral runners, so we can evaluate impact on queue times andgather statistics on reuse and tune noflap cluster behaviour. Once wefeel comfortable the experiment will be shifted to 100% and we'llmigrate all workflows to fully ephemeral instances. Eventually, allrunners will be ephemeral and the experiment and runner variants will beremoved, as we update other workflows like slow, trunk and nightlies.# The ephemeral runners are not working properly?Then go to the experiment on the `What?` section of this document andset the experiment named `ephemeral` it to 0.
Camyll pushed a commit that referenced this pull requestMar 13, 2025
…le_config.py to enforce it (#6402)# TLDRAdds `ephemeral` variant to all runner types, and updatesvalidate_scale_config.py to enforce it# What?Enable experiment with ephemeral runners for all workflows. The statusof the experiment can be found in the [test-infraissue](#5132).# Why?Those runners are ephemeral, as eliminating nonephemeral runners is afollow up for the recent security incident. Refreshable infrastructurehave been something we've trying to accomplish for a while, but haven'tbeen successful. The major blocker we face is related to stockouts andunreliability from GH side. Most of it is because nonephemeral runnerscan run other jobs and continue clearing the queue in case of a problem.This is not possible for ephemeral runners.# How?To remediate stockouts, the [reuse/refresh of ephemeralinstances](#6315) have beenintroduced.In order to remediate GH side issues, [queue healingmechanism](#6390) is beingimplemented.Also, in order to guarantee the stability of this behaviour, [additionalunit tests have been included with otherchanges](#6403).# Next stepsAfter merging those changes, we intend to put a small percentage of jobsto use ephemeral runners, so we can evaluate impact on queue times andgather statistics on reuse and tune noflap cluster behaviour. Once wefeel comfortable the experiment will be shifted to 100% and we'llmigrate all workflows to fully ephemeral instances. Eventually, allrunners will be ephemeral and the experiment and runner variants will beremoved, as we update other workflows like slow, trunk and nightlies.# The ephemeral runners are not working properly?Then go to the experiment on the `What?` section of this document andset the experiment named `ephemeral` it to 0.
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@ZainRizviZainRizviZainRizvi approved these changes

@atalmanatalmanatalman approved these changes

Assignees
No one assigned
Labels
CLA SignedThis label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@jeanschmidt@atalman@ZainRizvi@facebook-github-bot

[8]ページ先頭

©2009-2025 Movatter.jp