- Notifications
You must be signed in to change notification settings - Fork90
Tags: pytorch/test-infra
Tags
v20250317-134413
Adds scaleUpHealing chron (#6412)# TLDR This change introduces a new lambda `${var.environment}-scale-up-chron`.With all the typescript code and required terraform changes.# What is chaning?This PR introduces the typescript code for the new lambda, and therelated terraform changes to run the lambda every 30 minutes. The lambdashould timeout in 15 minutes. Its permissions and access should be thesame as the one in scaleUp.It goes to hud in a URL specified in a user configuration`retry_scale_up_chron_hud_query_url` and gets a list of instance typesand number of jobs enqueued. It then synchronously tries to deploy thoserunners.It introduces 2 new parameters in the main module:* `retry_scale_up_chron_hud_query_url` that for now should point tohttps://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5Donly in the installations that will benefit from it (both meta and linuxfoundation PROD clusters, NOT canary) as when this variable is set toempty string (default) the installation of this cron is not performed.* `scale_config_org` that should point to the org where scale-configfiles are defined. In our case it is `pytorch`.[example of thechange](https://github.com/pytorch-labs/pytorch-gha-infra/pull/622/files)# Why are we changing this?We're introducing this change in order to provide a solution to helprecover lost requests for infra scaling. Its been proven for a whilethat when there are github API outages we fail to get new jobs webhookor fail to provision new runners. Most of the time our retry mechanismis capable of dealing with the situation. But, in cases where we are notreceiving webhooks or other more esoteric problems, there is no way torecover.With this change, every 30 minutes, jobs enqueued for longer than 30minutes for one of the autoscaled instance types, will trigger thecreation of those instances.A few design decisions:1 - Why rely on hud?Hud currently already have these informations, so it should be simple tojust get it from there;2 - Why not send a scale message and allow scaleUp to handle it?We want to have isolation, in a way that we can easily circuit-break thecreation of enqueued instances. This also includes the isolation thatguarantees that if scaler is failing to deploy given instance type, thismechanism won;t risk flood/overflow the main scaler that have to dealwith all other ones.3 - why randomise the instance creation order?So if some instance type is problematic, we are not absolutelypreventing the recovery of other instances types (just interfering).Also we gain some time between instances creations of the same type,allowing for a smoother operation.4 - why a new lambda?check number 2# If something goes wrong?Given we introduced as much as possible work to make sure there aremaximal isolation between the regular scaler and the cron recoveryscaler that we're introducing, we;re not foreseeing any potential gapsthat could break the main scaler and as a consequence introduce systembreakages.Having said that, if you need to revert those changes from production,just follow the steps:https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.jwflgevrww4j---------Co-authored-by: Zain Rizvi <ZainR@meta.com>Co-authored-by: Camyll Harajli <camyllh@meta.com>
v20250313-185750
Adds additional tests to getRunnerTypes, simplifies code a bit, adds ……the support for `c.` runners (#6403)## Support for `c.` runnersWe don't really support well our canary environment with variants, notsure why, but this condition was not present, so variants will get thewrong naming when running at meta canary environment.`variant.c.runner.name` instead of `c.variant.runner.name`. This fix isvery low impact and should not change anything in production.## Simplifies the codeThere is a useless `if` that I noticed when reading `getRunnerTypes`. Iam removing that conditional;## Additional tests for getRunnerTypesAs we're adding empty variants as part of the ephemeral migration, Iwant to make sure that we have green tests and we're covering thissituation in our unit tests.The new situations to cover are:1) Empty variants2) meta's canary environment naming convention `c.`
v20250310-124810
Reuse Ephemeral runners (#6315)# AboutWith the goal to eventually move to all instances being ephemeral, weneed to fix the major limitation we have with ephemeral instances:stockouts.This is a problem as we currently release the instances when they finishthe job.The goal is to make the instances to be reused before return them to AWSby:* Tagging ephemeral instances that finished a job with`EphemeralRunnerFinished=finish_timestamp` so scaleUp is hinted that itcan be reused;* scaleUp finds instances that have the `EphemeralRunnerFinished` andtry to use them to run a new job;* scaleUp acquires lock on the instance name to avoid concurrency onreuse;* scaleUp mark instances re-deployed with`EBSVolumeReplacementRequestTm` tagging when the instance was marked forreuse;* scaleUp remove `EphemeralRunnerFinished` so others won't find the sameinstance for reuse;* scaleUp creates the necessary SSM parameters and return the instanceto its fresh state by restoring EBS volume;ScaleDown then:* Avoids removing ephemeral instances by `minRunningTime` using eithercreation time or `EphemeralRunnerFinished` or`EBSVolumeReplacementRequestTm` depending on instance status;# Disaster recovery plan:If this PR introduces breakages, they will mostly certainly be relatedto the capacity of deploying new instances/runners over having anydifferent behaviour in the runner itself.So, after reverting this change, it will be important to make sure therunner queue is under control. What should be accomplished by checkingthe queue size on [hud metrics](https://hud.pytorch.org/metrics) andrunning the[send_scale_message.py](https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/scale_tools/send_scale_message.py)script to make sure those instances will be properly deployed by thestable version of the scaler.## Step by step to revert this change from **META**1 - Identify if this PR is causing the identified problem: [look atqueue size](https://hud.pytorch.org/metrics) and if it is related toimpacted runners (ephemeral ones); It can also help to investigate the[metrics onunidash](https://www.internalfb.com/intern/unidash/dashboard/aws_infra_monitoring_for_github_actions/lambda_scaleup)and the[logs](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/gh-ci-scale-up?tab=monitoring)related to the scaleUp lambda;2 - In case of confirming the source of the problem be triggered by thisPR, revert it from main with the goal of making sure it won't impactagain in case someone else is working in other changes and accidentallyrelease a version of test-infra with this change.3 - In order to restore the infrastructure to the point before thischange:A) find the commit (or more than one, unlikely) that points to a releaseversion of test-infra that contains this change (will most likely be thelatest) on pytorch-gha-infra. It will be a change updating the Terrafilepointing to a newer version of test-infra([example](pytorch-labs/pytorch-gha-infra@c4e888f)).We maintain by standard the naming of such commit as `ReleasevDATE-TIME` like `Release v20250204-163312`B) Revert that commit fromhttps://github.com/pytorch-labs/pytorch-gha-infraC) Follow [thesteps](https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.vj4fvy46wzwk)outlined in the Pytorch GHA Infra runbook;D) There are pointers in that document to monitoring and making sure youare seeing recovery in metrics / queue / logs that you identified, andhow to make sure you are recovered;4 - Restore user experience:A) If you do have access, follow the [instructions into how to recoverephemeral queueingjobs](https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.ba0nyrda8jch)on the above mentioned document;B) Another option is to cancel jobs that are queued and trigger themagain;
v20250306-173054
Adding tooling and documentation for locally run tflint (#6370)created a Makefile on `./terraform-aws-github-runner` to perform tflintactions, and replaced the tflint calls on CI (`tflint.yml`) with thismakefile.This makes much easier to test locally and make sure to get greensignals on CI. Reducing the loop time to fix small syntax bugs.
v20250305-171119
[Bugfix] wait for ssm parameter to be created (#6359)Sometimes SSM parameter is not properly created. After investigation Iidentified that the promise is not being properly awaited. What couldcause some operations to be canceled.
v20250205-165758
20250205175711
v20250205-164602
20250205174527
v20250205-163646
20250205173601
v20250205-163308
20250205173224
v20250205-161117
Adds ci-queue-pct lambda code to aws/lambdas and include it to the re……lease
PreviousNext