- Notifications
You must be signed in to change notification settings - Fork100
Tags: pytorch/test-infra
Tags
v20250709-181311
[ez][CH] Fix infra_metrics.cloud.watch_metrics schema: use DateTime64 (……#6909)The timestamp used by cloudwatch has milliseconds, so change thetimestamp field to match thatTesting: replaced the old table, then ran `pythontools/rockset_migration/s32ch.py --clickhouse-table"infra_metrics.cloudwatch_metrics" --stored-data t.json --s3-bucketfbossci-cloudwatch-metrics --s3-prefix ghci-related `
v20250708-173352
[ghinfra] Set up ingestion from s3 -> clickhouse for cloudwatch (#6898)Path: cloudwatch metrics -> firehose -> s3 (new bucketfbossci-cloudwatch-metrics) -> clickhouseThis is the s3 -> clickhouse partI think clickhouse has some in built ingestions for kinesis but I'mlazy...Requirespytorch-labs/pytorch-gha-infra#751Testing: ran the python code via`python tools/rockset_migration/s32ch.py --clickhouse-table"infra_metrics.cloudwatch_metrics" --stored-data t.json --s3-bucketfbossci-cloudwatch-metrics --s3-prefix ghci-related`
v20250703-021349
Add revert category extractionand exclude `ghfirst` reverts from stats (#6882)Adds revert category extraction from GitHub comments and excludes`ghfirst` reverts from precision/recall metrics.## Changes### 1. Added Revert Category Extraction- New method `extract_revert_categories_batch()` in`autorevert_checker.py`- Extracts categories (`nosignal`, `ignoredsignal`, `landrace`, `weird`,`ghfirst`) from GitHub issue comments- Single batch query for performance### 2. Enhanced `get_commits_reverted_with_info()`- Now includes category information for each revert- Uses batch extraction for all reverts at once### 3. Updated Metrics Calculation- Excludes `ghfirst` reverts from recall calculation- Shows category breakdown in summary statistics- Per-workflow precision now shows both total and non-ghfirst metrics### 4. Fixed Pattern Detection Bug- Fixed `AttributeError: 'NoneType' object has no attribute 'head_sha'`- Created proper mapping between failures and their newer commits<details><summary>Bug Fix Details</summary>**Problem**: `newer_commit_same_job` was used outside its loop scope**Solution**: Created `failure_to_newer_commit` dict to track mappings```python# Map each failure to its newer commitfailure_to_newer_commit = {}for (rule, job) in suspected_failures: newer_commit_same_job, newer_same_jobs = self._find_last_commit_with_job(...) if newer_commit_same_job and any(...): failure_to_newer_commit[(rule, job)] = newer_commit_same_job# Use mapping in pattern creationfor (failure_rule, job_name), newer_commit in failure_to_newer_commit.items(): patterns.append({ "newer_commits": [newer_commit.head_sha, suspected_commit1.head_sha], ... })```</details>## Example Output``` python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel --hours 720 --verbose==================================================SUMMARY STATISTICS==================================================Workflow(s): Lint, trunk, pull, inductor, linux-binary-manywheelTimeframe: 720 hoursCommits checked: 6741Auto revert patterns detected: 419Actual reverts inside auto revert patterns detected (precision): 50 (11.9%)Total revert commits in period: 121Revert categories: nosignal: 46 (38.0%) ghfirst: 28 (23.1%) uncategorized: 21 (17.4%) ignoredsignal: 16 (13.2%) weird: 9 (7.4%) landrace: 1 (0.8%)Total reverts excluding ghfirst: 93Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 50 (53.8%)Per workflow precision: Lint: 6 reverts out of 17 patterns (35.3%) [excluding ghfirst: 6 (35.3%)] trunk: 2 reverts out of 14 patterns (14.3%) [excluding ghfirst: 2 (14.3%)] pull: 40 reverts out of 354 patterns (11.3%) [excluding ghfirst: 33 (9.3%)] inductor: 2 reverts out of 31 patterns (6.5%) [excluding ghfirst: 2 (6.5%)] linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%) [excluding ghfirst: 0 (0.0%)]Reverted patterns: - Python RuntimeError: e1aee866 (ignoredsignal) - GitHub workflows weren't regenerated: 3b6569b1 (ignoredsignal) - GitHub workflows weren't regenerated: bbbced94 (landrace) - Python RuntimeError: 060838c2 (nosignal) - Lintrunner failure: 1a55fb0e (weird) - Lintrunner failure: 3239da0c (nosignal) - MSVC compiler error: eab45643 (ignoredsignal) - Bad response status code: ea7b2330 (uncategorized) - gtest failure: 347ace4c (nosignal) - pytest failure: 216bd609 (nosignal) - pytest failure: 84c588e5 (ignoredsignal) - GHA error: 863327ae (ignoredsignal) - GHA error: eb9efb37 (ghfirst) - GHA error: 9c39bc24 (ignoredsignal) - Python Test File RuntimeError: 6de41ce0 (uncategorized) - Fallback for other test failure rules: 3f920f3d (nosignal) - pytest failure: f179b719 (ghfirst) - pytest failure: 92409b6c (uncategorized) - pytest failure: d1b4e0fa (nosignal) - pytest failure: 099d0d61 (uncategorized) - pytest failure: c79c7bbe (nosignal) - pytest failure: c95f7fa8 (nosignal) - pytest failure: 08dae945 (weird) - pytest failure: fb75dea2 (uncategorized) - pytest failure: 9de23d0c (uncategorized) - pytest failure: 830a335a (ghfirst) - pytest failure: 6d3a4356 (ignoredsignal) - pytest failure: a6a3a441 (nosignal) - Python Test timeout (KeyboardInterrupt): 8142a028 (nosignal) - pr_time_benchmarks regression: 2b9d638e (weird) - pytest failure: dc5e8f79 (nosignal) - pytest failure: 5264f8cd (weird) - pytest failure: 8823138e (nosignal) - pytest failure: f154f9b3 (ignoredsignal) - pr_time_benchmarks regression: b07725a9 (ghfirst) - pr_time_benchmarks regression: d4d0ede6 (ignoredsignal) - Build error: 2596e3d0 (nosignal) - pytest failure: c1f531f0 (ghfirst) - GHA error: c6b4f986 (nosignal) - GHA error: 529e0357 (nosignal) - GHA error: e694280d (ghfirst) - GHA error: e1180c72 (weird) - GHA error: 7dcc77e4 (nosignal) - GHA error: a3098a74 (ignoredsignal) - GHA error: a14f427d (nosignal) - GHA error: bee9c70c (nosignal) - GHA error: 409c396a (ghfirst) - GHA error: 67fb9b7c (nosignal) - GHA error: 1b50c125 (nosignal) - pytest failure: 196c95d4 (nosignal)```
v20250630-183403
[Pytorch AutoRevert] - Improves autorevert check heuristics (#6853)Do some improvements in the back analisys for the revert logic with thegoal of improving precision and recall and validate as a valid strategy.Checked against the workflows: pull trunk inductorlinux-binary-manywheelOld code:```Timeframe: 720 hoursCommits checked: 6177Auto revert patterns detected: 188Actual reverts inside auto revert patterns detected: 24 (12.8%)Total revert commits in period: 115Reverts that dont match any auto revert pattern detected: 91```Newer code:```Workflow(s): pull, trunk, inductor, linux-binary-manywheelTimeframe: 720 hoursCommits checked: 5403Auto revert patterns detected: 442Actual reverts inside auto revert patterns detected (precision): 48 (10.9%)Total revert commits in period: 115Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%)Per workflow precision: pull: 45 reverts out of 411 patterns (10.9%) trunk: 1 reverts out of 8 patterns (12.5%) inductor: 2 reverts out of 20 patterns (10.0%) linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%)```Critical implemented changes:* Look forward and back for the first commit that ran the failed job,instead of trusting on always looking on the one right before or rightafter.* Job names have parts we don't care, like shards indices. As a failurecould happen in any shard we want to find any shard with the samefailure;Things I tried and don't lead to great results:* ignoring error classification - too low precision, not significantincrease in recall* not requiring error repetition - too low precision, not significantincrease in recallMy take:With a precision of 10% it justifies the cost of re-running jobs inorder to confirm redness status, even if it is not possible to test, Isuspect that the fact we force require the same output 2 times for all 3signals, this should elevate the precision to a very high standard.Unfortunately the only way to test is run this in shadow mode.With a recall of 55%, it points out to being able to capture **most** ofthe introduced trunk redness errors. Lots of reverts might not be causedby ci redness, especially not in the workflows we are analyzing (couldbe performance degradation, GHF/internal reasons and many others). Thisnumber seems comfortable to provide a substantial gain in benefit for CIquality.
v20250630-164255
runners: Revert things related to batch termination (#6868)This reverts the following PRs:*#6859 *#6858 *#6855 *#6854*#6852These were causing issues where scale-down was too aggressively scalingdown instances leading to runners not being refreshed by scale-up.I do think the SSM expiration stuff is worth a re-do though but therewere merge conflicts so I have to revert the entire thing.
v20250627-202541
v20250627-200622
runners: make ssm policy an array (#6858)Fixes an issue where the SSM parameter policies were not being setcorrectly.Resulted in errors like:ValidationException: Invalid policies input:{"Type":"Expiration","Version":"1.0","Attributes":{"Timestamp":"2025-06-27T19:11:55.437Z"}}.Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
v20250627-185904
[log classifier] Rule for graph break registry check (#6837)For failures like [GH joblink](https://github.com/pytorch/pytorch/actions/runs/15859789097/job/44714997710)[HUD commitlink](https://hud.pytorch.org/pytorch/pytorch/commit/c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586)Currently matches ` ##[error]Process completed with exit code 1.`but there is a better line`Found the unimplemented_v2 or unimplemented_v2_with_warning calls belowthat don't match the registry in graph_break_registry.json.`
v20250627-183532
runners: Add expiration policy to SSM parameters (#6855)Instead of doing expensive cleanups we can rely on SSM parameterpolicies to do the cleanup for us!This is a workaround to avoid the need to do expensive cleanup of SSMparameters.Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
PreviousNext