Movatterモバイル変換

NotificationsYou must be signed in to change notification settings
Fork100
Star96

v20250709-181311

Toggle v20250709-181311's commit message

[ez][CH] Fix infra_metrics.cloud.watch_metrics schema: use DateTime64 (……#6909)The timestamp used by cloudwatch has milliseconds, so change thetimestamp field to match thatTesting: replaced the old table, then ran `pythontools/rockset_migration/s32ch.py --clickhouse-table"infra_metrics.cloudwatch_metrics" --stored-data t.json --s3-bucketfbossci-cloudwatch-metrics --s3-prefix ghci-related `

v20250708-173352

Toggle v20250708-173352's commit message

[ghinfra] Set up ingestion from s3 -> clickhouse for cloudwatch (#6898)Path: cloudwatch metrics -> firehose -> s3 (new bucketfbossci-cloudwatch-metrics) -> clickhouseThis is the s3 -> clickhouse partI think clickhouse has some in built ingestions for kinesis but I'mlazy...Requirespytorch-labs/pytorch-gha-infra#751Testing: ran the python code via`python tools/rockset_migration/s32ch.py --clickhouse-table"infra_metrics.cloudwatch_metrics" --stored-data t.json --s3-bucketfbossci-cloudwatch-metrics --s3-prefix ghci-related`

v20250703-021349

Toggle v20250703-021349's commit message

Add revert category extractionand exclude `ghfirst` reverts from stats (#6882)Adds revert category extraction from GitHub comments and excludes`ghfirst` reverts from precision/recall metrics.## Changes### 1. Added Revert Category Extraction- New method `extract_revert_categories_batch()` in`autorevert_checker.py`- Extracts categories (`nosignal`, `ignoredsignal`, `landrace`, `weird`,`ghfirst`) from GitHub issue comments- Single batch query for performance### 2. Enhanced `get_commits_reverted_with_info()`- Now includes category information for each revert- Uses batch extraction for all reverts at once### 3. Updated Metrics Calculation- Excludes `ghfirst` reverts from recall calculation- Shows category breakdown in summary statistics- Per-workflow precision now shows both total and non-ghfirst metrics### 4. Fixed Pattern Detection Bug- Fixed `AttributeError: 'NoneType' object has no attribute 'head_sha'`- Created proper mapping between failures and their newer commits<details><summary>Bug Fix Details</summary>**Problem**: `newer_commit_same_job` was used outside its loop scope**Solution**: Created `failure_to_newer_commit` dict to track mappings```python# Map each failure to its newer commitfailure_to_newer_commit = {}for (rule, job) in suspected_failures:    newer_commit_same_job, newer_same_jobs = self._find_last_commit_with_job(...)    if newer_commit_same_job and any(...):        failure_to_newer_commit[(rule, job)] = newer_commit_same_job# Use mapping in pattern creationfor (failure_rule, job_name), newer_commit in failure_to_newer_commit.items():    patterns.append({        "newer_commits": [newer_commit.head_sha, suspected_commit1.head_sha],        ...    })```</details>## Example Output``` python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel  --hours 720 --verbose==================================================SUMMARY STATISTICS==================================================Workflow(s): Lint, trunk, pull, inductor, linux-binary-manywheelTimeframe: 720 hoursCommits checked: 6741Auto revert patterns detected: 419Actual reverts inside auto revert patterns detected (precision): 50 (11.9%)Total revert commits in period: 121Revert categories:  nosignal: 46 (38.0%)  ghfirst: 28 (23.1%)  uncategorized: 21 (17.4%)  ignoredsignal: 16 (13.2%)  weird: 9 (7.4%)  landrace: 1 (0.8%)Total reverts excluding ghfirst: 93Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 50 (53.8%)Per workflow precision:  Lint: 6 reverts out of 17 patterns (35.3%) [excluding ghfirst: 6 (35.3%)]  trunk: 2 reverts out of 14 patterns (14.3%) [excluding ghfirst: 2 (14.3%)]  pull: 40 reverts out of 354 patterns (11.3%) [excluding ghfirst: 33 (9.3%)]  inductor: 2 reverts out of 31 patterns (6.5%) [excluding ghfirst: 2 (6.5%)]  linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%) [excluding ghfirst: 0 (0.0%)]Reverted patterns:  - Python RuntimeError: e1aee866 (ignoredsignal)  - GitHub workflows weren't regenerated: 3b6569b1 (ignoredsignal)  - GitHub workflows weren't regenerated: bbbced94 (landrace)  - Python RuntimeError: 060838c2 (nosignal)  - Lintrunner failure: 1a55fb0e (weird)  - Lintrunner failure: 3239da0c (nosignal)  - MSVC compiler error: eab45643 (ignoredsignal)  - Bad response status code: ea7b2330 (uncategorized)  - gtest failure: 347ace4c (nosignal)  - pytest failure: 216bd609 (nosignal)  - pytest failure: 84c588e5 (ignoredsignal)  - GHA error: 863327ae (ignoredsignal)  - GHA error: eb9efb37 (ghfirst)  - GHA error: 9c39bc24 (ignoredsignal)  - Python Test File RuntimeError: 6de41ce0 (uncategorized)  - Fallback for other test failure rules: 3f920f3d (nosignal)  - pytest failure: f179b719 (ghfirst)  - pytest failure: 92409b6c (uncategorized)  - pytest failure: d1b4e0fa (nosignal)  - pytest failure: 099d0d61 (uncategorized)  - pytest failure: c79c7bbe (nosignal)  - pytest failure: c95f7fa8 (nosignal)  - pytest failure: 08dae945 (weird)  - pytest failure: fb75dea2 (uncategorized)  - pytest failure: 9de23d0c (uncategorized)  - pytest failure: 830a335a (ghfirst)  - pytest failure: 6d3a4356 (ignoredsignal)  - pytest failure: a6a3a441 (nosignal)  - Python Test timeout (KeyboardInterrupt): 8142a028 (nosignal)  - pr_time_benchmarks regression: 2b9d638e (weird)  - pytest failure: dc5e8f79 (nosignal)  - pytest failure: 5264f8cd (weird)  - pytest failure: 8823138e (nosignal)  - pytest failure: f154f9b3 (ignoredsignal)  - pr_time_benchmarks regression: b07725a9 (ghfirst)  - pr_time_benchmarks regression: d4d0ede6 (ignoredsignal)  - Build error: 2596e3d0 (nosignal)  - pytest failure: c1f531f0 (ghfirst)  - GHA error: c6b4f986 (nosignal)  - GHA error: 529e0357 (nosignal)  - GHA error: e694280d (ghfirst)  - GHA error: e1180c72 (weird)  - GHA error: 7dcc77e4 (nosignal)  - GHA error: a3098a74 (ignoredsignal)  - GHA error: a14f427d (nosignal)  - GHA error: bee9c70c (nosignal)  - GHA error: 409c396a (ghfirst)  - GHA error: 67fb9b7c (nosignal)  - GHA error: 1b50c125 (nosignal)  - pytest failure: 196c95d4 (nosignal)```

v20250630-183403

Toggle v20250630-183403's commit message

[Pytorch AutoRevert] - Improves autorevert check heuristics (#6853)Do some improvements in the back analisys for the revert logic with thegoal of improving precision and recall and validate as a valid strategy.Checked against the workflows: pull trunk inductorlinux-binary-manywheelOld code:```Timeframe: 720 hoursCommits checked: 6177Auto revert patterns detected: 188Actual reverts inside auto revert patterns detected: 24 (12.8%)Total revert commits in period: 115Reverts that dont match any auto revert pattern detected: 91```Newer code:```Workflow(s): pull, trunk, inductor, linux-binary-manywheelTimeframe: 720 hoursCommits checked: 5403Auto revert patterns detected: 442Actual reverts inside auto revert patterns detected (precision): 48 (10.9%)Total revert commits in period: 115Reverts that dont match any auto revert pattern detected (recall): 67 (58.3%)Per workflow precision:  pull: 45 reverts out of 411 patterns (10.9%)  trunk: 1 reverts out of 8 patterns (12.5%)  inductor: 2 reverts out of 20 patterns (10.0%)  linux-binary-manywheel: 0 reverts out of 3 patterns (0.0%)```Critical implemented changes:* Look forward and back for the first commit that ran the failed job,instead of trusting on always looking on the one right before or rightafter.* Job names have parts we don't care, like shards indices. As a failurecould happen in any shard we want to find any shard with the samefailure;Things I tried and don't lead to great results:* ignoring error classification - too low precision, not significantincrease in recall* not requiring error repetition - too low precision, not significantincrease in recallMy take:With a precision of 10% it justifies the cost of re-running jobs inorder to confirm redness status, even if it is not possible to test, Isuspect that the fact we force require the same output 2 times for all 3signals, this should elevate the precision to a very high standard.Unfortunately the only way to test is run this in shadow mode.With a recall of 55%, it points out to being able to capture **most** ofthe introduced trunk redness errors. Lots of reverts might not be causedby ci redness, especially not in the workflows we are analyzing (couldbe performance degradation, GHF/internal reasons and many others). Thisnumber seems comfortable to provide a substantial gain in benefit for CIquality.

v20250630-164255

Toggle v20250630-164255's commit message

runners: Revert things related to batch termination (#6868)This reverts the following PRs:*#6859 *#6858 *#6855 *#6854*#6852These were causing issues where scale-down was too aggressively scalingdown instances leading to runners not being refreshed by scale-up.I do think the SSM expiration stuff is worth a re-do though but therewere merge conflicts so I have to revert the entire thing.

v20250627-203612

Toggle v20250627-203612's commit message

runners: Fix lint (#6859)There was some outstanding lint issues from previous PRs.Fixes the lint and formatting.Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

v20250627-202541

Toggle v20250627-202541's commit message

[ez][docs] Add wiki maintenance magic strings to aws/lambda/readme (#……6856)As in titleAlso* switches some things to permalinks* some capitalizationMost of this was written by yang, so I can't take credit even though thewiki maintenance script will say otherwise

v20250627-200622

Toggle v20250627-200622's commit message

runners: make ssm policy an array (#6858)Fixes an issue where the SSM parameter policies were not being setcorrectly.Resulted in errors like:ValidationException: Invalid policies input:{"Type":"Expiration","Version":"1.0","Attributes":{"Timestamp":"2025-06-27T19:11:55.437Z"}}.Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

v20250627-185904

Toggle v20250627-185904's commit message

[log classifier] Rule for graph break registry check (#6837)For failures like [GH joblink](https://github.com/pytorch/pytorch/actions/runs/15859789097/job/44714997710)[HUD commitlink](https://hud.pytorch.org/pytorch/pytorch/commit/c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586)Currently matches ` ##[error]Process completed with exit code 1.`but there is a better line`Found the unimplemented_v2 or unimplemented_v2_with_warning calls belowthat don't match the registry in graph_break_registry.json.`

v20250627-183532

Toggle v20250627-183532's commit message

runners: Add expiration policy to SSM parameters (#6855)Instead of doing expensive cleanups we can rely on SSM parameterpolicies to do the cleanup for us!This is a workaround to avoid the need to do expensive cleanup of SSMparameters.Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

PreviousNext

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v20250709-181311

Verified

v20250708-173352

Verified

v20250703-021349

Verified

v20250630-183403

Verified

v20250630-164255

Verified

v20250627-203612

Verified

v20250627-202541

Verified

v20250627-200622

Verified

v20250627-185904

Verified

v20250627-183532

Verified

Movatterモバイル変換

Tags: pytorch/test-infra