- Notifications
You must be signed in to change notification settings - Fork4
Admin/XMover: Problematic translog detection and cancellation#533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:xmover
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
coderabbitaibot commentedSep 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughIntroduces shard-size analytics and reporting in ShardAnalyzer, adds TranslogReporter and a new CLI command problematic-translogs, extends RecoveryMonitor with problematic shard visibility, incorporates partition-aware table displays, updates data model and database queries to include partition_values, and adds documentation and tests covering new behaviors. Changes
Sequence Diagram(s)sequenceDiagram autonumber actor U as User participant CLI as xmover CLI participant TR as TranslogReporter participant DB as CrateDBClient U->>CLI: xmover problematic-translogs --size-mb N [--cancel] CLI->>TR: problematic_translogs(size_mb=N) TR->>DB: query replicas with uncommitted translog >= N MB DB-->>TR: rows (table, partition_values, shard_id, node, ...) TR-->>CLI: generated ALTER ... REROUTE CANCEL SHARD commands alt --cancel flag and commands exist CLI->>U: Prompt for confirmation U-->>CLI: Confirm/Decline opt If confirmed loop For each command CLI->>DB: execute_query(command) Note right of CLI: 1s delay between commands end end end CLI-->>U: Summary (counts, successes/failures)sequenceDiagram autonumber participant SA as ShardAnalyzer participant DB as CrateDBClient participant UI as ShardReporter UI SA->>DB: fetch shard stats and nodes/zones DB-->>SA: shards (sizes, states), node zones SA-->>UI: distribution_stats (buckets, totals, warnings) UI-->>UI: render size distribution + optional large/small detailssequenceDiagram autonumber participant RM as RecoveryMonitor participant DB as CrateDBClient participant T as Terminal RM->>DB: get_recoveries() DB-->>RM: active/completed recoveries with partition_values RM->>DB: get_problematic_shards() DB-->>RM: shards not STARTED (with partition_values) RM-->>T: status lines, summaries, examples (partition-aware)Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
373efef to143ffbfCompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@WalBeh: Unfortunately, the software tests are failing here. I guess CrateDB has to be properly provisioned before.
pytest --no-cov tests/admin/test_problematic_translogs.pyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We've fixed the software tests per13d59cc.
dac2a6d tocb05b4dCompare143ffbf toa48d9fcCompare21dd237 to3b3942fCompared6a1753 to0c12e4fCompareamotl commentedSep 19, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@WalBeh:
Can you add software tests here or to another branch which will bring up software test code coverage again, with a specific focus on your new code? EDIT: All right, coverage is still >50%. However, the addition still decreased it, so it would not hurt to still add more test cases when applicable.
|
amotl commentedSep 19, 2025
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (11)
cratedb_toolkit/admin/xmover/model.py (1)
61-61:Standardize “no partition” sentinel.Tests use the string "NULL". Prefer None to represent “no partition” at the data layer to avoid leaking sentinel strings into UIs. If "NULL" must be retained for compatibility, document it on the field.
doc/admin/xmover/handbook.md (1)
271-287:Add language to fenced code block to satisfy markdownlint.Label the sample output block as text.
Apply:
-**Sample Output:**-```+**Sample Output:**+```text Found 3 shards with problematic translogs: ... Total: 3 ALTER commands generated-```+```cratedb_toolkit/admin/xmover/util/format.py (1)
46-58:Fix typing: avoid implicit Optional (RUF013).Annotate
partition_valuesas Optional and import it.Apply:
+from typing import Optional@@-def format_table_display_with_partition(schema_name: str, table_name: str, partition_values: str = None) -> str:+def format_table_display_with_partition(+ schema_name: str, table_name: str, partition_values: Optional[str] = None+) -> str:Optional (defensive): treat common sentinels as “no partition” to keep displays clean:
- if partition_values:+ if partition_values and partition_values.strip() not in {"NULL", "-", "none", ""}: return f"{base_display} {partition_values}"cratedb_toolkit/admin/xmover/cli.py (2)
480-493:Catching broad Exception — either narrow or document.If
execute_queryraises a known exception type, catch that. Otherwise, suppress the Ruff BLE001 warning explicitly.Apply one of:
- except Exception as e:+ # Broad except is intentional for CLI UX+ except Exception as e: # noqa: BLE001or, if available:
- except Exception as e:+ except (click.ClickException, RuntimeError) as e:Confirm the concrete exception type thrown by
CrateDBClient.execute_queryand replaceExceptionaccordingly.
506-510:Same: broad except for analysis block.Mirror the decision above for consistency.
Apply:
- except Exception as e:+ except Exception as e: # noqa: BLE001tests/admin/test_problematic_translogs.py (2)
156-159:Patch time.sleep at the module where it’s used to avoid cross-test bleed.Patch the symbol in the CLI module (or the module that calls sleep), not the global
time.sleep.-), patch("time.sleep"): # Mock sleep to speed up test+), patch("cratedb_toolkit.admin.xmover.cli.time.sleep"): # Mock sleep where usedIf sleep is called from another module, target that module path instead.
128-131:Make the query assertion resilient to quoting of "primary".The SQL should quote the column as
sh."primary" = FALSE. Update the assertion to allow both.+import re ...-assert "primary = FALSE" in query+assert re.search(r'"?primary"?\s*=\s*FALSE', query), querycratedb_toolkit/admin/xmover/operational/monitor.py (1)
140-147:Use the shared formatter for table+partition display for consistency.Align
_format_recovery_tablewith the rest of the file.-# Format table name with partition values if available-table_display = f"{recovery.schema_name}.{recovery.table_name}"-if recovery.partition_values:- table_display = f"{table_display} {recovery.partition_values}"+table_display = format_table_display_with_partition(+ recovery.schema_name, recovery.table_name, recovery.partition_values+)cratedb_toolkit/admin/xmover/util/database.py (1)
516-573:Docstring vs filter mismatch: exclude actively recovering shards or adjust wording.
get_problematic_shardssays “aren't actively recovering” but currently includes allstate != 'STARTED', which can includeRECOVERING. Either update the filter or the docstring.Option A (exclude recovering):
- where_conditions = ["s.state != 'STARTED'"]+ where_conditions = ["s.state NOT IN ('STARTED','RECOVERING')"]Option B: Clarify the docstring to say it returns “non-started shards (may include recovering),” and rely on caller-side filtering (as done in monitor).
cratedb_toolkit/admin/xmover/analysis/shard.py (2)
1069-1127:Narrow the try/except and useelse:for Ruff TRY300; annotate broad catch if intentional.Limit the try to the query, put the rendering in
else:, and annotate the broad catch (BLE001) since this is a CLI UX path.- try:- result = self.client.execute_query(query, [size_mb])- rows = result.get("rows", [])-- if not rows:- console.print(f"[green]✓ No replica shards found with translog uncommitted size > {size_mb}MB[/green]")- return []-- console.print(f"Found {len(rows)} shards with problematic translogs:")- console.print()- ...- return alter_commands- except Exception as e:- console.print(f"[red]Error analyzing problematic translogs: {e}[/red]")+ try:+ result = self.client.execute_query(query, [size_mb])+ except Exception as e: # noqa: BLE001+ console.print(f"[red]Error analyzing problematic translogs: {e}[/red]") import traceback- console.print(f"[dim]{traceback.format_exc()}[/dim]") return []+ else:+ rows = result.get("rows", [])+ if not rows:+ console.print(f"[green]✓ No replica shards found with translog uncommitted size > {size_mb}MB[/green]")+ return []+ console.print(f"Found {len(rows)} shards with problematic translogs:")+ console.print()+ ...+ return alter_commands
771-773:Remove unused variable or use it.
zoneis computed but unused.-# FIXME: `zone` does not get used.-zone = next((node.zone for node in self.nodes if node.id == node_id), "unknown") # noqa: F841+# zone lookup omitted intentionally; remove variable to avoid noiseOr use it in emitted stats if valuable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
cratedb_toolkit/admin/xmover/analysis/shard.py(4 hunks)cratedb_toolkit/admin/xmover/cli.py(4 hunks)cratedb_toolkit/admin/xmover/model.py(2 hunks)cratedb_toolkit/admin/xmover/operational/monitor.py(11 hunks)cratedb_toolkit/admin/xmover/util/database.py(5 hunks)cratedb_toolkit/admin/xmover/util/format.py(1 hunks)doc/admin/xmover/handbook.md(1 hunks)tests/admin/test_cli.py(1 hunks)tests/admin/test_problematic_translogs.py(1 hunks)tests/admin/test_recovery_monitor.py(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T17:32:10.537Z
Learnt from: amotlPR: crate/cratedb-toolkit#523File: cratedb_toolkit/admin/xmover/analyze/shard.py:0-0Timestamp: 2025-09-18T17:32:10.537ZLearning: XMover (cratedb_toolkit.admin.xmover) intentionally uses print statements throughout its codebase instead of structured logging, as it's designed as a CLI tool where direct console output is the intended behavior.Applied to files:
cratedb_toolkit/admin/xmover/cli.pycratedb_toolkit/admin/xmover/operational/monitor.py
🪛 markdownlint-cli2 (0.18.1)
doc/admin/xmover/handbook.md
272-272: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.0)
cratedb_toolkit/admin/xmover/util/format.py
46-46: PEP 484 prohibits implicitOptional
Convert toOptional[T]
(RUF013)
cratedb_toolkit/admin/xmover/cli.py
490-490: Do not catch blind exception:Exception
(BLE001)
506-506: Do not catch blind exception:Exception
(BLE001)
cratedb_toolkit/admin/xmover/analysis/shard.py
1126-1126: Consider moving this statement to anelse block
(TRY300)
1128-1128: Do not catch blind exception:Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-and-publish
🔇 Additional comments (5)
cratedb_toolkit/admin/xmover/model.py (1)
2-5:Type alias and imports look good.The
TableStatsTypealias and expanded typing imports are appropriate and improve clarity.tests/admin/test_cli.py (1)
10-19:CLI surface coverage updated — looks good.Including
active-shards,problematic-translogs, andshard-distributionhere keeps the smoke test aligned with the CLI.tests/admin/test_recovery_monitor.py (1)
74-74:Partition awareness wired into tests — OK.Passing
partition_valuesaligns tests with the new dataclass signature.If "no partition" is expected, consider using
Nonerather than the string "NULL" to avoid UI artifacts. Confirm downstream formatting handlesNonecorrectly.Also applies to: 138-138, 156-156, 224-224, 241-241
cratedb_toolkit/admin/xmover/cli.py (1)
44-56:Don't require a DB connection for offline/diagnostic commands — lazy-init the CrateDBClientEager connection blocks
explain-errorand other offline flows; initialize the client only for subcommands that need DB access.File: cratedb_toolkit/admin/xmover/cli.py (lines 44–56)
-# Test connection on startup.-client = CrateDBClient()-if not client.test_connection():- console.print("[red]Error: Failed connecting to CrateDB[/red]")- console.print(- "Please check your database connection string, "- "i.e. the CRATE_CONNECTION_STRING environment variable, "- "possibly stored within an .env file"- )- raise click.Abort()--# Propagate the client handle.-ctx.obj["client"] = client+# Only initialize the DB client for commands that need it+needs_db = {+ "analyze",+ "find-candidates",+ "recommend",+ "check-balance",+ "shard-distribution",+ "active-shards",+ "monitor-recovery",+ "problematic-translogs",+ "zone-analysis",+ "validate-move",+ "test-connection",+}+sub = ctx.invoked_subcommand+if sub in needs_db:+ client = CrateDBClient()+ if not client.test_connection():+ console.print("[red]Error: Failed connecting to CrateDB[/red]")+ console.print(+ "Please check your database connection string, "+ "i.e. the CRATE_CONNECTION_STRING environment variable, "+ "possibly stored within an .env file"+ )+ raise click.Abort()+ ctx.obj["client"] = clientVerification: attempted the provided smoke-check but the import failed here with "ModuleNotFoundError: No module named 'croud'"; unable to run the smoke test in this environment. Re-run the smoke-check locally or in CI (with project dependencies, including croud, installed) after applying the patch to confirm offline commands work as expected.
tests/admin/test_problematic_translogs.py (1)
13-214:Solid, comprehensive test coverage for the new CLI path.Scenarios are well covered (empty, partitioned/non‑partitioned, mixed, cancel flow, failures, defaults). Nice use of CliRunner and mocking.
| WHERE | ||
| sh.state = 'STARTED' | ||
| AND sh.translog_stats['uncommitted_size']\ | ||
| > ? * 1024^2 | ||
| AND primary = FALSE | ||
| ORDER BY | ||
| 6 DESC\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Quote and qualify the "primary" column in SQL.
Unquotedprimary can break on some engines; qualify with the alias for clarity.
- AND primary = FALSE+ AND sh."primary" = FALSE
Note: adjust the related test to tolerate quotedprimary (see test update suggestion).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| WHERE | |
| sh.state='STARTED' | |
| ANDsh.translog_stats['uncommitted_size'] \ | |
| > ?*1024^2 | |
| ANDprimary=FALSE | |
| ORDERBY | |
| 6DESC \ | |
| WHERE | |
| sh.state='STARTED' | |
| ANDsh.translog_stats['uncommitted_size'] \ | |
| > ?*1024^2 | |
| ANDsh."primary"=FALSE | |
| ORDERBY | |
| 6DESC \ |
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analysis/shard.py around lines 1060 to 1066 theSQL uses an unquoted column name primary which can fail on some SQL engines;qualify it with the table alias (e.g., sh."primary" or sh.primary depending onquoting rules used elsewhere) in the WHERE clause and in any relatedORDER/GROUP/SELECT references, and update the related test to accept thequoted/qualified form of primary.| schema_name:str | ||
| table_name:str | ||
| partition_values:Optional[str]# Partition values for partitioned tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
🧩 Analysis chain
New required field may be a breaking change; verify all call sites.
partition_values: Optional[str] is non‑default and now required at construction. Ensure all creators ofRecoveryInfo across the repo (and downstream users) were updated accordingly.
Run to verify:
🏁 Script executed:
#!/bin/bash# Find all RecoveryInfo constructions and show argumentsrg -nPU --type=py'(?s)RecoveryInfo\s*\(' -C2
Length of output: 2141
🏁 Script executed:
#!/bin/bashset -euo pipefailfile="cratedb_toolkit/admin/xmover/model.py"if [[!-f"$file" ]];thenecho"MISSING:$file"exit 0finl -ba"$file"| sed -n'1,240p'
Length of output: 59
Make partition_values optional (or update all callers).
Multiple RecoveryInfo(...) constructions don’t pass partition_values; either change the model to add a default (e.g. partition_values: Optional[str] = None in cratedb_toolkit/admin/xmover/model.py at ~line 61) or update all call sites to pass the new field. Affected call sites found: tests/admin/test_recovery_monitor.py:71,135,152,219,237 and cratedb_toolkit/admin/xmover/util/database.py:448.
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/model.py around line 61, the RecoveryInfo fieldpartition_values is declared as Optional[str] but has no default causing callersthat omit it to fail; update the model to make it truly optional by giving it adefault (e.g. partition_values: Optional[str] = None) so existing callerscontinue to work, or alternatively update all call sites (tests andcratedb_toolkit/admin/xmover/util/database.py) to pass a value forpartition_values.| is_recovering=any( | ||
| r.shard_id==shard["shard_id"] | ||
| andr.table_name==shard["table_name"] | ||
| andr.schema_name==shard["schema_name"] | ||
| forrinrecoveries | ||
| ) | ||
| ifnotis_recovering: | ||
| non_recovering_shards.append(shard) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Include partition in “is_recovering” checks to avoid false matches across partitions.
table_name + shard_id is not unique for partitioned tables; this can hide problematic shards or misclassify them.
- is_recovering = any(- r.shard_id == shard["shard_id"]- and r.table_name == shard["table_name"]- and r.schema_name == shard["schema_name"]- for r in recoveries- )+ is_recovering = any(+ r.shard_id == shard["shard_id"]+ and r.table_name == shard["table_name"]+ and r.schema_name == shard["schema_name"]+ and (r.partition_values or None) == (shard.get("partition_values") or None)+ for r in recoveries+ )
Apply the same change at Lines 395-401 and 463-470.
Also applies to: 395-401, 463-470
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/operational/monitor.py around lines 311-319, theis_recovering check only compares shard_id, table_name and schema_name which canproduce false matches for partitioned tables; update the predicate to alsocompare partition (or partition_value/partition_id field used in shard dict andRecovery objects) so the match includes partition and preventsmisclassification; make the identical change in the same logic at lines 395-401and 463-470 so all three checks include partition in the equality conditions.- More or less just line-length fixes.- Only a single type adjustment was needed on the return value of the `analyze_distribution` method.- Ruff recommended to use set comprehensions, so here we go.- At a single spot where an exception has been `pass`ed, we added error output. Is it bad?
The new command can find and cancel replica shards with largeuncommitted translog sizes. It provides options to set size thresholdsand optionally execute cancel commands after user confirmation.
…portingThis commit adds comprehensive shard size analysis capabilities,including:- Size distribution bucketing and reporting- Early warning detection for oversized and undersized shards- Partition-aware shard status display- Integration with recovery monitoring- Improved problematic shard detection- Enhanced CLI output formattingThe changes help identify shard sizing issues that could impact clusterperformance and stability.
- Add max shard size tracking to size buckets- Add small shards analysis view to analyzer command
0c12e4f tobc2a66bCompare
Uh oh!
There was an error while loading.Please reload this page.
About
The new subcommand
xmover problematic-translogscan find and cancel replica shards with large uncommitted translog sizes. It provides options to set size thresholds and optionally execute cancel commands after user confirmation.Install
uv pip install --upgrade'cratedb-toolkit @ git+https://github.com/crate/cratedb-toolkit.git@xmover-problematic-translogs'Documentation
References