Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add metric yaml#22168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
bluepal-avanthi-dundigala wants to merge4 commits intodevel
base:devel
Choose a base branch
Loading
fromfeature/cor-8-coordinator-shard-metrics

Conversation

@bluepal-avanthi-dundigala
Copy link

@bluepal-avanthi-dundigalabluepal-avanthi-dundigala commentedDec 9, 2025
edited by cursorbot
Loading

Scope & Purpose

(Please describe the changes in this PR for reviewers, motivation, rationale -mandatory)

  • 💩 Bugfix
  • 🍕 New feature
  • 🔥 Performance improvement
  • 🔨 Refactoring/simplification

Checklist

  • Tests
    • Regression tests
    • C++Unit tests
    • integration tests
    • resilience tests
  • 📖 CHANGELOG entry made
  • 📚 documentation written (release notes, API changes, ...)
  • Backports
    • Backport for 3.12.0:(Please link PR)
    • Backport for 3.11:(Please link PR)
    • Backport for 3.10:(Please link PR)

Related Information

(Please reference tickets / specification / other PRs etc)

  • Docs PR:
  • Enterprise PR:
  • GitHub issue / Jira ticket:
  • Design document:

Note

Add documentation for six new shard-related metrics (totals, not replicated/out-of-sync, follower counts) across coordinator and DBServer.

  • Metrics docs added:
    • Coordinator (Statistics):
      • Documentation/Metrics/arangodb_metadata_total_number_of_shards.yaml:arangodb_metadata_total_number_of_shards
      • Documentation/Metrics/arangodb_metadata_number_follower_shards.yaml:arangodb_metadata_number_follower_shards
      • Documentation/Metrics/arangodb_metadata_number_not_replicated_shards.yaml:arangodb_metadata_number_not_replicated_shards
      • Documentation/Metrics/arangodb_metadata_number_out_of_sync_shards.yaml:arangodb_metadata_number_out_of_sync_shards
    • DBServer (Replication):
      • Documentation/Metrics/arangodb_shards_follower_number.yaml:arangodb_shards_follower_number
      • Documentation/Metrics/arangodb_shard_followers_out_of_sync_number.yaml:arangodb_shard_followers_out_of_sync_number
  • All metrics markedintroducedIn: "3.12.7", typegauge, with descriptions and troubleshoot guidance.

Written byCursor Bugbot for commit35cf940. This will update automatically on new commits. Configurehere.

@bluepal-avanthi-dundigalabluepal-avanthi-dundigala marked this pull request as draftDecember 9, 2025 06:15
@bluepal-avanthi-dundigalabluepal-avanthi-dundigala requested review fromjbajic and removed request fora teamDecember 9, 2025 06:16
Comment on lines +1 to +2
name:arangodb_shards_follower_number
introducedIn:"3.12.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Isn't this the same asarangodb_metadata_number_follower_shards.yaml?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Isn't this the same asarangodb_metadata_number_follower_shards.yaml?

I added a separate file for the coordinator metrics as we previously agreed. If you prefer to use the existing 'arangodb_metadata_number_follower_shards.yaml', we need to update it by including 'coordinator' in the 'exposedBy' section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess this file (arangodb_shards_follower_number.yaml) was taken from the PR#22121 (I assume as an example to work with). It can be deleted in this branch, as it will be added when the other one is merged. The same goes for the filearangodb_shard_followers_out_of_sync_number.yaml.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Deleting files.

Copy link
Contributor

@KVS85KVS85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See review.

@@ -0,0 +1,35 @@
name:arangodb_metadata_number_follower_shards
introducedIn:"3.12.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

@@ -0,0 +1,35 @@
name:arangodb_metadata_number_not_replicated_shards
introducedIn:"3.12.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

@@ -0,0 +1,35 @@
name:arangodb_metadata_number_out_of_sync_shards
introducedIn:"3.12.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

@@ -0,0 +1,36 @@
name:arangodb_metadata_total_number_of_shards
introducedIn:"3.12.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

@@ -0,0 +1,23 @@
name:arangodb_shard_followers_out_of_sync_number
introducedIn:"3.12.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

Comment on lines 1958 to 1973
updateMetadataMetrics();

// Update coordinator specific shard metrics computed from Plan
if (_metadataMetrics.has_value()) {
updateCoordinatorPlanMetrics();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Until now,updateMetadataMetrics() has been the (only) method that updates_metadataMetrics. Now you had the need to add another one to be called duringloadCurrent() (updateCoordinatorCurrentMetrics()), and so I also get the siblingupdateCoordinatorPlanMetrics() here.

Nevertheless, the current situation with these three methods is a little confusing, at least becauseupdateMetadataMetrics() no longer updates (all the) metadata metrics.

To mitigate that, you could do one or both of

  • change the name(s) to clarify their respective responsibilities
  • mergeupdateCoordinatorPlanMetrics() withupdateMetadataMetrics()
    .

Later, when we follow through with loading Plan and Current synchronously, it all can be merged into one call.

Comment on lines 6373 to 6377
if (it == _shardsToPlanServers.end() || !it->second) {
// no plan servers recorded
++notReplicated;
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This case should be impossible,_shards and_shardsToPlanServers have to be consistent here.

It's surely sensible to be defensive and handle it safely regardless. As it hints to a bug at a different place, this should get a maintainer-mode assertion (e.g.TRI_ASSERT(it != _shardsToPlanServers.end() && it->second)). And possibly a log message with an error in production, though that doesn't feel necessary to me.

Fyi, maintainer-mode assertions (TRI_ASSERT) are only compiled in maintainer builds, and are not part of release builds.

Comment on lines 6386 to 6388
}else {
++notReplicated;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As far as I can tell, this case should not be possible, either. Let's also add a maintainer-mode assertion (e.g.TRI_ASSERT(!servers.empty()) to cement this.

uint64_t notReplicated =0;

// read plan under read lock to compare leaders/replication factors
READ_LOCKER(readLocker, _planProt.lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is possibly dangerous, as we're already holding a write lock for_currentProt.lock. We need to be certain this can't introduce deadlocks (e.g. if there is another code path that acquires a write lock for_planProt.lock, and then tries to get a read lock for_currentProt.lock, it could run into a deadlock with this code).

Afair, the ClusterInfo generally only uses one of those locks at a time. But I am not certain, I haven't checked the current code. If there is already code that does that, we must follow a common pattern that avoids deadlocks. And regardless of whether there is such code already, this should be documented at both locks.

Alternatively, we could change the caller to hold only a read lock instead of a write lock for_currentProt.lock: As long as we only ever hold both locks as read locks at the same time, this will prevent deadlocks as well.

I think I prefer the latter solution (only holding read locks), as it also means holding the write lock for a shorter amount of time, and reduce contention with readers: This can be detrimental otherwise. Note that this also holds for not onlyupdateCoordinatorPlanMetrics, but also for the existing functionupdateMetadataMetrics(): It would be preferable to execute all of them under read locks only! Note thatloadPlan andloadCurrent both also hold the_planProt.mutex and_currentProt.mutex, respectively, for their duration: This can be relied on to prevent concurrent calls to theupdate*Metrics functions.

autoconst& shardId = it.first;
autoconst& serversPtr = it.second;
++totalShards;
size_t currentN = (serversPtr ? serversPtr->size() :0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

serversPtr must never be null, or there's a bug somewhere else. Double-checking it is absolutely fine, but I'd also add a maintainer-mode assertion to spot possible bugs during testing (e.g.TRI_ASSERT(serversPtr != nullptr)).

Comment on lines +6424 to +6426
if (currentN >0) {
followers += (currentN -1);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Similarly: In addition toserversPtr not being null, I think*serversPtr must never be empty. I'd also add a maintainer-mode assertion to that end. Though in this case, I am not 100% certain, so please do keep the check. :-)

// read plan under read lock to compare leaders/replication factors
READ_LOCKER(readLocker, _planProt.lock);

for (autoconst& it : _shardsToCurrentServers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note that currently,Current can have stale entries, i.e. for collections/shards that have already been deleted. This could cause miscounts. It may help to think of thePlan as the actual cluster-wide metadata for e.g. a collection, and correspondingCurrent entries as a last reported state.

I suggest to flip it around: Loop over_shardsToPlanServers instead, and look up the corresponding entries in_shardsToCurrent.

Comment on lines 6397 to 6399
_metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed);
_metadataMetrics->numberNotReplicatedShards.store(notReplicated,
std::memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These two metrics should only be updated inupdateCoordinatorCurrentMetrics():

Suggested change
_metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed);
_metadataMetrics->numberNotReplicatedShards.store(notReplicated,
std::memory_order_relaxed);

Which also means we don't need to calculatenotReplicated here (and we can't, really).

Comment on lines +6460 to +6463
_metadataMetrics->totalNumberOfShards.store(totalShards,
std::memory_order_relaxed);
_metadataMetrics->numberFollowerShards.store(followers,
std::memory_order_relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These two metrics are already being reported byupdateCoordinatorCurrentMetrics(), and should not be reported here as well.

Suggested change
_metadataMetrics->totalNumberOfShards.store(totalShards,
std::memory_order_relaxed);
_metadataMetrics->numberFollowerShards.store(followers,
std::memory_order_relaxed);

Comment on lines 6445 to 6457
// out of sync: leader changed or missing
if (plannedN ==0) {
// No plan present: treat as out-of-sync
++outOfSync;
}elseif (currentLeader.empty()) {
++outOfSync;
}elseif (!plannedLeader.empty() && plannedLeader != currentLeader) {
++outOfSync;
}

if (plannedN > currentN) {
++notReplicated;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ImooutOfSync andnotReplicated don't quite count the right thing; let me try to explain my idea of them (high-level, skipping over some details):

If there is only one in-sync replica (the leader), the shard should be counted asnotReplicated. This is also the case for shards that are planned withreplicationFactor: 1 (which means, the plan already only contains the leader as the only replica).

If any follower fromPlan is missing from the list inCurrent, it is regarded as out-of-sync.

Note that a shard can be counted as bothnotReplicated andoutOfSync, or any one of them, or neither.

But note that there are some additional edge cases to consider, e.g. during a leader failover or move shard operation. Let's discuss these separately.

@bluepal-avanthi-dundigalabluepal-avanthi-dundigalaforce-pushed thefeature/cor-8-coordinator-shard-metrics branch fromc1270fb to8b98ae3CompareDecember 19, 2025 07:56
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@goedderzgoedderzgoedderz left review comments

@jbajicjbajicjbajic left review comments

@KVS85KVS85KVS85 requested changes

Requested changes must be addressed to merge this pull request.

Assignees

No one assigned

Labels

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

5 participants

@bluepal-avanthi-dundigala@KVS85@goedderz@jbajic

[8]ページ先頭

©2009-2025 Movatter.jp