NotificationsYou must be signed in to change notification settings
Fork875
Star14k

Add metric yaml#22168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

bluepal-avanthi-dundigala wants to merge4 commits intodevel

base:devel

Choose a base branch

fromfeature/cor-8-coordinator-shard-metrics

Draft

Add metric yaml#22168

bluepal-avanthi-dundigala wants to merge4 commits intodevelfromfeature/cor-8-coordinator-shard-metrics

+624 −13

Conversation

Copy link

bluepal-avanthi-dundigala commentedDec 9, 2025•
edited by cursorbot
Loading

Scope & Purpose

(Please describe the changes in this PR for reviewers, motivation, rationale -mandatory)

💩 Bugfix
🍕 New feature
🔥 Performance improvement
🔨 Refactoring/simplification

Checklist

Tests
- Regression tests
- C++Unit tests
- integration tests
- resilience tests
📖 CHANGELOG entry made
📚 documentation written (release notes, API changes, ...)
Backports
- Backport for 3.12.0:(Please link PR)
- Backport for 3.11:(Please link PR)
- Backport for 3.10:(Please link PR)

Related Information

(Please reference tickets / specification / other PRs etc)

Docs PR:
Enterprise PR:
GitHub issue / Jira ticket:
Design document:

Note

Add documentation for six new shard-related metrics (totals, not replicated/out-of-sync, follower counts) across coordinator and DBServer.

Metrics docs added:
- Coordinator (Statistics):
  - Documentation/Metrics/arangodb_metadata_total_number_of_shards.yaml:arangodb_metadata_total_number_of_shards
  - Documentation/Metrics/arangodb_metadata_number_follower_shards.yaml:arangodb_metadata_number_follower_shards
  - Documentation/Metrics/arangodb_metadata_number_not_replicated_shards.yaml:arangodb_metadata_number_not_replicated_shards
  - Documentation/Metrics/arangodb_metadata_number_out_of_sync_shards.yaml:arangodb_metadata_number_out_of_sync_shards
- DBServer (Replication):
  - Documentation/Metrics/arangodb_shards_follower_number.yaml:arangodb_shards_follower_number
  - Documentation/Metrics/arangodb_shard_followers_out_of_sync_number.yaml:arangodb_shard_followers_out_of_sync_number
All metrics markedintroducedIn: "3.12.7", typegauge, with descriptions and troubleshoot guidance.

^{Written byCursor Bugbot for commit35cf940. This will update automatically on new commits. Configurehere.}

Add metric yaml

35cf940

bluepal-avanthi-dundigala requested a review froma team as acode owner

December 9, 2025 06:15

cla-botbot added the cla-signed label

Dec 9, 2025

bluepal-avanthi-dundigala marked this pull request as draft

December 9, 2025 06:15

bluepal-avanthi-dundigala requested review fromjbajic and removed request fora team

December 9, 2025 06:16

jbajic reviewed

Dec 9, 2025

View reviewed changes

Documentation/Metrics/arangodb_shards_follower_number.yaml

Comment on lines +1 to +2

		name:arangodb_shards_follower_number
		introducedIn:"3.12.7"

Copy link

Contributor

jbajicDec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Isn't this the same asarangodb_metadata_number_follower_shards.yaml?

Copy link

Author

bluepal-avanthi-dundigalaDec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Isn't this the same asarangodb_metadata_number_follower_shards.yaml?

I added a separate file for the coordinator metrics as we previously agreed. If you prefer to use the existing 'arangodb_metadata_number_follower_shards.yaml', we need to update it by including 'coordinator' in the 'exposedBy' section.

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess this file (arangodb_shards_follower_number.yaml) was taken from the PR#22121 (I assume as an example to work with). It can be deleted in this branch, as it will be added when the other one is merged. The same goes for the filearangodb_shard_followers_out_of_sync_number.yaml.

Copy link

Author

bluepal-avanthi-dundigalaDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Deleting files.

COR-8|Add coodinator metrics

4e7206f

KVS85 requested changes

Dec 11, 2025

View reviewed changes

Copy link

Contributor

KVS85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See review.

Documentation/Metrics/arangodb_metadata_number_follower_shards.yaml Outdated

		@@ -0,0 +1,35 @@
		name:arangodb_metadata_number_follower_shards
		introducedIn:"3.12.7"

Copy link

Contributor

KVS85Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Copy link

Author

bluepal-avanthi-dundigalaDec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

Documentation/Metrics/arangodb_metadata_number_not_replicated_shards.yaml Outdated

		@@ -0,0 +1,35 @@
		name:arangodb_metadata_number_not_replicated_shards
		introducedIn:"3.12.7"

Copy link

Contributor

KVS85Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Copy link

Author

bluepal-avanthi-dundigalaDec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

Documentation/Metrics/arangodb_metadata_number_out_of_sync_shards.yaml Outdated

		@@ -0,0 +1,35 @@
		name:arangodb_metadata_number_out_of_sync_shards
		introducedIn:"3.12.7"

Copy link

Contributor

KVS85Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Copy link

Author

bluepal-avanthi-dundigalaDec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

Documentation/Metrics/arangodb_metadata_total_number_of_shards.yaml Outdated

		@@ -0,0 +1,36 @@
		name:arangodb_metadata_total_number_of_shards
		introducedIn:"3.12.7"

Copy link

Contributor

KVS85Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Copy link

Author

bluepal-avanthi-dundigalaDec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

Documentation/Metrics/arangodb_shard_followers_out_of_sync_number.yaml

		@@ -0,0 +1,23 @@
		name:arangodb_shard_followers_out_of_sync_number
		introducedIn:"3.12.7"

Copy link

Contributor

KVS85Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please, put3.12.8 here instead.

Copy link

Author

bluepal-avanthi-dundigalaDec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated the version as 3.12.8

goedderz reviewed

Dec 15, 2025

View reviewed changes

arangod/Cluster/ClusterInfo.cpp Outdated

Comment on lines 1958 to 1973

		updateMetadataMetrics();

		// Update coordinator specific shard metrics computed from Plan
		if (_metadataMetrics.has_value()) {
		updateCoordinatorPlanMetrics();
		}

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Until now,updateMetadataMetrics() has been the (only) method that updates_metadataMetrics. Now you had the need to add another one to be called duringloadCurrent() (updateCoordinatorCurrentMetrics()), and so I also get the siblingupdateCoordinatorPlanMetrics() here.

Nevertheless, the current situation with these three methods is a little confusing, at least becauseupdateMetadataMetrics() no longer updates (all the) metadata metrics.

To mitigate that, you could do one or both of

change the name(s) to clarify their respective responsibilities
mergeupdateCoordinatorPlanMetrics() withupdateMetadataMetrics()
.

Later, when we follow through with loading Plan and Current synchronously, it all can be merged into one call.

arangod/Cluster/ClusterInfo.cpp Outdated

Comment on lines 6373 to 6377

		if (it == _shardsToPlanServers.end() \|\| !it->second) {
		// no plan servers recorded
		++notReplicated;
		continue;
		}

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This case should be impossible,_shards and_shardsToPlanServers have to be consistent here.

It's surely sensible to be defensive and handle it safely regardless. As it hints to a bug at a different place, this should get a maintainer-mode assertion (e.g.TRI_ASSERT(it != _shardsToPlanServers.end() && it->second)). And possibly a log message with an error in production, though that doesn't feel necessary to me.

Fyi, maintainer-mode assertions (TRI_ASSERT) are only compiled in maintainer builds, and are not part of release builds.

arangod/Cluster/ClusterInfo.cpp Outdated

Comment on lines 6386 to 6388

		}else {
		++notReplicated;
		}

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As far as I can tell, this case should not be possible, either. Let's also add a maintainer-mode assertion (e.g.TRI_ASSERT(!servers.empty()) to cement this.

arangod/Cluster/ClusterInfo.cpp

		uint64_t notReplicated =0;

		// read plan under read lock to compare leaders/replication factors
		READ_LOCKER(readLocker, _planProt.lock);

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is possibly dangerous, as we're already holding a write lock for_currentProt.lock. We need to be certain this can't introduce deadlocks (e.g. if there is another code path that acquires a write lock for_planProt.lock, and then tries to get a read lock for_currentProt.lock, it could run into a deadlock with this code).

Afair, the ClusterInfo generally only uses one of those locks at a time. But I am not certain, I haven't checked the current code. If there is already code that does that, we must follow a common pattern that avoids deadlocks. And regardless of whether there is such code already, this should be documented at both locks.

Alternatively, we could change the caller to hold only a read lock instead of a write lock for_currentProt.lock: As long as we only ever hold both locks as read locks at the same time, this will prevent deadlocks as well.

I think I prefer the latter solution (only holding read locks), as it also means holding the write lock for a shorter amount of time, and reduce contention with readers: This can be detrimental otherwise. Note that this also holds for not onlyupdateCoordinatorPlanMetrics, but also for the existing functionupdateMetadataMetrics(): It would be preferable to execute all of them under read locks only! Note thatloadPlan andloadCurrent both also hold the_planProt.mutex and_currentProt.mutex, respectively, for their duration: This can be relied on to prevent concurrent calls to theupdate*Metrics functions.

arangod/Cluster/ClusterInfo.cpp

		autoconst& shardId = it.first;
		autoconst& serversPtr = it.second;
		++totalShards;
		size_t currentN = (serversPtr ? serversPtr->size() :0);

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

serversPtr must never be null, or there's a bug somewhere else. Double-checking it is absolutely fine, but I'd also add a maintainer-mode assertion to spot possible bugs during testing (e.g.TRI_ASSERT(serversPtr != nullptr)).

arangod/Cluster/ClusterInfo.cpp

Comment on lines +6424 to +6426

		if (currentN >0) {
		followers += (currentN -1);
		}

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Similarly: In addition toserversPtr not being null, I think*serversPtr must never be empty. I'd also add a maintainer-mode assertion to that end. Though in this case, I am not 100% certain, so please do keep the check. :-)

arangod/Cluster/ClusterInfo.cpp Outdated

		// read plan under read lock to compare leaders/replication factors
		READ_LOCKER(readLocker, _planProt.lock);

		for (autoconst& it : _shardsToCurrentServers) {

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note that currently,Current can have stale entries, i.e. for collections/shards that have already been deleted. This could cause miscounts. It may help to think of thePlan as the actual cluster-wide metadata for e.g. a collection, and correspondingCurrent entries as a last reported state.

I suggest to flip it around: Loop over_shardsToPlanServers instead, and look up the corresponding entries in_shardsToCurrent.

arangod/Cluster/ClusterInfo.cpp Outdated

Comment on lines 6397 to 6399

		_metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed);
		_metadataMetrics->numberNotReplicatedShards.store(notReplicated,
		std::memory_order_relaxed);

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These two metrics should only be updated inupdateCoordinatorCurrentMetrics():

Suggested change

	_metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed);
	_metadataMetrics->numberNotReplicatedShards.store(notReplicated,
	std::memory_order_relaxed);

Which also means we don't need to calculatenotReplicated here (and we can't, really).

arangod/Cluster/ClusterInfo.cpp

Comment on lines +6460 to +6463

		_metadataMetrics->totalNumberOfShards.store(totalShards,
		std::memory_order_relaxed);
		_metadataMetrics->numberFollowerShards.store(followers,
		std::memory_order_relaxed);

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These two metrics are already being reported byupdateCoordinatorCurrentMetrics(), and should not be reported here as well.

Suggested change

	_metadataMetrics->totalNumberOfShards.store(totalShards,
	std::memory_order_relaxed);
	_metadataMetrics->numberFollowerShards.store(followers,
	std::memory_order_relaxed);

arangod/Cluster/ClusterInfo.cpp Outdated

Comment on lines 6445 to 6457

		// out of sync: leader changed or missing
		if (plannedN ==0) {
		// No plan present: treat as out-of-sync
		++outOfSync;
		}elseif (currentLeader.empty()) {
		++outOfSync;
		}elseif (!plannedLeader.empty() && plannedLeader != currentLeader) {
		++outOfSync;
		}

		if (plannedN > currentN) {
		++notReplicated;
		}

Copy link

Member

goedderzDec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ImooutOfSync andnotReplicated don't quite count the right thing; let me try to explain my idea of them (high-level, skipping over some details):

If there is only one in-sync replica (the leader), the shard should be counted asnotReplicated. This is also the case for shards that are planned withreplicationFactor: 1 (which means, the plan already only contains the leader as the only replica).

If any follower fromPlan is missing from the list inCurrent, it is regarded as out-of-sync.

Note that a shard can be counted as bothnotReplicated andoutOfSync, or any one of them, or neither.

But note that there are some additional edge cases to consider, e.g. during a leader failover or move shard operation. Let's discuss these separately.

COR-8|Add coodinator metrics

8b98ae3

bluepal-avanthi-dundigala force-pushed thefeature/cor-8-coordinator-shard-metrics branch fromc1270fb to8b98ae3Compare

December 19, 2025 07:56

COR-8|Add testcase for coordinator metrics

38e14cb

Labels

cla-signed

Movatterモバイル変換

Add metric yaml#22168

Are you sure you want to change the base?

Add metric yaml#22168

Conversation

bluepal-avanthi-dundigala commentedDec 9, 2025• edited by cursorbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Scope & Purpose

Checklist

Related Information

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KVS85 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bluepal-avanthi-dundigala commentedDec 9, 2025•
edited by cursorbot
Loading