- Notifications
You must be signed in to change notification settings - Fork875
Add metric yaml#22168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:devel
Are you sure you want to change the base?
Add metric yaml#22168
Conversation
| name:arangodb_shards_follower_number | ||
| introducedIn:"3.12.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Isn't this the same asarangodb_metadata_number_follower_shards.yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Isn't this the same as
arangodb_metadata_number_follower_shards.yaml?
I added a separate file for the coordinator metrics as we previously agreed. If you prefer to use the existing 'arangodb_metadata_number_follower_shards.yaml', we need to update it by including 'coordinator' in the 'exposedBy' section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I guess this file (arangodb_shards_follower_number.yaml) was taken from the PR#22121 (I assume as an example to work with). It can be deleted in this branch, as it will be added when the other one is merged. The same goes for the filearangodb_shard_followers_out_of_sync_number.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Deleting files.
KVS85 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
See review.
| @@ -0,0 +1,35 @@ | |||
| name:arangodb_metadata_number_follower_shards | |||
| introducedIn:"3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Please, put3.12.8 here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Updated the version as 3.12.8
| @@ -0,0 +1,35 @@ | |||
| name:arangodb_metadata_number_not_replicated_shards | |||
| introducedIn:"3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Please, put3.12.8 here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Updated the version as 3.12.8
| @@ -0,0 +1,35 @@ | |||
| name:arangodb_metadata_number_out_of_sync_shards | |||
| introducedIn:"3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Please, put3.12.8 here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Updated the version as 3.12.8
| @@ -0,0 +1,36 @@ | |||
| name:arangodb_metadata_total_number_of_shards | |||
| introducedIn:"3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Please, put3.12.8 here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Updated the version as 3.12.8
| @@ -0,0 +1,23 @@ | |||
| name:arangodb_shard_followers_out_of_sync_number | |||
| introducedIn:"3.12.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Please, put3.12.8 here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Updated the version as 3.12.8
arangod/Cluster/ClusterInfo.cpp Outdated
| updateMetadataMetrics(); | ||
| // Update coordinator specific shard metrics computed from Plan | ||
| if (_metadataMetrics.has_value()) { | ||
| updateCoordinatorPlanMetrics(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Until now,updateMetadataMetrics() has been the (only) method that updates_metadataMetrics. Now you had the need to add another one to be called duringloadCurrent() (updateCoordinatorCurrentMetrics()), and so I also get the siblingupdateCoordinatorPlanMetrics() here.
Nevertheless, the current situation with these three methods is a little confusing, at least becauseupdateMetadataMetrics() no longer updates (all the) metadata metrics.
To mitigate that, you could do one or both of
- change the name(s) to clarify their respective responsibilities
- merge
updateCoordinatorPlanMetrics()withupdateMetadataMetrics()
.
Later, when we follow through with loading Plan and Current synchronously, it all can be merged into one call.
arangod/Cluster/ClusterInfo.cpp Outdated
| if (it == _shardsToPlanServers.end() || !it->second) { | ||
| // no plan servers recorded | ||
| ++notReplicated; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This case should be impossible,_shards and_shardsToPlanServers have to be consistent here.
It's surely sensible to be defensive and handle it safely regardless. As it hints to a bug at a different place, this should get a maintainer-mode assertion (e.g.TRI_ASSERT(it != _shardsToPlanServers.end() && it->second)). And possibly a log message with an error in production, though that doesn't feel necessary to me.
Fyi, maintainer-mode assertions (TRI_ASSERT) are only compiled in maintainer builds, and are not part of release builds.
arangod/Cluster/ClusterInfo.cpp Outdated
| }else { | ||
| ++notReplicated; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
As far as I can tell, this case should not be possible, either. Let's also add a maintainer-mode assertion (e.g.TRI_ASSERT(!servers.empty()) to cement this.
| uint64_t notReplicated =0; | ||
| // read plan under read lock to compare leaders/replication factors | ||
| READ_LOCKER(readLocker, _planProt.lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is possibly dangerous, as we're already holding a write lock for_currentProt.lock. We need to be certain this can't introduce deadlocks (e.g. if there is another code path that acquires a write lock for_planProt.lock, and then tries to get a read lock for_currentProt.lock, it could run into a deadlock with this code).
Afair, the ClusterInfo generally only uses one of those locks at a time. But I am not certain, I haven't checked the current code. If there is already code that does that, we must follow a common pattern that avoids deadlocks. And regardless of whether there is such code already, this should be documented at both locks.
Alternatively, we could change the caller to hold only a read lock instead of a write lock for_currentProt.lock: As long as we only ever hold both locks as read locks at the same time, this will prevent deadlocks as well.
I think I prefer the latter solution (only holding read locks), as it also means holding the write lock for a shorter amount of time, and reduce contention with readers: This can be detrimental otherwise. Note that this also holds for not onlyupdateCoordinatorPlanMetrics, but also for the existing functionupdateMetadataMetrics(): It would be preferable to execute all of them under read locks only! Note thatloadPlan andloadCurrent both also hold the_planProt.mutex and_currentProt.mutex, respectively, for their duration: This can be relied on to prevent concurrent calls to theupdate*Metrics functions.
| autoconst& shardId = it.first; | ||
| autoconst& serversPtr = it.second; | ||
| ++totalShards; | ||
| size_t currentN = (serversPtr ? serversPtr->size() :0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
serversPtr must never be null, or there's a bug somewhere else. Double-checking it is absolutely fine, but I'd also add a maintainer-mode assertion to spot possible bugs during testing (e.g.TRI_ASSERT(serversPtr != nullptr)).
| if (currentN >0) { | ||
| followers += (currentN -1); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Similarly: In addition toserversPtr not being null, I think*serversPtr must never be empty. I'd also add a maintainer-mode assertion to that end. Though in this case, I am not 100% certain, so please do keep the check. :-)
arangod/Cluster/ClusterInfo.cpp Outdated
| // read plan under read lock to compare leaders/replication factors | ||
| READ_LOCKER(readLocker, _planProt.lock); | ||
| for (autoconst& it : _shardsToCurrentServers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Note that currently,Current can have stale entries, i.e. for collections/shards that have already been deleted. This could cause miscounts. It may help to think of thePlan as the actual cluster-wide metadata for e.g. a collection, and correspondingCurrent entries as a last reported state.
I suggest to flip it around: Loop over_shardsToPlanServers instead, and look up the corresponding entries in_shardsToCurrent.
arangod/Cluster/ClusterInfo.cpp Outdated
| _metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed); | ||
| _metadataMetrics->numberNotReplicatedShards.store(notReplicated, | ||
| std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
These two metrics should only be updated inupdateCoordinatorCurrentMetrics():
| _metadataMetrics->numberOutOfSyncShards.store(0, std::memory_order_relaxed); | |
| _metadataMetrics->numberNotReplicatedShards.store(notReplicated, | |
| std::memory_order_relaxed); |
Which also means we don't need to calculatenotReplicated here (and we can't, really).
| _metadataMetrics->totalNumberOfShards.store(totalShards, | ||
| std::memory_order_relaxed); | ||
| _metadataMetrics->numberFollowerShards.store(followers, | ||
| std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
These two metrics are already being reported byupdateCoordinatorCurrentMetrics(), and should not be reported here as well.
| _metadataMetrics->totalNumberOfShards.store(totalShards, | |
| std::memory_order_relaxed); | |
| _metadataMetrics->numberFollowerShards.store(followers, | |
| std::memory_order_relaxed); |
arangod/Cluster/ClusterInfo.cpp Outdated
| // out of sync: leader changed or missing | ||
| if (plannedN ==0) { | ||
| // No plan present: treat as out-of-sync | ||
| ++outOfSync; | ||
| }elseif (currentLeader.empty()) { | ||
| ++outOfSync; | ||
| }elseif (!plannedLeader.empty() && plannedLeader != currentLeader) { | ||
| ++outOfSync; | ||
| } | ||
| if (plannedN > currentN) { | ||
| ++notReplicated; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
ImooutOfSync andnotReplicated don't quite count the right thing; let me try to explain my idea of them (high-level, skipping over some details):
If there is only one in-sync replica (the leader), the shard should be counted asnotReplicated. This is also the case for shards that are planned withreplicationFactor: 1 (which means, the plan already only contains the leader as the only replica).
If any follower fromPlan is missing from the list inCurrent, it is regarded as out-of-sync.
Note that a shard can be counted as bothnotReplicated andoutOfSync, or any one of them, or neither.
But note that there are some additional edge cases to consider, e.g. during a leader failover or move shard operation. Let's discuss these separately.
c1270fb to8b98ae3Compare
Uh oh!
There was an error while loading.Please reload this page.
Scope & Purpose
(Please describe the changes in this PR for reviewers, motivation, rationale -mandatory)
Checklist
Related Information
(Please reference tickets / specification / other PRs etc)
Note
Add documentation for six new shard-related metrics (totals, not replicated/out-of-sync, follower counts) across coordinator and DBServer.
Documentation/Metrics/arangodb_metadata_total_number_of_shards.yaml:arangodb_metadata_total_number_of_shardsDocumentation/Metrics/arangodb_metadata_number_follower_shards.yaml:arangodb_metadata_number_follower_shardsDocumentation/Metrics/arangodb_metadata_number_not_replicated_shards.yaml:arangodb_metadata_number_not_replicated_shardsDocumentation/Metrics/arangodb_metadata_number_out_of_sync_shards.yaml:arangodb_metadata_number_out_of_sync_shardsDocumentation/Metrics/arangodb_shards_follower_number.yaml:arangodb_shards_follower_numberDocumentation/Metrics/arangodb_shard_followers_out_of_sync_number.yaml:arangodb_shard_followers_out_of_sync_numberintroducedIn: "3.12.7", typegauge, with descriptions and troubleshoot guidance.Written byCursor Bugbot for commit35cf940. This will update automatically on new commits. Configurehere.