Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: expose agent metrics via Prometheus endpoint#7011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
mtojek merged 16 commits intocoder:mainfrommtojek:6724-metrics
Apr 7, 2023

Conversation

mtojek
Copy link
Member

@mtojekmtojek commentedApr 5, 2023
edited
Loading

Related:#6724

This PR exposes basic agent metrics via the Prometheus endpoint. It is a low-hanging fruit as all the information is already exposed via HTTP API.

# HELP coderd_agents_apps Agent applications with statuses.# TYPE coderd_agents_apps gaugecoderd_agents_apps{agent_name="main",app_name="code-server",health="healthy",username="admin",workspace_name="workspace-1"} 1coderd_agents_apps{agent_name="main",app_name="code-server",health="healthy",username="admin",workspace_name="workspace-2"} 1coderd_agents_apps{agent_name="main",app_name="code-server",health="healthy",username="admin",workspace_name="workspace-3"} 1# HELP coderd_agents_connection_latencies_seconds Agent connection latencies in seconds.# TYPE coderd_agents_connection_latencies_seconds gaugecoderd_agents_connection_latencies_seconds{agent_id="main",derp_region="Coder Embedded Relay",preferred="true",username="admin",workspace_name="workspace-1"} 0.031924709coderd_agents_connection_latencies_seconds{agent_id="main",derp_region="Coder Embedded Relay",preferred="true",username="admin",workspace_name="workspace-2"} 0.028658416coderd_agents_connection_latencies_seconds{agent_id="main",derp_region="Coder Embedded Relay",preferred="true",username="admin",workspace_name="workspace-3"} 0.028099541# HELP coderd_agents_connections Agent connections with statuses.# TYPE coderd_agents_connections gaugecoderd_agents_connections{agent_name="main",lifecycle_state="ready",status="connected",tailnet_node="nodeid:16966f7df70d8cc5",username="admin",workspace_name="workspace-3"} 1coderd_agents_connections{agent_name="main",lifecycle_state="start_timeout",status="connected",tailnet_node="nodeid:3237d00938be23e3",username="admin",workspace_name="workspace-2"} 1coderd_agents_connections{agent_name="main",lifecycle_state="start_timeout",status="connected",tailnet_node="nodeid:3779bd45d00be0eb",username="admin",workspace_name="workspace-1"} 1# HELP coderd_agents_up The number of active agents per workspace.# TYPE coderd_agents_up gaugecoderd_agents_up{username="admin",workspace_name="workspace-1"} 1coderd_agents_up{username="admin",workspace_name="workspace-2"} 1coderd_agents_up{username="admin",workspace_name="workspace-3"} 1

@mtojekmtojek self-assigned thisApr 5, 2023
@mtojekmtojek marked this pull request as ready for reviewApril 5, 2023 12:41
Comment on lines 193 to 196
agentsGauge.Reset()
agentsConnectionsGauge.Reset()
agentsConnectionLatenciesGauge.Reset()
agentsAppsGauge.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This feels like a race condition. We reset the gauges, then loop over something that does db calls.

The loop can be mid way through when a scrape happens, and the numbers will be incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

One cheap way to fix this is make our ownprometheus.Collector and only update the values of the metrics at the end of the function. So we store the last values while we compute the new values, and always return the last completed dataset.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Gauges are being reset on every loop iteration before collecting any metrics. In theory, it's possible that a scrape happens after the reset but before the end of the iteration, so metrics may disappear.

I will look into a custom collector. BTW most likely the same issue affectsprometheusmetrics.Workspaces().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yea, that one at least doesn't do db calls in the increment, but it does affect that too.
This is my first time looking at this.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This feels like a race condition. We reset the gauges, then loop over something that does db calls.

I implemented a wrapper to mitigate concurrency risks. It seems to be the simplest way to address it.Collect() will wait a bit if there is a "flushing" procedure ongoing.

}

// Agents tracks the total number of workspaces with labels on status.
funcAgents(ctx context.Context,logger slog.Logger,registerer prometheus.Registerer,db database.Store,coordinator*atomic.Pointer[tailnet.Coordinator],derpMap*tailcfg.DERPMap,agentInactiveDisconnectTimeout,duration time.Duration) (context.CancelFunc,error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Does this data already live on each workspace/agent?

I wonder if there is another way to do this. In v1 we implemented a customprometheus.Collector to handle agent stats in a non-racy way.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My idea was to keep it aligned with otherprometheusmetrics, and use a single source of metrics, the database. In this case, the information we present over Coderd API is consistent with Prometheus endpoint.

Regarding theprometheus.Collector, I will take a look 👍 (as stated in the other comment).

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I find the v1 implementation a bit complex for this use case. As I tried to separate metric collections apart from agent reporting logic, a collector like the one in v1 would be great if we have metrics coming from different parts of the application.

BTW It looks like the v1 collector doesn't support vectors, but here we depend mostly on them. Porting the collector would make it more complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yea, I didn't think we could port the v1 metrcis verbatim. How it works though is each agent uses prometheus to create their metrics. Those metrics get sent to thecoderd they are connected to with their agent, and push their prom metrics. The aggregator then combines all those metrics together, labeling them for each workspace.

The v1 collector does support labels, which is "vectors". Each unique label set is a single "metric" to the aggregator. Socoderd_agents_metric{favorite-number="7"} andcoderd_agents_metric{favorite-number="1"} are 2 differentprometheus.Metric. This matches the prometheus design oflabels:

Remember that every unique combination of key-value label pairs represents a new time series


I liked the v1 design as it made it easier to add metrics from the agent, as I think we make our own payloads in v2. 🤷‍♂️

I was more pointing it out as making aCollector gives you a lot more freedom on how to manipulate theGather part of the metrics.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How it works though is each agent uses prometheus to create their metrics. Those metrics get sent to the coderd they are connected to with their agent, and push their prom metrics.

That is one approach, but unfortunately, we would miss metrics from disconnected agents.

As stated in the PR description, the idea behind this submission is to expose metric data we have already collected and stored in the database. If we have this data already stored in the database, why just don't use it :) A similar story would be with "agent stats" that are stored in a dedicated database table.

Let me know your thoughts.

Emyrk reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Disconnected agent stats in v1 Prometheus are eventually "stale" and then removed. Since Prometheus doesn't need to be a perfect source of truth (a little lag eg 1min is ok imo).


I agree with you though to just expose what we currently have is the go to move. My initial hunch was to make a collector that has all the internal counters you are trying to track.

When the "Gather" func is called, the Collector returns the cached counters.

Every "update" perioud, new counts are created and incremented in an internal loop. When the counts are finished, then the Collector is locked, all counters updated, unlocked.

So the exposed counters are always "correct"

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My initial hunch was to make a collector that has all the internal counters you are trying to track.

The idea is good for sure, it's just a matter of the capacity we have 👍

Every "update" perioud, new counts are created and incremented in an internal loop. When the counts are finished, then the Collector is locked, all counters updated, unlocked.

Yup, this is more or less what I've implemented on the gauge vector level:CachedGaugeVec. I just found it simpler compared with the concept of a collector.

Emyrk reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah that makes sense and looks good 👍

Comment on lines 193 to 196
agentsGauge.Reset()
agentsConnectionsGauge.Reset()
agentsConnectionLatenciesGauge.Reset()
agentsAppsGauge.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

One cheap way to fix this is make our ownprometheus.Collector and only update the values of the metrics at the end of the function. So we store the last values while we compute the new values, and always return the last completed dataset.

@mtojekmtojek requested a review fromEmyrkApril 6, 2023 11:35
Copy link
Member

@EmyrkEmyrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Small nits, LG

Comment on lines 37 to 42
func (v*CachedGaugeVec)Describe(descchan<-*prometheus.Desc) {
v.m.Lock()
deferv.m.Unlock()

v.gaugeVec.Describe(desc)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This actually does not need the mutex.Describe is safe and does not return the counter, which is what you are protecting.

Describe is not really called much in prod, if ever, so it's not that big of a deal.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wasn't sure ifDescribe should be guarded or not, so thanks for raising it. I removed the mutex from the function.

Comment on lines +9 to +14
typeCachedGaugeVecstruct {
m sync.Mutex

gaugeVec*prometheus.GaugeVec
records []vectorRecord
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you doc the usage? And the why?

Eg:

CachedGaugeVec does .....
Calling WithLabelValues will update the internal gauge value. The value will not be returned by 'Collect' until 'Commit' is called.
'Commit' will reset the internal value, requiring the next set of values to build upon a completely reset metric.

Or something...

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added 👍

})
}

func (v*CachedGaugeVec)Commit() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

// Commit will set the internal value as the cached value to return from 'Collect'.
// The internal metric value is completely reset, so the caller should expect
// the gauge to be empty for the next 'WithLabelValues' values.

Suggested change
func (v*CachedGaugeVec)Commit() {
func (v*CachedGaugeVec)Commit() {

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Comment added.

Comment on lines 69 to 76
switchrecord.operation {
caseVectorOperationAdd:
g.Add(record.value)
caseVectorOperationSet:
g.Set(record.value)
default:
panic("unsupported vector operation")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Might want this switch statement on theWithLabelValues call so the panic is closer to the source.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This switch is also useful to pick the right operation, so I will add another switch-case-panic toWithLabelValues.

@mtojekmtojek merged commit0347231 intocoder:mainApr 7, 2023
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsApr 7, 2023
Comment on lines +59 to +62
switchoperation {
caseVectorOperationAdd:
caseVectorOperationSet:
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I usually prefer this. But it does not matter.

switchoperation {caseVectorOperationAdd,VectorOperationSet:default:}

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@EmyrkEmyrkEmyrk approved these changes

@mafredrimafredriAwaiting requested review from mafredri

@coadlercoadlerAwaiting requested review from coadler

@bpmctbpmctAwaiting requested review from bpmct

Assignees

@mtojekmtojek

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@mtojek@Emyrk

[8]ページ先頭

©2009-2025 Movatter.jp