Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix: refactor agent resource monitoring API to avoid excessive calls to DB#20430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
cstyan merged 6 commits intomainfromcallum/workspace-agent-call-volume
Oct 28, 2025

Conversation

@cstyan
Copy link
Contributor

This shouldresolvecoder/internal#728 by refactoring the ResourceMonitorAPI struct to only require querying the resource monitor once for memory and once for volumes, then using the stored monitors on the API struct from that point on. This should eliminate the vast majority of calls toGetWorkspaceByAgentID andFetchVolumesResourceMonitorsUpdatedAfter/FetchMemoryResourceMonitorsUpdatedAfter (millions of calls per week).

Tests passed, and I ran an instance of coder via a workspace with a template that added resource monitoring every 10s. Note that this is the default docker container, so there are other sources ofGetWorkspaceByAgentID db queries. Note that this workspace was running for ~15 minutes at the time I gathered this data.

Over 30s for theResourceMonitor calls:

coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2100  288k    0  288k    0     0  58.3M      0 --:--:-- --:--:-- --:--:-- 70.4Mcoderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 155coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 155coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2100  288k    0  288k    0     0  34.7M      0 --:--:-- --:--:-- --:--:-- 40.2Mcoderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 158coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 158

And over 1m for theGetWorkspaceAgentByID calls, the majority are from the workspace metadata stats updates:

coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100  284k    0  284k    0     0  42.4M      0 --:--:-- --:--:-- --:--:-- 46.3Mcoderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 876coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100  284k    0  284k    0     0  75.4M      0 --:--:-- --:--:-- --:--:-- 92.7Mcoderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 918

ethanndickson reacted with hooray emojiethanndickson reacted with heart emoji
for fetching workspaces/workspace agent monitor definitionsSigned-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
}

a.monitorsLock.RLock()
defera.monitorsLock.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we could simplify this locking by just moving it toPushResourcesMonitoringUsage and enforcing that calls to this function are not concurrent. It's meant to be sequential, and that's how the agent uses it.

cstyan reacted with thumbs up emoji
returnxerrors.Errorf("fetch memory resource monitor: %w",err)
}
iferr==nil {
a.memoryMonitor=&memMon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why is this field a pointer if the fetch doesn't return a pointer? Seems ok as a value which can save allocation and GC.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was using thenil check as a way to check for existence/proper instantiation, but we can alternatively useCreatedAt.IsZero to check.

// Load memory monitor once
varmemoryErrerror
a.memOnce.Do(func() {
memoryErr=a.fetchMemoryMonitor(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This doesn't really work, because you're passing a closure toDo() that includes thememErr, which is a local variable in this function. So, only the first call to this once can possibly capture the error. Every subsequent call will getmemErr unchanged, even if there was an error.

What needs to happen here depends on the error handling strategy.

One option would be to fetch the initial value of these monitors at the time the*ResourceMonitoringAPI itself is instantiated (that is, when the agent connects to the RPC service). If we fail to fetch the monitors, we error out the connection and assume the agent will reconnect. That's simple, and probably OK, given that we expect DB errors to be rare.

It has the nominal drawback that it tears the whole connection down, when the agent could just retry the RPCs for resource monitoring, but at present the agent doesn't do that. If anything fails on the agentapi, it tends to just tear everything down and start again. So, being more sophisticated like hitting the db when we get RPC calls, and allowing retries for errors would be an improvement that our only client doesn't take advantage of.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yep, that's just a miss on my part during some refactoring of the syncDo functions. I was trying to avoid the refactor that would be required to return an error and have the client retry properly while also not exiting completely for a transient failure. With your added context though it sounds like we're fine with the teardown/early exit as it's what we're already doing everywhere else for similar paths in the agent code.

We can explore the refactor later if we decide it would be useful 👍

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Contributor

@spikecurtisspikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A small suggestion inline, but I don't need to review again.

}

if!monitor.Enabled {
if!a.memoryMonitor.Enabled||a.memoryMonitor.CreatedAt.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The zero value ofa.memoryMonitor.Enabled is false, so I think the first check is sufficient here.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

👍 yep, it should be sufficient to have just one (either check works here), will change this back to just checkingEnabled.

I'd addedCreatedAt.IsZero() here since we had to changeif memoryErr != nil { inGetResourcesMonitoringConfiguration and there we need to useCreateAt.IsZero to indicate "no configuration".

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyancstyan merged commit45c43d4 intomainOct 28, 2025
49 of 51 checks passed
@cstyancstyan deleted the callum/workspace-agent-call-volume branchOctober 28, 2025 20:38
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsOct 28, 2025
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@spikecurtisspikecurtisspikecurtis approved these changes

Assignees

@cstyancstyan

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

bug: GetWorkspaceByAgentID is called over a million times a day on dogfood

3 participants

@cstyan@spikecurtis

[8]ページ先頭

©2009-2025 Movatter.jp