NotificationsYou must be signed in to change notification settings
Fork1.1k
Star11.4k

fix: refactor agent resource monitoring API to avoid excessive calls to DB#20430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

cstyan merged 6 commits intomainfromcallum/workspace-agent-call-volume

Oct 28, 2025

Merged

fix: refactor agent resource monitoring API to avoid excessive calls to DB#20430

cstyan merged 6 commits intomainfromcallum/workspace-agent-call-volume

Oct 28, 2025

Conversation

Copy link

Contributor

cstyan commentedOct 23, 2025

This shouldresolvecoder/internal#728 by refactoring the ResourceMonitorAPI struct to only require querying the resource monitor once for memory and once for volumes, then using the stored monitors on the API struct from that point on. This should eliminate the vast majority of calls toGetWorkspaceByAgentID andFetchVolumesResourceMonitorsUpdatedAfter/FetchMemoryResourceMonitorsUpdatedAfter (millions of calls per week).

Tests passed, and I ran an instance of coder via a workspace with a template that added resource monitoring every 10s. Note that this is the default docker container, so there are other sources ofGetWorkspaceByAgentID db queries. Note that this workspace was running for ~15 minutes at the time I gathered this data.

Over 30s for theResourceMonitor calls:

coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2100  288k    0  288k    0     0  58.3M      0 --:--:-- --:--:-- --:--:-- 70.4Mcoderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 155coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 155coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2100  288k    0  288k    0     0  34.7M      0 --:--:-- --:--:-- --:--:-- 40.2Mcoderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 158coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 158

And over 1m for theGetWorkspaceAgentByID calls, the majority are from the workspace metadata stats updates:

coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100  284k    0  284k    0     0  42.4M      0 --:--:-- --:--:-- --:--:-- 46.3Mcoderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 876coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100  284k    0  284k    0     0  75.4M      0 --:--:-- --:--:-- --:--:-- 92.7Mcoderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 918

refactor agent resource monitoring API to avoid excessive calls to DB

d00ce62

for fetching workspaces/workspace agent monitor definitionsSigned-off-by: Callum Styan <callumstyan@gmail.com>

github-actionsbot assignedcstyan

Oct 23, 2025

fix linting

82a6fd0

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan requested a review fromspikecurtis

October 23, 2025 15:53

spikecurtis reviewed

Oct 24, 2025

View reviewed changes

coderd/agentapi/resources_monitoring.go Outdated

		}

		a.monitorsLock.RLock()
		defera.monitorsLock.RUnlock()

Copy link

Contributor

spikecurtisOct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we could simplify this locking by just moving it toPushResourcesMonitoringUsage and enforcing that calls to this function are not concurrent. It's meant to be sequential, and that's how the agent uses it.

coderd/agentapi/resources_monitoring.go Outdated

		returnxerrors.Errorf("fetch memory resource monitor: %w",err)
		}
		iferr==nil {
		a.memoryMonitor=&memMon

Copy link

Contributor

spikecurtisOct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why is this field a pointer if the fetch doesn't return a pointer? Seems ok as a value which can save allocation and GC.

Copy link

ContributorAuthor

cstyanOct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was using thenil check as a way to check for existence/proper instantiation, but we can alternatively useCreatedAt.IsZero to check.

coderd/agentapi/resources_monitoring.go Outdated

		// Load memory monitor once
		varmemoryErrerror
		a.memOnce.Do(func() {
		memoryErr=a.fetchMemoryMonitor(ctx)

Copy link

Contributor

spikecurtisOct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This doesn't really work, because you're passing a closure toDo() that includes thememErr, which is a local variable in this function. So, only the first call to this once can possibly capture the error. Every subsequent call will getmemErr unchanged, even if there was an error.

What needs to happen here depends on the error handling strategy.

One option would be to fetch the initial value of these monitors at the time the*ResourceMonitoringAPI itself is instantiated (that is, when the agent connects to the RPC service). If we fail to fetch the monitors, we error out the connection and assume the agent will reconnect. That's simple, and probably OK, given that we expect DB errors to be rare.

It has the nominal drawback that it tears the whole connection down, when the agent could just retry the RPCs for resource monitoring, but at present the agent doesn't do that. If anything fails on the agentapi, it tends to just tear everything down and start again. So, being more sophisticated like hitting the db when we get RPC calls, and allowing retries for errors would be an improvement that our only client doesn't take advantage of.

Copy link

ContributorAuthor

cstyanOct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yep, that's just a miss on my part during some refactoring of the syncDo functions. I was trying to avoid the refactor that would be required to return an error and have the client retry properly while also not exiting completely for a transient failure. With your added context though it sounds like we're fine with the teardown/early exit as it's what we're already doing everywhere else for similar paths in the agent code.

We can explore the refactor later if we decide it would be useful 👍

cstyan added2 commits

October 25, 2025 02:14

address review comments

ffcb43b

Signed-off-by: Callum Styan <callumstyan@gmail.com>

ctx is no longer used in this function

8743e8e

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan commented

Oct 25, 2025

View reviewed changes

coderd/agentapi/resources_monitoring.goShow resolvedHide resolved

Merge branch 'main' into callum/workspace-agent-call-volume

6701660

cstyan requested a review fromspikecurtis

October 28, 2025 06:12

spikecurtis approved these changes

Oct 28, 2025

View reviewed changes

Copy link

Contributor

spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A small suggestion inline, but I don't need to review again.

coderd/agentapi/resources_monitoring.go Outdated

		}

		if!monitor.Enabled {
		if!a.memoryMonitor.Enabled\|\|a.memoryMonitor.CreatedAt.IsZero() {

Copy link

Contributor

spikecurtisOct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The zero value ofa.memoryMonitor.Enabled is false, so I think the first check is sufficient here.

Copy link

ContributorAuthor

cstyanOct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

👍 yep, it should be sufficient to have just one (either check works here), will change this back to just checkingEnabled.

I'd addedCreatedAt.IsZero() here since we had to changeif memoryErr != nil { inGetResourcesMonitoringConfiguration and there we need to useCreateAt.IsZero to indicate "no configuration".

remove unnecessary OR condition in monitorMemory

87934c3

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan merged commit45c43d4 intomain

Oct 28, 2025

49 of 51 checks passed

cstyan deleted the callum/workspace-agent-call-volume branch

October 28, 2025 20:38

github-actionsbot locked and limited conversation to collaborators

Oct 28, 2025

Labels

None yet

Movatterモバイル変換

fix: refactor agent resource monitoring API to avoid excessive calls to DB#20430

fix: refactor agent resource monitoring API to avoid excessive calls to DB#20430

Uh oh!

Conversation

cstyan commentedOct 23, 2025

Uh oh!

spikecurtisOct 24, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 24, 2025

Choose a reason for hiding this comment

Uh oh!

cstyanOct 25, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 24, 2025

Choose a reason for hiding this comment

Uh oh!

cstyanOct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

spikecurtis left a comment

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 28, 2025

Choose a reason for hiding this comment

Uh oh!

cstyanOct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants