- Notifications
You must be signed in to change notification settings - Fork1.1k
fix: refactor agent resource monitoring API to avoid excessive calls to DB#20430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
for fetching workspaces/workspace agent monitor definitionsSigned-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
| } | ||
| a.monitorsLock.RLock() | ||
| defera.monitorsLock.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think we could simplify this locking by just moving it toPushResourcesMonitoringUsage and enforcing that calls to this function are not concurrent. It's meant to be sequential, and that's how the agent uses it.
| returnxerrors.Errorf("fetch memory resource monitor: %w",err) | ||
| } | ||
| iferr==nil { | ||
| a.memoryMonitor=&memMon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
why is this field a pointer if the fetch doesn't return a pointer? Seems ok as a value which can save allocation and GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I was using thenil check as a way to check for existence/proper instantiation, but we can alternatively useCreatedAt.IsZero to check.
| // Load memory monitor once | ||
| varmemoryErrerror | ||
| a.memOnce.Do(func() { | ||
| memoryErr=a.fetchMemoryMonitor(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This doesn't really work, because you're passing a closure toDo() that includes thememErr, which is a local variable in this function. So, only the first call to this once can possibly capture the error. Every subsequent call will getmemErr unchanged, even if there was an error.
What needs to happen here depends on the error handling strategy.
One option would be to fetch the initial value of these monitors at the time the*ResourceMonitoringAPI itself is instantiated (that is, when the agent connects to the RPC service). If we fail to fetch the monitors, we error out the connection and assume the agent will reconnect. That's simple, and probably OK, given that we expect DB errors to be rare.
It has the nominal drawback that it tears the whole connection down, when the agent could just retry the RPCs for resource monitoring, but at present the agent doesn't do that. If anything fails on the agentapi, it tends to just tear everything down and start again. So, being more sophisticated like hitting the db when we get RPC calls, and allowing retries for errors would be an improvement that our only client doesn't take advantage of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yep, that's just a miss on my part during some refactoring of the syncDo functions. I was trying to avoid the refactor that would be required to return an error and have the client retry properly while also not exiting completely for a transient failure. With your added context though it sounds like we're fine with the teardown/early exit as it's what we're already doing everywhere else for similar paths in the agent code.
We can explore the refactor later if we decide it would be useful 👍
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A small suggestion inline, but I don't need to review again.
| } | ||
| if!monitor.Enabled { | ||
| if!a.memoryMonitor.Enabled||a.memoryMonitor.CreatedAt.IsZero() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The zero value ofa.memoryMonitor.Enabled is false, so I think the first check is sufficient here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
👍 yep, it should be sufficient to have just one (either check works here), will change this back to just checkingEnabled.
I'd addedCreatedAt.IsZero() here since we had to changeif memoryErr != nil { inGetResourcesMonitoringConfiguration and there we need to useCreateAt.IsZero to indicate "no configuration".
Signed-off-by: Callum Styan <callumstyan@gmail.com>
45c43d4 intomainUh oh!
There was an error while loading.Please reload this page.
This shouldresolvecoder/internal#728 by refactoring the ResourceMonitorAPI struct to only require querying the resource monitor once for memory and once for volumes, then using the stored monitors on the API struct from that point on. This should eliminate the vast majority of calls to
GetWorkspaceByAgentIDandFetchVolumesResourceMonitorsUpdatedAfter/FetchMemoryResourceMonitorsUpdatedAfter(millions of calls per week).Tests passed, and I ran an instance of coder via a workspace with a template that added resource monitoring every 10s. Note that this is the default docker container, so there are other sources of
GetWorkspaceByAgentIDdb queries. Note that this workspace was running for ~15 minutes at the time I gathered this data.Over 30s for the
ResourceMonitorcalls:And over 1m for the
GetWorkspaceAgentByIDcalls, the majority are from the workspace metadata stats updates: