Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: integrate agentAPI with resources monitoring logic#16438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
defelmnq merged 32 commits intomainfromagent_res_mon
Feb 14, 2025

Conversation

defelmnq
Copy link
Contributor

@defelmnqdefelmnq commentedFeb 4, 2025
edited
Loading

As part of the new resources monitoring logic - more specifically for OOM & OOD Notifications , we need to update the AgentAPI , and the agents logic.

This PR aims to do it, and more specifically :
We are updating the AgentAPI & TailnetAPI to version 24 to add two new methods in the AgentAPI :

  • One method to fetch the resources monitoring configuration
  • One method to push the datapoints for the resources monitoring.

Also, this PR adds a new logic on the agent side, with a routine running and ticking - fetching the resources usage each time , but also storing it in a FIFO like queue.

Finally, this PR fixes a problem we had with RBAC logic on the resources monitoring model, applying the same logic than we have for similar entities.

@defelmnqdefelmnq self-assigned thisFeb 4, 2025
@defelmnqdefelmnq changed the titlefeat: connect agentAPI with resources monitoring logicfeat: integrate agentAPI with resources monitoring logicFeb 10, 2025
@defelmnqdefelmnq marked this pull request as ready for reviewFebruary 11, 2025 05:04
Statusstatus=6;
}

messageGetResourcesMonitoringConfigurationRequest {
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This message is based on what has been described in the RFC - and has been mutually discussed and approved by Vincent and@DanielleMaywood

"github.com/coder/quartz"
)

func (a*agent)pushResourcesMonitoring(ctx context.Context,aAPI proto.DRPCAgentClient24)error {
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : We have this intermediate function in order to have an extra layer, and simplicify for testing purpose.
As all arguments are injected as parameters toPushResourcesMonitoringWithConfig we can use the mocked versions.

configFetcherResourcesMonitorConfigurationFetcher,
datapointsPusherResourcesMonitorDatapointsPusher,
)error {
config,err:=configFetcher(ctx,&proto.GetResourcesMonitoringConfigurationRequest{})
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : There's three situations on the which we do not want to run the go-routine :

  • If we can not fetch the configuration from Coderd.
  • If the resources_monitoring is not enabled.
  • If we can not instantiate the clistat client (which is the one used to fetch resources usage.)

},nil
}

func (a*ResourcesMonitoringAPI)PushResourcesMonitoringUsage(ctx context.Context,req*proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse,error) {
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : This function is just here as a placeholder - we are fine keeping it this way for now as it will be replaced by@DanielleMaywood PR.

dannykopping reacted with thumbs up emoji

datapointsQueue:=NewResourcesMonitorQueue(int(config.Config.NumDatapoints))

clk.TickerFunc(ctx,time.Duration(config.Config.TickInterval*int32(time.Second)),func()error {
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : We prefer to log instead of returning an error in case of problem.

UsingTickerFunc from quartz, returning an error would just shutdown the whole ticker and stop it. Returning nil stops the current tick, but the next one will still trigger.

@defelmnq
Copy link
ContributorAuthor

@spikecurtis I'd love to have your review - specially on the whole AgentAPI / TailnetAPI part - I want to be sure that I increase the version as we should.

Copy link
Contributor

@dannykoppingdannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Generally looking good, but I'm concerned that this doesn't match the RFC. See my comment inagent/resources_monitor.go.

funcTestPushResourcesMonitoringWithConfig(t*testing.T) {
t.Parallel()

tests:= []struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking at the code coverage, we're missing a few test-cases here:

  • disabled config pushes nothing
  • errors when fetching memory/volume info
  • errors when pushing to coderd

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I improved a lot the testability of this block by creating a struct for the resources_fetcher part - so we can mock it.

Also added test cases for all the missing cases. Should be way better now.

ResourcesMonitorDatapointsPusherfunc(ctx context.Context,params*proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse,error)
)

funcPushResourcesMonitoringWithConfig(ctx context.Context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This implementation does not seem to match the RFC design.

Once the buffer fills up, it'll send the full buffer to the control plane. However, it'll keep doing so everyTickInterval thereafter since the buffer only ever replaces the oldest value.

The RFC specifies that the payload should only be delivered every 15s for the 20 datapoints; it's currently performing collection and submission in the same ticker with no distinction between them.

We also seem to not be adding theUNKNOWN datapoints when values cannot be retrieved.

Was this a conscious choice? Is the RFC out-of-date perhaps?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

About theUNKNOWN status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, isOK,NOK orUNKNOWN.

About the tick logic - that's maybe a misunderstanding in the RFC , but when I defined it to be delivered every 15s, it was based on theTickInterval - considering it was set to 15. having both theTickInterval and another value set independently would require some sync to avoid sending two times the datapoints without having a refresh or fetching datapoints too frequently and so not pushing data to coderd at each tick.

TL;DR - I think the best solution is to have theTickInterval handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

About the UNKNOWN status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, is OK, NOK or UNKNOWN.

Can you point me to some specific code I can look at please? Will you be inferring that a datapoint was missed by looking at the time of each datapoint? I wonder why we opted for this approach rather than explicitly sending a noop datapoint to indicate a problem.

TL;DR - I think the best solution is to have the TickInterval handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.

This doesn't really address what I'm asking, though.

We're going to be sending alln datapoints everyTickInterval once the queue fills up since we never drain the queue but only shifting off the oldest element.

Copy link
Contributor

@spikecurtisspikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Proto version changes look overall fine, but I want to review again re: RBAC


type (
ResourcesMonitorConfigurationFetcherfunc(ctx context.Context,params*proto.GetResourcesMonitoringConfigurationRequest) (*proto.GetResourcesMonitoringConfigurationResponse,error)
ResourcesMonitorDatapointsPusherfunc(ctx context.Context,params*proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse,error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should be an interface. With a bare function type, IDE tooling has a hard time finding implementations of this function. If you make it an interface, you can find implementations and see what this is.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I removed thefunc parts andResourcesMonitorConfigurationFetcher to have an interface - should be way better this way. 👀

datapointsPusherResourcesMonitorDatapointsPusher,
)error {
config,err:=configFetcher(ctx,&proto.GetResourcesMonitoringConfigurationRequest{})
iferr!=nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of passing a "fetcher", you should just pass the config, and do the fetching before you call this function. It will make testing easier and you can avoid definingResourcesMonitorConfigurationFetcher which is a real mouthful.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

✅ changed too.

returnxerrors.Errorf("failed to create resources fetcher: %w",err)
}

datapointsQueue:=NewResourcesMonitorQueue(int(config.Config.NumDatapoints))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think this would be a lot more readable with aResourceMonitor struct that includes things like the resources fetcher and queue as members, instead of this nested function style. It makes it hard to keep track of the scope of things, that is, which things are per tick and which last between ticks.

You set up the long lived struct, then start the ticker calling a method on the struct. That way it's very clear what is part of the tick and what outlives it. It also means you don't have to pass thefetcher to helpers likefetchResourceMonitoredVolume.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I changed a lot the way things were organised based on the comments - so now :

  • Resources monitor logic is in its own package
  • It works with a struct and methods to make it an entity
  • members are used to store the queue & config

🙏

@@ -0,0 +1,132 @@
package agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think Go style would be to move this into it's own package so you don't have to include some variant of "resource" and "monitor" in the name of everything to keep things clear.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I moved it to its own package ✅ should indeed simplify a lot the naming.

}

func (a*ResourcesMonitoringAPI)GetResourcesMonitoringConfiguration(ctx context.Context,_*proto.GetResourcesMonitoringConfigurationRequest) (*proto.GetResourcesMonitoringConfigurationResponse,error) {
agent,err:=a.AgentFn(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit:

I think we could avoid using theAgentFn pattern here. We only need the Agent's ID, but this function makes a call to the database, using the Agent's ID, to fetch the agent. We could instead have theResourcesMonitoringAPI have anAgentID field and just pass in theAgentID fromOptions.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A bit surprised there's a so simple option , but makes sense and changed - Thanks ! ✅

@defelmnq
Copy link
ContributorAuthor

@dannykopping@spikecurtis thanks for the first review. I changed and iterated on most of the things excepted RBAC logic as it will be a whole topic and I dont want to block the rest.

Can I ask you for another review (without RBAC) - ensuring we are all good for the whole logic ?

Globally :

  • I moved the ResourcesMonitor part to its own package.
  • Created a fetcher interface to improve testability (and added a few cases)
  • Changed the proto part to have optional fields , and remove the globalenabled in favor of enabled for each resource.
  • Some cleanup here and there based on comments

Thanks !

@spikecurtisGraphite App
Copy link
Contributor

I'm happy to give another review, but I consider the RBAC to be a blocker to merging.

defelmnq reacted with thumbs up emoji

@defelmnq
Copy link
ContributorAuthor

Thanks@spikecurtis for the other review. I think we reached something way better here :

  • I fixed your last comments
  • re-inserted the RBAC logic. Took a bit of time understanding the way everything works but now tested and am happy with the result.

Provisionerd can add entries
Whoever can read the associated workspace can also fetch the resources_monitoring associated.

Copy link
Contributor

@dannykoppingdannykopping left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I still have concerns about how this data collection is implemented.

GetResourcesMonitoringConfiguration specifies that a collection will occur every 10 seconds. Datapoints will be added to the queue, and once the queue fills up it'll send the contents of the queue tocoderd. Once the queue is full it never empties, so it'll just shift the oldest element off to accommodate a new one.

TheCollectionIntervalSeconds defines both the collection AND send interval, which means that once the queue is full it'll send the full contents everyCollectionIntervalSeconds, but...

CollectionIntervalSeconds is 10s by default, which is too slow for datapoint collection; the point of the system we discussed in the RFC was to collectoften (maybe even once per second), and to buffer those datapoints to send every so often so as to not overwhelmcoderd. This implementation means we're sending thesame datapointNumDatapoints-1 times tocoderd. I don't see the point in this.

Regarding the processor: with this design it would take up to 200s / 3.3m (20 datapoints collected 10s apart) to go from an unhealthy state to a healthy state; that will not feel responsive, and we can do much better!

Inanother comment I described my reservations about this design concerning failed collections; from the answer it seems likecoderd will beinferring those failed collections by noticing gaps in the data; how we'll get this precise I don't know (since we're not collecting timestamps with each datapoint) - and it seems unnecessary since we already know at the source when this has occurred. Currently we're just skipping over these failed datapoints in the agent with a log, and leavingcoderd to guess as to what happened; this doesn't seem like a resilient design to me.

@defelmnq pointed out that an empty datapoint would be sent if the collection failed (I missed this on line 73).coderd will also look at the monitoring config and determine if the datapoint is blank but shouldn't be, and determine that asUNKNOWN. I think it could've been more explicit from the agent's perspective, but this approach will work fine.

{
// If one of the resources fails to be fetched, the datapoints still should be pushed with the other resources.
name:"ErrorFetchingVolume",
config:&proto.GetResourcesMonitoringConfigurationResponse{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: since these all have the same config, consider creating a const to improve readability.
Alternatively (I'd prefer this) you should change these configs up so that you avoid accidentally overfitting for this specific config.

// We have one call per tick, once reached the ${config.NumDatapoints}.
expectedCalls:=tt.numTicks-int(tt.config.Config.NumDatapoints)+1
require.Equal(t,expectedCalls,counterCalls)
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: I don't think you need this here since you have adefer cancel() already which will be invoked after this.

Comment on lines +145 to +152
require.Len(t,req.Datapoints,20)
require.Nil(t,req.Datapoints[0].Memory)
require.NotNil(t,req.Datapoints[0].Volumes)
require.Equal(t,&proto.PushResourcesMonitoringUsageRequest_Datapoint_VolumeUsage{
Volume:"/",
Total:100000,
Used:50000,
},req.Datapoints[0].Volumes[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: this feels awkward; the assertions should be happening in the test body, not the test case definitions.
Don't we always want to check that the returned values match expectations?

Copy link
Contributor

@spikecurtisspikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

RBAC stuff looks OK now.

defelmnq reacted with heart emoji
Copy link
Contributor

@dannykoppingdannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK,@defelmnq and I had discussed the implementation and I'm happy to unblock this as it stands.
What I thought we were building and what I agreed to in the RFC were different, and for that I take responsibility. Apologies for the inertia this caused on this review!

One optimisation we agreed to do now is to start shipping the payloads even before the buffer fills up, so we can alert early if errors are present straight away.


For later:

My argument is that we should collect faster and send at the ~same rate, to achieve a higher resolution measurement and consequently have more reactive alerts / status updates.

To achieve this, we would need to send at least half (but probably the full) previousNumDatapoints datapoints to ensure we can satisfy the 3 rules (which may only be true across current + previous payloads):

  • 4 datapoints in a row above threshold =NOK
  • any 10 datapoints in aNumDatapoints payload above threshold =NOK
  • allNumDatapoints within threshold =OK

@defelmnqdefelmnq merged commitbc609d0 intomainFeb 14, 2025
30 checks passed
@defelmnqdefelmnq deleted the agent_res_mon branchFebruary 14, 2025 09:28
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsFeb 14, 2025
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@dannykoppingdannykoppingdannykopping approved these changes

@spikecurtisspikecurtisspikecurtis approved these changes

@johnstcnjohnstcnAwaiting requested review from johnstcnjohnstcn is a code owner

@DanielleMaywoodDanielleMaywoodAwaiting requested review from DanielleMaywood

Assignees

@defelmnqdefelmnq

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@defelmnq@spikecurtis@dannykopping@DanielleMaywood

[8]ページ先頭

©2009-2025 Movatter.jp