Movatterモバイル変換

agent/proto/agent.proto

		Statusstatus=6;
		}

		messageGetResourcesMonitoringConfigurationRequest {

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This message is based on what has been described in the RFC - and has been mutually discussed and approved by Vincent and@DanielleMaywood

defelmnq commented

agent/resources_monitor.go Outdated

		"github.com/coder/quartz"
		)

		func (a*agent)pushResourcesMonitoring(ctx context.Context,aAPI proto.DRPCAgentClient24)error {

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : We have this intermediate function in order to have an extra layer, and simplicify for testing purpose.
As all arguments are injected as parameters toPushResourcesMonitoringWithConfig we can use the mocked versions.

defelmnq commented

agent/resources_monitor.go Outdated

		configFetcherResourcesMonitorConfigurationFetcher,
		datapointsPusherResourcesMonitorDatapointsPusher,
		)error {
		config,err:=configFetcher(ctx,&proto.GetResourcesMonitoringConfigurationRequest{})

Copy link

ContributorAuthor

defelmnqFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : There's three situations on the which we do not want to run the go-routine :

If we can not fetch the configuration from Coderd.
If the resources_monitoring is not enabled.
If we can not instantiate the clistat client (which is the one used to fetch resources usage.)

defelmnq commented

coderd/agentapi/resources_monitoring.go

		},nil
		}

		func (aResourcesMonitoringAPI)PushResourcesMonitoringUsage(ctx context.Context,reqproto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse,error) {

Copy link

ContributorAuthor

defelmnqFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : This function is just here as a placeholder - we are fine keeping it this way for now as it will be replaced by@DanielleMaywood PR.

defelmnq commented

agent/resources_monitor.go Outdated


		datapointsQueue:=NewResourcesMonitorQueue(int(config.Config.NumDatapoints))

		clk.TickerFunc(ctx,time.Duration(config.Config.TickInterval*int32(time.Second)),func()error {

Copy link

ContributorAuthor

defelmnqFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

self review : We prefer to log instead of returning an error in case of problem.

UsingTickerFunc from quartz, returning an error would just shutdown the whole ticker and stop it. Returning nil stops the current tick, but the next one will still trigger.

improve error messages

06adbf7

defelmnq requested a review fromdannykopping

February 11, 2025 05:11

Copy link

ContributorAuthor

defelmnq commentedFeb 11, 2025

@spikecurtis I'd love to have your review - specially on the whole AgentAPI / TailnetAPI part - I want to be sure that I increase the version as we should.

defelmnq requested a review fromDanielleMaywood

February 11, 2025 05:14

dannykopping reviewed

Copy link

Contributor

dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Generally looking good, but I'm concerned that this doesn't match the RFC. See my comment inagent/resources_monitor.go.

agent/resources_monitor.go OutdatedShow resolvedHide resolved

agent/resources_monitor_test.go OutdatedShow resolvedHide resolved

agent/resources_monitor_test.go Outdated

		funcTestPushResourcesMonitoringWithConfig(t*testing.T) {
		t.Parallel()

		tests:= []struct {

Copy link

Contributor

dannykoppingFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking at the code coverage, we're missing a few test-cases here:

disabled config pushes nothing
errors when fetching memory/volume info
errors when pushing to coderd

Copy link

ContributorAuthor

defelmnqFeb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I improved a lot the testability of this block by creating a struct for the resources_fetcher part - so we can mock it.

Also added test cases for all the missing cases. Should be way better now.

coderd/agentapi/resources_monitoring.go OutdatedShow resolvedHide resolved

agent/resources_monitor_queue_test.go OutdatedShow resolvedHide resolved

agent/agent.go OutdatedShow resolvedHide resolved

agent/resources_monitor.go OutdatedShow resolvedHide resolved

agent/resources_monitor.go Outdated

		ResourcesMonitorDatapointsPusherfunc(ctx context.Context,paramsproto.PushResourcesMonitoringUsageRequest) (proto.PushResourcesMonitoringUsageResponse,error)
		)

		funcPushResourcesMonitoringWithConfig(ctx context.Context,

Copy link

Contributor

dannykoppingFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This implementation does not seem to match the RFC design.

Once the buffer fills up, it'll send the full buffer to the control plane. However, it'll keep doing so everyTickInterval thereafter since the buffer only ever replaces the oldest value.

The RFC specifies that the payload should only be delivered every 15s for the 20 datapoints; it's currently performing collection and submission in the same ticker with no distinction between them.

We also seem to not be adding theUNKNOWN datapoints when values cannot be retrieved.

Was this a conscious choice? Is the RFC out-of-date perhaps?

Copy link

ContributorAuthor

defelmnqFeb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

About theUNKNOWN status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, isOK,NOK orUNKNOWN.

About the tick logic - that's maybe a misunderstanding in the RFC , but when I defined it to be delivered every 15s, it was based on theTickInterval - considering it was set to 15. having both theTickInterval and another value set independently would require some sync to avoid sending two times the datapoints without having a refresh or fetching datapoints too frequently and so not pushing data to coderd at each tick.

TL;DR - I think the best solution is to have theTickInterval handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.

Copy link

Contributor

dannykoppingFeb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

About the UNKNOWN status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, is OK, NOK or UNKNOWN.

Can you point me to some specific code I can look at please? Will you be inferring that a datapoint was missed by looking at the time of each datapoint? I wonder why we opted for this approach rather than explicitly sending a noop datapoint to indicate a problem.

TL;DR - I think the best solution is to have the TickInterval handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.

This doesn't really address what I'm asking, though.

We're going to be sending alln datapoints everyTickInterval once the queue fills up since we never drain the queue but only shifting off the oldest element.

spikecurtis requested changes

Copy link

Contributor

spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Proto version changes look overall fine, but I want to review again re: RBAC

agent/proto/agent.proto OutdatedShow resolvedHide resolved

agent/proto/agent.protoShow resolvedHide resolved

agent/proto/agent_drpc_old.goShow resolvedHide resolved

tailnet/proto/version.go OutdatedShow resolvedHide resolved

coderd/rbac/object_gen.goShow resolvedHide resolved

agent/resources_monitor.go Outdated


		type (
		ResourcesMonitorConfigurationFetcherfunc(ctx context.Context,paramsproto.GetResourcesMonitoringConfigurationRequest) (proto.GetResourcesMonitoringConfigurationResponse,error)
		ResourcesMonitorDatapointsPusherfunc(ctx context.Context,paramsproto.PushResourcesMonitoringUsageRequest) (proto.PushResourcesMonitoringUsageResponse,error)

Copy link

Contributor

spikecurtisFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should be an interface. With a bare function type, IDE tooling has a hard time finding implementations of this function. If you make it an interface, you can find implementations and see what this is.

Copy link

ContributorAuthor

defelmnqFeb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I removed thefunc parts andResourcesMonitorConfigurationFetcher to have an interface - should be way better this way. 👀

agent/resources_monitor.go Outdated

		datapointsPusherResourcesMonitorDatapointsPusher,
		)error {
		config,err:=configFetcher(ctx,&proto.GetResourcesMonitoringConfigurationRequest{})
		iferr!=nil {

Copy link

Contributor

spikecurtisFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of passing a "fetcher", you should just pass the config, and do the fetching before you call this function. It will make testing easier and you can avoid definingResourcesMonitorConfigurationFetcher which is a real mouthful.

Copy link

ContributorAuthor

defelmnqFeb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

✅ changed too.

agent/resources_monitor.go Outdated

		returnxerrors.Errorf("failed to create resources fetcher: %w",err)
		}

		datapointsQueue:=NewResourcesMonitorQueue(int(config.Config.NumDatapoints))

Copy link

Contributor

spikecurtisFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think this would be a lot more readable with aResourceMonitor struct that includes things like the resources fetcher and queue as members, instead of this nested function style. It makes it hard to keep track of the scope of things, that is, which things are per tick and which last between ticks.

You set up the long lived struct, then start the ticker calling a method on the struct. That way it's very clear what is part of the tick and what outlives it. It also means you don't have to pass thefetcher to helpers likefetchResourceMonitoredVolume.

Copy link

ContributorAuthor

defelmnqFeb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I changed a lot the way things were organised based on the comments - so now :

Resources monitor logic is in its own package
It works with a struct and methods to make it an entity
members are used to store the queue & config

🙏

agent/resources_monitor.go Outdated

		@@ -0,0 +1,132 @@
		package agent

Copy link

Contributor

spikecurtisFeb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think Go style would be to move this into it's own package so you don't have to include some variant of "resource" and "monitor" in the name of everything to keep things clear.

Copy link

ContributorAuthor

defelmnqFeb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I moved it to its own package ✅ should indeed simplify a lot the naming.

DanielleMaywood mentioned this pull request

chore: implement oom/ood processing component#16436

Merged

DanielleMaywood reviewed

Feb 12, 2025

feat: add agentapi endpoint to report connections for audit#16507

coderd/agentapi/resources_monitoring.go Outdated

		}

		func (aResourcesMonitoringAPI)GetResourcesMonitoringConfiguration(ctx context.Context,_proto.GetResourcesMonitoringConfigurationRequest) (*proto.GetResourcesMonitoringConfigurationResponse,error) {
		agent,err:=a.AgentFn(ctx)

Copy link

Contributor

DanielleMaywoodFeb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit:

I think we could avoid using theAgentFn pattern here. We only need the Agent's ID, but this function makes a call to the database, using the Agent's ID, to fetch the agent. We could instead have theResourcesMonitoringAPI have anAgentID field and just pass in theAgentID fromOptions.

Copy link

ContributorAuthor

defelmnqFeb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A bit surprised there's a so simple option , but makes sense and changed - Thanks ! ✅

johnstcn mentioned this pull request

Feb 12, 2025

Merged

defelmnq added3 commits

February 12, 2025 22:20

change proto payload for get resources monitoring config

c95b05a

change proto payload for get resources monitoring config

c79b6cb

rework fetcher and tests

b28d4fa

Copy link

ContributorAuthor

defelmnq commentedFeb 13, 2025

@dannykopping @spikecurtis thanks for the first review. I changed and iterated on most of the things excepted RBAC logic as it will be a whole topic and I dont want to block the rest.

Can I ask you for another review (without RBAC) - ensuring we are all good for the whole logic ?

Globally :

I moved the ResourcesMonitor part to its own package.
Created a fetcher interface to improve testability (and added a few cases)
Changed the proto part to have optional fields , and remove the globalenabled in favor of enabled for each resource.
Some cleanup here and there based on comments

Thanks !

defelmnq added4 commits

February 13, 2025 01:41

fix tests

7701624

fix tests

5fad903

fix tests

b611ae5

fix logic

3c65b8a

Copy link

Contributor

spikecurtis commentedFeb 13, 2025

I'm happy to give another review, but I consider the RBAC to be a blocker to merging.

spikecurtis requested changes

Feb 13, 2025

agent/proto/resourcesmonitor/fetcher.go OutdatedShow resolvedHide resolved

agent/proto/resourcesmonitor/fetcher_test.go OutdatedShow resolvedHide resolved

agent/proto/resourcesmonitor/fetcher.go OutdatedShow resolvedHide resolved

defelmnq added7 commits

February 13, 2025 07:54

improve testing fetcher and rename struct

63c5869

lint

2d3eeb5

work on dbauthz

e17aafc

improve dbauthz for fetching

c5a4201

change dbauthz permissions

262a672

finalise tests

dbca96e

fix comments from github

3145eab

Copy link

ContributorAuthor

defelmnq commentedFeb 13, 2025

Thanks@spikecurtis for the other review. I think we reached something way better here :

I fixed your last comments
re-inserted the RBAC logic. Took a bit of time understanding the way everything works but now tested and am happy with the result.

Provisionerd can add entries
Whoever can read the associated workspace can also fetch the resources_monitoring associated.

defelmnq requested review fromDanielleMaywood,dannykopping andspikecurtis

February 13, 2025 22:31

dannykopping requested changes

agent/proto/resourcesmonitor/resources_monitor_test.go

Copy link

Contributor

dannykopping left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I still have concerns about how this data collection is implemented.

GetResourcesMonitoringConfiguration specifies that a collection will occur every 10 seconds. Datapoints will be added to the queue, and once the queue fills up it'll send the contents of the queue tocoderd. Once the queue is full it never empties, so it'll just shift the oldest element off to accommodate a new one.

TheCollectionIntervalSeconds defines both the collection AND send interval, which means that once the queue is full it'll send the full contents everyCollectionIntervalSeconds, but...

CollectionIntervalSeconds is 10s by default, which is too slow for datapoint collection; the point of the system we discussed in the RFC was to collectoften (maybe even once per second), and to buffer those datapoints to send every so often so as to not overwhelmcoderd. This implementation means we're sending thesame datapointNumDatapoints-1 times tocoderd. I don't see the point in this.

Regarding the processor: with this design it would take up to 200s / 3.3m (20 datapoints collected 10s apart) to go from an unhealthy state to a healthy state; that will not feel responsive, and we can do much better!

Inanother comment I described my reservations about this design concerning failed collections; from the answer it seems likecoderd will beinferring those failed collections by noticing gaps in the data; how we'll get this precise I don't know (since we're not collecting timestamps with each datapoint) - and it seems unnecessary since we already know at the source when this has occurred. Currently we're just skipping over these failed datapoints in the agent with a log, and leavingcoderd to guess as to what happened; this doesn't seem like a resilient design to me.

@defelmnq pointed out that an empty datapoint would be sent if the collection failed (I missed this on line 73).coderd will also look at the monitoring config and determine if the datapoint is blank but shouldn't be, and determine that asUNKNOWN. I think it could've been more explicit from the agent's perspective, but this approach will work fine.

		{
		// If one of the resources fails to be fetched, the datapoints still should be pushed with the other resources.
		name:"ErrorFetchingVolume",
		config:&proto.GetResourcesMonitoringConfigurationResponse{

Copy link

Contributor

dannykoppingFeb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: since these all have the same config, consider creating a const to improve readability.
Alternatively (I'd prefer this) you should change these configs up so that you avoid accidentally overfitting for this specific config.

agent/proto/resourcesmonitor/resources_monitor_test.go

		// We have one call per tick, once reached the ${config.NumDatapoints}.
		expectedCalls:=tt.numTicks-int(tt.config.Config.NumDatapoints)+1
		require.Equal(t,expectedCalls,counterCalls)
		cancel()

Copy link

Contributor

dannykoppingFeb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: I don't think you need this here since you have adefer cancel() already which will be invoked after this.

agent/proto/resourcesmonitor/resources_monitor_test.go

Comment on lines +145 to +152

		require.Len(t,req.Datapoints,20)
		require.Nil(t,req.Datapoints[0].Memory)
		require.NotNil(t,req.Datapoints[0].Volumes)
		require.Equal(t,&proto.PushResourcesMonitoringUsageRequest_Datapoint_VolumeUsage{
		Volume:"/",
		Total:100000,
		Used:50000,
		},req.Datapoints[0].Volumes[0])

Copy link

Contributor

dannykoppingFeb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: this feels awkward; the assertions should be happening in the test body, not the test case definitions.
Don't we always want to check that the returned values match expectations?

spikecurtis approved these changes

Copy link

Contributor

spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

RBAC stuff looks OK now.

dannykopping approved these changes

Copy link

Contributor

dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK,@defelmnq and I had discussed the implementation and I'm happy to unblock this as it stands.
What I thought we were building and what I agreed to in the RFC were different, and for that I take responsibility. Apologies for the inertia this caused on this review!

One optimisation we agreed to do now is to start shipping the payloads even before the buffer fills up, so we can alert early if errors are present straight away.

For later:

My argument is that we should collect faster and send at the ~same rate, to achieve a higher resolution measurement and consequently have more reactive alerts / status updates.

To achieve this, we would need to send at least half (but probably the full) previousNumDatapoints datapoints to ensure we can satisfy the 3 rules (which may only be true across current + previous payloads):

4 datapoints in a row above threshold =NOK
any 10 datapoints in aNumDatapoints payload above threshold =NOK
allNumDatapoints within threshold =OK

add collectedAt

3bec324

defelmnq merged commitbc609d0 intomain

30 checks passed

defelmnq deleted the agent_res_mon branch

February 14, 2025 09:28

github-actionsbot locked and limited conversation to collaborators