- Notifications
You must be signed in to change notification settings - Fork927
feat: add connection statistics for workspace agents#6469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
This made for some weird tracking... we want the point-in-timenumber of counts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Took a preliminary look at this, I'll finish my review tomorrow.
agent/agent.go Outdated
select { | ||
case a.connStatsChan <- stats: | ||
// Only store the latest stat when it's successfully sent! | ||
// Otherwise, it should be sent again on the next iteration. | ||
a.latestStat.Store(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Considering the previous comment about Tailscale resetting counts on every report, I'd think this current implementation will lose stats?
I imagine a more safe way to update the stats would be something along the lines of:
a.statMu.Lock()a.stats.RxBytes+=...select {casea.connStatsChan<-a.stats:// note: a copy assuming basic struct// sentdefault:// dropped this report}a.statMu.Unlock()Thiswaywe'realwaysincrementingthenumbers,evenfromdroppedreports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ahh, good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I wonder if we'd be better off blocking instead of dropping? It seems like that's fine from the Tailscale side, and then we don't really run any risks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I changed it to block instead. Let me know your thoughts! The loop retries anyways.
coderd/coderd.go Outdated
@@ -405,15 +408,16 @@ func New(options *Options) *API { | |||
r.Post("/csp/reports", api.logReportCSPViolations) | |||
r.Get("/buildinfo", buildInfo) | |||
r.Route("/deployment", func(r chi.Router) { | |||
r.Use(apiKeyMiddleware) | |||
r.Get("/config", api.deploymentConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Breaking change?/config/deployment
=>/deployment/config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Since this endpoint is only relied on in the dashboard, I wouldn't consider this breaking, but if you think it is I'm fine to put!
with it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No strong preference, just being cautious. 😄
Uh oh!
There was an error while loading.Please reload this page.
coderd/database/migrations/000102_workspace_agent_stats_types.up.sql OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I didn't look very closely at frontend, but backend looks mostly good. I still have some concerns about the agent stats reporting (see comments) but that's either a lack of my understanding or something we should fix, approving nonetheless.
cli/deployment/config.go Outdated
@@ -406,7 +406,7 @@ func newConfig() *codersdk.DeploymentConfig { | |||
Usage: "How frequently agent stats are recorded", | |||
Flag: "agent-stats-refresh-interval", | |||
Hidden: true, | |||
Default:10 * time.Minute, | |||
Default:30 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is a pretty big change, I think it's OK but increases spamminess somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It definitely is, but I did some napkin math and I think itshould be alright.
Even if a user has hundreds of workspaces, a few hundred more writes/minute shouldn't be a big deal. I suppose it might spam the logs, which I'll check and resolve before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd hate for all of coderd to be spammed with stat logs in a large deployment ;p
@@ -1270,10 +1267,16 @@ func (a *agent) startReportingConnectionStats(ctx context.Context) { | |||
// Convert from microseconds to milliseconds. | |||
stats.ConnectionMedianLatencyMS /= 1000 | |||
lastStat := a.latestStat.Load() | |||
if lastStat != nil && reflect.DeepEqual(lastStat, stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I guess this still confuses me a bit. If Tailscale stats aren't cumulative, isn't the only way this matches lastStat if there was no chatter (tx/rx), the latency and sessions for SSH/Code/JetBrains stayed the same?
Since we're also doing network pings in the latency check, I think there is a non-zero chance for multiplereportStats
to be running concurrently, essentially competing about Load()/Store() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hmm, good points. I'll refactor this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
After looking at this again, it seems like this should be fine.
This will only match if there's no traffic, but that's arguably great because then we aren't spamming the database with nonsense. I don't want to do this incoderd
, because we'd need to query for the last stat to ensure it's not the same.
reportStats
is blocking, and so subsequent agent stat refreshes will wait before running again, so I don't think they'd compete.
Let me know if I'm overlooking something or didn't understand properly, I'm sick and my brain is stuffy right now ;p
@@ -0,0 +1 @@ | |||
ALTER TABLEworkspace_agent_stats ALTER COLUMN connection_median_latency_ms TYPE bigint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Will this work with non-empty data? You could consider adding two fixtures intestdata/fixtures
(000106_pre_workspace_agent_stats_connection_latency.up.sql
and000107_post_workspace_agent_stats_connection_latency.up.sql
). In the former you add a row with bigint value and in the latter you add a row with float value. If tests pass then all is good. 👍🏻
codersdk/deployment.go Outdated
type DeploymentStats struct { | ||
// AggregatedFrom is the time in which stats are aggregated from. | ||
// This might be back in time a specific duration or interval. | ||
AggregatedFrom time.Time `json:"aggregated_since" format:"date-time"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Any reason to keep json/field out of sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nope, just a mistake on my end. Good catch!
Uh oh!
There was an error while loading.Please reload this page.
codersdk/deployment.go Outdated
BuildingWorkspaces int64 `json:"building_workspaces"` | ||
RunningWorkspaces int64 `json:"running_workspaces"` | ||
FailedWorkspaces int64 `json:"failed_workspaces"` | ||
StoppedWorkspaces int64 `json:"stopped_workspaces"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Could utilize nesting, e.g.workspaces.pending
,session_count.vscode
, etc. Matter of preference, so dealers choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Agreed that's a bit easier to parse!
codersdk/deployment.go Outdated
SessionCountReconnectingPTY int64 `json:"session_count_reconnecting_pty"` | ||
WorkspaceRxBytes int64 `json:"workspace_rx_bytes"` | ||
WorkspaceTxBytes int64 `json:"workspace_tx_bytes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
These could be underworkspaces.rx_bytes
. Singular vs plural (workspace, workspaces) above is a bit confusing currently.
Uh oh!
There was an error while loading.Please reload this page.
This adds a bar at the bottom of the dashboard (only visible to admins) with periodically updating statistics on workspaces.
After merge, I'll add warnings for the
DERPForcedWebsocket
property to indicate reduced throughput.