Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: expose workspace statuses (with details) as a prometheus metric#12762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
dannykopping merged 15 commits intocoder:mainfromdannykopping:dk/workspace-metric
Apr 2, 2024

Conversation

dannykopping
Copy link
Contributor

@dannykoppingdannykopping commentedMar 26, 2024
edited
Loading

Implements#12462

This produces the following metric:

coderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_name="test_workspace",workspace_owner="danny",workspace_transition="start"}

Notes to reviewers:

  • Minor behaviour change: previously if a metric existed but then subsequent queries loaded no entries, it would no reset the gauge. Additionally, we set the state of the gauge on startup to help with testing. Both of these would create a weird user-experience in my view so I got rid of them, but it has some small side-effects which I call out in the code
  • I've reduced the refresh rate ofActiveUsers andWorkspaces metrics to a minute; 5 minutes seemed quite excessive and may not allow operators to act fast enough; not sure if there was a reason these were different to the other metrics in this file
  • I've also made theWorkspace tests run with a slower interval (25ms instead of 1ms) and a faster cut-off (1s vs 10s)

…levant detailLight refactoringSigned-off-by: Danny Kopping <danny@coder.com>
Light refactoringSigned-off-by: Danny Kopping <danny@coder.com>
…ed inappropriately slowSigned-off-by: Danny Kopping <danny@coder.com>
Moderate refactoringSigned-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Use stronger randSigned-off-by: Danny Kopping <danny@coder.com>
Comment on lines -984 to -987
if strings.HasPrefix(scanner.Text(), "coderd_api_workspace_latest_build_total") {
hasWorkspaces = true
continue
}
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Side-effect of clearing the gauge when no db rows are loaded.
This wasn't strictly necessary for the test since other manually-registered metrics are included to validate this test.

mtojek reacted with thumbs up emoji
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykoppingdannykopping marked this pull request as ready for reviewMarch 26, 2024 13:38
LatestBuildCanceledAt sql.NullTime `db:"latest_build_canceled_at" json:"latest_build_canceled_at"`
LatestBuildError sql.NullString `db:"latest_build_error" json:"latest_build_error"`
LatestBuildTransition WorkspaceTransition `db:"latest_build_transition" json:"latest_build_transition"`
LatestBuildStatus NullProvisionerJobStatus `db:"latest_build_status" json:"latest_build_status"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a way to make this be non-nullable in the query?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Afraid not, looks like thelatest_build attributes are the result of a left join:

SELECT
workspaces.id, workspaces.created_at, workspaces.updated_at, workspaces.owner_id, workspaces.organization_id, workspaces.template_id, workspaces.deleted, workspaces.name, workspaces.autostart_schedule, workspaces.ttl, workspaces.last_used_at, workspaces.dormant_at, workspaces.deleting_at, workspaces.automatic_updates, workspaces.favorite,
COALESCE(template.name, 'unknown') as template_name,
latest_build.template_version_id,
latest_build.template_version_name,
users.username as username,
latest_build.completed_at as latest_build_completed_at,
latest_build.canceled_at as latest_build_canceled_at,
latest_build.error as latest_build_error,
latest_build.transition as latest_build_transition,
latest_build.job_status as latest_build_status
FROM
workspaces
JOIN
users
ON
workspaces.owner_id = users.id
LEFT JOIN LATERAL (
SELECT
workspace_builds.id,
workspace_builds.transition,
workspace_builds.template_version_id,
template_versions.name AS template_version_name,
provisioner_jobs.id AS provisioner_job_id,
provisioner_jobs.started_at,
provisioner_jobs.updated_at,
provisioner_jobs.canceled_at,
provisioner_jobs.completed_at,
provisioner_jobs.error,
provisioner_jobs.job_status
FROM
workspace_builds
LEFT JOIN
provisioner_jobs
ON
provisioner_jobs.id = workspace_builds.job_id
LEFT JOIN
template_versions
ON
template_versions.id = workspace_builds.template_version_id
WHERE
workspace_builds.workspace_id = workspaces.id
ORDER BY
build_number DESC
LIMIT
1
) latest_build ON TRUE

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suppose by definition a workspace cannot exist without a build...

I see inworkspace_builds:

createtableif not existspublic.workspace_builds(    id                  uuidnot nullprimary key,    created_attimestamp with time zonenot null,    updated_attimestamp with time zonenot null,    workspace_id        uuidnot nullreferencespublic.workspaceson delete cascade,...

The workspace foreign key is not null, so I think this left join may be inappropriate?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hhmm. I changed it and ranmake gen but it's still being defined as a nullable field.
Also, one of its sibling fieldslatest_build_transition is not nullable and is derived from the same join, so actually all of the above about the join was ostensibly was invalid.

I'll need to dig a little further.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah! I was basically on the correct track!

provisioner_jobs was being left joined as well, and since that also has a non-nullable foreign key withworkspace_builds.job_id, it is safe to use an inner join. I can actually leave the outer lateral join as it is but just change this to correct the generated code.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

johnstcn reacted with hooray emoji
@@ -12405,7 +12406,7 @@ WHERE
-- @authorize_filter
), filtered_workspaces_order AS (
SELECT
fw.id, fw.created_at, fw.updated_at, fw.owner_id, fw.organization_id, fw.template_id, fw.deleted, fw.name, fw.autostart_schedule, fw.ttl, fw.last_used_at, fw.dormant_at, fw.deleting_at, fw.automatic_updates, fw.favorite, fw.template_name, fw.template_version_id, fw.template_version_name, fw.username, fw.latest_build_completed_at, fw.latest_build_canceled_at, fw.latest_build_error, fw.latest_build_transition
fw.id, fw.created_at, fw.updated_at, fw.owner_id, fw.organization_id, fw.template_id, fw.deleted, fw.name, fw.autostart_schedule, fw.ttl, fw.last_used_at, fw.dormant_at, fw.deleting_at, fw.automatic_updates, fw.favorite, fw.template_name, fw.template_version_id, fw.template_version_name, fw.username, fw.latest_build_completed_at, fw.latest_build_canceled_at, fw.latest_build_error, fw.latest_build_transition, fw.latest_build_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

follow-up idea: it could be useful to have this field in thecodersdk.Workspace as well :-)

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@johnstcn and I spoke offline to elaborate on this, apparently we convert between thedatabase.Workspace andcodersdk.Workspace types in some situations (convertWorkspace, for example), and it would be good to maintain parity between them. This would be an enhancement for a subsequent PR, we agreed.

Signed-off-by: Danny Kopping <danny@coder.com>
Copy link
Member

@johnstcnjohnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Smoke-tested:

After creating workspace:

curl -fsSL http://localhost:21112 | grep coderd_api_workspace_detail# HELP coderd_api_workspace_detail The current workspace details by template, transition, owner, and status.# TYPE coderd_api_workspace_detail gaugecoderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="nice_hofstadter1",workspace_name="test",workspace_owner="admin",workspace_transition="start"} 1

After stopping a workspace and waiting approx 1 minute:

curl -fsSL http://localhost:21112 | grep coderd_api_workspace_detail# HELP coderd_api_workspace_detail The current workspace details by template, transition, owner, and status.# TYPE coderd_api_workspace_detail gaugecoderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="nice_hofstadter1",workspace_name="test",workspace_owner="admin",workspace_transition="stop"} 1

@dannykopping
Copy link
ContributorAuthor

Based on@johnstcn's comment and a little time away from this, I'm thinking the current metric name is not quite explanatory enough.

I'm thinking ofcoderd_api_workspace_latest_build_status. Thoughts?

Copy link
Member

@mtojekmtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

High level review round

@@ -118,7 +119,7 @@ LEFT JOIN LATERAL (
provisioner_jobs.job_status
FROM
workspace_builds
LEFTJOIN
JOIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Check this with@mafredri if it was not intentional 👍

Copy link
Member

@mafredrimafredriMar 27, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would avoid making this change. I believe there's a window of time when there exists a new build but it's not assigned to a job. So the nullability is something that should be handled.

This logic change essentially results in returning the previous build until the it's been assigned to a job.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not sure if that's possible?

createtableif not exists workspace_builds(    id                  uuidnot null    ...    workspace_id        uuidnot nullreferencespublic.workspaceson delete cascade,    ...    job_id              uuidnot null        uniquereferencespublic.provisioner_jobson delete cascade,    ...);

These keys are all non-nullable.
Is there something I'm missing?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I should say though that I'd rather not make this change if you in any way suspect this will cause issues.
The way I had it prior to#12762 (comment) worked fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Builds are inserted with their respective job in a single transaction. c.f.wsbuilder

Also, Danny's right that job_id isNOT NULL with a foreign key constraint, so even if we brokewsbuilder it still wouldn't be possible to have a build with no job.

mtojek reacted with eyes emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Given the constraint, the change is perfectly fine 👍. I should’ve checked it myself before commenting. Sorry for the noise@dannykopping.

Copy link
Member

@mafredrimafredriMar 28, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The DB schema should probably be updated to include NOT NULL on the foreign keys, though. Otherwise NULL entries are not enforced by the DB.

Actually, where did you get the schema@dannykopping you referenced above? Doesn’t look the same as our dump which has NOT NULL.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No worries@mafredri!

I just generated the DDL from an existing database in my IDE (GoLand). The above also hasnot null but you have to scroll over a bit to the right.

mafredri reacted with heart emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Haha, it was perfectly hidden by the code block size 😂.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I know right??! Sorry about that lol

return nil, err
}

workspaceDetails := prometheus.NewGaugeVec(prometheus.GaugeOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thinking loud:

During the last scale tests we hit 2000 workspaces, does it mean that we're going to swamp our Prometheus endpoint with details for all of them? If so, maybe we should make it configurable? cc@johnstcn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think a time-series per workspace is excessive here. Operators want to know general counts of workspaces in different states. If they want to know specific workspace names they can just hit our API or look at the/workspaces page in our frontend.

This is especially true since the metrics will be reported by each Coderd, so if you have M workspaces and N Coderd replicas, that's M*N time-series that prometheus needs to track.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That's fair. I was basing the decision to include it from the original request so I'll defer to@bpmct before removing it.

Copy link
Member

@johnstcnjohnstcnMar 27, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See also Spike's comment:#12462 (comment)

I think what they want is a GUAGE that tracks the current number of workspaces in a particular state, not a counter of builds.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh wow, it seems I misunderstood the initial ask.

I was under the incorrect assumption that@bpmct wanted a metricto match the format described in the issue, and "the builds metric" which he was referring to was the existingcoderd_api_workspace_latest_build_total metric 🤦

coderd_workspace_builds_total already exists!

# HELP coderd_workspace_builds_total The number of workspaces started, updated, or deleted.# TYPE coderd_workspace_builds_total countercoderd_workspace_builds_total{status="failed",template_name="docker",template_version="vibrant_davinci2",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="START"} 1coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="2",workspace_owner="danny",workspace_transition="START"} 1coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="START"} 1coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="STOP"} 1coderd_workspace_builds_total{status="success",template_name="docker",template_version="mystifying_driscoll3",workspace_name="2",workspace_owner="danny",workspace_transition="STOP"} 1

Plus we don't have to worry about cardinality because these metrics are written by the provisioner when executing a job, not reading from the database state.

I have amended the new metric to drop theworkspace label.

Ithink this will satisfy the requirements but@bpmct please confirm.

# HELP coderd_workspace_latest_build_status The current workspace statuses by template, transition, and owner.# TYPE coderd_workspace_latest_build_status gaugecoderd_workspace_latest_build_status{status="failed",template_name="docker",template_version="happy_feynman7",workspace_owner="danny",workspace_transition="stop"} 1coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_owner="danny",workspace_transition="start"} 2coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_owner="danny",workspace_transition="start"} 1coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="ecstatic_greider3",workspace_owner="danny",workspace_transition="start"} 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ithink this will satisfy the requirements but@bpmct please confirm.

# HELP coderd_workspace_latest_build_status The current workspace statuses by template, transition, and owner.# TYPE coderd_workspace_latest_build_status gaugecoderd_workspace_latest_build_status{status="failed",template_name="docker",template_version="happy_feynman7",workspace_owner="danny",workspace_transition="stop"} 1coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_owner="danny",workspace_transition="start"} 2coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_owner="danny",workspace_transition="start"} 1coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="ecstatic_greider3",workspace_owner="danny",workspace_transition="start"} 1

Yep!

dannykopping reacted with thumbs up emoji
@mtojek
Copy link
Member

I'm thinking of coderd_api_workspace_latest_build_status. Thoughts?

I wonder if it isn't too late for renaming as people may use it on their dashboards.

@dannykopping
Copy link
ContributorAuthor

I'm thinking of coderd_api_workspace_latest_build_status. Thoughts?

I wonder if it isn't too late for renaming as people may use it on their dashboards.

This is a new metric being added by this PR.

@mtojek
Copy link
Member

My bad, I assumed that this iscoderd_workspace_builds_total being refactored.

dannykopping reacted with thumbs up emoji

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
return
}

workspaceStatuses.Reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this has a race condition where if the registry collects the metrics before the for-loop below is complete, it will get partial data. This can result in glitchy stats were they are zeroed out for a single data point and then return to "normal." If anyone puts alerts on those stats it could result in false alerting in addition to visual noise on the dashboard.

I realize this is following the pattern of the existing metric, which has the same bug.

Instead these stats should be a custom collector that queries the database and returns the requested data on demand.

Copy link
ContributorAuthor

@dannykoppingdannykoppingMar 28, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead these stats should be a custom collector that queries the database and returns the requested data on demand.
Definitely agree! Called this out in the description as well.

I'm not sure if we can fix this race without the change to the collector pattern. Do you have any ideas?
In reality the race would have to be on the nanosecond scale because this loop will iterate very quickly, but your point is absolutely valid.

I'll add a follow-up issue for this in any case.

mtojek reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copy link
Member

@mtojekmtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks like it works as intended 👍 I left some comments on naming, but definitely not blockers for pushing.

workspaceStatuses.Reset()
}

logger.Warn(ctx, "failed to load active workspaces", slog.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nit: maybelogger.Error?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wouldn't strictly call it an error because it could be an instance ofsql.ErrNoRows which is valid but undesirable. To be the most correct I would make it warn in case ofsql.ErrNoRows and error in all other cases, but that feels unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

which is valid but undesirable

Will it also returnsql.ErrNoRows if there are zero workspaces = fresh deployment?

Copy link
ContributorAuthor

@dannykoppingdannykoppingMar 28, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Believe so, yes.
Or if all workspaces have been deleted.

@dannykopping
Copy link
ContributorAuthor

I believe I've addressed all the comments. I'll merge, but if there any subsequent thoughts, I'll amend those in a follow-up.
Thanks for the reviews!

@dannykoppingdannykopping merged commit79fb8e4 intocoder:mainApr 2, 2024
@dannykoppingdannykopping deleted the dk/workspace-metric branchApril 2, 2024 07:57
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsApr 2, 2024
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@johnstcnjohnstcnjohnstcn approved these changes

@spikecurtisspikecurtisspikecurtis left review comments

@mtojekmtojekmtojek approved these changes

@bpmctbpmctbpmct left review comments

@mafredrimafredriAwaiting requested review from mafredri

Assignees

@dannykoppingdannykopping

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

6 participants
@dannykopping@mtojek@mafredri@johnstcn@spikecurtis@bpmct

[8]ページ先頭

©2009-2025 Movatter.jp