Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: add prebuild timing metrics to Prometheus#19503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
ssncferreira merged 13 commits intomainfromssncferreira/prebuild_metrics
Aug 28, 2025

Conversation

ssncferreira
Copy link
Contributor

@ssncferreirassncferreira commentedAug 22, 2025
edited
Loading

Description

This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations.

coderd_workspace_creation_total

  • Metric type: Counter
  • Name:coderd_workspace_creation_total
  • Labels:organization_name,template_name,preset_name

This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not.
Currently, we already exposecoderd_prebuilt_workspaces_claimed_total for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims.

Implementation notes:

  • Exposed as acoderd_ metric, consistent with other workspace-related metrics (e.g.coderd_api_workspace_latest_build:https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
  • EverydefaultRefreshRate (1 minute ), DB queryGetRegularWorkspaceCreateMetrics is executed to fetch all regular workspaces (not created from a prebuild pool).
  • The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward.

coderd_workspace_creation_duration_seconds &coderd_prebuilt_workspace_claim_duration_seconds

  • Metric types: Histogram
  • Names:
    • coderd_workspace_creation_duration_seconds
      • Labels:organization_name,template_name,preset_name,type (regular,prebuild)
    • coderd_prebuilt_workspace_claim_duration_seconds
      • Labels:organization_name,template_name,preset_name

We already havecoderd_provisionerd_workspace_build_timings_seconds, which tracks build run times for all workspace builds handled by the provisioner daemon.
However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not includepreset_name, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here:

  • coderd_workspace_creation_duration_seconds: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build).
  • coderd_prebuilt_workspace_claim_duration_seconds: Time to claim a prebuilt workspace from the pool.

The reason for two separate histograms is that:

  • Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation.
  • Claim: expected to be a much faster provisioning build.

Native histogram usage

Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms.
To address this, these metrics usePrometheus native histograms:

  • First introduced in Prometheus v2.40.0
  • Recommended stable usage from v2.45+
  • Requires Go clientprometheus/client_golang v1.15.0+
  • Experimental and must be explicitly enabled on the server (--enable-feature=native-histograms)

For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric:https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).

  • If native histograms are enabled, Prometheus ingests the high-resolution histogram.
  • If not, it falls back to the predefined buckets.

Implementation notes:

  • Unlike the counter, these histograms are updated in real-time at workspace build job completion.
  • They reflect data only from the point of introduction forward (no historical backfill).

Relates to

Closes:#19528
Native histograms tested in observability stack:coder/observability#50

@ssncferreirassncferreiraforce-pushed thessncferreira/prebuild_metrics branch 4 times, most recently from9a14915 to7bd4c6aCompareAugust 25, 2025 11:06
@ssncferreirassncferreiraforce-pushed thessncferreira/prebuild_metrics branch from7bd4c6a to1cd4417CompareAugust 25, 2025 11:48
COUNT(*)AS created_count
FROM workspaces w
INNER JOIN workspace_builds wbONwb.workspace_id=w.id
ANDwb.build_number=1
Copy link
ContributorAuthor

@ssncferreirassncferreiraAug 25, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This metric counts workspaces whose first build succeeded. If the first build failed and a later build succeeded, that workspace is not included. We could change the query to count the first successful build instead (wd.transition='start' &pj.job_status='succeeded'); however, we would still need to exclude prebuild-created workspaces by verifying that the first build was not initiated by the prebuilds system user.
It is guaranteed that the first build of a prebuilt workspace will always be initiated by the prebuilds system user, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It is guaranteed that the first build of a prebuilt workspace will always be initiated by the prebuilds system user, right?

This assumption would fail if an admin were to create a workspace 'on behalf of' the prebuilds user. I don't see a logical reason why someone would do this though, as the reconciler would likely try to delete it?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This assumption would fail if an admin were to create a workspace 'on behalf of' the prebuilds user.

Is this possible? If yes, should we allow this?

as the reconciler would likely try to delete it?

The reconciliation loop would detect if the number of desired instances matches the number of running instances. If that were not the case, it would determine there is an extraneous instance, but it wouldn't necessarily delete this one; it just deletes the oldest prebuilt workspace:https://github.com/coder/coder/blob/main/coderd/prebuilds/preset_snapshot.go#L327

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this possible?

Yes, it's possible and supported to create a workspace for another user:

$ coder create --help                          coder v2.25.1-devel+dd867bd74USAGE:  coder create [flags] [workspace]  Create a workspace    - Create a workspace for another user (if you have permission):          $ coder create <username>/<workspace_name>

If yes, should we allow this?

Good question :) It gets very weird to reason about especially if the workspace that gets created is a prebuilt workspace...

Screenshot 2025-08-25 at 20 21 35

I would consider it out of scope of this issue.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Addressed in:11d7ed3
Updated the SQL to get the first successful start build for each workspace, then exclude workspaces where that build was initiated by the prebuilds system user.


workspaceCreationTimings:=prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace:"coderd",
Subsystem:"provisionerd",
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This PR introduces histogram metrics that are registered at the API layer but observed in provisionerdserver. Because provisionerdserver doesn’t own a Prometheus registry, we inject a callback from the API to record observations. Both metrics are updated in provisionerdserver’scompleteWorkspaceBuildJob.
Let me know what you think, or if it would be better to have the metrics on provisionerdserver or maybe even update them in another more appropriate place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We already have set a precedent inenterprise/coderd/prebuilds.NewStoreReconciler, where we injectapi.PrometheusRegistry. Would it make sense to do the same here for consistency?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, that makes sense 👍 Just wanted to check whether there would be a more appropriate place to introduce these metrics, before making that change

dannykopping reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Addressed inb59ebad
Introduced a new metrics file inprovisionerdserver package. Since each provisioner daemon gets its own provisionerdserver instance, the provisionerdserver metrics are created once inenablePrometheus() and passed down to each provisionerdserver instance. This ensures all daemons share the same metrics objects and avoids Prometheus registration conflicts.

This follows the existing pattern used forprovisionerd.Metrics, which are created once in cli/server.go and shared across all in-memory provisioner daemons:https://github.com/coder/coder/blob/main/cli/server.go#L1004

Comment on lines 196 to 203
NativeHistogramBucketFactor:1.1,
// Max number of native buckets kept at once to bound memory.
NativeHistogramMaxBucketNumber:100,
// Merge/flush small buckets periodically to control churn.
NativeHistogramMinResetDuration:time.Hour,
// Treat tiny values as zero (helps with noisy near-zero latencies).
NativeHistogramZeroThreshold:0,
NativeHistogramMaxZeroThreshold:0,
Copy link
ContributorAuthor

@ssncferreirassncferreiraAug 25, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These are the native histogram parameters. If we choose this approach, weneed to (edit: should) ensure our Prometheus server supports native histograms (still experimental) before deploying to dogfood. Additionally, we should also update the documentation to note that we use on this Prometheus feature.
As described in the PR, if Prometheus is not started with native histograms enabled, the metrics will automatically fall back to standard bucketed histograms using the bucket scheme listed above (aligned with the existing provisionerd metrics).

Note: these native histograms were tested locally with Prometheus v3.5.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we need to ensure our Prometheus server supports native histograms (still experimental)

I wouldn't assume that existing customers have an experimental Prometheus feature enabled. But as you say, there is a transparent fallback so this shouldn't be an issue?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

But as you say, there is a transparent fallback so this shouldn't be an issue?

This shouldn’t be an issue while customers are using classic histograms, but there will inevitably be a migration phase once customers want to adopt native histograms (or if we decide to enforce them). Fromhttps://prometheus.io/docs/specs/native_histograms/#migration-considerations

Classic and native histograms cannot be aggregated with each other. A change from classic to native histograms at a certain point in time makes it hard to create dashboards that work across the transition point, and range vectors that contain the transition point will inevitably be incomplete (i.e. a range vector selecting classic histograms will only contain data points in the earlier part of the range, and a range vector selecting native histograms will only contain data points in the later part of the range).

From my perspective, we have three paths forward:

  1. Enforce native histograms only:
  • Customers would need to explicitly enable the feature on their Prometheus server today, to be able to see these metrics.
  • This is risky, since not all customers may be willing (or able) to do so yet.
  • Given that this feature was requested by a customer, this option might be a non-starter.
  1. Support both native and classic histograms (the approach proposed in this PR):
  • Provides the best compatibility: customers without native enabled continue using classic, while those with it enabled get the benefits.

  • At some point, there would still need to be a migration once customers decide to use native histograms (either by explicitly enabling it or once it becomes the default)

  • There are some interesting considerations here:https://prometheus.io/docs/specs/native_histograms/#migration-considerations

  • For our dogfood setup, it might make sense to mirror the expected customer journey:

    1. Run without native histograms.
    2. Enable native histograms in parallel with classic.
    3. Fully transition to native.
  • This way, we can validate the migration path ourselves and uncover issues before customers do.

  1. Stick with classic histograms only:
  • Safest and simplest option in the short term.
  • We can still migrate later once the feature becomes stable.
  • At that point, we could migrate all histogram metrics together, not just these new ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Without fully understanding the implications yet, my weak preference is to tread the expected path and find issues first!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I prefer option#2. A graceful fallback seems like a good idea, particularly given how wide you've made the fallback buckets. The metrics will be less useful, but still directionally helpful (i.e. you probably don't care if a claim took 5m or 6m or 7m or 10m to claim; anything above 5m is just awful).

We should add a follow-up to add the native histogram flag to coder/observability, and ensure everything works out-of-the-box. I'd prefer we did that before we merge this, otherwise proof of local validation using the flag will suffice.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Native histograms successfully tested in coder/observability:coder/observability#50
Will finish this to make it ready for the release, and then continue with the observability draft PR to also introduce Grafana dashboards for the metrics here introduced.

Copy link
Contributor

@dannykoppingdannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I like the approach you've taken here 👍

Comment on lines 196 to 203
NativeHistogramBucketFactor:1.1,
// Max number of native buckets kept at once to bound memory.
NativeHistogramMaxBucketNumber:100,
// Merge/flush small buckets periodically to control churn.
NativeHistogramMinResetDuration:time.Hour,
// Treat tiny values as zero (helps with noisy near-zero latencies).
NativeHistogramZeroThreshold:0,
NativeHistogramMaxZeroThreshold:0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I prefer option#2. A graceful fallback seems like a good idea, particularly given how wide you've made the fallback buckets. The metrics will be less useful, but still directionally helpful (i.e. you probably don't care if a claim took 5m or 6m or 7m or 10m to claim; anything above 5m is just awful).

We should add a follow-up to add the native histogram flag to coder/observability, and ensure everything works out-of-the-box. I'd prefer we did that before we merge this, otherwise proof of local validation using the flag will suffice.

Comment on lines 218 to 222
60,// 1min
60*5,
60*10,
60*30,// 30min
60*60,// 1hr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd prefer we have higher resolution between 1 and 5m, since that'll actually be helpful, and cap at 5m because anything beyond 5m is so horrific that prebuilds are basically useless.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That’s a good point. I wasn’t completely sure what appropriate claim time values would be, since that can vary a lot from customer to customer. For this first draft, I just reused the same buckets we already use for other histograms, likehttps://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L203-L211 for workspace build times.

I agree with having higher resolution between 1m and 5m, I’m just not sure about capping at 5m 🤔 Looking at those existing values, it seems we’ve had builds that take between 30 minutes and 1 hour. Do we still expect claim times to reliably stay under 5 minutes even in those cases? Just curious how you arrived at the 5m mark 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking at those existing values, it seems we’ve had builds that take between 30 minutes and 1 hour

Forclaims?

Do we still expect claim times to reliably stay under 5 minutes even in those cases? Just curious how you arrived at the 5m mark 🙂

If a prebuild claim takes longer than 5m, there's no point in using prebuilds. I chose 5m since it was the most realistic upper bound of the buckets you listed.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For claims?

No, the bucket definitions I was referring to are for workspace builds, more precisely:coderd_provisionerd_workspace_build_timings_secondshttps://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L203-L211.
These record the time taken for a workspace to build, and I believe they were introduced before prebuilds existed. Based on these buckets, my assumption was that some workspace builds (without prebuilds) have taken on the order of 30m–1h, so I was curious whether in those cases a claim from a prebuilt workspace would still reliably fall under 5m.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let's come back to the point of these metrics... Operators need to know when claims are taking too long, right? Any prebuilt workspace claim which exceeds 5m is pointless. Why do we need any more resolution above a threshold like this?

We're talking only about the fallback case here, btw. If we have native histograms then operators can see the "real" distribution, if they really need to.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We're talking only about the fallback case here, btw. If we have native histograms then operators can see the "real" distribution, if they really need to.

True, but my guess is that most customers with Prometheus are not using native histograms, so they will most likely use the classic histograms.

Why do we need any more resolution above a threshold like this?

I'm ok with capping at 5m. I was mainly curious if there’s already an agreement that 5 minutes is the maximum acceptable claim time for prebuilt workspaces to still be considered effective (especially taking into account longer build times for workspaces without prebuilds)

Copy link
ContributorAuthor

@ssncferreirassncferreiraAug 27, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Addressed the bucket definition for the claim histogram in11d7ed3


workspaceClaimTimings:=prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace:"coderd",
Subsystem:"provisionerd",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It feels a bit awkward to have this namespaced in this way as opposed to undercoderd_prebuilt_workspaces.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, I agree. Right now, the histogram metrics are set in the provisionerd server, which is why the subsystem was originally defined asprovisionerd. I can rename this one toprebuilt_workspace, but then for consistency, we should also updateworkspace_creation_duration_seconds.

An alternative would be to drop theprovisionerd subsystem altogether and just go with:

  • coderd_workspace_creation_duration_seconds
  • coderd_prebuilt_workspace_claim_duration_seconds

Copy link
Contributor

@dannykoppingdannykoppingAug 26, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

An alternative

Yeah that seems like a good solution. We already have metrics likecoderd_workspace_builds_total.

ssncferreira reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Addressed in11d7ed3

@matifalimatifali added the cherry-pick/v2.26Needs to be cherry-picked to the 2.26 release branch labelAug 27, 2025
}
}

funcTestWorkspaceCreationTotal(t*testing.T) {
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Note: The tests included in this PR are not exhaustive. I plan to add additional test cases to cover more use cases in a follow-up PR. Keeping this PR focused helps avoid delaying the release of this feature.

dannykopping reacted with thumbs up emoji
@ssncferreirassncferreira marked this pull request as ready for reviewAugust 28, 2025 09:22
-- Exclude workspaces whose first successful start was the prebuilds system user
ANDfsb.initiator_id!='c42fdf75-3097-471c-8c33-fb52454d81c0'::uuid
GROUP BYt.name, COALESCE(tvp.name,''),o.name
ORDER BYt.name, preset_name,o.name;
Copy link
ContributorAuthor

@ssncferreirassncferreiraAug 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This SQL query follows a similar pattern asGetPrebuildMetrics

This query (in combination with others inprometheusmetrics) will be executed everydefaultRefreshRate, which is currently set to 1 minute. AnEXPLAIN was run in dogfood's database and the results are here:https://explain.dalibo.com/plan/6f96bfac3baac3d6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is it necesary to look at all workspace builds ever, or do we need to look at builds in a certain time window?

The explain output shows a number of seq scans which we probably want to turn into at least index scans.

Copy link
ContributorAuthor

@ssncferreirassncferreiraAug 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is it necesary to look at all workspace builds ever, or do we need to look at builds in a certain time window?

This is to account for the edge case where a workspace was not successfully provisioned on the first workspace build, that is why we are getting the first successful start workspace build. I would say that in most cases the first workspace build will be successful, so maybe we don't need this level of detail for the counter and we can just look at the first workspace build (like we did in the initial version of thequery)

The explain output shows a number of seq scans which we probably want to turn into at least index scans.

Yes,that is the main issue here 😕 I can try to create some indexes to improve the performance here.
(edit: actually it seems the biggest bottleneck here is the ORDER BY wb.workspace_id, wb.build_number, wb.id)

Comment on lines 2317 to 2319
// Is a regular workspace creation build
// Only consider the first build number for regular workspaces
workspaceBuild.BuildNumber==1,
Copy link
ContributorAuthor

@ssncferreirassncferreiraAug 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be nice if we could add metadata to describe if this is the first regular workspace build, similar to what we do withPrebuiltWorkspaceBuildStage_CREATE 🤔

Copy link
Contributor

@dannykoppingdannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM! Really appreciate the care and consideration that went into this.

ssncferreira reacted with heart emoji
Copy link
Member

@johnstcnjohnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we can go ahead and merge this as-is, but we may need to make some follow-up changes to the query to improve the performance and reduce the number of seq scans.

ssncferreira reacted with thumbs up emoji
@ssncferreirassncferreira merged commit0ab345c intomainAug 28, 2025
47 of 50 checks passed
@ssncferreirassncferreira deleted the ssncferreira/prebuild_metrics branchAugust 28, 2025 14:00
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsAug 28, 2025
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@dannykoppingdannykoppingdannykopping approved these changes

@johnstcnjohnstcnjohnstcn approved these changes

@SasSwartSasSwartAwaiting requested review from SasSwart

Assignees

@ssncferreirassncferreira

Labels
cherry-pick/v2.26Needs to be cherry-picked to the 2.26 release branch
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Implement Prebuild Timing Metrics in Prometheus
4 participants
@ssncferreira@dannykopping@johnstcn@matifali

[8]ページ先頭

©2009-2025 Movatter.jp