NotificationsYou must be signed in to change notification settings
Fork1.1k
Star11.8k

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline#21281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

cstyan wants to merge13 commits intomain

base:main

Choose a base branch

fromcallum/workspace-create-metrics

Open

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline#21281

cstyan wants to merge13 commits intomainfromcallum/workspace-create-metrics

+433 −209

Conversation

Copy link

Contributor

cstyan commentedDec 15, 2025

Implemented via tasks.

As of Dec 15, even with only ~30k calls per week, the currentGetRegularWorkspaceCreateMetrics query is either first or second most expensive. The average execution time is ~185ms but the plan P99 is over 2.5s and execution P99 over 4s. Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.

This changes the "regular" workspace metric to be split into two, one for workspace creation 'events' and another for workspace creation job successes/failures.

To query correctly end users will now need to sum, by some subset or all of the labels, across their coderd replicas to get the total count of workspace creations and successes/failures. IMO this also more nicely provides insights into failure cases when they happen, as I believe ATM you can only see that the initial creation code path failed by looking at logs.

cstyan requested a review fromssncferreira

December 15, 2025 20:31

github-actionsbot assignedcstyan

Dec 15, 2025

Copy link

github-actionsbot commentedDec 15, 2025•
edited
Loading

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign ourContributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

1 out of2 committers have signed the CLA.
✅ (cstyan)[https://github.com/cstyan]
❌@callum Styan
Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, pleaseadd the email address used for this commit to your account.
_{You can retrigger this bot by commentingrecheck in this Pull Request.}_{Posted by theCLA Assistant Lite bot.}

cstyan force-pushed thecallum/workspace-create-metrics branch 2 times, most recently from8eb6651 to40720fdCompare

December 15, 2025 21:32

Callum Styanand others added12 commits

December 16, 2025 00:02

feat: add inline workspace creation metrics

229496a

Add two new Prometheus metrics to track workspace creations without expensivedatabase queries:1. workspace_creation_attempts_total - Tracks when regular (non-prebuilt)   workspace creation attempts begin (first 'start' build initiated)   - Labels: organization_name, template_name, preset_name   - Increments in coderd/workspaces.go when workspace build is created2. workspace_creation_outcomes_total - Tracks outcomes of workspace first builds   - Labels: organization_name, template_name, preset_name, status (success/failure)   - Increments in coderd/provisionerdserver when provisioner job completesBenefits:- Real-time metrics (no polling delay)- No expensive database queries- Event-driven counters in packages where events occur- Can calculate success rate: outcomes{status="success"} / attemptsTests:- Added comprehensive tests for both metrics using prometheus testutil- Tests verify metrics increment correctly on first builds only- Tests verify labels track organization, template, and preset names- Tests verify outcomes track success/failure status🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

chore: remove expensive database query for workspace creation metrics

af6b9f1

This commit removes the GetRegularWorkspaceCreateMetrics database queryand replaces it with inline event-driven Prometheus counters that trackworkspace creation attempts and outcomes.Changes:- Removed SQL query from coderd/database/queries/workspaces.sql- Removed generated method implementations from dbmock, dbauthz, dbmetrics- Removed old metric collection from coderd/prometheusmetrics/prometheusmetrics.go- Fixed missing imports in coderd/database/queries.sql.go (pre-existing issue)- Fixed method names to use GetPresetByID instead of GetTemplateVersionPresetByIDThe new metrics (WorkspaceCreationAttemptsTotal and WorkspaceCreationOutcomesTotal)track workspace creations inline as they occur, eliminating the need for expensivedatabase aggregations.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

test: rewrite workspace creation metric tests to not require PostgreSQL

4f5cb8e

Simplified the tests to verify basic metric functionality without requiringa full PostgreSQL database. The new tests:- Verify metrics can be incremented and read correctly- Test that different label values (org, template, preset, status) are tracked- Run quickly without needing Docker/PostgreSQLThe original integration tests required full workspace creation flows withPostgreSQL. These simplified unit tests focus on verifying the prometheusmetric itself works correctly.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

test: rewrite workspace creation metric tests using database mocks

15416d5

Rewrote metric tests to use dbmock instead of requiring PostgreSQL:- Created table-driven tests covering all conditional logic scenarios- Mock database calls with gomock expectations- Test build numbers, transitions, initiators, presets, success/failure- Tests run quickly without Docker/PostgreSQL dependenciesCreated new test files:- coderd/workspaces_metric_test.go: Tests WorkspaceCreationAttemptsTotal- coderd/provisionerdserver/provisionerdserver_metric_test.go: Tests WorkspaceCreationOutcomesTotalRemoved old integration tests that required PostgreSQL and cleaned up unused imports.🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

refactor: extract workspace creation outcomes metric logic into function

05f6a7b

Extracted the metric increment logic into a reusable function`incrementWorkspaceCreationOutcomesMetric` to avoid duplicating theconditional logic in tests. This ensures the test verifies the actualproduction code path.Changes:- Added `incrementWorkspaceCreationOutcomesMetric()` function in provisionerdserver.go- Updated test to call the extracted function instead of duplicating logic- Tests now verify the actual implementation, not a recreation of it🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

refactor: extract workspace creation attempts metric logic into function

7863e9e

Extracted the metric increment logic into a reusable function`incrementWorkspaceCreationAttemptsMetric` to avoid duplicating theconditional logic in tests. This ensures the test verifies the actualproduction code path.Changes:- Added `incrementWorkspaceCreationAttemptsMetric()` function in workspaces.go- Updated test to call the extracted function instead of duplicating logic- Tests now verify the actual implementation, not a recreation of it🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

refactor: register workspace creation metrics with PrometheusRegistry

dd5ae4c

Fixed metrics registration so they are properly exported via /metrics endpoint.Previously, metrics were created with promauto but never registered with thecustom PrometheusRegistry, so they weren't exported in multi-replica deployments.Changes:- WorkspaceCreationAttemptsTotal: Now created with promauto.With(registry) in API initialization- WorkspaceCreationOutcomesTotal: Added to provisionerdserver.Metrics and registered with registry- Renamed test files to follow *_internal_test.go naming pattern for internal package tests- Updated tests to create test metric instances and server/API objectsMetrics are now properly registered and will be scraped from all coderd replicas.🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fix: register workspace creation metrics with correct PrometheusRegistry

caf9629

The workspace creation metrics were not appearing on the /metrics endpointdue to registry conflicts:1. WorkspaceCreationAttemptsTotal was only initialized when   options.DeploymentValues.Prometheus.Enable was true, but the registry   is always available. Removed unnecessary check.2. ProvisionerdServerMetrics was never initialized. Added initialization   and registration with options.PrometheusRegistry.3. CRITICAL: Global WorkspaceCreationOutcomesTotal variable used   promauto.NewCounterVec() which registers with the DEFAULT registry,   not the custom options.PrometheusRegistry. This caused duplicate   registration attempts and prevented the metric from appearing.   Removed the global variable entirely and made the metric   instance-based only, matching the pattern used by the working   workspaceCreationTimings and workspaceClaimTimings metrics.Both metrics now register correctly with the custom PrometheusRegistryand will appear on the /metrics endpoint after workspace creation events.

fix: use promauto pattern for provisionerdserver metrics registration

e97e445

Refactored provisionerdserver.NewMetrics() to follow the same pattern asnotifications.NewMetrics():- Takes prometheus.Registerer parameter- Uses promauto.With(reg) to automatically register metrics on creation- Removed separate Register() method (no longer needed)This matches the existing pattern in the codebase and ensures metrics areregistered with the correct custom PrometheusRegistry, not the globaldefault registry.Changes:- Updated NewMetrics signature to accept prometheus.Registerer- Changed metric creation from prometheus.New* to promauto.With(reg).New*- Removed Register() method- Updated coderd.go to pass options.PrometheusRegistry when creating metrics- Updated all tests to pass prometheus.NewRegistry()Both workspace creation metrics will now appear correctly on /metrics.

fix: update cli/server.go to use new NewMetrics signature

bfa79a3

Updated cli/server.go to pass prometheus.Registerer toprovisionerdserver.NewMetrics() and removed the separate Register() callsince promauto handles registration automatically now.

fix: prevent duplicate metrics registration in coderd

c48b946

Add nil check before creating ProvisionerdServerMetrics to avoidduplicate registration when metrics are already provided via options.This fixes panic that occurred because metrics were being created inboth cli/server.go and coderd.go.

cleanup + lint/fmt

1aca85a

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan force-pushed thecallum/workspace-create-metrics branch from4454f50 to1aca85aCompare

December 16, 2025 00:29

ssncferreira reviewed

Dec 16, 2025

View reviewed changes

Copy link

Contributor

ssncferreira left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.

Completely agree. However, the reasoncoderd_workspace_creation_total was implemented as a counter updated with the all-time total is because we already have other metrics likecoderd_prebuilt_workspaces_claimed_total that have the same issue, they're also queried from the database and every replica exports the same values. This was a requirement to be able to compare regular workspaces vs prebuilt workspaces (dashboard link) to understand if prebuilds are being adopted.

With this PR's removal ofcoderd_workspace_creation_total, we lose this comparison sincecoderd_prebuilt_workspaces_claimed_total is all-time whereascoderd_workspace_creation_attempts_total is since the metric introduction.

I think the main problem withGetRegularWorkspaceCreateMetrics is thefirst_success_build subquery that attempts to find the first successful build (even if the first build failed). We could follow a simpler approach like we do forcoderd_workspace_creation_duration_seconds, where we only report if the first build is successful. This should improve query performance while maintaining consistency with the prebuild metrics (even though we don't get exact values for workspace creation).

Alternatively, we could change the prebuild metrics to also be reported since server start instead of all-time, but that would require more extensive changes and would be a breaking change for those metrics 😕

coderd/provisionerdserver/provisionerdserver.go Outdated

Comment on lines 81 to 83

		ifworkspaceBuild.BuildNumber==1&&
		workspaceBuild.Transition==database.WorkspaceTransitionStart&&
		workspaceBuild.InitiatorID!=database.PrebuildsSystemUserID {

Copy link

Contributor

ssncferreiraDec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We already do similar checks for the timing metricscoderd_workspace_creation_duration_seconds andcoderd_prebuilt_workspace_claim_duration_seconds oncompleteWorkspaceBuildJob:

coder/coderd/provisionerdserver/provisionerdserver.go

Line 2393 in1aca85a

ifs.metrics!=nil&&workspaceBuild.Transition==database.WorkspaceTransitionStart {

As well as, getting the presetID from the database, so maybe we could update all metrics in the same place?

Copy link

ContributorAuthor

cstyanDec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'll have a look 👍

coderd/provisionerdserver/provisionerdserver.go Outdated

		workspaceBuild.Transition==database.WorkspaceTransitionStart&&
		workspaceBuild.InitiatorID!=database.PrebuildsSystemUserID {
		// Determine status based on job completion.
		status:="success"

Copy link

Contributor

ssncferreiraDec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This metric already encompassescoderd_workspace_creation_attempts_total, right? Since it reports both success and failure status. Could we consolidate into a single metric instead of having two separate ones?

Copy link

ContributorAuthor

cstyanDec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What I had in mind, and it's not dealt with here yet, but technically there can be an error for the workspace build jobprior to the insertion of the job into the database. Soattempts should potentially be renamed and also track failures there.

Then, via comparison between the two metrics we can determine if workspace builds are failing before they get to a provisioner or within the provisioner, and if there's a large discrepancy between the sums for these two metrics we would be able to track the growing of a backlog of build jobs or potentially build jobs being lost somehow.

refactor: consolidate workspace creation metrics updates

46fe890

Move workspace creation outcomes metric increment into the same section astiming metrics in provisionerdserver.go. This consolidates all metricupdates in one place, reuses the same preset lookup logic, and uses freshjob data from the database.Changes:- Add UpdateWorkspaceCreationOutcomesMetric method to Metrics struct- Update provisionerdserver.go to call both timing and outcomes metrics  in the same section (lines 2399-2458)- Remove separate incrementWorkspaceCreationOutcomesMetric method- Remove workspaceCreationOutcomesTotal field from server struct- Simplify tests to use the new consolidated approach- Remove unused prometheus import from provisionerdserver.goThe consolidation addresses feedback to update all metrics in the sameplace where preset lookups and job status checks already occur for timingmetrics.

Copy link

ContributorAuthor

cstyan commentedDec 18, 2025

Completely agree. However, the reasoncoderd_workspace_creation_total was implemented as a counter updated with the all-time total is because we already have other metrics likecoderd_prebuilt_workspaces_claimed_total that have the same issue, they're also queried from the database and every replica exports the same values. This was a requirement to be able to compare regular workspaces vs prebuilt workspaces (dashboard link) to understand if prebuilds are being adopted.
With this PR's removal ofcoderd_workspace_creation_total, we lose this comparison sincecoderd_prebuilt_workspaces_claimed_total is all-time whereascoderd_workspace_creation_attempts_total is since the metric introduction.

The Grafana dashboard could be modified to query PG directly or we can introduce an API endpoint (for wrapping with auth or something similar) that executes those queries as well, if theall time nature of the current metrics is really that vital for these comparisons. IMO an all-time value is not that useful, in reality if a customer is wanting to track prebuild effectiveness it's more valuable to see the change in prebuild claims vs. total workspaces for a given time range.

I think the main problem withGetRegularWorkspaceCreateMetrics is thefirst_success_build subquery that attempts to find the first successful build (even if the first build failed). We could follow a simpler approach like we do forcoderd_workspace_creation_duration_seconds, where we only report if the first build is successful. This should improve query performance while maintaining consistency with the prebuild metrics (even though we don't get exact values for workspace creation).

It appears so, yes. The main source of cost is sequential scans on bothprovisioner_jobs andworkspace_builds. However, as Spike pointed out on theissue for this query, most builds for workspaces as successful. The deployment has a larger issue at hand if that's not the case.

Alternatively, we could change the prebuild metrics to also be reported since server start instead of all-time, but that would require more extensive changes and would be a breaking change for those metrics 😕

I think this is the change we're going to need to make, not just here for the workspace creation metrics (prebuilds and regular) but more generally going forward. Whether it's a breaking change or not, I'm not sure what guarantees we actually provide around metrics stability across versions.

Labels

None yet

Movatterモバイル変換

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline#21281

Are you sure you want to change the base?

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline#21281

Conversation

cstyan commentedDec 15, 2025

Uh oh!

github-actionsbot commentedDec 15, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ssncferreira left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssncferreiraDec 16, 2025

Choose a reason for hiding this comment

Uh oh!

cstyanDec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ssncferreiraDec 16, 2025

Choose a reason for hiding this comment

Uh oh!

cstyanDec 16, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan commentedDec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actionsbot commentedDec 15, 2025•
edited
Loading

ssncferreira left a comment•
edited
Loading