- Notifications
You must be signed in to change notification settings - Fork1.1k
perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline#21281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
github-actionsbot commentedDec 15, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I have read the CLA Document and I hereby sign the CLA 1 out of2 committers have signed the CLA. |
8eb6651 to40720fdCompareAdd two new Prometheus metrics to track workspace creations without expensivedatabase queries:1. workspace_creation_attempts_total - Tracks when regular (non-prebuilt) workspace creation attempts begin (first 'start' build initiated) - Labels: organization_name, template_name, preset_name - Increments in coderd/workspaces.go when workspace build is created2. workspace_creation_outcomes_total - Tracks outcomes of workspace first builds - Labels: organization_name, template_name, preset_name, status (success/failure) - Increments in coderd/provisionerdserver when provisioner job completesBenefits:- Real-time metrics (no polling delay)- No expensive database queries- Event-driven counters in packages where events occur- Can calculate success rate: outcomes{status="success"} / attemptsTests:- Added comprehensive tests for both metrics using prometheus testutil- Tests verify metrics increment correctly on first builds only- Tests verify labels track organization, template, and preset names- Tests verify outcomes track success/failure status🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>This commit removes the GetRegularWorkspaceCreateMetrics database queryand replaces it with inline event-driven Prometheus counters that trackworkspace creation attempts and outcomes.Changes:- Removed SQL query from coderd/database/queries/workspaces.sql- Removed generated method implementations from dbmock, dbauthz, dbmetrics- Removed old metric collection from coderd/prometheusmetrics/prometheusmetrics.go- Fixed missing imports in coderd/database/queries.sql.go (pre-existing issue)- Fixed method names to use GetPresetByID instead of GetTemplateVersionPresetByIDThe new metrics (WorkspaceCreationAttemptsTotal and WorkspaceCreationOutcomesTotal)track workspace creations inline as they occur, eliminating the need for expensivedatabase aggregations.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Simplified the tests to verify basic metric functionality without requiringa full PostgreSQL database. The new tests:- Verify metrics can be incremented and read correctly- Test that different label values (org, template, preset, status) are tracked- Run quickly without needing Docker/PostgreSQLThe original integration tests required full workspace creation flows withPostgreSQL. These simplified unit tests focus on verifying the prometheusmetric itself works correctly.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Rewrote metric tests to use dbmock instead of requiring PostgreSQL:- Created table-driven tests covering all conditional logic scenarios- Mock database calls with gomock expectations- Test build numbers, transitions, initiators, presets, success/failure- Tests run quickly without Docker/PostgreSQL dependenciesCreated new test files:- coderd/workspaces_metric_test.go: Tests WorkspaceCreationAttemptsTotal- coderd/provisionerdserver/provisionerdserver_metric_test.go: Tests WorkspaceCreationOutcomesTotalRemoved old integration tests that required PostgreSQL and cleaned up unused imports.🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extracted the metric increment logic into a reusable function`incrementWorkspaceCreationOutcomesMetric` to avoid duplicating theconditional logic in tests. This ensures the test verifies the actualproduction code path.Changes:- Added `incrementWorkspaceCreationOutcomesMetric()` function in provisionerdserver.go- Updated test to call the extracted function instead of duplicating logic- Tests now verify the actual implementation, not a recreation of it🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extracted the metric increment logic into a reusable function`incrementWorkspaceCreationAttemptsMetric` to avoid duplicating theconditional logic in tests. This ensures the test verifies the actualproduction code path.Changes:- Added `incrementWorkspaceCreationAttemptsMetric()` function in workspaces.go- Updated test to call the extracted function instead of duplicating logic- Tests now verify the actual implementation, not a recreation of it🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed metrics registration so they are properly exported via /metrics endpoint.Previously, metrics were created with promauto but never registered with thecustom PrometheusRegistry, so they weren't exported in multi-replica deployments.Changes:- WorkspaceCreationAttemptsTotal: Now created with promauto.With(registry) in API initialization- WorkspaceCreationOutcomesTotal: Added to provisionerdserver.Metrics and registered with registry- Renamed test files to follow *_internal_test.go naming pattern for internal package tests- Updated tests to create test metric instances and server/API objectsMetrics are now properly registered and will be scraped from all coderd replicas.🤖 Generated with Claude CodeCo-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The workspace creation metrics were not appearing on the /metrics endpointdue to registry conflicts:1. WorkspaceCreationAttemptsTotal was only initialized when options.DeploymentValues.Prometheus.Enable was true, but the registry is always available. Removed unnecessary check.2. ProvisionerdServerMetrics was never initialized. Added initialization and registration with options.PrometheusRegistry.3. CRITICAL: Global WorkspaceCreationOutcomesTotal variable used promauto.NewCounterVec() which registers with the DEFAULT registry, not the custom options.PrometheusRegistry. This caused duplicate registration attempts and prevented the metric from appearing. Removed the global variable entirely and made the metric instance-based only, matching the pattern used by the working workspaceCreationTimings and workspaceClaimTimings metrics.Both metrics now register correctly with the custom PrometheusRegistryand will appear on the /metrics endpoint after workspace creation events.
Refactored provisionerdserver.NewMetrics() to follow the same pattern asnotifications.NewMetrics():- Takes prometheus.Registerer parameter- Uses promauto.With(reg) to automatically register metrics on creation- Removed separate Register() method (no longer needed)This matches the existing pattern in the codebase and ensures metrics areregistered with the correct custom PrometheusRegistry, not the globaldefault registry.Changes:- Updated NewMetrics signature to accept prometheus.Registerer- Changed metric creation from prometheus.New* to promauto.With(reg).New*- Removed Register() method- Updated coderd.go to pass options.PrometheusRegistry when creating metrics- Updated all tests to pass prometheus.NewRegistry()Both workspace creation metrics will now appear correctly on /metrics.
Updated cli/server.go to pass prometheus.Registerer toprovisionerdserver.NewMetrics() and removed the separate Register() callsince promauto handles registration automatically now.
Add nil check before creating ProvisionerdServerMetrics to avoidduplicate registration when metrics are already provided via options.This fixes panic that occurred because metrics were being created inboth cli/server.go and coderd.go.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
4454f50 to1aca85aCompare
ssncferreira left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.
Completely agree. However, the reasoncoderd_workspace_creation_total was implemented as a counter updated with the all-time total is because we already have other metrics likecoderd_prebuilt_workspaces_claimed_total that have the same issue, they're also queried from the database and every replica exports the same values. This was a requirement to be able to compare regular workspaces vs prebuilt workspaces (dashboard link) to understand if prebuilds are being adopted.

With this PR's removal ofcoderd_workspace_creation_total, we lose this comparison sincecoderd_prebuilt_workspaces_claimed_total is all-time whereascoderd_workspace_creation_attempts_total is since the metric introduction.
I think the main problem withGetRegularWorkspaceCreateMetrics is thefirst_success_build subquery that attempts to find the first successful build (even if the first build failed). We could follow a simpler approach like we do forcoderd_workspace_creation_duration_seconds, where we only report if the first build is successful. This should improve query performance while maintaining consistency with the prebuild metrics (even though we don't get exact values for workspace creation).
Alternatively, we could change the prebuild metrics to also be reported since server start instead of all-time, but that would require more extensive changes and would be a breaking change for those metrics 😕
| ifworkspaceBuild.BuildNumber==1&& | ||
| workspaceBuild.Transition==database.WorkspaceTransitionStart&& | ||
| workspaceBuild.InitiatorID!=database.PrebuildsSystemUserID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We already do similar checks for the timing metricscoderd_workspace_creation_duration_seconds andcoderd_prebuilt_workspace_claim_duration_seconds oncompleteWorkspaceBuildJob:
| ifs.metrics!=nil&&workspaceBuild.Transition==database.WorkspaceTransitionStart { |
As well as, getting the presetID from the database, so maybe we could update all metrics in the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'll have a look 👍
| workspaceBuild.Transition==database.WorkspaceTransitionStart&& | ||
| workspaceBuild.InitiatorID!=database.PrebuildsSystemUserID { | ||
| // Determine status based on job completion. | ||
| status:="success" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This metric already encompassescoderd_workspace_creation_attempts_total, right? Since it reports both success and failure status. Could we consolidate into a single metric instead of having two separate ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What I had in mind, and it's not dealt with here yet, but technically there can be an error for the workspace build jobprior to the insertion of the job into the database. Soattempts should potentially be renamed and also track failures there.
Then, via comparison between the two metrics we can determine if workspace builds are failing before they get to a provisioner or within the provisioner, and if there's a large discrepancy between the sums for these two metrics we would be able to track the growing of a backlog of build jobs or potentially build jobs being lost somehow.
Move workspace creation outcomes metric increment into the same section astiming metrics in provisionerdserver.go. This consolidates all metricupdates in one place, reuses the same preset lookup logic, and uses freshjob data from the database.Changes:- Add UpdateWorkspaceCreationOutcomesMetric method to Metrics struct- Update provisionerdserver.go to call both timing and outcomes metrics in the same section (lines 2399-2458)- Remove separate incrementWorkspaceCreationOutcomesMetric method- Remove workspaceCreationOutcomesTotal field from server struct- Simplify tests to use the new consolidated approach- Remove unused prometheus import from provisionerdserver.goThe consolidation addresses feedback to update all metrics in the sameplace where preset lookups and job status checks already occur for timingmetrics.
cstyan commentedDec 18, 2025
The Grafana dashboard could be modified to query PG directly or we can introduce an API endpoint (for wrapping with auth or something similar) that executes those queries as well, if the
It appears so, yes. The main source of cost is sequential scans on both
I think this is the change we're going to need to make, not just here for the workspace creation metrics (prebuilds and regular) but more generally going forward. Whether it's a breaking change or not, I'm not sure what guarantees we actually provide around metrics stability across versions. |
Implemented via tasks.
As of Dec 15, even with only ~30k calls per week, the current
GetRegularWorkspaceCreateMetricsquery is either first or second most expensive. The average execution time is ~185ms but the plan P99 is over 2.5s and execution P99 over 4s. Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.This changes the "regular" workspace metric to be split into two, one for workspace creation 'events' and another for workspace creation job successes/failures.
To query correctly end users will now need to sum, by some subset or all of the labels, across their coderd replicas to get the total count of workspace creations and successes/failures. IMO this also more nicely provides insights into failure cases when they happen, as I believe ATM you can only see that the initial creation code path failed by looking at logs.