Adds a runner for scaletesting prebuilds. The runner uploads a no-op template with prebuilds, watches for the corresponding workspaces to be created, and then does the same to tear them down. I didn't originally plan on having metrics for the teardown, but I figured we might as well as it's still the same prebuilds reconciliation loop mechanism being tested.

github-actionsbot assignedethanndickson

Oct 30, 2025

Copy link

MemberAuthor

ethanndickson commentedOct 30, 2025•
edited
Loading

feat(cli): addprebuilds scaletest command #20600
feat(scaletest): add runner for prebuilds #20571 👈(View in Graphite)
main

This stack of pull requests is managed byGraphite. Learn more aboutstacking.

ethanndickson marked this pull request as ready for review

October 30, 2025 01:04

ethanndickson commented

Oct 30, 2025

View reviewed changes

enterprise/scaletest/prebuilds/run_test.go

		func TestRun(t *testing.T) {
		t.Parallel()

		t.Skip("This test takes several minutes to run, and is intended as a manual regression test")

Copy link

MemberAuthor

ethanndicksonOct 30, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I had a brief attempt at doing this with a fake API client, and it just becomes so complex so quickly. Because we're uploading the template from within the runner, we can't use the echo provisioner either, which would be considerably faster. With the Terraform provisioner it takes about 2 minutes.

ethanndickson requested review fromkacpersaw andspikecurtis

October 30, 2025 01:08

ethanndickson force-pushed theethan/prebuilds-runner branch 3 times, most recently fromf0b61b4 tofff0900Compare

October 31, 2025 01:18

ethanndickson mentioned this pull request

Oct 31, 2025

feat(cli): addprebuilds scaletest command#20600

Open

feat(scaletest): add runner for prebuilds

d40117d

ethanndickson force-pushed theethan/prebuilds-runner branch fromfff0900 tod40117dCompare

October 31, 2025 04:12

spikecurtis reviewed

Oct 31, 2025

View reviewed changes

scaletest/prebuilds/tf/main.tf.tpl



		data "coder_workspace_preset" "presets" {
		count = local.num_presets

Copy link

Contributor

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there some reason we need a local for this, instead of just{{.NumPresets}}?

scaletest/prebuilds/run.go

		if err != nil {
		return nil, err
		}
		var result strings.Builder

Copy link

Contributor

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think a normal bytes buffer is fine here, since you never actually care about getting the result as a string.

scaletest/prebuilds/run.go


		for _, ws := range workspaces.Workspaces {
		if ws.LatestBuild.Job.Status == codersdk.ProvisionerJobRunning \|\|
		ws.LatestBuild.Job.Status == codersdk.ProvisionerJobSucceeded {

Copy link

Contributor

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Jobs could technically fail. I know the template is really trivial, but I think we have a timeout for provisioners, and under extreme load some could time out and have their jobs marked failed. The prebuilds reconciler should retry them, but I'm curious how you think they should be counted in the metrics.

scaletest/prebuilds/metrics.go

		Subsystem: "scaletest",
		Name: "prebuild_job_creation_latency_seconds",
		Help: "Time from when prebuilds are unpaused to when all the template build jobs have been created.",
		}, []string{"template_name"}),

Copy link

Contributor

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The default bucket for a histogram are:

var DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}

That is, anything that takes longer than 10 seconds isn't a good fit, because you'll only fill one bucket and be unable to infer anything about the distribution other than the fact that it's longer than 10 seconds.

The other thing to say is that we are only going to record a single observation per template, so it doesn't make sense for template_name to be a label for this histogram. If you actually want to look at the latency per template, you need to make this a guage and just set it to the latency.

With some tests only creating 12 or 18 templates, I kind of question the utility of a histogram anyway.

Prebuilds only reconcile every minute by default, so I think we've made a kind of tacit choice that we don't really care about the latency of prebuilds at the level of seconds. Another option for these metrics would be to have them be gauges for the number of workspaces created, acquired, and completed (and failed, if any). We'd only see their values change at the speed of prometheus scraping (15 - 30s is typical), but I think we'd all be happy if everything completes within 30s at scale and we wouldn't worry too much about the dynamics on the per-second level.

scaletest/prebuilds/run.go

		r.cfg.DeletionBarrier.Wait()
		logger.Info(ctx, "all runners reached deletion barrier, proceeding with prebuild deletion")

		err = r.measureDeletion(ctx, logger)

Copy link

Contributor

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I really don't think this should be inCleanup. I kinda see your reasoning, but this part of the test run because we are measuring stuff.

The comment on the harness cleanup says:

// Cleanup should be called after the test run has finished and results have// been collected.func (h *TestHarness) Cleanup(ctx context.Context) (err error) {

The standard thing to do is run the harness, collect results, and then finally clean up. If you did that with the runner like this, you wouldn't get the deletion metrics as part of the results. For the sake of clarity and consistency, put this prebuild deletion stuff in the mainRun.

scaletest/prebuilds/run.go

		}
		}

		if r.prebuildDeletionJobCreationLatency == 0 && (currentCount < targetNumWorkspaces \|\| deletingCount > 0) {

Copy link

Contributor

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The metric is supposed to be whenall workspaces have their deleting job created, but this will trigger ifany deleting jobs are created.

scaletest/prebuilds/run.go

		r.cfg.Metrics.RecordDeletionJobCreation(r.prebuildDeletionJobCreationLatency, r.template.Name)
		}

		if r.prebuildDeletionJobAcquiredLatency == 0 && deletingCount > 0 {

Copy link

Contributor

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, this triggers onany, when it's supposed to beall.

Labels

None yet

Movatterモバイル変換

feat(scaletest): add runner for prebuilds#20571

Are you sure you want to change the base?

feat(scaletest): add runner for prebuilds#20571

Uh oh!

Conversation

ethanndickson commentedOct 30, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ethanndickson commentedOct 30, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ethanndicksonOct 30, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisOct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ethanndickson commentedOct 30, 2025•
edited
Loading

ethanndickson commentedOct 30, 2025•
edited
Loading

ethanndicksonOct 30, 2025•
edited
Loading