Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(scaletest): add runner for prebuilds#20571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ethanndickson wants to merge1 commit intomain
base:main
Choose a base branch
Loading
fromethan/prebuilds-runner

Conversation

@ethanndickson
Copy link
Member

@ethanndicksonethanndickson commentedOct 30, 2025
edited
Loading

Relates tocoder/internal#914

Adds a runner for scaletesting prebuilds. The runner uploads a no-op template with prebuilds, watches for the corresponding workspaces to be created, and then does the same to tear them down. I didn't originally plan on having metrics for the teardown, but I figured we might as well as it's still the same prebuilds reconciliation loop mechanism being tested.

@ethanndicksonGraphite App
Copy link
MemberAuthor

ethanndickson commentedOct 30, 2025
edited
Loading

@ethanndicksonethanndickson marked this pull request as ready for reviewOctober 30, 2025 01:04
func TestRun(t *testing.T) {
t.Parallel()

t.Skip("This test takes several minutes to run, and is intended as a manual regression test")
Copy link
MemberAuthor

@ethanndicksonethanndicksonOct 30, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I had a brief attempt at doing this with a fake API client, and it just becomes so complex so quickly. Because we're uploading the template from within the runner, we can't use the echo provisioner either, which would be considerably faster. With the Terraform provisioner it takes about 2 minutes.



data "coder_workspace_preset" "presets" {
count = local.num_presets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there some reason we need a local for this, instead of just{{.NumPresets}}?

if err != nil {
return nil, err
}
var result strings.Builder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think a normal bytes buffer is fine here, since you never actually care about getting the result as a string.


for _, ws := range workspaces.Workspaces {
if ws.LatestBuild.Job.Status == codersdk.ProvisionerJobRunning ||
ws.LatestBuild.Job.Status == codersdk.ProvisionerJobSucceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Jobs could technically fail. I know the template is really trivial, but I think we have a timeout for provisioners, and under extreme load some could time out and have their jobs marked failed. The prebuilds reconciler should retry them, but I'm curious how you think they should be counted in the metrics.

Subsystem: "scaletest",
Name: "prebuild_job_creation_latency_seconds",
Help: "Time from when prebuilds are unpaused to when all the template build jobs have been created.",
}, []string{"template_name"}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The default bucket for a histogram are:

var DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}

That is, anything that takes longer than 10 seconds isn't a good fit, because you'll only fill one bucket and be unable to infer anything about the distribution other than the fact that it's longer than 10 seconds.

The other thing to say is that we are only going to record a single observation per template, so it doesn't make sense for template_name to be a label for this histogram. If you actually want to look at the latency per template, you need to make this a guage and just set it to the latency.

With some tests only creating 12 or 18 templates, I kind of question the utility of a histogram anyway.

Prebuilds only reconcile every minute by default, so I think we've made a kind of tacit choice that we don't really care about the latency of prebuilds at the level of seconds. Another option for these metrics would be to have them be gauges for the number of workspaces created, acquired, and completed (and failed, if any). We'd only see their values change at the speed of prometheus scraping (15 - 30s is typical), but I think we'd all be happy if everything completes within 30s at scale and we wouldn't worry too much about the dynamics on the per-second level.

r.cfg.DeletionBarrier.Wait()
logger.Info(ctx, "all runners reached deletion barrier, proceeding with prebuild deletion")

err = r.measureDeletion(ctx, logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I really don't think this should be inCleanup. I kinda see your reasoning, but this part of the test run because we are measuring stuff.

The comment on the harness cleanup says:

// Cleanup should be called after the test run has finished and results have// been collected.func (h *TestHarness) Cleanup(ctx context.Context) (err error) {

The standard thing to do is run the harness, collect results, and then finally clean up. If you did that with the runner like this, you wouldn't get the deletion metrics as part of the results. For the sake of clarity and consistency, put this prebuild deletion stuff in the mainRun.

}
}

if r.prebuildDeletionJobCreationLatency == 0 && (currentCount < targetNumWorkspaces || deletingCount > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The metric is supposed to be whenall workspaces have their deleting job created, but this will trigger ifany deleting jobs are created.

r.cfg.Metrics.RecordDeletionJobCreation(r.prebuildDeletionJobCreationLatency, r.template.Name)
}

if r.prebuildDeletionJobAcquiredLatency == 0 && deletingCount > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, this triggers onany, when it's supposed to beall.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@spikecurtisspikecurtisspikecurtis left review comments

@kacpersawkacpersawAwaiting requested review from kacpersaw

At least 1 approving review is required to merge this pull request.

Assignees

@ethanndicksonethanndickson

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@ethanndickson@spikecurtis

[8]ページ先頭

©2009-2025 Movatter.jp