The old implementation ofGetWorkspacesEligibleForTransition returns many workspaces that are not actually eligible for transition. This new implementation reduces this number significantly (at least on our dogfood instance).

$ psql -f old.sql  count -------   100(1 row)$ psql -f new.sql  count -------     5(1 row)

For each workspace returned fromGetWorkspacesEligibleForTransition, we make (at minimum) 7 calls to the database. That means for our dogfood instance, we currently make roughly 700 DB calls a minute for workspaces that are not eligible for transition. The new implementation cuts this down to <50DB calls a minute (a reduction of around 90%).

Looking at the planning/execution time of the new query, it appears to be either within run-to-run variance or a little quicker.

Before:
https://explain.dalibo.com/plan/95f6e257d8fg51gd

After:
https://explain.dalibo.com/plan/h2cb9hc533cac0c7

DanielleMaywood added3 commits

November 7, 2024 13:07

fix: rewrite GetWorkspacesEligibleForTransition query

fb857e0

feat: add caching to reduce db calls

de6e6af

revert: cache

e81a985

github-actionsbot assignedDanielleMaywood

Nov 7, 2024

DanielleMaywood added3 commits

November 7, 2024 15:15

revert: mistake

190617c

chore: comment query

18e9550

fix: update dbmem.go to match new query

5aff6e3

DanielleMaywood marked this pull request as ready for review

November 8, 2024 10:34

DanielleMaywood requested review fromdannykopping,johnstcn andmafredri

November 8, 2024 10:35

mafredri reviewed

Nov 8, 2024

View reviewed changes

Copy link

Member

mafredri left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'll take another look through as I haven't looked at autostop behavior in a long time and need to wrap my head around it a bit more. I do have concerns there are cases we won't pick up jobs we shouldor that there are ways to abuse the system.

Let's say I'm a nefarious user and start a bitcoin miner in a tmux of my workspace. Then I hit stop + cancel immediately afterwards. Now my workspace is in a cancelled state, the previous job was stop and I don't think it'll be picked up by any condition here and thus my bitcoin miner remains running in perpetuity. There may be other similar cases and I'm not sure what should happen in these, but I could also be wrong.

PS. Very nice job on reducing the DB load through better filters 👍🏻

coderd/database/queries/workspaces.sql OutdatedShow resolvedHide resolved

coderd/database/queries/workspaces.sql

		-- If the workspace's template has an inactivity_ttl set
		-- it may be eligible for dormancy.
		-- A workspace may be eligible for dormant stop if the following are true:
		-- * The workspace is not dormant.

Copy link

Member

mafredriNov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Who/when marks a workspace as dormant? Is there a chance the workspace can be marked dormant and we never trigger either autostop or dormant stop?

Copy link

Member

johnstcnNov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Autobuild executor does it here:

coder/coderd/autobuild/lifecycle_executor.go

Lines 497 to 506 in7b33ab0

	// isEligibleForDormantStop returns true if the workspace should be dormant
	// for breaching the inactivity threshold of the template.
	funcisEligibleForDormantStop(ws database.Workspace,templateSchedule schedule.TemplateScheduleOptions,currentTick time.Time)bool {
	// Only attempt against workspaces not already dormant.
	return!ws.DormantAt.Valid&&
	// The template must specify an time_til_dormant value.
	templateSchedule.TimeTilDormant>0&&
	// The workspace must breach the time_til_dormant value.
	currentTick.Sub(ws.LastUsedAt)>templateSchedule.TimeTilDormant
	}

Copy link

Member

mafredriNov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok, that plus this seems relevant:

coder/coderd/autobuild/lifecycle_executor.go

Lines 421 to 428 in7b33ab0

	caseisEligibleForDormantStop(ws,templateSchedule,currentTick):
	// Only stop started workspaces.
	iflatestBuild.Transition==database.WorkspaceTransitionStart {
	returndatabase.WorkspaceTransitionStop,database.BuildReasonDormancy,nil
	}
	// We shouldn't transition the workspace but we should still
	// make it dormant.
	return"",database.BuildReasonDormancy,nil

As does this:

coder/coderd/autobuild/lifecycle_executor.go

Lines 254 to 264 in7b33ab0

	// Transition the workspace to dormant if it has breached the template's
	// threshold for inactivity.
	ifreason==database.BuildReasonDormancy {
	wsOld:=ws
	wsNew,err:=tx.UpdateWorkspaceDormantDeletingAt(e.ctx, database.UpdateWorkspaceDormantDeletingAtParams{
	ID:ws.ID,
	DormantAt: sql.NullTime{
	Time:dbtime.Now(),
	Valid:true,
	},
	})

So one thing that can happen at least in the code-version is that a workspace has remained in a failed stop state for a long time. Then it becomes eligible for dormant stop but since it wasn't a start, we don't try to stop it but mark it as dormant anyway.

Then we'll have a workspace marked dormant that we never attempted to stop.

coderd/database/queries/workspaces.sql OutdatedShow resolvedHide resolved

coderd/database/queries/workspaces.sql

		workspaces.dormant_at ISNULLAND
		templates.time_til_dormant>0AND
		workspaces.dormant_at ISNULL
		(@now ::timestamptz)-workspaces.last_used_at> (INTERVAL'1 millisecond'* (templates.time_til_dormant/1000000))

Copy link

Member

mafredriNov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I realize this is probably already in use elsewhere but not going to lie, it's weird seeing nanos being converted into millis. I know ms is the smallest unit in pg but feels like this isn't super intuitive for humans to understand what's going on here.

Obviously storing ns in the db is the main problem (and we're not fixing that here). But I'd prefer to see the conversion into seconds instead of ms as that feels like a more intuitive unit.

Even the db column fortime_til_dormant doesn't have a comment so this is pretty much hidden magic 😄.

Not a blocker, but we should probably do something about this at some point.

Copy link

Member

johnstcnNov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Every day I regret my previous decision 😓

coderd/database/queries/workspaces.sql OutdatedShow resolvedHide resolved

coderd/database/queries/workspaces.sql

		-- If the user account is suspended, and the workspace is running.
		-- A workspace may be eligible for failed stop if the following are true:
		-- * The template has a failure ttl set.
		-- * The workspace build was a start transition.

Copy link

Member

mafredriNov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If we failed to autostop previously, then this limitation will prevent this case from triggering andI think we'll never retry it because of the other cases. Is that as intended?

Copy link

Member

johnstcnNov 8, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This should be the same logic that's done later on? We're not changing any logic here, just front-loading it.

coder/coderd/autobuild/lifecycle_executor.go

Lines 525 to 536 in7b33ab0

	// isEligibleForFailedStop returns true if the workspace is eligible to be stopped
	// due to a failed build.
	funcisEligibleForFailedStop(build database.WorkspaceBuild,job database.ProvisionerJob,templateSchedule schedule.TemplateScheduleOptions,currentTick time.Time)bool {
	// If the template has specified a failure TLL.
	returntemplateSchedule.FailureTTL>0&&
	// And the job resulted in failure.
	job.JobStatus==database.ProvisionerJobStatusFailed&&
	build.Transition==database.WorkspaceTransitionStart&&
	// And sufficient time has elapsed since the job has completed.
	job.CompletedAt.Valid&&
	currentTick.Sub(job.CompletedAt.Time)>templateSchedule.FailureTTL
	}

Copy link

Member

mafredriNov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I didn't look at the original logic, but then I suppose I'm proposing that the original logic doesn't take this into account?

Here, unless the failed build is "start", we'll never try to stop it. The failed build could just as well be a cancelled (or failed) "stop" which has left resources behind.

Copy link

Member

johnstcnNov 11, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That's a good catch. We might want to adjust this behaviour in a separate PR.
EDIT: filed#15477

Copy link

Member

mafredri commentedNov 8, 2024

Thanks for also linking the explains. I think adding an index ondeleted column forworkspaces could help too.

There only exists one tangentially related index:

    "workspaces_owner_id_lower_idx" UNIQUE, btree (owner_id, lower(name::text)) WHERE deleted = false

But it's non-trivial to utilize it, so an index targeting that single column is probably the best.

fix: typo

b953902

johnstcn mentioned this pull request

Nov 11, 2024

bug: autobuild: current FailureTTL logic does not allow stopping workspaces with a failed build#15477

Open

Copy link

ContributorAuthor

DanielleMaywood commentedNov 12, 2024

@johnstcn and I spotted, whilst doing some testing yesterday, that we're still returning false-positives in the following scenario:

The user is active
The job did not fail
The workspace is stopped
The workspace has an autostart schedule.

(users.status='active'::user_statusANDprovisioner_jobs.job_status!='failed'::provisioner_job_statusANDworkspace_builds.transition='stop'::workspace_transitionANDworkspaces.autostart_scheduleIS NOT NULL)

What is happening here is thatworkspaces.autostart_schedule is a cron expression, and there is no simply way (that I'm aware of) currently to perform this check on the database, so we delegate this check to the coder server.

johnstcn approved these changes

Nov 13, 2024

View reviewed changes

Copy link

Member

johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@mafredri Are you happy to merge the changes to this query as-is for now and investigate later improvements in a follow-up PR?

mafredri approved these changes

Nov 13, 2024

View reviewed changes

Copy link

Member

mafredri left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@johnstcn I worry about either going stale now that we have logic duplicated both in code an query. Is there a reason to keep both?

Just thinking out loud but right now the conditions are lost since they're part of theWHERE. But if we moved them toSELECT instead they could be utilized as booleans in the code.

But aside from the duplication, and considering we're tracking additional state handling in#15477, I have no further objections. Feel free to merge.

DanielleMaywood merged commitf2fe379 intomain

Nov 13, 2024

26 checks passed

DanielleMaywood deleted the dm-autobuild-experiment branch

November 13, 2024 10:24

DanielleMaywood mentioned this pull request

Nov 20, 2024

fix: make GetWorkspacesEligibleForTransition return even less false positives#15594

Merged

DanielleMaywood added a commit that referenced this pull request

Dec 2, 2024

fix: make GetWorkspacesEligibleForTransition return even less false p…

e21a301

…ositives (#15594)Relates to#15082Further to#15429, this reduces theamount of false-positives returned by the 'is eligible for autostart'part of the query. We achieve this by calculating the 'next start at'time of the workspace, storing it in the database, and using it in our`GetWorkspacesEligibleForTransition` query.The prior implementation of the 'is eligible for autostart' query wouldreturn _all_ workspaces that at some point in the future _might_ beeligible for autostart. This now ensures we only return workspaces that_should_ be eligible for autostart.We also now pass `currentTick` instead of `t` to the`GetWorkspacesEligibleForTransition` query as otherwise we'll have oneround of workspaces that are skipped by `isEligibleForTransition` due to`currentTick` being a truncated version of `t`.

Labels

None yet

Movatterモバイル変換

fix: make GetWorkspacesEligibleForTransition return less false-positives#15429

fix: make GetWorkspacesEligibleForTransition return less false-positives#15429

Uh oh!

Conversation

DanielleMaywood commentedNov 7, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

mafredri left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mafredriNov 8, 2024

Choose a reason for hiding this comment

Uh oh!

johnstcnNov 8, 2024

Choose a reason for hiding this comment

Uh oh!

mafredriNov 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mafredriNov 8, 2024

Choose a reason for hiding this comment

Uh oh!

johnstcnNov 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mafredriNov 8, 2024

Choose a reason for hiding this comment

Uh oh!

johnstcnNov 8, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredriNov 8, 2024

Choose a reason for hiding this comment

Uh oh!

johnstcnNov 11, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri commentedNov 8, 2024

Uh oh!

DanielleMaywood commentedNov 12, 2024

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

mafredri left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DanielleMaywood commentedNov 7, 2024•
edited
Loading

mafredri left a comment•
edited
Loading

johnstcnNov 8, 2024•
edited
Loading

johnstcnNov 11, 2024•
edited
Loading

mafredri left a comment•
edited
Loading