NotificationsYou must be signed in to change notification settings
Fork1k
Star11.2k

chore: acquire lock for individual workspace transition#15859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

DanielleMaywood merged 5 commits intomainfromdm-lifecycle-executor-race

Dec 13, 2024

Merged

chore: acquire lock for individual workspace transition#15859

DanielleMaywood merged 5 commits intomainfromdm-lifecycle-executor-race

Dec 13, 2024

Conversation

Copy link

Contributor

DanielleMaywood commentedDec 13, 2024•
edited
Loading

When Coder is ran in High Availability mode, each Coder instance has a lifecycle executor. These lifecycle executors are all trying to do the same work, and whilst transactions saves us from this causing an issue, we are still doing extra work that could be prevented.

This PR adds aTryAcquireLock call for each attempted workspace transition, meaning two Coder instances shouldn't duplicate effort.

This approach does still allowsome duplicated effort to occur though. This is because we aren't locking the entirerunOnce function, meaning the follow scenario could still occur:

InstanceXcallsGetWorkspacesEligibleForTransition, returning WorkspaceW
InstanceXacquires lock to transition workspaceW
InstanceXstarts transitioning WorkspaceW
InstanceYcallsGetWorkspacesEligibleForTransition, returning WorkspaceW
InstanceXfinishes transitioning WorkspaceW
InstanceXreleases lock to transition workspaceW
InstanceYacquires lock to transition workspaceW
InstanceYstarts transitioning WorkspaceW
InstanceYfails to transition WorkspaceW
InstanceYreleases lock to transition workspaceW

I decided against lockingrunOnce for now as we run each workspace transition in their own transaction. Using nested transactions here will require extra design work and consideration.

DanielleMaywood added2 commits

December 10, 2024 16:36

chore: attempt to fix race

1213769

chore: uselog instead ofe.log

c45817b

github-actionsbot assignedDanielleMaywood

Dec 13, 2024

DanielleMaywood requested review fromjohnstcn,mafredri andmtojek

December 13, 2024 11:00

DanielleMaywood marked this pull request as ready for review

December 13, 2024 11:33

mafredri approved these changes

Dec 13, 2024

View reviewed changes

coderd/autobuild/lifecycle_executor.goShow resolvedHide resolved

coderd/autobuild/lifecycle_executor_test.go

		gofunc() {
		tickChB<-next
		close(tickChB)
		}()

Copy link

Member

mafredriDec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this potentially racy? We're testing that the lock acquire works but theoretically that might not happen if the first coderd grabs the job, completes it, and then the second one does.

I doubt it matters as I suppose we're happy even if the try acquire is hit only a faction of the time, but thought I'd flag it anyway.

Copy link

ContributorAuthor

DanielleMaywoodDec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking again, you're probably right. I ran the test with verbose logging and it looks like this all occurs within0.05s.

If the testdoesn't hit the lock, then we are likely to hit a flake. I'll have a go at increasing this time buffer.

Copy link

Member

johnstcnDec 13, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think you might be able to reduce (but not eliminate) racyness by having a secondchan struct{} that you then close after starting both goroutines, making them both wait until it's closed to start.

e.g.

    startCh := make(chan struct{})go func() {        <-startChtickChA <- nextclose(tickChA)}()go func() {        <-startChtickChB <- nextclose(tickChB)}()    close(startCh)

You might also be able to get both of them to tick very closely in time by sharing the same tick channel, and making it buffered with size 2. (Of course then you'd want to avoid closing the channel twice to avoid a panic)

Copy link

ContributorAuthor

DanielleMaywoodDec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've gone with your proposal@johnstcn.

It looks like for testing we just use an echo provisioner job, so getting that to take artificially longer for this specific test may not be a trivial task.

chore: add check for err != context.Canceled

46fa216

johnstcn approved these changes

Dec 13, 2024

View reviewed changes

coderd/autobuild/lifecycle_executor_test.go

		gofunc() {
		tickChB<-next
		close(tickChB)
		}()

Copy link

Member

johnstcnDec 13, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

e.g.

    startCh := make(chan struct{})go func() {        <-startChtickChA <- nextclose(tickChA)}()go func() {        <-startChtickChB <- nextclose(tickChB)}()    close(startCh)