Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: implement reconciliation loop#17261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
evgeniy-scherbina merged 158 commits intomainfromyevhenii/510-reconciliation-loop-v2
Apr 17, 2025

Conversation

evgeniy-scherbina
Copy link
Contributor

@evgeniy-scherbinaevgeniy-scherbina commentedApr 4, 2025
edited
Loading

Closescoder/internal#510

Refactoring Summary

1)CalculateActions Function

Issues Before Refactoring:

  • Large function (~150 lines), making it difficult to read and maintain.
  • The control flow is hard to follow due to complex conditional logic.
  • TheReconciliationActions struct was partially initialized early, then mutated in multiple places, making the flow error-prone.

Original source:

func (pPresetState)CalculateActions(clock quartz.Clock,backoffInterval time.Duration) (*ReconciliationActions,error) {
// TODO: align workspace states with how we represent them on the FE and the CLI
// right now there's some slight differences which can lead to additional prebuilds being created
// TODO: add mechanism to prevent prebuilds being reconciled from being claimable by users; i.e. if a prebuild is
// about to be deleted, it should not be deleted if it has been claimed - beware of TOCTOU races!
var (
actualint32// Running prebuilds for active version.
desiredint32// Active template version's desired instances as defined in preset.
eligibleint32// Prebuilds which can be claimed.
outdatedint32// Prebuilds which no longer match the active template version.
extraneousint32// Extra running prebuilds for active version (somehow).
starting,stopping,deletingint32// Prebuilds currently being provisioned up or down.
)
ifp.Preset.UsingActiveVersion {
actual=int32(len(p.Running))
desired=p.Preset.DesiredInstances.Int32
}
for_,prebuild:=rangep.Running {
ifp.Preset.UsingActiveVersion {
ifprebuild.Ready {
eligible++
}
extraneous=int32(math.Max(float64(actual-p.Preset.DesiredInstances.Int32),0))
}
ifprebuild.TemplateVersionID==p.Preset.TemplateVersionID&&!p.Preset.UsingActiveVersion {
outdated++
}
}
// In-progress builds are common across all presets belonging to a given template.
// In other words: these values will be identical across all presets belonging to this template.
for_,progress:=rangep.InProgress {
num:=progress.Count
switchprogress.Transition {
casedatabase.WorkspaceTransitionStart:
starting+=num
casedatabase.WorkspaceTransitionStop:
stopping+=num
casedatabase.WorkspaceTransitionDelete:
deleting+=num
}
}
var (
toCreate=int(math.Max(0,float64(
desired-(actual+starting)),// The number of prebuilds currently being stopped (should be 0)
))
toDelete=int(math.Max(0,float64(
outdated-// The number of prebuilds running above the desired count for active version
deleting),// The number of prebuilds currently being deleted
))
actions=&ReconciliationActions{
Actual:actual,
Desired:desired,
Eligible:eligible,
Outdated:outdated,
Extraneous:extraneous,
Starting:starting,
Stopping:stopping,
Deleting:deleting,
}
)
// If the template has become deleted or deprecated since the last reconciliation, we need to ensure we
// scale those prebuilds down to zero.
ifp.Preset.Deleted||p.Preset.Deprecated {
toCreate=0
toDelete=int(actual+outdated)
actions.Desired=0
}
// We backoff when the last build failed, to give the operator some time to investigate the issue and to not provision
// a tonne of prebuilds (_n_ on each reconciliation iteration).
ifp.Backoff!=nil&&p.Backoff.NumFailed>0 {
actions.Failed=p.Backoff.NumFailed
backoffUntil:=p.Backoff.LastBuildAt.Add(time.Duration(p.Backoff.NumFailed)*backoffInterval)
ifclock.Now().Before(backoffUntil) {
actions.Create=0
actions.DeleteIDs=nil
actions.BackoffUntil=backoffUntil
// Return early here; we should not perform any reconciliation actions if we're in a backoff period.
returnactions,nil
}
}
// It's possible that an operator could stop/start prebuilds which interfere with the reconciliation loop, so
// we check if there are somehow more prebuilds than we expect, and then pick random victims to be deleted.
ifextraneous>0 {
// Sort running IDs by creation time so we always delete the oldest prebuilds.
// In general, we want fresher prebuilds (imagine a mono-repo is cloned; newer is better).
slices.SortFunc(p.Running,func(a,b database.GetRunningPrebuiltWorkspacesRow)int {
ifa.CreatedAt.Before(b.CreatedAt) {
return-1
}
ifa.CreatedAt.After(b.CreatedAt) {
return1
}
return0
})
fori:=0;i<int(extraneous);i++ {
ifi>=len(p.Running) {
// This should never happen.
// TODO: move up
// c.logger.Warn(ctx, "unexpected reconciliation state; extraneous count exceeds running prebuilds count!",
//slog.F("running_count", len(p.Running)),
//slog.F("extraneous", extraneous))
continue
}
actions.DeleteIDs=append(actions.DeleteIDs,p.Running[i].ID)
}
// TODO: move up
// c.logger.Warn(ctx, "found extra prebuilds running, picking random victim(s)",
//slog.F("template_id", p.Preset.TemplateID.String()), slog.F("desired", desired), slog.F("actual", actual), slog.F("extra", extraneous),
//slog.F("victims", victims))
// Prevent the rest of the reconciliation from completing
returnactions,nil
}
actions.Create=int32(toCreate)
// if toDelete > 0 && len(p.Running) != toDelete {
// TODO: move up
// c.logger.Warn(ctx, "mismatch between running prebuilds and expected deletion count!",
//slog.F("template_id", s.preset.TemplateID.String()), slog.F("running", len(p.Running)), slog.F("to_delete", toDelete))
// }
// TODO: implement lookup to not perform same action on workspace multiple times in $period
// i.e. a workspace cannot be deleted for some reason, which continually makes it eligible for deletion
fori:=0;i<toDelete;i++ {
ifi>=len(p.Running) {
// TODO: move up
// Above warning will have already addressed this.
continue
}
actions.DeleteIDs=append(actions.DeleteIDs,p.Running[i].ID)
}
returnactions,nil
}

Improvements After Refactoring:

  • Simplified and broken down into smaller, focused helper methods.
  • The flow of the function is now more linear and easier to understand.
  • Struct initialization is cleaner, avoiding partial and incremental mutations.

Refactored function:

func (pPresetSnapshot)CalculateActions(clock quartz.Clock,backoffInterval time.Duration) (*ReconciliationActions,error) {
// TODO: align workspace states with how we represent them on the FE and the CLI
// right now there's some slight differences which can lead to additional prebuilds being created
// TODO: add mechanism to prevent prebuilds being reconciled from being claimable by users; i.e. if a prebuild is
// about to be deleted, it should not be deleted if it has been claimed - beware of TOCTOU races!
actions,needsBackoff:=p.needsBackoffPeriod(clock,backoffInterval)
ifneedsBackoff {
returnactions,nil
}
if!p.isActive() {
returnp.handleInactiveTemplateVersion()
}
returnp.handleActiveTemplateVersion()
}


2)ReconciliationActions Struct

Issues Before Refactoring:

  • The struct mixed both actionable decisions and diagnostic state, which blurred its purpose.
  • It was unclear which fields were necessary for reconciliation logic, and which were purely for logging/observability.

Improvements After Refactoring:

  • Split into two clear, purpose-specific structs:
    • ReconciliationActions — defines the intended reconciliation action.
    • ReconciliationState — captures runtime state and metadata, primarily for logging and diagnostics.

Original struct:

// ReconciliationActions represents the set of actions which must be taken to achieve the desired state for prebuilds.
typeReconciliationActionsstruct {
Actualint32// Running prebuilds for active version.
Desiredint32// Active template version's desired instances as defined in preset.
Eligibleint32// Prebuilds which can be claimed.
Outdatedint32// Prebuilds which no longer match the active template version.
Extraneousint32// Extra running prebuilds for active version (somehow).
Starting,Stopping,Deletingint32// Prebuilds currently being provisioned up or down.
Failedint32// Number of prebuilds which have failed in the past CODER_WORKSPACE_PREBUILDS_RECONCILIATION_BACKOFF_LOOKBACK_PERIOD.
Createint32// The number of prebuilds required to be created to reconcile required state.
DeleteIDs []uuid.UUID// IDs of running prebuilds required to be deleted to reconcile required state.
BackoffUntil time.Time// The time to wait until before trying to provision a new prebuild.
}

SasSwartand others added30 commitsMarch 17, 2025 12:11
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
appeasing linterSigned-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
}

if !preset.UsingActiveVersion && len(ps.Running) == 0 && len(ps.InProgress) == 0 {
logger.Debug(ctx, "skipping reconciliation for preset; inactive, no running prebuilds, and no in-progress operations",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This short-circuit is not necessary: ReconcilePreset should accurately calculate the actions.

I don't think it's a good idea to have this short circuit because you're splitting the business logic of reconciliation into two different places that do the same thing today. If we want to modify the logic, it will be easy to make the change only in one place and have inconsistent or undesired behavior.

Copy link
ContributorAuthor

@evgeniy-scherbinaevgeniy-scherbinaApr 16, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Okay, I'll remove it
cc@dannykopping

Btw: there are a bit similar check here:https://github.com/coder/coder/blob/yevhenii/510-reconciliation-loop-v2/enterprise/coderd/prebuilds/reconcile.go#L320-L329

Should we keep it or remove as well?


eg.Go(func() error {
// Pass outer context.
err = c.ReconcilePreset(ctx, *ps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

seems strange to me that we take the snapshot under the ReconciliationLock transaction, but then don't use that transaction for the individual reconciliation of presets.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Reconciliation of presets are independent. If reconciliation for one preset will fail - we don't want to rollback changes for another presets.


Few more thoughts:

We needReconciliationLock only to prevent other coderd replicas to executereconciliation iterations in parallel because we runReconciliation Loop on all replicas.

So I guess we don't really need transaction - we just need distributed lock. So we can usepg_advisory_lock instead ofpg_advisory_xact_lock?
But I see that we don't usepg_advisory_lock anywhere.


fyi:SnapshotState also creates transaction because it maybe called outsideReconciliationLock. But this tx will be ignored within reconciliation loop - because we don't support nested txs.

Copy link
Member

@mtojekmtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks@evgeniy-scherbina for addressing my comments. LGTM 👍

  1. I saw somebody raised an idea of internal tests, so that could address the problem. I’m cool with that.
  2. A separate reconciliation loop for presets is understandable, but there is a risk of increasing code complexity in favor of a very little gain. Time difference would be a matter of seconds, right? Non-blocking this PR, but wanted to confirm if I’m not missing anything.
  3. Thanks for the explanation! I admit I took into consideration slightly different assumptions, more focused on ensuring the final state. Since the current logic reduces build failures, I will not challenge it further.

// reconciliation loop will bring these resources back into their desired numbers in an EVENTUALLY-consistent way.
//
// For example: we could decide to provision 1 new instance in this reconciliation.
// While that workspace is being provisioned, another template version is created which means this same preset will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we should clarify what "this same preset" means a little more precisely, since the preset ID will strictly be different for new template versions.

fields := []any{
slog.F("action_type", actions.ActionType),
slog.F("create_count", actions.Create),
slog.F("delete_count", len(actions.DeleteIDs)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it's a pity that we don't log the current state of the preset here anymore. This will be helpful for us when debugging reconciliation problems. What is the reason why you removed these fields?

https://github.com/coder/coder/blob/dk/prebuilds/enterprise/coderd/prebuilds/reconcile.go#L307-L314

Copy link
ContributorAuthor

@evgeniy-scherbinaevgeniy-scherbinaApr 16, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Okay, I'll put it back. Initially I removed it because I splitted Actions & State structures. Basically I only need 3 fields to make a reconciliation decision:

  • backoff
  • create
  • deleteIDs

I don't need 12 fields includingoutdated,extraneous,failed, etc... Because it maybe a bit confusing for API consumer if I see structure like this:
fe60b56#diff-e435a5b0fb81846382c9401a9058a87a473f9f39eca0e884d1dd981c4d2885a1R29-R41

Imagine you're API consumer and you see:

  • outdated,extraneous,failed,deleteIDs

What decision to make? Should you deleteoutdated orfailed ordeleteIDs?
Obviously you can always read the source and understand how it works, but it means it's bad API if you need to do it.


It was my thoughts during refactoring, but maybe I'm not right.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@dannykopping I put it back.

@@ -0,0 +1,53 @@
package prebuilds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We'll need at least some of the API in the AGPL code because we need the noop implementation "running" until a premium license is used. Other than that, I think most of this can shift to enterprise.

I believe I had it this way because AGPL cannot import enterprise but the reverse is allowed. The interfaces reference the types which are in this package, and if we moved those to enterprise then it would violate the license.

This is how the implementation will be swapped out at runtime based on the license:
https://github.com/coder/coder/blob/dk/prebuilds/enterprise/coderd/coderd.go#L1165-L1188

johnstcn reacted with thumbs up emoji
Stop(ctx context.Context, cause error)
}

type Reconciler interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we do this refactor in a follow-up? Let's try keep the momentum on this PR.

evgeniy-scherbinaand others added21 commitsApril 16, 2025 15:11
Co-authored-by: Spike Curtis <spike@coder.com>
Co-authored-by: Spike Curtis <spike@coder.com>
Co-authored-by: Danny Kopping <danny@coder.com>
@evgeniy-scherbinaevgeniy-scherbina merged commit27bc60d intomainApr 17, 2025
32 checks passed
@evgeniy-scherbinaevgeniy-scherbina deleted the yevhenii/510-reconciliation-loop-v2 branchApril 17, 2025 13:29
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsApr 17, 2025
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@dannykoppingdannykoppingdannykopping approved these changes

@johnstcnjohnstcnjohnstcn approved these changes

@mtojekmtojekmtojek approved these changes

@spikecurtisspikecurtisspikecurtis approved these changes

@EmyrkEmyrkAwaiting requested review from Emyrk

@SasSwartSasSwartAwaiting requested review from SasSwart

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Prebuild Reconciliation Loop
7 participants
@evgeniy-scherbina@dannykopping@johnstcn@spikecurtis@mtojek@SasSwart@deansheather

[8]ページ先頭

©2009-2025 Movatter.jp