- Notifications
You must be signed in to change notification settings - Fork906
feat: track resource replacements when claiming a prebuilt workspace#17571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Uh oh!
There was an error while loading.Please reload this page.
coderd/database/migrations/000318_resource_replacements_notification.up.sql OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
4859410
to634082b
CompareUh oh!
There was an error while loading.Please reload this page.
@@ -75,6 +75,7 @@ message CompletedJob { | |||
repeated provisioner.Resource resources = 2; | |||
repeated provisioner.Timing timings = 3; | |||
repeated provisioner.Module modules = 4; | |||
repeated provisioner.ResourceReplacement resourceReplacements = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I believe this is backwards-compatible sincerepeated
fields are effectivelyoptional
if nil length.
// nolint:gocritic // Necessary to query all the required data. | ||
ctx = dbauthz.AsSystemRestricted(ctx) | ||
// Since this may be called in a fire-and-forget fashion, we need to give up at some point. | ||
trackCtx, trackCancel := context.WithTimeout(ctx, time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is a best-effort attempt to warn operators of this situation; it's ok if it times out, we'll get a log to trace this with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I see this partially covers my earlier comment. I still think it'd be good to take shutdowns into consideration for these and define what that behavior should be (rather than undefined). Right now these routines will be rudely interrupted during shutdown rather than exiting cleanly. Likewise these can be left running even if a CompleteJob is interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Right now these routines will be rudely interrupted during shutdown rather than exiting cleanly.
Good call! I've now passed in theprovisionerdserver
's lifecycle context.
Likewise these can be left running even if a CompleteJob is interrupted.
I believe that's already the case?
@@ -258,7 +258,7 @@ func getStateFilePath(workdir string) string { | |||
} | |||
// revive:disable-next-line:flag-parameter | |||
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink,destroy bool) (*proto.PlanComplete, error) { | |||
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink,metadata *proto.Metadata) (*proto.PlanComplete, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is the meat of the feature; everything else is just plumbing between system and user eyeball.
level := proto.LogLevel_INFO | ||
// Terraform indicates that a resource will be deleted and recreated by showing the change along with this substring. | ||
if bytes.Contains(line, []byte("# forces replacement")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A bit flimsy; open to other ideas.
In any case, this is just sugar. The fact that the plan, with all its drift details, are shown will be sufficient. Highlighting the lines is just a courtesy to the user.
Uh oh!
There was an error while loading.Please reload this page.
coderd/database/migrations/000320_resource_replacements_notification.up.sql OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
@@ -13,43 +16,65 @@ import ( | |||
"github.com/coder/coder/v2/coderd/prebuilds" | |||
) | |||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Muddies the purpose of the PR a bit, but it was a worthwhile driveby refactoring given that we're adding a new metric (MetricResourceReplacementsCount
) and we need to check for its value in a test.
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
…replacement(s)Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
also a test that was broken from an earlier fixSigned-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No major issues uncovered, my main concern is the goroutine management in CompleteJob, other than that mostly minor nits and suggestions.
Uh oh!
There was an error while loading.Please reload this page.
coderd/database/migrations/000320_resource_replacements_notification.up.sql OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
coderd/database/migrations/000320_resource_replacements_notification.up.sql OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
…17757)Used in combination withcoder/terraform-provider-coder#396This is required by both#17475 and#17571Operators may need to conditionalize their templates to perform certainoperations once a prebuilt workspace has been claimed. This value will**only** be set once a claim takes place and a subsequent `terraformapply` occurs. Any `terraform apply` runs thereafter will beindistinguishable from a normal run on a workspace.---------Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Uh oh!
There was an error while loading.Please reload this page.
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
mafredri left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A few more comments but my main concern is around using theatomic.Pointer
for the notifier, I don't really see a reason for it in which case I think the code should be simplified.
The other remaining concern is around goroutine/shutdown which could lead to test flakes (leak detection).
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
… APIavoid all this pointer jugglingSigned-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
…nywaySigned-off-by: Danny Kopping <dannykopping@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice work!
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
My review limited to the protobuf / version changes.
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
6e96778
intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Closescoder/internal#369
We can't know whether a replacement (i.e. drift of terraform state leading to a resource needing to be deleted/recreated) will take place apriori; we can only detect it at
plan
time, because the provider decides whether a resource must be replaced and it cannot be inferred through static analysis of the template.This is likely to be the most common gotcha with using prebuilds, since it requires a slight template modification to use prebuilds effectively, so let's head this off before it's an issue for customers.
Drift details will now be logged in the workspace build logs:
Plus a notification will be sent to template admins when this situation arises:
A new metric -
coderd_prebuilt_workspaces_resource_replacements_total
- will also increment each time a workspace encounters replacements.We only trackthat a resource replacement occurred, not how many. Just one is enough to ruin a prebuild, but we can't know apriori which replacement would cause this.
For example, say we have 2 replacements: a
docker_container
and anull_resource
; we don't know which one mightcause an issue (or indeed if either would), so we just track the replacement.