- Notifications
You must be signed in to change notification settings - Fork928
refactor: replace startup script logs EOF with starting/ready time#8082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
// Only the latest state is reported, intermediate states may be | ||
// lost if the agent can't communicate with the API. | ||
// reportLifecycleLoop reports the current lifecycle state once. All state | ||
// changes are reported in order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Review: Originally, this was a weird/performance choice to only submit the latest status which simply resulted in needing to document the behavior increasing the complexity. We now change it so that the agent always reports all states (still non-blocking), we achieve this by adding the timestamp for the event in the payload.
a.lifecycleMu.Unlock() | ||
return | ||
} | ||
a.lifecycleState=state | ||
a.logger.Debug(ctx,"set lifecycle state",slog.F("state",state),slog.F("last",lastState)) | ||
a.lifecycleStates=append(a.lifecycleStates,report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Review: Guaranteed to not grow unboundedly (len(a.lifecycleStates) <= enum entries
).
}else { | ||
logger.Info(ctx,"script completed",slog.F("execution_time",execTime),slog.F("exit_code",exitCode)) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Review: Small cleanup to unify logging between startup and shutdown scripts.
d8a51d1
tocedf8a4
Comparereturn err | ||
} | ||
workspace, err := q.db.GetWorkspaceByAgentID(ctx, agent.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Review: Perhaps I shouldn't have made this simplification? I see mixed use in dbauthz not sure why we were doing both.
488b0ae
to1334344
CompareThis commit reverts some of the changes in#8029 and implements analternative method of keeping track of when the startup script has endedand there will be no more logs.This is achieved by adding new agent fields for tracking when the agententers the "starting" and "ready"/"start_error" lifecycle states. Thetimestamps simplify logic since we don't need understand if the currentstate is before or after the state we're interested in. They can also beused to show data like how long the startup script took to execute. Thisalso allowed us to remove the EOF field from the logs as theimplementation was problematic when we returned the EOF log entry in theresponse since requesting _after_ that ID would give no logs and the APIwould thus lose track of EOF.
1334344
to73f19d6
Comparecoderd/database/migrations/000128_drop_startup_logs_eof_and_add_completion.down.sql OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
@@ -461,15 +470,20 @@ func (a *agent) reportLifecycleLoop(ctx context.Context) { | |||
// setLifecycle sets the lifecycle state and notifies the lifecycle loop. | |||
// The state is only updated if it's a valid state transition. | |||
func (a *agent) setLifecycle(ctx context.Context, state codersdk.WorkspaceAgentLifecycle) { | |||
report := agentsdk.PostLifecycleRequest{ | |||
State: state, | |||
ChangedAt: database.Now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
database.Now()
ortime.Now()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Intentionally useddatabase.Now()
here for consistency since it's a value that will be stored in the DB. Startup logs also usedatabase.Now()
but we do seem to have some mixed use in the agent and some other places may be wrong?
Thoughts@kylecarbs? For all intents and purposes, this shouldn't matter since the DB fields aretimestamptz
, what's the motivation for usingdatabase.Now()
that always uses UTC? Logging purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
what's the motivation for using database.Now() that always uses UTC? Logging purposes?
- It might be a good candidate for the linter rule.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
👍
Thanks for addressing all comments. If CI is happy to merge it, I'm cool with that too! You can work on improvements in follow-ups.
This change breaks envbuilder, logstream-kube, and anything that writes startup logs to the UI after the startup script might be complete. We should rename startup logs to something more generalized like agent logs to remove this confusion in the future. The way it worked before (I believe) was that the EOF was sent incorrectly, but we really shouldn't have been sending an EOF at all - there is no EOF, because infrastructure logs are never actually complete. A pod can always restart, or something can trigger a reboot of the agent. |
The only break is the EOF btw, which is like 6 lines of code ;p I'm just removing that for now after talking with Mathias, and we'll work on a better long-term fix soon. |
PR here:#8528 This just reverts the EOF, which should bring back the prior behavior. |
This commit reverts some of the changes in#8029 and implements an
alternative method of keeping track of when the startup script has ended
and there will be no more logs.
This is achieved by adding new agent fields for tracking when the agent
enters the "starting" and "ready"/"start_error" lifecycle states. The
timestamps simplify logic since we don't need understand if the current
state is before or after the state we're interested in. They can also be
used to show data like how long the startup script took to execute. This
also allowed us to remove the EOF field from the logs as the
implementation was problematic when we returned the EOF log entry in the
response since requestingafter that ID would give no logs and the API
would thus lose track of EOF.