- Notifications
You must be signed in to change notification settings - Fork1k
feat(coderd/database/dbpurge): retain most recent agent build logs#14460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
139174a
to34be657
Compare12df09c
to3321a28
Compare// After dbpurge completes, the ticker is reset. Trap this call. | ||
trapReset:=clk.Trap().TickerReset() | ||
funcawaitDoTick(ctx context.Context,t*testing.T,clk*quartz.Mock)chanstruct{} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
addresses#14480 (comment)
//nolint:paralleltest // It uses LockIDDBPurge. | ||
funcTestDeleteOldWorkspaceAgentLogs(t*testing.T) { | ||
db,_:=dbtestutil.NewDB(t) | ||
ctx:=testutil.Context(t,testutil.WaitShort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Addresses#14480 (comment)
…t builds and delete never-connected workspaces
31c3351
to0f8ab25
Compare// Start the ticker with the initial delay. | ||
ticker:=clk.NewTicker(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Review: I found this to be the source of a race condition while testing.
I moved the initial 'tick' to instead be inside the goroutine below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What was the race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It may actually be the same race condition you mentioned below, just moved around slightly.
now:=dbtime.Now() | ||
//nolint:paralleltest // It uses LockIDDBPurge. | ||
t.Run("AgentHasNotConnectedSinceWeek_LogsExpired",func(t*testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Review: I collapsed both of the sub-tests into a single test with more breadth.
-- If an agent hasn't connected in the last 7 days, we purge it's logs. | ||
-- Exception: if the logs are related to the latest build, we keep those around. | ||
-- Logs can take up a lot of space, so it's important we clean up frequently. | ||
-- name: DeleteOldWorkspaceAgentLogs :exec | ||
DELETEFROM workspace_agent_logsWHERE agent_idIN | ||
(SELECT idFROM workspace_agentsWHERE last_connected_atIS NOT NULL | ||
AND last_connected_at< @threshold ::timestamptz); | ||
WITH | ||
latest_buildsAS ( | ||
SELECT | ||
workspace_id,max(build_number)AS max_build_number | ||
FROM | ||
workspace_builds | ||
GROUP BY | ||
workspace_id | ||
), | ||
old_agentsAS ( | ||
SELECT | ||
wa.id | ||
FROM | ||
workspace_agentsAS wa | ||
JOIN | ||
workspace_resourcesAS wr | ||
ON | ||
wa.resource_id=wr.id | ||
JOIN | ||
workspace_buildsAS wb | ||
ON | ||
wb.job_id=wr.job_id | ||
LEFT JOIN | ||
latest_builds | ||
ON | ||
latest_builds.workspace_id=wb.workspace_id | ||
AND | ||
latest_builds.max_build_number=wb.build_number | ||
WHERE | ||
-- Filter out the latest builds for each workspace. | ||
latest_builds.workspace_id ISNULL | ||
AND CASE | ||
-- If the last time the agent connected was before @threshold | ||
WHENwa.last_connected_atIS NOT NULL THEN | ||
wa.last_connected_at< @threshold ::timestamptz | ||
-- The agent never connected, and was created before @threshold | ||
ELSEwa.created_at< @threshold ::timestamptz | ||
END | ||
) | ||
DELETEFROM workspace_agent_logsWHERE agent_idIN (SELECT idFROM old_agents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is the main change here. I welcome any further scrutiny or suggestions of this query.
I realize thatold_agents
can potentially get pretty big, but#14340 may help address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Would be interesting to see ahttps://explain.dalibo.com against dogfood. If the postgres query analyzer is working good with this query, it should hopefully be able to minimize the work in the CTE, but since it's a CTE it might fail at that optimization. Might actually produce a better result moving theold agents
CTE inline in the delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I ran the query and two variants (one with theold_agents
CTE inlined, and one with both CTEs inelined).
The plan looks basically the same for all three:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks! Looks like there's not much difference and the original is easiest to plan for so might as well leave it. Perhaps issues may surface on a larger scale but I think it's alright to deal with those as they appear.
(Side note: So many sequential scans though, guess we don't have indexes to optimize this join order :/.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Correct me if I'm wrong, with this change you leave the latest workspace build logs as the agent might have not connected yet?
If I understood you weel, then 👍 from me.
Correct, that was the overall consensus from#10576 |
if!toBeDeleted { | ||
validLogs=append(validLogs,log) | ||
/* | ||
old_agents AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is really nice!
I'm concerned it'll go stale...
Although it could be argued all the business logic duplicated in this file is stale the second it's written.
Wereally need to get rid of dbmem.
coderd/database/dbpurge/dbpurge.go Outdated
deferclose(closed) | ||
deferticker.Stop() | ||
// Force an initial tick immediately. | ||
ticker.Reset(time.Nanosecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Won't this set the ticker's new period to a nanosecond?
I believe this is the idiomatic way to tick immediately:
ticker:=time.NewTicker(period)deferticker.Stop()for ;true;<- ticker.C {...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Won't this set the ticker's new period to a nanosecond?
Yes, briefly until the second prong of theselect
is reached. Then the ticker is stopped anddoTick()
executes. At the end ofdoTick()
, the ticker is reset with a longer delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The correction reset happens indoTick
, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.
- tick reset nano
- tick
- <-ticker.C
- tick
- ticker.Stop()
- doTick()
- <-ticker.C (immediately after because of 4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The "immediate tick" idiom you mention may not play nicely with<-ctx.Done()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah yeah, that's very possible. I didn't think about it when I commented but original code is also not impervious to this race.
An alternative non-racy way is to kick it off with a different channel:
vartickerC<-chan time.TimeinitC:=make(chan time.Time,1)initC<- time.Time{}close(initC)tickerC=initC// ...casetick:=tickerC:tickerC=ticker.CdoTick(...)
(Or a separate case if that's preferable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
An alternative is to just calldoTick()
immediately before thefor-select
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Get out of here@johnstcn with your good and reasonable ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think the "initCh" approach definitely has merits, but not sure if it's necessary here as the context is still available indoTick
. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Except for the ticker init change which may produce two initial ticks, LGTM.
// Start the ticker with the initial delay. | ||
ticker:=clk.NewTicker(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What was the race condition?
coderd/database/dbpurge/dbpurge.go Outdated
deferclose(closed) | ||
deferticker.Stop() | ||
// Force an initial tick immediately. | ||
ticker.Reset(time.Nanosecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The correction reset happens indoTick
, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.
- tick reset nano
- tick
- <-ticker.C
- tick
- ticker.Stop()
- doTick()
- <-ticker.C (immediately after because of 4)
funcmustCreateAgentWithLogs(ctx context.Context,t*testing.T,db database.Store,user database.User,org database.Organization,tmpl database.Template,tv database.TemplateVersion,agentLastConnectedAt time.Time,outputstring) uuid.UUID { | ||
agent:=mustCreateAgent(t,db,user,org,tmpl,tv) | ||
returnch | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice 👍🏻
0f8251b
intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Builds upon#14480
Addresses#10576
Updates the
DeleteOldWorkspaceAgentLogs
to:The motivation for the second changes is from running the following query on the dogfood database: