Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(coderd/database/dbpurge): retain most recent agent build logs#14460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
johnstcn merged 9 commits intomainfromcj/retain-last-build-logs
Aug 30, 2024

Conversation

johnstcn
Copy link
Member

@johnstcnjohnstcn commentedAug 27, 2024
edited
Loading

Builds upon#14480
Addresses#10576

Updates theDeleteOldWorkspaceAgentLogs to:

  • Retain logs for the most recent build regardless of age,
  • Delete logs for agents that never connected and were created before the cutoff for deleting logs while still retaining the logs most recent build.

The motivation for the second changes is from running the following query on the dogfood database:

coder=> SELECT COUNT(*), MIN(workspace_agents.created_at) FROM workspace_agent_logs JOIN workspace_agents ON workspace_agent_logs.agent_id = workspace_agents.idWHERE workspace_agents.last_connected_at IS NULLAND workspace_agents.created_at < (NOW() - INTERVAL '7 days'); count |              min              -------+-------------------------------  2169 | 2023-11-21 21:48:26.339238+00(1 row)

@johnstcnjohnstcn self-assigned thisAug 27, 2024
@johnstcnjohnstcnforce-pushed thecj/retain-last-build-logs branch 4 times, most recently from139174a to34be657CompareAugust 29, 2024 15:59
@johnstcnjohnstcn changed the base branch frommain tocj/dbpurge-quartzAugust 29, 2024 15:59
@johnstcnjohnstcn changed the titlewip: feat(coderd/database/dbpurge): retain most recent agent build logsfeat(coderd/database/dbpurge): retain most recent agent build logsAug 29, 2024
@johnstcnjohnstcnforce-pushed thecj/retain-last-build-logs branch from12df09c to3321a28CompareAugust 30, 2024 09:38

// After dbpurge completes, the ticker is reset. Trap this call.
trapReset:=clk.Trap().TickerReset()
funcawaitDoTick(ctx context.Context,t*testing.T,clk*quartz.Mock)chanstruct{} {
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

//nolint:paralleltest // It uses LockIDDBPurge.
funcTestDeleteOldWorkspaceAgentLogs(t*testing.T) {
db,_:=dbtestutil.NewDB(t)
ctx:=testutil.Context(t,testutil.WaitShort)
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

dannykopping reacted with thumbs up emoji
Base automatically changed fromcj/dbpurge-quartz tomainAugust 30, 2024 10:55
@johnstcnjohnstcnforce-pushed thecj/retain-last-build-logs branch from31c3351 to0f8ab25CompareAugust 30, 2024 11:00
Comment on lines +34 to +35
// Start the ticker with the initial delay.
ticker:=clk.NewTicker(delay)
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Review: I found this to be the source of a race condition while testing.
I moved the initial 'tick' to instead be inside the goroutine below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What was the race condition?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It may actually be the same race condition you mentioned below, just moved around slightly.

now:=dbtime.Now()

//nolint:paralleltest // It uses LockIDDBPurge.
t.Run("AgentHasNotConnectedSinceWeek_LogsExpired",func(t*testing.T) {
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Review: I collapsed both of the sub-tests into a single test with more breadth.

mafredri reacted with thumbs up emoji
Comment on lines 190 to +233
-- If an agent hasn't connected in the last 7 days, we purge it's logs.
-- Exception: if the logs are related to the latest build, we keep those around.
-- Logs can take up a lot of space, so it's important we clean up frequently.
-- name: DeleteOldWorkspaceAgentLogs :exec
DELETEFROM workspace_agent_logsWHERE agent_idIN
(SELECT idFROM workspace_agentsWHERE last_connected_atIS NOT NULL
AND last_connected_at< @threshold ::timestamptz);
WITH
latest_buildsAS (
SELECT
workspace_id,max(build_number)AS max_build_number
FROM
workspace_builds
GROUP BY
workspace_id
),
old_agentsAS (
SELECT
wa.id
FROM
workspace_agentsAS wa
JOIN
workspace_resourcesAS wr
ON
wa.resource_id=wr.id
JOIN
workspace_buildsAS wb
ON
wb.job_id=wr.job_id
LEFT JOIN
latest_builds
ON
latest_builds.workspace_id=wb.workspace_id
AND
latest_builds.max_build_number=wb.build_number
WHERE
-- Filter out the latest builds for each workspace.
latest_builds.workspace_id ISNULL
AND CASE
-- If the last time the agent connected was before @threshold
WHENwa.last_connected_atIS NOT NULL THEN
wa.last_connected_at< @threshold ::timestamptz
-- The agent never connected, and was created before @threshold
ELSEwa.created_at< @threshold ::timestamptz
END
)
DELETEFROM workspace_agent_logsWHERE agent_idIN (SELECT idFROM old_agents);
Copy link
MemberAuthor

@johnstcnjohnstcnAug 30, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is the main change here. I welcome any further scrutiny or suggestions of this query.

I realize thatold_agents can potentially get pretty big, but#14340 may help address this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would be interesting to see ahttps://explain.dalibo.com against dogfood. If the postgres query analyzer is working good with this query, it should hopefully be able to minimize the work in the CTE, but since it's a CTE it might fail at that optimization. Might actually produce a better result moving theold agents CTE inline in the delete.

johnstcn reacted with eyes emoji
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I ran the query and two variants (one with theold_agents CTE inlined, and one with both CTEs inelined).

The plan looks basically the same for all three:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks! Looks like there's not much difference and the original is easiest to plan for so might as well leave it. Perhaps issues may surface on a larger scale but I think it's alright to deal with those as they appear.

(Side note: So many sequential scans though, guess we don't have indexes to optimize this join order :/.)

johnstcn reacted with confused emoji
@johnstcnjohnstcn marked this pull request as ready for reviewAugust 30, 2024 11:05
Copy link
Member

@mtojekmtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Correct me if I'm wrong, with this change you leave the latest workspace build logs as the agent might have not connected yet?

If I understood you weel, then 👍 from me.

@johnstcn
Copy link
MemberAuthor

Correct me if I'm wrong, with this change you leave the latest workspace build logs as the agent might have not connected yet?

Correct, that was the overall consensus from#10576

if!toBeDeleted {
validLogs=append(validLogs,log)
/*
old_agents AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is really nice!

I'm concerned it'll go stale...
Although it could be argued all the business logic duplicated in this file is stale the second it's written.
Wereally need to get rid of dbmem.

johnstcn reacted with thumbs up emoji
deferclose(closed)
deferticker.Stop()
// Force an initial tick immediately.
ticker.Reset(time.Nanosecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Won't this set the ticker's new period to a nanosecond?

I believe this is the idiomatic way to tick immediately:

ticker:=time.NewTicker(period)deferticker.Stop()for ;true;<- ticker.C {...}

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Won't this set the ticker's new period to a nanosecond?

Yes, briefly until the second prong of theselect is reached. Then the ticker is stopped anddoTick() executes. At the end ofdoTick(), the ticker is reset with a longer delay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The correction reset happens indoTick, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.

  1. tick reset nano
  2. tick
  3. <-ticker.C
  4. tick
  5. ticker.Stop()
  6. doTick()
  7. <-ticker.C (immediately after because of 4)

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The "immediate tick" idiom you mention may not play nicely with<-ctx.Done()

Copy link
Member

@mafredrimafredriAug 30, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah yeah, that's very possible. I didn't think about it when I commented but original code is also not impervious to this race.

An alternative non-racy way is to kick it off with a different channel:

vartickerC<-chan time.TimeinitC:=make(chan time.Time,1)initC<- time.Time{}close(initC)tickerC=initC// ...casetick:=tickerC:tickerC=ticker.CdoTick(...)

(Or a separate case if that's preferable.)

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

An alternative is to just calldoTick() immediately before thefor-select loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Get out of here@johnstcn with your good and reasonable ideas!

johnstcn reacted with laugh emoji
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the "initCh" approach definitely has merits, but not sure if it's necessary here as the context is still available indoTick. :)

Copy link
Member

@mafredrimafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Except for the ticker init change which may produce two initial ticks, LGTM.

Comment on lines +34 to +35
// Start the ticker with the initial delay.
ticker:=clk.NewTicker(delay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What was the race condition?

deferclose(closed)
deferticker.Stop()
// Force an initial tick immediately.
ticker.Reset(time.Nanosecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The correction reset happens indoTick, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.

  1. tick reset nano
  2. tick
  3. <-ticker.C
  4. tick
  5. ticker.Stop()
  6. doTick()
  7. <-ticker.C (immediately after because of 4)

funcmustCreateAgentWithLogs(ctx context.Context,t*testing.T,db database.Store,user database.User,org database.Organization,tmpl database.Template,tv database.TemplateVersion,agentLastConnectedAt time.Time,outputstring) uuid.UUID {
agent:=mustCreateAgent(t,db,user,org,tmpl,tv)
returnch
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice 👍🏻

@johnstcnjohnstcn merged commit0f8251b intomainAug 30, 2024
27 checks passed
@johnstcnjohnstcn deleted the cj/retain-last-build-logs branchAugust 30, 2024 16:39
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsAug 30, 2024
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@mafredrimafredrimafredri approved these changes

@dannykoppingdannykoppingdannykopping left review comments

@mtojekmtojekmtojek approved these changes

Assignees

@johnstcnjohnstcn

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@johnstcn@mafredri@dannykopping@mtojek

[8]ページ先頭

©2009-2025 Movatter.jp