Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chore: refactor agent routines that use the v2 API#12223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
spikecurtis merged 1 commit intomainfromspike/10534-agent-api-routines
Feb 23, 2024

Conversation

spikecurtis
Copy link
Contributor

@spikecurtisspikecurtis commentedFeb 20, 2024
edited
Loading

In anticipation of needing theLogSender to run on a context that doesn't get immediately canceled when youClose() the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.

This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.

@spikecurtisGraphite App
Copy link
ContributorAuthor

spikecurtis commentedFeb 20, 2024
edited
Loading

This stack of pull requests is managed by Graphite.Learn more about stacking.

Join@spikecurtis and the rest of your teammates onGraphiteGraphite

@spikecurtisGraphite App
Copy link
ContributorAuthor

This stack of pull requests is managed by Graphite.Learn more about stacking.

Join@spikecurtis and the rest of your teammates onGraphiteGraphite

@spikecurtisspikecurtis marked this pull request as ready for reviewFebruary 20, 2024 11:19
@spikecurtisspikecurtisforce-pushed thespike/10534-agent-api-routines branch 4 times, most recently from7ecf10e to2f05031CompareFebruary 20, 2024 16:19
Copy link
Member

@mafredrimafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is a monster undertaking and has been due for a long time, thanks for tackling it and nice work so-far!

if err != nil {
return xerrors.Errorf("init script runner: %w", err)
}
err = a.trackGoroutine(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm a bit worried about the complexity of what's going on here (muddy colored functions). We're inside a routine belonging to the api conn manager, but within it starting a routine that's to be tracked on a higher level.

This is definitely due for a refactor, but it's probably fine to keep it like this for now.

Maybe we should separate out all the script stuff into its own service/loop that performs the appropriate execution based on agent events.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, I think separating the script stuff out along with the logging into a neat subcomponent would be useful. Part of the problem is the "Manifest" is a grab bag of different things, so it's harder to make it well-encapsulated sub-components.

Not for this PR though. I want to finish off the v2 Agent API before we contemplate a v2.1!

mafredri reacted with thumbs up emoji
if network == nil {
// use the graceful context here, because creating the tailnet is not itself tied to the
// agent API.
network, err = a.createTailnet(a.gracefulCtx, manifest.AgentID, manifest.DERPMap, manifest.DERPForceWebSockets, manifest.DisableDirectConnections)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I feel like this context may be problematic duringClose. If we close the agent before we've established a connection, then the connection can't establish and we have some connection dependent things inClose.

Why should we limit tailnet to anything except "hardCtx"?

This function is also pretty confusing since it takesctx as an argument, but uses graceful here.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, this whole thing needs a bigger refactor like you suggested above. This routine doesn't directly use thedrpc.Conn at all, so could arguably be outside theapiConnRoutineManager --- but it does rely on the manifest, and several other routines block on the network being available. Fixing that is more than I wanted to bite off.

In terms of the context, I don't really think it matters whether we use thegracefulCtx or thehardCtx. As soon as we cancel thegracefulCtx, then we send disconnect to the coordinator and clients will start disconnecting their tunnels.

If we close before we'veever established a tailnet, the I don't think anything inClose() can block on it. We'd never have any SSH sessions, and would never have connected to the Coordinator.

mafredri reacted with thumbs up emoji
lifecycleWaitLoop:
for {
select {
case <-ctx.Done():
case <-a.hardCtx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

An option to usinghardCtx here would be to closea.lifeCycleReported once the loop exits. (As closinghardCtx would signal the loop to close.) This ensures we don't leave the routine behind.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess it would be fine to use the channel close to indicate a timeout, but I don't want to wait for the channel to close in the success case, since that's a chicken-and-egg problem (we wait until the final state is reported to close thehardCtx in the success case).

Since it only enforces that the goroutine exits in the failure case, I don't think it's a worthwhile refactor.

mafredri reacted with thumbs up emoji
break lifecycleWaitLoop
}
}
}

// Wait for graceful disconnect from the Coordinator RPC
select {
case <-a.hardCtx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

While always checking the context is a good way to ensure we actually do exit, it creates a lot of paths that short-circuit execution, potentially leaving things behind. The state is also harder to reason about as there are multiple branches.

That is to say, I think it would be OK to skiphardCtx here? I guess the only danger is ifcoordination.Close() hangs indefinitely or is very slow.

This change might also suggest it'd be sensible to handlehardCtx inrunCoodinator (instead of justreturn <-errCh).

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The that arm of the select is mostly about logging when we hit a timeout.

I take your point from earlier discussions that relying solely on context to tear down goroutines can cause us to exit before finalizer actions have run. Here we are explicitly checking the status of a finalizer.

We're definitely not doing a complete job of tracking goroutines in the agent, but I'm not convinced that should be a goal unto itself. Tracking and waiting for errant goroutines doesn't actuallyprevent leaks. If there is a leak, in testing, it causes the test to hang, whereas goleak fails much faster. In production, it will cause shutdown to hang so the cluster manager has to force-stop the agent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fair 👍🏻

@spikecurtisspikecurtisforce-pushed thespike/10534-agent-api-routines branch 2 times, most recently from5cf1bb4 tof9ca6b2CompareFebruary 22, 2024 06:14
@spikecurtisspikecurtisforce-pushed thespike/10534-agent-api-routines branch fromf9ca6b2 tod9b360cCompareFebruary 22, 2024 08:15
Copy link
Member

@mafredrimafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking good, thanks for making the proposed changes!

// Wait for graceful disconnect from the Coordinator RPC
select {
case <-a.hardCtx.Done():
a.logger.Warn(context.Background(), "timed out waiting for Coordinator RPC disconnect")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
a.logger.Warn(context.Background(),"timed out waiting forCoordinator RPC disconnect")
a.logger.Warn(context.Background(),"timed out waiting forcoordinator RPC disconnect")

@spikecurtisspikecurtis merged commitaf3fdc6 intomainFeb 23, 2024
@spikecurtisspikecurtis deleted the spike/10534-agent-api-routines branchFebruary 23, 2024 07:04
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsFeb 23, 2024
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@mafredrimafredrimafredri approved these changes

@coadlercoadlerAwaiting requested review from coadler

Assignees

@spikecurtisspikecurtis

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@spikecurtis@mafredri

[8]ページ先頭

©2009-2025 Movatter.jp