- Notifications
You must be signed in to change notification settings - Fork1k
chore: replace wsconncache with a single tailnet#8176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
@@ -0,0 +1,169 @@ | |||
package agenttest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This was essentially copy/pasted out ofagent_test.go
so it can be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I didn't manually test this, but the code looks good. It's sick that it's backward-compatible.
We should have a timeline on the removal of wsconncache, because that seems like some big dead weight.
Uh oh!
There was an error while loading.Please reload this page.
// NewServerTailnet creates a new tailnet intended for use by coderd. It | ||
// automatically falls back to wsconncache if a legacy agent is encountered. | ||
funcNewServerTailnet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Although maybe a bit weird, I'd argue we should put this inworkspaceagents.go
sincewsconncache
will be going away.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Ignore test failures, will fix those in the morning. |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
func (c*pgCoord)ServeMultiAgent(id uuid.UUID) agpl.MultiAgentConn { | ||
_,_=c,id | ||
panic("not implemented")// TODO: Implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
this effectively torpedos thepgCoord
, so we can't merge this PR until fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Since I'm still mostly flushing out the API I didn't think it made sense to impl onpgCoord
yet. Definitely a blocker for merge though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
FWIW, I think what we need to do is to drop theagent_id
column fromtailnet_clients
table, and add a new table that tracks the agents that each client wants to connect to
CREATETABLEtailnet_subscriptions ( client_id uuidNOT NULL, coordinator_id uuidNOT NULL, agent_id uuidNOT NULL,PRIMARY KEY (client_id, coordinator_id, agent_id),FOREIGN KEY (client_id)REFERENCES tailnet_clients(id)ON DELETE CASCADE,FOREIGN KEY (coordinator_id)REFERENCES tailnet_coordinators(id)ON DELETE CASCADE);
FWIW, I think the right architecture for both the in-memory and distributed coordinator is to consider an individual client connection as a special case of the more general "multiagent" idea. That is, a normal end user client is a multiagent where the number of agents is exactly one. It's much easier to maintain and understand code that computes the general case, and then at the edge, we have adaptations the wire up clients with exactly one agent, rather than trying to build two different sets of logic and keep them consistent. So, in the core logic of the coordinator, clients are all multi-agent, and we compute and queue up updates on them based on the set of agents they've asked to connect to. Then, at the very edge we have special case code that either serializes the nodes over the websocket, or sends them out over channels/callbacks depending on whether we have an end user, remote client, or a coderd, in-process client. |
// connection to. | ||
agentNodesmap[uuid.UUID]*tailnetNode | ||
transport*http.Transport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is so nice, great work!
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
tailnet/coordinator.go Outdated
// multiAgents holds all of the unique multiAgents listening on this | ||
// coordinator. We need to keep track of these separately because we need to | ||
// make sure they're closed on coordinator shutdown. If not, they won't be | ||
// able to reopen another multiAgent on the new coordinator. | ||
multiAgentsmap[uuid.UUID]*MultiAgent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Havent added this to thehaCoordinator
yet.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
tailnet/coordinator.go Outdated
// agentToConnectionSockets maps agent IDs to connection IDs of conns that | ||
// are subscribed to updates for that agent. | ||
agentToConnectionSocketsmap[uuid.UUID]map[uuid.UUID]*TrackedConn | ||
agentToConnectionSocketsmap[uuid.UUID]map[uuid.UUID]Enqueueable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is great!
Uh oh!
There was an error while loading.Please reload this page.
coderd/tailnet.go Outdated
} | ||
delete(s.agentNodes,agentID) | ||
// TODO(coadler): actually remove from the netmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Will do this before merge
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
// It is exported so that tests can use it. | ||
constWriteTimeout=time.Second*5 | ||
typeTrackedConnstruct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Good call on moving this stuff to its own file. Much more readable.
Stats() (start,lastWriteint64) | ||
Overwrites()int64 | ||
CoordinatorClose()error | ||
Close()error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't understand the distinction betweenCoordinatorClose()
andClose()
--- how are they different and why do we need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There's sort-of a chicken and egg problem. When you want to close the individual queue, you expect it to be removed from the coordinator. When the coordinator is closing, you want the coordinator to close all of the queues, but you don't want the queues to reach back into the coordinator to close themselves and reenter the mutex.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
} | ||
func (m*MultiAgent)UpdateSelf(node*Node)error { | ||
returnm.OnNodeUpdate(m.ID,node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
closing theMultiAgent
feels incomplete to me since it seems that we would still send updates, subscribes, and unsubscribes into thecore
, even after it's been closed.
Furthermore,NextUpdate()
runs on is own, separate Context, rather than respecting the closed state of theMultiAgent
. This seems like a recipe for zombie goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
NextUpdate()
does respect the closed state ofMultiAgent
sincem.updates
gets closed when theMultiAgent
is closed.
I've made this opt-in via a feature flag, so it won't be enabled by default. The implementations for pgcoord and moons aren't done, but I plan on addressing that in follow-up PRs due to this one getting huge. After both are resolved, I'll remove the feature flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I understand this PR is getting big and unwieldy, so I'm fine with it going in behind an experiment flag, but I would like to see a GitHub epic and GitHub issues tracking the things that need a resolution before this can go GA.
coderd/tailnet.go Outdated
s.nodesMu.Lock() | ||
agentConn:=s.getAgentConn() | ||
foragentID,node:=ranges.agentNodes { | ||
iftime.Since(node.lastConnection)>cutoff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This measures the last time westarted a connection, not the last time the connection was used. If we proxy a long-lived connection likeReconnectingPTY
or a devURL websocket, it could easily be in use for greater than 30 minutes.
We might need some ref-counting to keep track of the connections to each agent, so that we expire them when they are no longer used.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
This PR mostly removes
wsconncache
from coderd, instead replacing it with a single Tailnet per coderd. The biggest benefit to this is not needing a cache of Tailnets sitting around for each agent. Additionally, by reusing the samehttp.Transport
for proxying to workspaces, we're able to reuse TCP conns between requests instead of them being open and closed after each request. This is the same strategy we use in wgtunnel.I made some changes to the coordinator interface to support coderd hooking directly in, so I'd definitely like some feedback on that. I haven't yet implemented it for pgcoord.
Left out of this PR is support for moons. This was getting a bit too big, so I'll instead do that in a followup. Moons will still use wsconncache for the time being.