- Notifications
You must be signed in to change notification settings - Fork927
fix: coordinator node update race#7345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Signed-off-by: Spike Curtis <spike@coder.com>
tailnet/coordinator.go Outdated
@@ -159,8 +158,30 @@ type coordinator struct { | |||
agentNameCache *lru.Cache[uuid.UUID, string] | |||
} | |||
func NewCore(logger slog.Logger) *Core { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thoughts on putting the coordinator in it's own package? The nameNewCore
andCore
doesn't make much sense in the context oftailnet
itself for providing connections.
An alternative would be putting the in-memory coordinator in it's own package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I originally exported those names thinking we'd want theCore
for thehaCoordinator
but I'm now thinking they won't be that reusable. So, if the concern is that the publictailnet
API contains these and it's not clear, we could just make them private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Makes sense to me!
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Uh oh!
There was an error while loading.Please reload this page.
This fixes the most pressing bug on#7295 but leaves the
haCoordinator
mainly as is, meaning that we're still subject to various distributed races leaving us in a bad state where clients can't connect.A full fix for
haCoordinator
is quite a large change, so I plan to write an RFC for it.The main idea of this fix is that we establish an output queue of updates on each connection. This allows us to queue up the initial node update while holding the lock, ensuring we don't miss any updates coming from the other side just as we are connecting. The queue is buffered, and actual I/O on the connection is done on a separate goroutine, where it will never contend for the lock.
There was also a lot of complex manipulation of locking and unlocking the mutex in the existing routines, so I refactored the
coordinator
to contain a mutex-protected memory "core". All operations on thiscore
are via methods thatso it should be much easier to follow what things are holding the lock and what things aren't (
core
operations vs the rest of thecoordinator
methods).