- Notifications
You must be signed in to change notification settings - Fork1.1k
fix: coordinator node update race#7345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Signed-off-by: Spike Curtis <spike@coder.com>
tailnet/coordinator.go Outdated
| agentNameCache*lru.Cache[uuid.UUID,string] | ||
| } | ||
| funcNewCore(logger slog.Logger)*Core { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thoughts on putting the coordinator in it's own package? The nameNewCore andCore doesn't make much sense in the context oftailnet itself for providing connections.
An alternative would be putting the in-memory coordinator in it's own package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I originally exported those names thinking we'd want theCore for thehaCoordinator but I'm now thinking they won't be that reusable. So, if the concern is that the publictailnet API contains these and it's not clear, we could just make them private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Makes sense to me!
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Uh oh!
There was an error while loading.Please reload this page.
This fixes the most pressing bug on#7295 but leaves the
haCoordinatormainly as is, meaning that we're still subject to various distributed races leaving us in a bad state where clients can't connect.A full fix for
haCoordinatoris quite a large change, so I plan to write an RFC for it.The main idea of this fix is that we establish an output queue of updates on each connection. This allows us to queue up the initial node update while holding the lock, ensuring we don't miss any updates coming from the other side just as we are connecting. The queue is buffered, and actual I/O on the connection is done on a separate goroutine, where it will never contend for the lock.
There was also a lot of complex manipulation of locking and unlocking the mutex in the existing routines, so I refactored the
coordinatorto contain a mutex-protected memory "core". All operations on thiscoreare via methods thatso it should be much easier to follow what things are holding the lock and what things aren't (
coreoperations vs the rest of thecoordinatormethods).