- Notifications
You must be signed in to change notification settings - Fork927
fix(enterprise): mark nodes from unhealthy coordinators as lost#13123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
fix(enterprise): mark nodes from unhealthy coordinators as lost#13123
Uh oh!
There was an error while loading.Please reload this page.
Conversation
This stack of pull requests is managed by Graphite.Learn more about stacking. |
Instead of removing the mappings of unhealthy coordinators entirely,mark them as lost instead. This prevents peers from disappearing fromother peers if a coordinator misses a heartbeat.
58801cf
toa8ac205
CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This LGTM, but I'm relying on reading code to convince myself that it solves the high level use case of sending LOST updates to peers when a coordinator fails heartbeats.
In addition to the unit test on the heartbeats subcomponent, I'd like to see a test in a similar vein toTestPGCoordinatorSingle_MissedHeartbeats
where we simulate a second coordinator entirely by DB calls, make it miss its heartbeats, and then verify that the first coordinator sends out a LOST update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM! Minor suggestion inline, but I don't need to review again
enterprise/tailnet/pgcoord_test.go Outdated
defer coordinator.Close() | ||
agent := test.NewPeer(ctx, t, coordinator, "agent") | ||
defer agent.Close(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think the agent on the real coordinator is superfluous to this test. Just need to test the agent update on the fake coordinator, then have it miss heartbeats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah, good point.
Uh oh!
There was an error while loading.Please reload this page.
Fixes#13041
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.