- Notifications
You must be signed in to change notification settings - Fork928
Description
PGCoordinator has a mechanism where if it is unable to heartbeat over the pub-sub, it declares itself unhealthy, disconnects any coordinatees (agents, server tailnet, CLI), and immediately disconnects any new coordinatees that connect to it.
The purpose of this feature is if a Coordinator loses connection to the pubsub/database thru a network partition, it drops connections so that coordinatees can retry and hopefully land on a healthy peer.
However, if multiple PGCoordinators go unhealthy at the same time, coordinatees can bounce between coordinators.
Furthermore, there is a bug in our implementation such that when we disconnect a coordinatee that has never sent a node binding, we trigger an unnecessary DeleteTailnetPeer query to the database. The query is idempotent, so any individual query does no harm, but since we do it once per connection, this can trigger a storm of queries.
Impact:
Contributing or major factor in production outage at a customer