Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork1.7k
NRG: WAL requires repair after truncation#7587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Draft
MauriceVanVeen wants to merge5 commits intomainChoose a base branch frommaurice/nrg-truncate-recovery
base:main
Could not load branches
Branch not found:{{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline, and old review comments may become outdated.
Draft
+414 −19
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
693c34e to3624b16CompareSigned-off-by: Maurice van Veen <github@mauricevanveen.com>
2c7e6fe toc781449CompareSign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading.Please reload this page.
The WAL was assumed to never be corrupted, which would lead to truncation and the chance for state to diverge. This PR fixes that by:
repair.idxwhich persists for the lifetime of the repair needing to happen. Since a lookup or write to this file only happens on startup or when the repair completes, this is not expensive to do. (This information could technically be added to a file liketav.idxand be persisted along with the term and vote, but that file is not extensible currently, requiring a separate file for now)This PR builds on the concept of "empty votes" and fixes some bugs in the PR that introduced this:#7038. Like highlighted in that PR but including this PR's fixes, it means that:
The latter two points are technically unsafe, since normally a Raft-based system is meant to halt. But, this is where we prefer the system to become available again. Illustrating this in a simpler example: for a R3 in-memory stream, two servers can be restarted and lose all data, but the data will not be entirely lost as long as there is a single server still containing this data. However, if all servers were restarted and all data was (obviously) lost, we'd rather not halt but instead take the loss of data and continue operation (while ensuring all servers agree on the state of log, albeit reset). This makes us try as best as we can to preserve the log, but if against all odds the data ends up lost then we'd rather not brick the system to the point of requiring manual intervention.
Additionally, the user previously had no way of knowing this happened. Now we'll log the following after all three servers hosting a R3 in-memory stream are restarted such that all data of the stream was lost:
If only a single server with the in-memory data was left and was catching up followers but got shutdown halfway. The most up-to-date follower with partial data will become the new leader and the above log message will print but with "the log was partially reset".
Signed-off-by: Maurice van Veengithub@mauricevanveen.com