- Notifications
You must be signed in to change notification settings - Fork291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Allow for in-place restores.#1094
base:4.x
Are you sure you want to change the base?
Conversation
6369b6a
to9557275
CompareString restoreFileName = | ||
PATH_JOINER.join(dataDir, keyspace, columnFamily, indexDir, fileName); | ||
return_ = new File(restoreFileName); | ||
return_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
While we are in here, consider renaming this variable, its not idiomatic java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done.
String dataDirectory, String monitoringFolder) throws IOException { | ||
ImmutableSet.Builder<Path> backupPaths = ImmutableSet.builder(); | ||
Path dataPath = Paths.get(dataDirectory); | ||
if (Files.exists(dataPath) && Files.isDirectory(dataPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
TheFile.exists
call is redundant here:https://docs.oracle.com/javase/6/docs/api/java/io/File.html#isDirectory()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done.
Files.newDirectoryStream(keyspaceDirPath, Files::isDirectory)) { | ||
for (Path columnfamilyDirPath : keyspaceStream) { | ||
Path backupDirPath = | ||
Paths.get(columnfamilyDirPath.toString(), monitoringFolder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I believe you can usecolumnFamilyDirPath.resolve(monitoringFolder)
here instead of converting back and forth between string and path. Reference:https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/Path.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done.
for (Path columnfamilyDirPath : keyspaceStream) { | ||
Path backupDirPath = | ||
Paths.get(columnfamilyDirPath.toString(), monitoringFolder); | ||
if (Files.exists(backupDirPath) && Files.isDirectory(backupDirPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Same comment here that theFile.exists
call is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Done.
if (instanceState.getRestoreStatus() != null | ||
&& instanceState.getRestoreStatus().getStatus() != null | ||
&& instanceState.getRestoreStatus().getStatus() == Status.STARTED) { | ||
if (instanceState.isRestoring()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I like this cleanup 🙌
failedImports.addAll(importData(keyspace, table, tableDir.toString())); | ||
} | ||
} else { | ||
recursiveMove(Paths.get(srcDir), Paths.get(configuration.getDataFileLocation())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Since the move is non-atomic I can see a weird edge case where we start Cassandra while the move is happening (and we have like 10,000+ files). Should we have some sort of "fence" that tells the subsystem that starts Cassandra it needs to wait while this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes we should. I will need to address this in a follow-up, however. This is analogous to Cassandra getting started accidentally before the full download is complete the in the model we use today. Which hasn't happened yet, knock on wood. It might requires some rethinking about consolidating Cassandra's remediation policy in our environment. That is, the between Priam and systemd.
if (Files.isRegularFile(path)) { | ||
Files.move(path, destination.resolve(path.getFileName())); | ||
} else if (Files.isDirectory(path)) { | ||
recursiveMove(path, destination.resolve(path.getFileName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If am reading this right, if the source path we are currently iterating over is a directory we recurse by moving the files in that directory into[destination/directory name]
. Cant we just move the entire directory instead of recursing then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
My goal here was for the the operation to be strictly additive. That is, if there are already files present I want to ensure that for them to get overwritten that the operator does something explicit.
Note that to that end I have changed the policy regarding whether to replace files if they are already present in a follow-up commit.
} else if (Files.isDirectory(path)) { | ||
recursiveMove(path, destination.resolve(path.getFileName())); | ||
} else { | ||
throw new IOException("Failed determining type of inode is " + path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Do we need this error or just skip/log? Likely an edge case but do we really want to fail if thats the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
In the new model using walkFileTree() we always continue.
// Cleanup local data | ||
File dataDir = new File(config.getDataFileLocation()); | ||
if (dataDir.exists() && dataDir.isDirectory()) FileUtils.cleanDirectory(dataDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Question about this: I know we don't need to stopcass
anymore because we useimport
but I was surprised to see we no longer clean the directory. Is this for the case where we want to restore a subset of the data but keep whats there? Do we still need the option to clean the directory for some restores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The common cases for restores are:
- Restores into empty clusters (e.g., red-black)
- Restored into live clusters (e.g., emergencies)
In neither case do we want any data to be deleted. I would prefer that in outlier cases where that is required that there is a separate explicit job or command that the operator needs to run to best manage expectations. It is scary today that we could configure Priam such that a restart deletes all of the data.
"config.doesCassandraStartManually() is set to True, hence Cassandra needs to be started manually ..."); | ||
List<String> failedImports = cassOps.importAll(config.getRestoreDataLocation()); | ||
if (!failedImports.isEmpty()) { | ||
instanceState.endRestore(Status.FAILED, LocalDateTime.now()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
do we want to use the list of failed imports somewhere to log what failed and be able to resume? instead of starting over from the beginning?
Also if we only need the boolean why return the list? I would rather us use the list but if we are not going to I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is a good point. In case we don't specify --copy-data and Cassandra's behavior is to remove the origin files regardless of successful import, we want a log of what failed so we can at least get a sense of the scope of what is missing. I will log the failures.
…hs when fetching incrementals. The iterators are always fully materialized. Also, remove the now-redundant method from BackupRestoreUtil that merely wrapped the MetaProxy call.
… false in practice.Remove final modifier on static method.Make static property an instance variable instead.Use parameterized constructor call in place of two lines.
…is not relevant to whether a restore should be attempted.
…intuitive and will be empty in practice in most cases anyway.
…re to cease requiring an IConfiguration but rather just take the data directory path as a String.
…y to restore to a staging area and then import or move the data depending on whether Cassandra is running.
…offline restore. Remove unused code.
9557275
toe1ebaa5
Compare
6369b6a 2024-05-02 | Cease stopping Cassandra in the restore process and reveal the ability to restore to a staging area and then import or move the data depending on whether Cassandra is running.
4ece8d3 2024-05-01 | Move getBackupDirectories to BackupRestoreUtil and change its signature to cease requiring an IConfiguration but rather just take the data directory path as a String.
6b3d45a 2024-04-10 | Cease wiping the data directory in advance of restore. It is counter-intuitive and will be empty in practice in most cases anyway.
f10289f 2024-04-10 | Consolidate restore status API.
4d6d07f 2024-03-09 | Remove backup racs as that configuration is not used in practice and is not relevant to whether a restore should be attempted.
aa3dffe 2024-03-09 | Remove redundant vertical whitespace.
8c23800 2024-03-09 | Remove redundant logging.
646973d 2024-03-09 | Remove all non-javadoc comments that don't provide warnings to future maintainers.
8d18f51 2024-03-09 | Remove waitForCompletion parameter to download method which is always false in practice. Remove final modifier on static method. Make static property an instance variable instead. Use parameterized constructor call in place of two lines.
ce6c7c1 2024-03-09 | Remove unused code from our restore tooling.