Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup using Checkpointing #52

Open
9 tasks
marvin-j97 opened this issue May 20, 2024 Discussed in #50 · 6 comments
Open
9 tasks

Backup using Checkpointing #52

marvin-j97 opened this issue May 20, 2024 Discussed in #50 · 6 comments
Labels
enhancement New feature or request epic help wanted Extra attention is needed test

Comments

@marvin-j97
Copy link
Collaborator

marvin-j97 commented May 20, 2024

Possible API

  • Keyspace::backup_to<P: AsRef<Path>>(path: P) -> crate::Result<()>
  • TxKeyspace::backup_to<P: AsRef<Path>>(path: P) -> crate::Result<()> (just needs to call inner)

(When we have #70 we could also provide a offline backup method that takes an exclusive temporary lock of an unopened Keyspace and clones it file-by-file (ideally using hard links))

Steps

  • Implement a copy method in value-log and lsm-tree to copy an LSM-tree in its current form - hard linking if possible, make sure every file is synced, including metadata files (manifest, ...) - this should be easy to unit test
  • In fjall, lock the journal and a global lock (let's call it BackupLock) to prevent garbage collection and compactions from submitting results (which would change disk segments)
  • Create the new folder
  • Fsync & copy all journal files
  • Drop journal lock (write ops can now proceed)
  • For each partition, call partition.tree.copy to the appropiate position in the new folder
  • Copy over keyspace-level metadata files (actually it's just version)
  • Finish backup (drops BackupLock, everything else can now proceed)
  • Using the global lock, prevent any changes to all LSM-tree levels (no segments are allowed to disappear through compaction, flushes can actually proceed technically, as long as they are not picked up by the backup)

Unsure

How to prevent compactions from applying their result? There may be compactions going on currently. We may need to pass the BackupLock to all trees, and make compactors take the lock? Can this be tested reliably? Same goes for blob file rewrites.


Also discussed in #50

Originally posted by jeromegn May 19, 2024
If one wanted to do a backup of the database, what's the best practice here? Is there a way to do an "online" (not shutting down the process using the database) backup?

Since there are many directories and files, I assume the safest way is to persist from the process, exit and then create a tarball of the whole directory.

With SQLite, for example, it's possible to run a VACUUM INTO that will create a space-efficient snapshot of the database. SQLite can support multiple processes though, so it's a different ball game.

@marvin-j97 marvin-j97 added enhancement New feature or request epic help wanted Extra attention is needed labels May 20, 2024
@xenacool
Copy link

Apache Paimon is an LSM format designed for writing to S3 compatible object storage while it's running. I really liked this author's articles on it.

https://jack-vanlightly.com/analyses/2024/7/3/understanding-apache-paimon-consistency-model-part-1

@zach-schoenberger
Copy link

Has there been any more thought on this? I'm looking at using this crate but backups are necessary for my use case. I wouldn't mind trying out an implementation if theres been a general decision on the right approach.

@marvin-j97
Copy link
Collaborator Author

marvin-j97 commented Nov 7, 2024

Has there been any more thought on this? I'm looking at using this crate but backups are necessary for my use case. I wouldn't mind trying out an implementation if theres been a general decision on the right approach.

The general idea is to use RocksDB-like Checkpoints for online backups, I didn't want to add it in 1.x because I knew 2.x would add and change some stuff anyway. But to fully implement it, there will need to be some synchronization, because the journal(s) need to be copied, then all the disk segments (hard-linked if possible) and metadata files, for all partitions. When the backup starts, no journal GC, journal writes, blob file rewrites, memtable flushes or compactions are allowed to complete temporarily, but reads should not be blocked. So I think there needs to be a keyspace-level RwLock...

Another way would be to do a full scan through (using a snapshot) and write it out to a flat file...

For offline backups, simply using cp -R will work.

@marvin-j97 marvin-j97 changed the title Think about backup strategies Backup using Checkpointing Nov 12, 2024
@marvin-j97
Copy link
Collaborator Author

marvin-j97 commented Nov 12, 2024

I have added a todo list to the OP. I still have some stuff coming up for 2.4.0/2.5.0, so this isn't gonna make it in there, but I want to take a look at it in the near future... unless someone wants to take a look into it and contribute a possible solution, even if just a draft.

@Svenskunganka
Copy link

Once the backup starts, no [...] journal writes [...] are allowed to complete until the backup is done.

Would rotating to a new journal (that is excluded from backup) when triggering a checkpoint backup allow writes to also continue? I guess it could result in unbounded memtable and journal growth if backup takes a long time during a period of high write throughput though.

@marvin-j97
Copy link
Collaborator Author

marvin-j97 commented Nov 19, 2024

@Svenskunganka I think it can be more granular. Copying the active journal should be quick, so we just need to lock it shortly. Then, we can already unlock the active journal - and then, we copy the sealed journals; they are just not allowed to be dropped by journal GC during all this time for consistency. My todo list in the OP is more correct I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic help wanted Extra attention is needed test
Projects
None yet
Development

No branches or pull requests

4 participants