Backup using Checkpointing #52

marvin-j97 · 2024-05-20T15:06:53Z

Possible API

Keyspace::backup_to<P: AsRef<Path>>(path: P) -> crate::Result<()>
TxKeyspace::backup_to<P: AsRef<Path>>(path: P) -> crate::Result<()> (just needs to call inner)

(When we have #70 we could also provide a offline backup method that takes an exclusive temporary lock of an unopened Keyspace and clones it file-by-file (ideally using hard links))

Steps

Implement a copy method in value-log and lsm-tree to copy an LSM-tree in its current form - hard linking if possible, make sure every file is synced, including metadata files (manifest, ...) - this should be easy to unit test
In fjall, lock the journal and a global lock (let's call it BackupLock) to prevent garbage collection and compactions from submitting results (which would change disk segments)
Create the new folder
Fsync & copy all journal files
Drop journal lock (write ops can now proceed)
For each partition, call partition.tree.copy to the appropiate position in the new folder
Copy over keyspace-level metadata files (actually it's just version)
Finish backup (drops BackupLock, everything else can now proceed)
Using the global lock, prevent any changes to all LSM-tree levels (no segments are allowed to disappear through compaction, flushes can actually proceed technically, as long as they are not picked up by the backup)

Unsure

How to prevent compactions from applying their result? There may be compactions going on currently. We may need to pass the BackupLock to all trees, and make compactors take the lock? Can this be tested reliably? Same goes for blob file rewrites.

Also discussed in #50

^{Originally posted by jeromegn May 19, 2024}
If one wanted to do a backup of the database, what's the best practice here? Is there a way to do an "online" (not shutting down the process using the database) backup?

Since there are many directories and files, I assume the safest way is to persist from the process, exit and then create a tarball of the whole directory.

With SQLite, for example, it's possible to run a VACUUM INTO that will create a space-efficient snapshot of the database. SQLite can support multiple processes though, so it's a different ball game.

The text was updated successfully, but these errors were encountered:

xenacool · 2024-08-13T18:17:21Z

Apache Paimon is an LSM format designed for writing to S3 compatible object storage while it's running. I really liked this author's articles on it.

https://jack-vanlightly.com/analyses/2024/7/3/understanding-apache-paimon-consistency-model-part-1

zach-schoenberger · 2024-11-07T14:20:22Z

Has there been any more thought on this? I'm looking at using this crate but backups are necessary for my use case. I wouldn't mind trying out an implementation if theres been a general decision on the right approach.

marvin-j97 · 2024-11-07T16:02:13Z

Has there been any more thought on this? I'm looking at using this crate but backups are necessary for my use case. I wouldn't mind trying out an implementation if theres been a general decision on the right approach.

The general idea is to use RocksDB-like Checkpoints for online backups, I didn't want to add it in 1.x because I knew 2.x would add and change some stuff anyway. But to fully implement it, there will need to be some synchronization, because the journal(s) need to be copied, then all the disk segments (hard-linked if possible) and metadata files, for all partitions. When the backup starts, no journal GC, journal writes, blob file rewrites, memtable flushes or compactions are allowed to complete temporarily, but reads should not be blocked. So I think there needs to be a keyspace-level RwLock...

Another way would be to do a full scan through (using a snapshot) and write it out to a flat file...

For offline backups, simply using cp -R will work.

marvin-j97 · 2024-11-12T21:46:05Z

I have added a todo list to the OP. I still have some stuff coming up for 2.4.0/2.5.0, so this isn't gonna make it in there, but I want to take a look at it in the near future... unless someone wants to take a look into it and contribute a possible solution, even if just a draft.

Svenskunganka · 2024-11-19T04:45:03Z

Once the backup starts, no [...] journal writes [...] are allowed to complete until the backup is done.

Would rotating to a new journal (that is excluded from backup) when triggering a checkpoint backup allow writes to also continue? I guess it could result in unbounded memtable and journal growth if backup takes a long time during a period of high write throughput though.

marvin-j97 · 2024-11-19T12:48:03Z

@Svenskunganka I think it can be more granular. Copying the active journal should be quick, so we just need to lock it shortly. Then, we can already unlock the active journal - and then, we copy the sealed journals; they are just not allowed to be dropped by journal GC during all this time for consistency. My todo list in the OP is more correct I think.

marvin-j97 added enhancement New feature or request epic help wanted Extra attention is needed labels May 20, 2024

marvin-j97 changed the title ~~Think about backup strategies~~ Backup using Checkpointing Nov 12, 2024

marvin-j97 added the test label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup using Checkpointing #52

Backup using Checkpointing #52

marvin-j97 commented May 20, 2024 •

edited

Loading

xenacool commented Aug 13, 2024

zach-schoenberger commented Nov 7, 2024

marvin-j97 commented Nov 7, 2024 •

edited

Loading

marvin-j97 commented Nov 12, 2024 •

edited

Loading

Svenskunganka commented Nov 19, 2024

marvin-j97 commented Nov 19, 2024 •

edited

Loading

Backup using Checkpointing #52

Backup using Checkpointing #52

Comments

marvin-j97 commented May 20, 2024 • edited Loading

Possible API

Steps

Unsure

Also discussed in #50

xenacool commented Aug 13, 2024

zach-schoenberger commented Nov 7, 2024

marvin-j97 commented Nov 7, 2024 • edited Loading

marvin-j97 commented Nov 12, 2024 • edited Loading

Svenskunganka commented Nov 19, 2024

marvin-j97 commented Nov 19, 2024 • edited Loading

marvin-j97 commented May 20, 2024 •

edited

Loading

marvin-j97 commented Nov 7, 2024 •

edited

Loading

marvin-j97 commented Nov 12, 2024 •

edited

Loading

marvin-j97 commented Nov 19, 2024 •

edited

Loading