Skip to content
This repository has been archived by the owner on May 31, 2023. It is now read-only.

Latest commit

 

History

History
303 lines (220 loc) · 12.5 KB

README.md

File metadata and controls

303 lines (220 loc) · 12.5 KB

scat

Scatter your data before loosing it

Backup tool that treats its stores as throwaway, untrustworthy commodity

Features

  • Decentralization: avoid trusting any one third-party with all your data

    • Round-Robin interleave across uneven storage capacities ~JBOD
    • mesh heterogenous storage hosts: local/remote, big/small, fast/slow
    • automatic redistribution: add/remove stores later
    • ex: back up 15GiB of data over 2GiB in Google Drive, 5GiB on spare VPS disk space and the rest on my HDD
  • Block-level de-duplication

    • CDC-based detection of duplicate blocks, from restic
    • incremental, immutable backups
    • reuse identical blocks of unrelated backups from common stores
    • ex: back up a 10GiB pre-allocated disk image with 2GiB used, backup takes <2GiB
    • ex: append 1 byte to a 1GiB file, next backup takes ~1MiB (last block)
  • RAID-like error correction

    • striping with distributed parity ~RAID 5/6
    • grow/shrink array later
    • SHA256-based integrity checks
    • Reed-Solomon erasure coding
    • ex: backup with 1 parity block among Drive, Backblaze and my HDD
      • some block comes back corrupted from my HDD, recover from Drive and B2
      • I'm locked out of my Google account, recover from B2 and my HDD
  • Redundancy: N-copies duplication

    • ensure N+ copies exist at all times ~RAID 1
    • increase fault-tolerance from erasure coding ~RAID 1+5/6
    • automatic failover on restore
    • ex: backup in 2 copies among Drive, Backblaze and my HDD
      • my HDD died, recover from Drive and B2
      • with 1 parity block
        • my HDD died and I forgot my Google password, recover from B2
  • Stream-based: less is more

    • file un-/packing, filtering → tar
    • snapshot management → git
    • remote file transfer → ssh
    • cloud storage → rclone
    • asymmetric-key encryptiongpg
    • progress, throughput → pv
    • Android backup → Termux + ssh
  • And:

    • compression
    • multithreaded: configurable concurrency
    • idempotent backup: resumable, run often
    • easy to setup, use, and hack on
    • cross-platform: binaries for Linux, macOS, Windows, etc.

...pick some or all of the above, apply in any order.

Indeed, scat decomposes backing up and restoring into basic stream processors ("procs") arranged like filters in a pipeline. They're chained together, piping the output of proc x to the input of proc x+1. As such, though created for backing up data, its core doesn't actually know anything about backups, but provides the necessary procs.

Such modularity enables unlimited flexibility: stream data from anywhere (local/remote file, arbitrary command, etc.), process it in any way (encrypt, compress, filter through arbitrary command, etc.), to anywhere: write/read/upload/download is just another proc at the end/beginning of a chain.

                 +---------------------------------+
                 | chain proc                      |
                 |                                 |
+---------+      |  +--------+         +--------+  |
| chunk 0 +----->|  | proc 0 |         | proc 1 |  |
| (seed)  |      |  +--+-----+         +--------+  |
+---------+      |     |                    ^      |
                 |     |    +-------+       |      |
                 |     +--->|+-------+ -----+      |
                 |          +|+-------+            |
                 |           +| chunk |            |
                 |            +-------+            |
                 +---------------------------------+

...where seed may be a tar stream and procs 0..n would be split, checksum, parity, gzip, scp, etc. part of a chain that is itself a proc also.

Demo

demo

Full-length 4K demo video: on YouTube

Setup

  1. Download: latest release
    • flat versioning scheme: v0, v1, etc.
  2. Put scat in your $PATH

Usage

Stream processing, like performing a backup from a tar stream, is done via a proc chain formulated as a proc string.

The following examples showcase proc strings for typical use cases. They're good starting points to start playing with. Copy them in shell scripts and play around with them, backing up and restoring test files until fully understanding the mechanics at play and reaching desired behaviours. It's important to get comfortable both ways to both back up often and not fear potential moments restoring gets necessary.

See Proc string for syntax documentation and the full list of available procs.

Backup

Example of backing up dir foo/ in a RAID 5 fashion to 2 Google Drive accounts and 1 VPS (compress, encrypt, 2 data shards, 1 parity shard, upload >= 2 exclusive copies - using 8 threads, 4 concurrent transfers)

  • seed ← stdin: tar stream of foo/
  • procs: split, checksum, index track, compress, parity-split, checksum, encrypt, striped upload, index write (implicit)
  • output → stdout: index

Command:

$ tar c foo | scat -stats "split | backlog 8 {
  checksum
  | index foo_index
  | gzip
  | parity 2 1
  | checksum
  | cmd gpg --batch -e -r 00828C1D
  | group 3
  | concur 4 stripe(1 2
      mydrive=rclone(drive:tmp)=7gib
      mydrive2=rclone(drive2:tmp)=14gib
      myvps=scp(bankmon tmp)
    )
  }"

The combination of parity, group and stripe creates a RAID 5:

  1. parity(2 1): split into 2 data shards and 1 parity shard
  2. group(3): aggregate all 3 shards for striping
  3. stripe(1 2 ...): interleave those across given stores, making 1 copy of each, ensuring at least 2 of 3 are on distinct stores from the others so we can afford to lose any one of them and still be able to recompute original data

Order matters. Notably:

  • split before compression and encryption to correctly detect identical chunks
  • checksum right after split, before index and after the last producer proc, to properly track output chunks: see index
    • but encrypt after final checksum as gpg -e is not idempotent, to avoid re-writing/uploading identical chunks
  • compress before parity-split and encryption for better ratio
  • group before striping: see stripe

Note:

  • Both backlog and concur are being used above. The former limits the number of concurrent instances of a chain proc ({}) to 8, while the latter limits the number of concurrent transfers by stripe to 4. They may appear redundant, why not one or the other for both? They actually take different types of arguments and have distinct purposes. See backlog and concur.

  • rclone(drive:tmp) and scp(bankmon tmp) have a different arguments layout. The former takes a "remote" argument (passed as-is to rclone), while the latter's arguments are "[user@]host" (passed as-is to ssh) and remote directory. See rclone and scp.

Restore

Reverse chain:

  • seed ← stdin: index
  • procs: index read, download, decrypt, integrity check, parity-join, uncompress, join
  • output → stdout: tar stream of foo

Command:

$ scat -stats "uindex | backlog 8 {
  backlog 4 multireader(
    drive=rclone(drive:tmp)
    drive2=rclone(drive2:tmp)
    bankmon=scp(bankmon tmp)
  )
  | cmd gpg --batch -d
  | uchecksum
  | group 3
  | uparity 2 1
  | ugzip
  | join -
}" < foo_index | tar x

More

The above only demonstrate a subset of what's possible with scat. There exist more procs and they may be assembled in different manners to tailor to one's particular needs. See Proc string.

Command

$ scat [options] <proc>

Options:

  • -stats print stats: rates, quotas, etc.
  • -version show version
  • -help show usage

Args:

Progress

Being stream-based implies not knowing in advance the total size of the data to process. Thus, no progress percentage can be reported. However, when transferring files or directories, size can be known by the caller and passed to pv.

Note: When piping from pv, do not pass the -stats option to scat. Both commands would step on each other's toes writing to stderr and moving the terminal cursor.

File backup:

$ pv my_file | scat "..."

Directory backup (approximate progress, not taking into account tar headers):

# Using GNU du:
$ tar c my_dir | pv -s $(du -sb ~/tmp/100m | cut -f1) | scat "..."

# Under macOS, install GNU coreutils
$ brew install coreutils
$ # idem above, replace du with gdu

# ...or using stock Darwin du, even more approximate:
$ tar c my_dir | pv -s $(du -sk my_dir | cut -f1)k | scat "..."

Snapshots

Making snapshots boils down to versioning the index file in a git repository:

$ git init
$ git add foo_index
$ git commit -m "backup of foo"

Restoring a snapshot consists in checking out a particular commit and restoring using the old index file:

$ git checkout <commit-ish>
$ # ...use foo_index: see restore example

You could have a single repository for all your backups and commit index files after each backup, as well as the backup and restore scripts used to write and read these particular indexes. This allows for modifying proc strings from one backup to the next, while reusing identical chunks if any, and still be able to restore old snapshots created with potentially different proc strings, without having to remember what they were at the time.

Rationale

scat is born out of frustration from existing backup solutions.

As of writing the initial version, I had one or more of the following gripes with available solutions:

  • central repository
  • static, inflexible set of storage engines
    • local filesystem, SSH, S3, etc. vs generic stdout piping
  • only file-level deduplication
  • reinventing the wheel: own implementation of file un-/packing, pattern-based filtering, snapshot management, storage engines, encryption, etc.
  • coding style not to my taste, monolythic (if not spaghetti) code base

I wanted to be able to:

  • back up anything (one file/dir, some files)
  • from anywhere (PC, phone)
  • to anywhere (other PC, cloud, vacant space on some VPS)
  • when sensible, rely on tools I'm familiar with (ex: tar, git, ssh, rclone, gpg)
    • instead of trusting whether some new tool properly re-implements what existing battle-tested tools already do well

without:

  • trusting any third-party (hard drive, server/cloud provider, etc.) for reliable storage/retrieval nor privacy
  • having to divide at the file-level myself: some dir here, other dir there, that big file doesn't fit anywhere without splitting it
  • having to keep track of what's where, let alone copies

I believe scat achieves these objectives 🙂

Next

Thanks