s3napback.txt

s3napback: Cycling, Incremental, Compressed, Encrypted Backups on Amazon S3
===========================================================================

Manual for version 1.0
2008-05-07
Copyright (c) 2008 David Soergel <dev@davidsoergel.com>


The problem
-----------

In searching for a way to back up one of my Linux boxes to Amazon S3, I was surprised to find that none of the many backup methods and scripts I found on the net did what I wanted, so I wrote yet another one.

The design requirements were:

* Occasional full backups, and daily incremental backups
* Stream data to S3, rather than making a local temp file first (i.e., if I want to archive all of /home at once, there's no point in making huge local tarball, doing lots of disk access in the process)
* Break up large archives into manageable chunks
* Encryption

As far as I could tell, no available backup script (including, e.g. s3sync, backup-manager, s3backup, etc. etc.) met all four requirements.

The closest thing is [js3tream](http://js3tream.sourceforge.net), which handles streaming and splitting, but not incrementalness or encryption.  Those are both fairly easy to add, though, using tar and gpg, as [suggested] (http://js3tream.sourceforge.net/linux_tar.html) by the js3tream author.  However, the s3backup.sh script he provides uses temp files (unnecessarily), and does not encrypt.  So I modified it a bit to produce [s3backup-gpg-streaming.sh](s3backup-gpg-streaming.sh).

That's not the end of the story, though, since it leaves open the problem of managing the backup rotation.  I found the explicit cron jobs suggested on the js3tream site too messy, especially since I sometimes want to back up a lot of different directories.  Some other available solutions will send incremental backups to S3, but never purge the old ones, and so use ever more storage.

Finally, I wanted to easily deal with MySQL and Subversion dumps.


The solution
------------

I wrote s3napback.pl, which wraps js3tream and solves all of the above issues by providing:

* Dead-simple configuration
* Automatic rotation of backup sets
* Alternation of full and incremental backups (using "tar -g")
* Integrated GPG encryption
* No temporary files used anywhere, only pipes and TCP streams (optionally, uses smallish temp files to save memory)
* Integrated handling of MySQL dumps
* Integrated handling of Subversion repositories, and of directories containing multiple Subversion repositories.

It's not rocket science, just a wrapper that makes things a bit easier.


Prerequisites
-------------

* Java 1.5 or above
* gpg


Quick Start
-----------

Download and extract the s3napback package.  There's no "make" or any such needed, so just extract it to some convenient location (I use /usr/local/s3napback).

Configure it with your S3 login information by creating a file called e.g. /usr/local/s3napback/key.txt containing

	key=your AWS key
	secret=your AWS secret

You'll need a GPG key pair for encryption.  Create it with

	gpg --gen-key

Since you'll need the secret key to decrypt your backups, you'll obviously need to store it in a safe place (see [the GPG manual](http://www.gnupg.org/gph/en/manual/c481.html))

If you'll be backing up a different machine from the one where you generated the key pair, export the public key:

	gpg --export me@example.com > backup.pubkey

and import it on the machine to be backed up:

	gpg --import backup.pubkey
	gpg --edit-key backup@example.com
	then "trust"


Create a configuration file something like this (descriptions of the options follow, if they're not entirely obvious):

	DiffDir /usr/local/s3napback/diffs
	Bucket dev.davidsoergel.com.backup1
	GpgRecipient backup@davidsoergel.com
	S3Keyfile /usr/local/s3napback/key.txt
	ChunkSize 25000000
	
	NotifyEmail me@example.com      # not implemented yet
	LogFile /var/log/s3napback.log  # not implemented yet
	LogLevel 2                      # not implemented yet
	
	# make diffs of these every day, store fulls once a week, and keep two weeks
	<Cycle>
		Frequency 1
		Phase 0
		Diffs 7
		Fulls 2
	
		Directory /etc
		Directory /home/build
		Directory /home/notebook
		Directory /home/trac
		Directory /usr
		Directory /var
	</Cycle>
	
	# make diffs of these every week, store fulls once a month, and keep two months
	<Cycle>
		Frequency 7
		Phase 0
		Diffs 4
		Fulls 2
		
		Directory /bin
		Directory /boot
		Directory /lib
		Directory /lib64
		Directory /opt
		Directory /root
		Directory /sbin
	</Cycle>
	
	# make a diff of this every day, store fulls once a week, and keep eight weeks
	<Directory /home/foobar>
		Frequency 1
		Phase 0
		Diffs 7
		Fulls 8
		
		Exclude /home/foobar/wumpus
	</Directory>
		
	# backup an entire machine
	<Directory />
		Frequency 1
		Phase 0
		Diffs 7
		Fulls 8
		
		Exclude /proc
		Exclude /dev
		Exclude /sys
		Exclude /tmp
	</Directory>
		
	# store a MySQL dump of all databases every day, keeping 14.
	<MySQL all>
		Frequency 1
		Phase 0
		Fulls 14
	</MySQL>
	
	# store a MySQL dump of a specific database every day, keeping 14.
	<MySQL mydatabase>
		Frequency 1
		Phase 0
		Fulls 14
	</MySQL>
	
	# store a full dump of all Subversion repos every day, keeping 10.
	<SubversionDir /home/svn/repos>
	 	Frequency 1
		Phase 0
		Fulls 10
	</SubversionDir>
	
	# store a full dump of a specific Subversion repo every day, keeping 10.
	<Subversion /home/svn/repos/myproject>
	 	Frequency 1
		Phase 0
		Fulls 10
	</Subversion>

To run it, just run the script, passing the config file with the -c option:

	./s3snapback.pl -c s3snap.conf

That's it!  You can put that command in a cron job to run once a day.

Note that gpg will look for a keyring under ~/.gnupg, but on some systems /etc/crontab sets the HOME environment variable to "/".  So, you may want to change that to "/root"; or actually create and populate /.gnupg; or just use the GpgKeyring option in the s3napback config file to specify a keyring explicitly.


Priniciples of operation
------------------------

The cycling of backup sets here is rudimentary, taking its inspiration from the cron job approach given on the js3tream page.  The principle is that we'll sort the snapshots into a fixed number of "slots"; every new backup simply overwrites the oldest slot, so we don't need to explicitly purge old files.

This is a fairly crappy schedule, in that the rotation doesn't decay over time.  We just keep a certain number of daily backups (full or diff), and that's it.  For my purposes, that's good enough for now; but I bet someone out there will figure out a clever means of producing a decaying schedule.

Note also that the present scheme means that, once the oldest full backup is deleted, the diffs based on it will still be stored until they are overwritten, but may not be all that useful.  For instance, if you do daily diffs and weekly fulls for two weeks, then at some point you'll go from this situation, where you can reconstruct your data for any day from the last two weeks (F = full, D = diff, time flowing to the right):

	FDDDDDDFDDDDDD

to this one:

	 DDDDDDFDDDDDDF

where the full backup on which the six oldest diffs are based is gone, so in fact you can only fully reconstruct the last 8 days.  You can still retrieve files that changed on the days represented by the old diffs, of course.


Configuration
-------------

First off you'll need some general configuration statements:

* DiffDir
	a directory where tar can store its diff files (necessary for incremental backups).

* Bucket
 	the destination bucket on S3.

* GpgRecipient
	the address of the public key to use for encryption.  The gpg keyring of the user you're running the script as (i.e., root, for a systemwide cron job) must contain a matching key.

* GpgKeyring
	path to the keyring file containing the public key for the GpgRecipient.  Defaults to ~/.gnupg/pubring.gpg

* S3KeyFile
	the file containing your AWS authentication keys.

* ChunkSize
	the size of the chunks to be stored on S3, in bytes.

Then you can specify as many directories, databases, and repositories as you like to be backed up.  These may be contained in <Cycle> blocks, for the sake of reusing timing configuration, or may be blocks themselves with individual timings.

* <Cycle name>
	<name>  a unique identifier for the cycle.  This is not used except to establish the uniqueness of each block.

* Frequency <days> tells how often a backup should be made at all, in days.

* Phase <days> Allows adjusting the day on which the backup is made, with respect to the frequency.  Can take values 0 <= Phase < Frequency; defaults to 0.  This can be useful, for instance, if you want to alternate daily backups between two backup sets.  This can be accomplished by creating two nearly identical backup specifications, both with Frequency 2, but where one has a Phase of 0 and the other has a Phase of 1.

* Diffs <number> tells how long the cycle between full backups should be.  (Really there will be one fewer diffs than this, since the full backup that starts the cycle itself counts as one).

* Fulls <number> tells how many total cycles to keep.

* Directory <name>    or    <Directory name>
	<name>	a directory to be backed up
	May appear as a property within a cycle block, or as a block in its own right, e.g. <Directory /some/path>.  The latter case is just a shorthand for a cycle block containing a single Directory property.

* MySQL <databasename>    or    <MySQL databasename>
	In order for this to work, the user you're running the script as must be able to mysqldump the requested databases without entering a password.  This can be accomplished through the use of a .my.cnf file in the user's home directory.
	<databasename> names a single database to be backed up, or "all" to dump all databases.
	the Diffs property is ignored, since mysql dumps are always "full".

* Subversion <repository>    or    <Subversion repository>
	In order for this to work, the user you're running the script as must have permission to svnadmin dump the requested repository.
	<repository> names a single svn repository to be backed up.
	the Diffs property is ignored, since svnadmin dumps are always "full".

* SubversionDir <repository-dir>    or    <SubversionDir repository-dir>
	<repository-dir> a directory containing multiple subversion repositories, all of which should be backed up
	the Diffs property is ignored, since svnadmin dumps are always "full".
	(this feature was inspired by http://www.hlynes.com/2006/10/01/backups-part-2-subversion)


Recovery
--------

Recovery is not automated, but if you need it, you'll be motivated to follow this simple manual process.

To retrieve a backup, use js3tream to download the files you need, then decrypt them with GPG, then extract the tarballs.  Always start with the most recent FULL backup, then apply all available diffs in date order, regardless of the slot number.

The procedure will be something along these lines:

	java -jar js3tream.jar --debug -n -f -v -K $s3keyfile -o -b $bucket:$name | gpg -d | tar xvz

Note that because of the streaming nature of all this, you can extract part of an archive even if there's not enough disk space to store the entire archive.  You'll still have to download the whole thing, unfortunately, since it's only at the tar stage that you can select which files will be restored.

	java -jar js3tream.jar --debug -n -f -v -K $s3keyfile -o -b $bucket:$name | gpg -d | tar xvz /path/to/desired/file

Future Improvements
-------------------

* Code could be a lot cleaner, handle errors better, etc.
* Rotation schedule should be made decaying somehow
* S3 uploads could be done in parallel, since that can speed things up a lot
* Recovery could be automated

Please let me know if you make these or any other improvements!