-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy paths3napback.txt
281 lines (191 loc) · 11.5 KB
/
s3napback.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
s3napback: Cycling, Incremental, Compressed, Encrypted Backups on Amazon S3
===========================================================================
Manual for version 1.0
2008-05-07
Copyright (c) 2008 David Soergel <[email protected]>
The problem
-----------
In searching for a way to back up one of my Linux boxes to Amazon S3, I was surprised to find that none of the many backup methods and scripts I found on the net did what I wanted, so I wrote yet another one.
The design requirements were:
* Occasional full backups, and daily incremental backups
* Stream data to S3, rather than making a local temp file first (i.e., if I want to archive all of /home at once, there's no point in making huge local tarball, doing lots of disk access in the process)
* Break up large archives into manageable chunks
* Encryption
As far as I could tell, no available backup script (including, e.g. s3sync, backup-manager, s3backup, etc. etc.) met all four requirements.
The closest thing is [js3tream](http://js3tream.sourceforge.net), which handles streaming and splitting, but not incrementalness or encryption. Those are both fairly easy to add, though, using tar and gpg, as [suggested] (http://js3tream.sourceforge.net/linux_tar.html) by the js3tream author. However, the s3backup.sh script he provides uses temp files (unnecessarily), and does not encrypt. So I modified it a bit to produce [s3backup-gpg-streaming.sh](s3backup-gpg-streaming.sh).
That's not the end of the story, though, since it leaves open the problem of managing the backup rotation. I found the explicit cron jobs suggested on the js3tream site too messy, especially since I sometimes want to back up a lot of different directories. Some other available solutions will send incremental backups to S3, but never purge the old ones, and so use ever more storage.
Finally, I wanted to easily deal with MySQL and Subversion dumps.
The solution
------------
I wrote s3napback.pl, which wraps js3tream and solves all of the above issues by providing:
* Dead-simple configuration
* Automatic rotation of backup sets
* Alternation of full and incremental backups (using "tar -g")
* Integrated GPG encryption
* No temporary files used anywhere, only pipes and TCP streams (optionally, uses smallish temp files to save memory)
* Integrated handling of MySQL dumps
* Integrated handling of Subversion repositories, and of directories containing multiple Subversion repositories.
It's not rocket science, just a wrapper that makes things a bit easier.
Prerequisites
-------------
* Java 1.5 or above
* gpg
Quick Start
-----------
Download and extract the s3napback package. There's no "make" or any such needed, so just extract it to some convenient location (I use /usr/local/s3napback).
Configure it with your S3 login information by creating a file called e.g. /usr/local/s3napback/key.txt containing
key=your AWS key
secret=your AWS secret
You'll need a GPG key pair for encryption. Create it with
gpg --gen-key
Since you'll need the secret key to decrypt your backups, you'll obviously need to store it in a safe place (see [the GPG manual](http://www.gnupg.org/gph/en/manual/c481.html))
If you'll be backing up a different machine from the one where you generated the key pair, export the public key:
gpg --export [email protected] > backup.pubkey
and import it on the machine to be backed up:
gpg --import backup.pubkey
gpg --edit-key [email protected]
then "trust"
Create a configuration file something like this (descriptions of the options follow, if they're not entirely obvious):
DiffDir /usr/local/s3napback/diffs
Bucket dev.davidsoergel.com.backup1
GpgRecipient [email protected]
S3Keyfile /usr/local/s3napback/key.txt
ChunkSize 25000000
NotifyEmail [email protected] # not implemented yet
LogFile /var/log/s3napback.log # not implemented yet
LogLevel 2 # not implemented yet
# make diffs of these every day, store fulls once a week, and keep two weeks
<Cycle>
Frequency 1
Phase 0
Diffs 7
Fulls 2
Directory /etc
Directory /home/build
Directory /home/notebook
Directory /home/trac
Directory /usr
Directory /var
</Cycle>
# make diffs of these every week, store fulls once a month, and keep two months
<Cycle>
Frequency 7
Phase 0
Diffs 4
Fulls 2
Directory /bin
Directory /boot
Directory /lib
Directory /lib64
Directory /opt
Directory /root
Directory /sbin
</Cycle>
# make a diff of this every day, store fulls once a week, and keep eight weeks
<Directory /home/foobar>
Frequency 1
Phase 0
Diffs 7
Fulls 8
Exclude /home/foobar/wumpus
</Directory>
# backup an entire machine
<Directory />
Frequency 1
Phase 0
Diffs 7
Fulls 8
Exclude /proc
Exclude /dev
Exclude /sys
Exclude /tmp
</Directory>
# store a MySQL dump of all databases every day, keeping 14.
<MySQL all>
Frequency 1
Phase 0
Fulls 14
</MySQL>
# store a MySQL dump of a specific database every day, keeping 14.
<MySQL mydatabase>
Frequency 1
Phase 0
Fulls 14
</MySQL>
# store a full dump of all Subversion repos every day, keeping 10.
<SubversionDir /home/svn/repos>
Frequency 1
Phase 0
Fulls 10
</SubversionDir>
# store a full dump of a specific Subversion repo every day, keeping 10.
<Subversion /home/svn/repos/myproject>
Frequency 1
Phase 0
Fulls 10
</Subversion>
To run it, just run the script, passing the config file with the -c option:
./s3snapback.pl -c s3snap.conf
That's it! You can put that command in a cron job to run once a day.
Note that gpg will look for a keyring under ~/.gnupg, but on some systems /etc/crontab sets the HOME environment variable to "/". So, you may want to change that to "/root"; or actually create and populate /.gnupg; or just use the GpgKeyring option in the s3napback config file to specify a keyring explicitly.
Priniciples of operation
------------------------
The cycling of backup sets here is rudimentary, taking its inspiration from the cron job approach given on the js3tream page. The principle is that we'll sort the snapshots into a fixed number of "slots"; every new backup simply overwrites the oldest slot, so we don't need to explicitly purge old files.
This is a fairly crappy schedule, in that the rotation doesn't decay over time. We just keep a certain number of daily backups (full or diff), and that's it. For my purposes, that's good enough for now; but I bet someone out there will figure out a clever means of producing a decaying schedule.
Note also that the present scheme means that, once the oldest full backup is deleted, the diffs based on it will still be stored until they are overwritten, but may not be all that useful. For instance, if you do daily diffs and weekly fulls for two weeks, then at some point you'll go from this situation, where you can reconstruct your data for any day from the last two weeks (F = full, D = diff, time flowing to the right):
FDDDDDDFDDDDDD
to this one:
DDDDDDFDDDDDDF
where the full backup on which the six oldest diffs are based is gone, so in fact you can only fully reconstruct the last 8 days. You can still retrieve files that changed on the days represented by the old diffs, of course.
Configuration
-------------
First off you'll need some general configuration statements:
* DiffDir
a directory where tar can store its diff files (necessary for incremental backups).
* Bucket
the destination bucket on S3.
* GpgRecipient
the address of the public key to use for encryption. The gpg keyring of the user you're running the script as (i.e., root, for a systemwide cron job) must contain a matching key.
* GpgKeyring
path to the keyring file containing the public key for the GpgRecipient. Defaults to ~/.gnupg/pubring.gpg
* S3KeyFile
the file containing your AWS authentication keys.
* ChunkSize
the size of the chunks to be stored on S3, in bytes.
Then you can specify as many directories, databases, and repositories as you like to be backed up. These may be contained in <Cycle> blocks, for the sake of reusing timing configuration, or may be blocks themselves with individual timings.
* <Cycle name>
<name> a unique identifier for the cycle. This is not used except to establish the uniqueness of each block.
* Frequency <days> tells how often a backup should be made at all, in days.
* Phase <days> Allows adjusting the day on which the backup is made, with respect to the frequency. Can take values 0 <= Phase < Frequency; defaults to 0. This can be useful, for instance, if you want to alternate daily backups between two backup sets. This can be accomplished by creating two nearly identical backup specifications, both with Frequency 2, but where one has a Phase of 0 and the other has a Phase of 1.
* Diffs <number> tells how long the cycle between full backups should be. (Really there will be one fewer diffs than this, since the full backup that starts the cycle itself counts as one).
* Fulls <number> tells how many total cycles to keep.
* Directory <name> or <Directory name>
<name> a directory to be backed up
May appear as a property within a cycle block, or as a block in its own right, e.g. <Directory /some/path>. The latter case is just a shorthand for a cycle block containing a single Directory property.
* MySQL <databasename> or <MySQL databasename>
In order for this to work, the user you're running the script as must be able to mysqldump the requested databases without entering a password. This can be accomplished through the use of a .my.cnf file in the user's home directory.
<databasename> names a single database to be backed up, or "all" to dump all databases.
the Diffs property is ignored, since mysql dumps are always "full".
* Subversion <repository> or <Subversion repository>
In order for this to work, the user you're running the script as must have permission to svnadmin dump the requested repository.
<repository> names a single svn repository to be backed up.
the Diffs property is ignored, since svnadmin dumps are always "full".
* SubversionDir <repository-dir> or <SubversionDir repository-dir>
<repository-dir> a directory containing multiple subversion repositories, all of which should be backed up
the Diffs property is ignored, since svnadmin dumps are always "full".
(this feature was inspired by http://www.hlynes.com/2006/10/01/backups-part-2-subversion)
Recovery
--------
Recovery is not automated, but if you need it, you'll be motivated to follow this simple manual process.
To retrieve a backup, use js3tream to download the files you need, then decrypt them with GPG, then extract the tarballs. Always start with the most recent FULL backup, then apply all available diffs in date order, regardless of the slot number.
The procedure will be something along these lines:
java -jar js3tream.jar --debug -n -f -v -K $s3keyfile -o -b $bucket:$name | gpg -d | tar xvz
Note that because of the streaming nature of all this, you can extract part of an archive even if there's not enough disk space to store the entire archive. You'll still have to download the whole thing, unfortunately, since it's only at the tar stage that you can select which files will be restored.
java -jar js3tream.jar --debug -n -f -v -K $s3keyfile -o -b $bucket:$name | gpg -d | tar xvz /path/to/desired/file
Future Improvements
-------------------
* Code could be a lot cleaner, handle errors better, etc.
* Rotation schedule should be made decaying somehow
* S3 uploads could be done in parallel, since that can speed things up a lot
* Recovery could be automated
Please let me know if you make these or any other improvements!