Off site backups are an often talked about and rarely done well item for small to medium enterprises. This is usually due to the cost of an offsite facility storage, complexity of backup software, and operational costs. However, offsite backups are critical to keep an enterprise running if an unfortunate event happens to hit the physical computing facility. Often hardware is a trip to BestBuy or the Apple store or Insert your favorite chain here,  but without your data what good is it? AWS offers a low cost solution to this problem with only a little bit of java and shell script knowledge. This post is split into two parts, getting the data in and getting the data out. Currently we will deal with getting your data safely into glacier.

Glacier – What it is and is not

Glacier is a cheap service specifically for bulk uploads and downloads where timeliness is not of the essence. Inventories are taken every 24 hours of your vaults, and it can take hours to get an archive back. To save yourself grief do not think of glacier as disk space, but rather a robot that can get you your data from a warehouse. For disk space S3 and EBS are what you want in the amazon world.

Decide what to back up

This part is usually unnecessarily difficult. The end product is a list of directories that will be fed to a shell script. You do not need everything. Typically you just need user data, transient data, and OS configs. Think of whatever the delta is from a stock OS image to what you have now…. back that delta up. If you want to backup a database, make sure you are using a dump and not relying on the file system to be consistent.

Stage/Chunk it

Here is a script that can do it:
 CPUS=`cat /proc/cpuinfo | grep processor | wc -l`
 DATE=`date +%m%d%Y-%H%M%S`
 dirs=(/root /sbin /usr /var /opt /lib32 /lib64 /etc /bin /boot)
 if [ ! -d "$BACKUP_PREFIX/$HOSTNAME/complete" ] ; then `mkdir $BACKUP_PREFIX/$HOSTNAME/complete` ; fi
 `mkdir -p $BACKUPDIR`
 mount /boot
 let CPUS++
 SPLITSIZE=`cat /proc/meminfo | grep MemTotal: | sed -e 's/MemTotal:[^0-9]\+\([0-9]\+\).*/\1/g'`
 for dir in "${dirs[@]}"
 TMP=`echo $TMP | sed -e 's/\//_/g'`
 echo "(cd $BACKUPDIR; tar cvf - $dir | split -b $SPLITSIZE - backup_$TMP)"
 `(cd $BACKUPDIR; tar cvfp - $dir | split -b $SPLITSIZE - backup_$TMP)`
 echo "(cd $BACKUPDIR; find . -type f | xargs -n 1 -P $CPUS xz -9 -e )"
`(cd $BACKUPDIR; find . -type f | xargs -n 1 -P $CPUS xz -9 -e )`
`(cd $BACKUP_PREFIX/$HOSTNAME/inflight; mv $DATE ../complete)`
umount /boot
for completedir in `(cd $BACKUP_PREFIX/$HOSTNAME/complete; ls -c)`
echo "$completedir $i $RETENTION"
if [ $i -gt $RETENTION ] ; then
echo "(cd $BACKUP_PREFIX/$HOSTNAME/complete; rm -rf $completedir)"
`(cd $BACKUP_PREFIX/$HOSTNAME/complete; rm -rf $completedir)`;
let i++

This script makes alot of decsions for you, all you really need to do is decide where you are staging your data and what directories you are backing up. This script will create a multi part archive that will be pretty efficient to produce on the hardware it is executed on.

Speaking of staging, you will need formatted disk space to hold the archive while it is being sent up to AWS. Typically you want to be able to hold a weeks worth of backups on the partition. Why a week? This is breathing room to solve any backup issues while not loosing continuity of your backups.

Possibly encrypt it

Take a look at the shell script. If you are extra paranoid use gpg to encrypt each archive piece AFTER they have been compressed. Due to the nature of encryption, you will negate the ability of the compression algorithm to work if you encrypt ahead of time.

Get it into Glacier

First we need to create a vault to put our stuff. Log into your AWS console and go to your glacier management. Select the create a new vault option and remember the name of the vault. Then go to IAM and create a backup user remebering to recorde the access and secret key. Now we are ready to start.

This is custom java program snippet based on the AWS sample code:

for (String s : args) {
try {
ArchiveTransferManager atm = new ArchiveTransferManager(client, credentials);
UploadResult result = atm.upload(vaultName, "my archive " + (new Date()), new File(s));
System.out.println(s + " Archive ID: " + result.getArchiveId());

The complete code and  pom file to build it are included in the git repository. The pom should compile this program into a single jar that can be executed with the java -jar myjar.jar command first we need to configure the program. Create a properties file in the directory you will be running the java command with the name to have your secret key, access key, vault name, and aws endpoint. It should look like this:

 #Insert your AWS Credentials
#for the endpoint select your region

Lastly feed the java program a list of the archive files, perhaps like this:

cd stage; ls *| xargs java -jar

This will get all the files in the staging directory up to your vault. In 24 hours you will see an inventory report with the size and number of uploads. Remember to save the output, the archive IDs are used to retrieve the data later. Some amount of internal book keeping will need to be done to keep the data organized. Amazon provides a safe place for data in this case, and not an easy way to index or find things.

To be Continued….

Next post… getting it all back after the said Armageddon.


All code is available, excuse the bad SSL cert:

Leave a comment

Your email address will not be published. Required fields are marked *