Digital Antiquity Backup Utility

Overview

This document describes Digital Antiquity’s procedures it’s archival and retrieval procedures for Digital Antiquity assets, including: the tDAR resource filestore, tDAR PostgreSQL metadata database, and Digital Antiquity websites.

Note: This document is not an installation guide or a tutorial. Installation, usage, and configuration instructions are available on Digital Antiquity's Bitbucket site.

Usage and project information: https://bitbucket.org/tdar/backup
Installation and configuration: https://bitbucket.org/tdar/backup/src/b7409a3a8f0aa0a29125a87cf4fe34b4a30a3441/installation-notes.md

Process Summary

Digital Antiquity will perform a routine full backup of assets and transfer these assets to the Amazon Glacier service (by way of Amazons S3-to-Glacier) utility. Digital Antiquity will augment these full-backups with smaller, differential backups that occur on a more frequent schedule. Digital Antiquity will compress & encrypt all data prior to sending it to Glacier.

Priorities

Backup files must be secure. All files must be encrypted using industry-accepted best practices. Unencrypted data must never be sent to an external provider.
Backup & restore process must be automated.
However, the process should be transparent and documented such that an administrator could perform all or part of the backup and recovery process “manually”.

Backup Procedures

This section generally describes the steps involved in the backup process. The process is implemented as a set of unix scripts & utilities.

Full Backups (“snapshots”)

Manifest file generation

Essentially a listing of every file contained in the backup
Elements:

full path and filename
hash signature (xxhash)
(undecided) owner+group
(undecided) permissions
(undecided) create+modify date

Backup to scratch location
File “Winterization”

Archive (tar)
Compression
Encryption

Transfer to endpoints.

Endpoint 1:Transfer backup file(s) to Glacier

use s3cmd to transfer backup to Amazon S3 bucket
After 1 month, Amazon automatically migrates backup files to Amazon Glacier

Endpoint 2: Transfer backup files to external hard drive

Differential Backups

The process for differential backups is very similar, with the main differences relating to the manifest file generation process.

Manifest File Generation

obtain full backup manifest
generate new, full backup manifest
using old+new manifest, derive list of file actions

Deleted files
New + modified files - this also serves as the manifest for the contents of the differential backup.

Backup to scratch location
File winterization
Transfer to endpoints

Restoration Procedures

Restoring a Full Backup

Obtain full backup manifest
Obtain full backup file (i.e. transfer to scratch location)
Unpack full backup

unencrypt
uncompress
untar
Optional - consult manifest & verify hashes.

Move backup files from scratch to target

Restoring Differential Backup

Obtain differential backup manifest
Unpack differential backup file
Process deletions and additions

Process Deleted files - the differential manifest specifies which files to remove from the target filesystem.
Move backup files from scratch to target

Glacier Backup Details

Procedure for Backing up to Glacier

We plan to use Amazon’s automated S3-to-Glacier transfer functionality. More information can be found here: https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/.

Amazon Glacier is Amazon’s data archival service. Glacier provides low-cost, durable storage that is tailored for data archival and backup services.

Amazon S3 is a near-realtime online storage service. While it can serve as a backup destination, it is more tailored for low-latency & high-availability file access and S3 pricing reflects this. S3 has been in service longer than Glacier, and benefits from a good selection of mature 3rd-party file transfer utilities.

Amazon Primers

Amazon S3 Filesystem Layout

Buckets

Top-level container
Container for objects
Non-hierarchical (no buckets in buckets)

Objects

Essentially files
Have name, permissions.

Folders

Hierarchical
Don’t really exist. Serve as a construct when downloading and visualizing.
Internally, just a prefix prepended to object name.

Limits

Unlimited # of buckets
Unlimited # of objects per bucket
Max object size: 5TB

Glacier Filesystem Layout

Vault

top level container, akin to s3 bucket

Archive

Roughly akin to s3 object.

Limits: effectively none, for a filesystem of our size (or a filesystem 1000x our size)

How Manifests Work

How S3-to-Glacier works

How to copy to S3

Suggested S3 File Layout

The basics

One bucket per “app” (e.g. “tdar filesystem”, “postgres”, “jira”, etc)
snapshot contained in “snapshot” subfolder
differential backups in “diffs” subfolder.
each folder contains:

one object containing manifest file(s)
one object containing winterized backup + manifest file(s)

Example Layout

s3:/tdar/filestore/2015q1

s3:/tdar/filestore/2015q1/full

filestore-2015q1.manifest
tdar-filestore-2015q1.tar.gz

s3:/tdar/filestore/2015q1/differential

s3:/tdar/filestore/2015q1/differential/2015-jan-01

filestore-2015-jan-01.deleted.txt
filestore-2015-jan-01.modified.txt
filestore-2015-jan-01.tar.gz

s3:/tdar/filestore/2015q1/differential/2015-jan-14

filestore-2015-jan-14.deleted.txt
filestore-2015-jan-14.modified.txt
filestore-2015-jan-14.tar.gz

s3:/tdar/postgresql/2015q1

s3:/tdar/postgresql/2015q1/full

…

s3:/tdar/postgresql/2015q1/differential

…