Digital Antiquity Backup Utility
Digital Antiquity Backup Utility
Overview
This document describes Digital Antiquity’s procedures it’s archival and retrieval procedures for Digital Antiquity assets, including: the tDAR resource filestore, tDAR PostgreSQL metadata database, and Digital Antiquity websites.
Note: This document is not an installation guide or a tutorial. Installation, usage, and configuration instructions are available on Digital Antiquity's Bitbucket site.
- Usage and project information: https://bitbucket.org/tdar/backup
- Installation and configuration: https://bitbucket.org/tdar/backup/src/b7409a3a8f0aa0a29125a87cf4fe34b4a30a3441/installation-notes.md
Process Summary
Digital Antiquity will perform a routine full backup of assets and transfer these assets to the Amazon Glacier service (by way of Amazons S3-to-Glacier) utility. Digital Antiquity will augment these full-backups with smaller, differential backups that occur on a more frequent schedule. Digital Antiquity will compress & encrypt all data prior to sending it to Glacier.
Priorities
Backup files must be secure. All files must be encrypted using industry-accepted best practices. Unencrypted data must never be sent to an external provider.
Backup & restore process must be automated.
However, the process should be transparent and documented such that an administrator could perform all or part of the backup and recovery process “manually”.
Backup Procedures
This section generally describes the steps involved in the backup process. The process is implemented as a set of unix scripts & utilities.
Full Backups (“snapshots”)
Manifest file generation
Essentially a listing of every file contained in the backup
Elements:
full path and filename
hash signature (xxhash)
(undecided) owner+group
(undecided) permissions
(undecided) create+modify date
Backup to scratch location
File “Winterization”
Archive (tar)
Compression
Encryption
Transfer to endpoints.
Endpoint 1:Transfer backup file(s) to Glacier
use s3cmd to transfer backup to Amazon S3 bucket
After 1 month, Amazon automatically migrates backup files to Amazon Glacier
Endpoint 2: Transfer backup files to external hard drive
Differential Backups
The process for differential backups is very similar, with the main differences relating to the manifest file generation process.
Manifest File Generation
obtain full backup manifest
generate new, full backup manifest
using old+new manifest, derive list of file actions
Deleted files
New + modified files - this also serves as the manifest for the contents of the differential backup.
Backup to scratch location
File winterization
Transfer to endpoints
Restoration Procedures
Restoring a Full Backup
Obtain full backup manifest
Obtain full backup file (i.e. transfer to scratch location)
Unpack full backup
unencrypt
uncompress
untar
Optional - consult manifest & verify hashes.
Move backup files from scratch to target
Restoring Differential Backup
Obtain differential backup manifest
Unpack differential backup file
Process deletions and additions
Process Deleted files - the differential manifest specifies which files to remove from the target filesystem.
Move backup files from scratch to target
Glacier Backup Details
Procedure for Backing up to Glacier
We plan to use Amazon’s automated S3-to-Glacier transfer functionality. More information can be found here: https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/.
Amazon Glacier is Amazon’s data archival service. Glacier provides low-cost, durable storage that is tailored for data archival and backup services.
Amazon S3 is a near-realtime online storage service. While it can serve as a backup destination, it is more tailored for low-latency & high-availability file access and S3 pricing reflects this. S3 has been in service longer than Glacier, and benefits from a good selection of mature 3rd-party file transfer utilities.
Amazon Primers
Amazon S3 Filesystem Layout
Buckets
Top-level container
Container for objects
Non-hierarchical (no buckets in buckets)
Objects
Essentially files
Have name, permissions.
Folders
Hierarchical
Don’t really exist. Serve as a construct when downloading and visualizing.
Internally, just a prefix prepended to object name.
Limits
Unlimited # of buckets
Unlimited # of objects per bucket
Max object size: 5TB
Glacier Filesystem Layout
Vault
top level container, akin to s3 bucket
Archive
Roughly akin to s3 object.
Limits: effectively none, for a filesystem of our size (or a filesystem 1000x our size)
How Manifests Work
How S3-to-Glacier works
How to copy to S3
Suggested S3 File Layout
The basics
One bucket per “app” (e.g. “tdar filesystem”, “postgres”, “jira”, etc)
snapshot contained in “snapshot” subfolder
differential backups in “diffs” subfolder.
each folder contains:
one object containing manifest file(s)
one object containing winterized backup + manifest file(s)
Example Layout
filestore-2015q1.manifest
tdar-filestore-2015q1.tar.gz
filestore-2015-jan-01.deleted.txt
filestore-2015-jan-01.modified.txt
filestore-2015-jan-01.tar.gz
filestore-2015-jan-14.deleted.txt
filestore-2015-jan-14.modified.txt
filestore-2015-jan-14.tar.gz