Digital Antiquity Backup Utility

Digital Antiquity Backup Utility


Overview

This document describes Digital Antiquity’s procedures it’s archival and retrieval procedures for Digital Antiquity assets, including: the tDAR resource filestore,  tDAR PostgreSQL metadata database,  and Digital Antiquity websites.


Note: This document is not an installation guide or a tutorial.  Installation, usage, and configuration instructions are available on Digital Antiquity's Bitbucket site.

 

Process Summary


Digital Antiquity will perform a routine full backup of assets and transfer these assets to the Amazon Glacier service (by way of Amazons S3-to-Glacier) utility.   Digital Antiquity will augment these full-backups with smaller,  differential backups that occur on a more frequent schedule.  Digital Antiquity will compress & encrypt all data prior to sending it to Glacier.


Priorities

  • Backup files must be secure.  All files must be encrypted using industry-accepted best practices. Unencrypted data must never be sent to an external provider.

  • Backup & restore process must be automated.

  • However,  the process should be transparent and documented such that an administrator could perform all or part of the backup and recovery process “manually”.


Backup Procedures

This section generally describes the steps involved in the backup process.  The process is implemented as a set of unix scripts & utilities.

Full Backups (“snapshots”)

  1. Manifest file generation

    1. Essentially a listing of every file contained in the backup

    2. Elements:

      1. full path and filename

      2. hash signature (xxhash)

      3. (undecided) owner+group

      4. (undecided) permissions

      5. (undecided) create+modify date

  2. Backup to scratch location

  3. File “Winterization”

    1. Archive (tar)

    2. Compression

    3. Encryption  

  4. Transfer to endpoints.

    1. Endpoint 1:Transfer  backup file(s) to Glacier

      1. use s3cmd to transfer backup to Amazon S3 bucket

      2. After 1 month,  Amazon automatically migrates backup files to Amazon Glacier

    2. Endpoint 2: Transfer backup files to external hard drive

Differential Backups

The process for differential backups is very similar,  with the main differences relating to the manifest file generation process.

  1. Manifest File Generation

    1. obtain full backup manifest

    2. generate new, full backup manifest

    3. using old+new manifest, derive list of file actions

      1. Deleted files

      2. New + modified files - this also serves as the manifest for the contents of the differential backup.

  2. Backup to scratch location

  3. File winterization

  4. Transfer to endpoints


Restoration Procedures

Restoring a Full Backup

  1. Obtain full backup manifest

  2. Obtain full backup file  (i.e. transfer to scratch location)

  3. Unpack full backup

    1. unencrypt

    2. uncompress

    3. untar

    4. Optional - consult manifest & verify hashes.

  4. Move backup files from scratch to target

Restoring Differential Backup

  1. Obtain differential backup manifest

  2. Unpack differential backup file

  3. Process deletions and additions

    1. Process Deleted files - the differential manifest specifies which files to remove from the target filesystem.

    2. Move backup files from scratch to target


Glacier Backup Details


Procedure for Backing up to Glacier

We plan to use Amazon’s automated S3-to-Glacier transfer functionality.  More information can be found here: https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/.


Amazon Glacier is Amazon’s data archival service.   Glacier provides low-cost, durable storage that is tailored for data archival and backup services.  


Amazon S3 is a near-realtime online storage service.  While it can serve as a backup destination, it is more tailored for low-latency & high-availability file access and S3 pricing reflects this.  S3 has been in service longer than Glacier, and benefits from a good selection of mature 3rd-party file transfer utilities.



Amazon Primers

Amazon S3 Filesystem Layout

  • Buckets  

    • Top-level container

    • Container for objects

    • Non-hierarchical (no buckets in buckets)

  • Objects

    • Essentially files

    • Have name, permissions.

  • Folders

    • Hierarchical

    • Don’t really exist. Serve as a construct when downloading and visualizing.

    • Internally, just a prefix prepended to object name.

  • Limits

    • Unlimited # of buckets

    • Unlimited # of objects per bucket

    • Max object size: 5TB

Glacier Filesystem Layout

  • Vault

    • top level container, akin to s3 bucket

  • Archive

    • Roughly akin to s3 object.

  • Limits: effectively none, for a filesystem of our size (or a filesystem 1000x our size)



How Manifests Work

How S3-to-Glacier works

How to copy to S3


Suggested S3 File Layout

  • The basics

    • One bucket per “app” (e.g. “tdar filesystem”,   “postgres”, “jira”, etc)

    • snapshot contained in “snapshot” subfolder

    • differential backups in “diffs” subfolder.

    • each folder contains:

      • one object containing manifest file(s)

      • one object containing winterized backup + manifest file(s)

Example Layout