Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added section re: fixity checks

Table of Contents

...

The tDAR data model (bean model) is built around the needs of expressing and managing data about archaeological information and managing administrative information. At the center of the model is the Resource. Although Resource is not an abstract class, it is never explicitly instantiated – due to some functional requirements with Hibernate and Hibernate Search, it cannot be abstract. Resources are split into two categories, those with files, and those without. Projects, resources without files, exist to help with data management for multiple resources. InformationResource objects, resources with files, exist in a number of forms – Document, Dataset, Image, and supporting formats (Coding Sheet, and Ontology). InformationResource beans may be part of a Project. Resources can be managed and organized through ResourceCollection objects and are described via various Keyword Objects. Resources are also related to Creators (People and Institutions) through both rights and other roles.
Figure 1: tDAR Resource Class Hierarchy
Figure 2: tDAR Keyword Class Hierarchy
The inheritance and relationships are managed by JPA 2.0 and hibernate, as well as within the Java Bean Hierarchy. This affordance is likely necessary in the code, but does complicate some of the hibernate interactions. At the center of the data model are a set of interfaces and static classes that centralize and manage

Serialized Data Models

There are effectively three separate serialized data models for tDAR: the SQL database through hibernate, the Lucene indexes through hibernate search, Java objects through Freemarker, and XML and JSON through JAXB (mainly).
These serializations provide both benefits and complexities to tDAR. There is both the challenge of keeping the data in sync across all representations, but also filtering data that may not be appropriate to that context, or that the user does not have the rights to see.

...

tDAR's file storage and management model is heavily influenced by the California Digital Library's Micro-Services model. Data is stored on the file-system in a pre-determined structure described as a PairTree filestore https://confluence.ucop.edu/display/Curation/PairTree. The filestore maintains archival copies of all of the data and metadata in tDAR. This organization allows us to map any data stored within the Postgres database that supports the application's web interface with the data stored on the file-system, while also partitioning data on the file-system into manageable chunks Technically, the user interface is driven by the Postgres database and a set of Lucene indexes for search and storing data. The resource IDs (document, data set, etc.) are the keys to the Pairtree store. When a resource is saved or modified, the store is updated, keeping data in sync.. Each branch of the filestore is a folder for each record, "rec/" illustrated in Figure 2, below. Data associated with each tDAR record is stored in a structure inspired by the D-Flat https://confluence.ucop.edu/display/Curation/D-flat convention ensuring a consistent organization of the archival record.

Anchor
_Ref236124535
_Ref236124535
Anchor
_Ref236124529
_Ref236124529
Figure 2: tDAR Filestore Visualized

Code Block
titletDAR filestore visualized
/home/tdar/filestore/36/67/45$ tree
--- rec/
(1) |-- record.2013-02-12--19-44-32.xml
    |-- record.2013-03-11--08-00-41.xml
    |-- record.2013-03-11--08-01-12.xml 
(2) |--- 7134/
(3) | 
    |  --- v1/
(4)    | 
       |-- aa-volume-376-no3.pdf
       |-- aa-volume-376-no3.pdf.MD5
(5)    | 
       |-- deriv/
               |-- aa-volume-376-no3_lg.jpg
               |-- aa-volume-376-no3_md.jpg
               |-- aa-volume-376-no3_sm.jpg
               |-- aa-volume-376-no3.pdf.txt
               |-- log.xml
More
generally a path might look like:
/home/tdar/filestore/resource id/file id/version/

...

  1. XML representations of each metadata record are stored at the base of each record directory. They are dated and time-stamped to allow for multiple versions.  These are automatically generated when a user saves a file and provide a backup in case of a database error, and versions to see changes over time.
  2. Sub-folders are created for each file a user associated with that record using the file id as a folder name. 
  3. Within each folder associated with a “file”, is another folder for successive versions of a file – thus when/if a file is replaced, it is provisioned a new version number eg. v1.
  4. Within each “version” folder is the original uploaded version of the file along with the MD5 checksum for that file.  This MD5 is also stored in tDAR's metadata database in order to perform routine integrity checks on the file.
  5. Finally, a derivatives “deriv“ folder maintains additional supporting files. Each derivative could theoretically be generated or re-generated as needed, but we decided that improved performance is worth the cost of storing derivatives.   The derivatives include:
    1. 3 separate thumbnails (small, medium, large) for each document or image or other resources for which a thumbnail would be useful, for use in various displays in tDAR.
    2. Extracted metadata from the document header, for indexing and other purposes.
    3. Translated versions of data sets using coding sheets and ontologies
    4. Extracted text for full-text indexing for documents, data sets, or other files, for faster reindexing, which occurs when a record is saved or other points.
    5. Other files as needed

...

As tDAR's functionality expands, and the number of files and formats is increased, these workflows will need to be improved and increase their flexibility and functionality.

Fixity Checks and File Integrity

In addition to the workflows associated with the ingest of file, tDAR performs recurring integrity checks on these files to ensure that they have not been corrupted or otherwise altered from their original form at the time of ingest.  At the time of ingest, the system records the MD5 checksum of the file in tDAR's metadata database.  TDAR then routinely confirms the fixity of these files by comparing a file's current MD5 check against the recorded MD5 value in the tDAR metadata database.  Any discrepancies are recorded and reported to Digital Antiquity staff. 

Web Layer – Struts2, Freemarker, JavaScript, CSS, and HTML

...

AbstractPersistableController


Figure 3:tDAR AbstractPersistableController Class Hierarchy
The AbstractPersistableController is probably the most complex structure within the tDAR controller infrastructure and attempts to manage and simplify CRUD (Create / Update / Delete / View) actions within tDAR by centralizing most of the logic and flow, and allowing stub methods to be overridden by subsequent controllers to adjust the workflow as needed. The controller breaks actions down into the general following process:

...

The other major controller hierarchy is the SearchResultHandler interface and AbstractLookupController hierarchy. The goal of the abstract class and interface are intended to standardize and centralize how tDAR interacts with both search results and the SearchService. The interface provides standard names for parameters supporting searching including for pagination for the end-user interface. It also allows for a standard interface between search controllers and the search service for managing common parameters there such as the query, results, and sorting among others.
Over time, we've begun to use the SearchParameters and ReservedSearchParameters class to assist in the creation and management of queries within the system as well. These helper classes assist in the generation of Boolean search queries by collecting the objects for us without us manually generating groups of fields. The objects themselves were built out of refactoring the AdvancedSearchController to handle generic Boolean searches, but have also helped with simply simplifying the logic.
Figure 4: AbstractLookupController Hierarchy

Asynchronous Actions

Due to complexity of actions, a number of controllers have asynchronous actions associated with them. A few are interactive, while most are not. These asynchronous actions are associated with long-running tasks such as indexing, re-indexing, and loading or processing of data. Asynchronous data processing is done through two different models depending on the result.

...

  • TDAR.common: Common functions and utilities that are utilized on most pages in tDAR, and low-level functionality utilized by the other TDAR components
  • TDAR.advancedSearch: functionality related to to the tDAR's "Advanced Search" page.
  • TDAR.autocomplete: provides the functionality for "autocomplete" form fields.
  • TDAR.contexthelp: enables context-sensitive help pop-ups on various tDAR forms.
  • TDAR.datatable: extends the JQuery DataTable plugin and allows it to be more-easily used in conjunction with tDAR-specific data.
  • TDAR.fileupload: extends the JQuery File Upload plugin , enables validation rules on the types of files and file names that users may upload to tDAR.
  • TDAR.integration: support for TDAR's dataset integration UI.
  • TDAR.maps: enables google map support, provides UI that allows users to designate map boundaries for tDAR resources.
  • TDAR.pricing: support for TDAR's pricing page UI.
  • TDAR.repeatrow: enables support for multi-valued data-entry in tDAR forms.

...


Concerns & Potential Pitfalls

...