tDAR technical overview

tDAR technical overview

Understanding data and data structures within tDAR

Data within tDAR is primarily modeled through the java bean structure, with a series of primary and secondary entities. Primary entities include resources, creators, keywords, and resource collections. Other entities tend to be relationships between primary entities or properties of them. Creators, resources, and keywords are all hierarchical entities which implement inheritance. Inheritance in these cases help us manage common fields, and simplify data management.

Data Model

The tDAR data model (bean model) is built around the needs of expressing and managing data about archaeological information and managing administrative information. At the center of the model is the Resource. Although Resource is not an abstract class, it is never explicitly instantiated – due to some functional requirements with Hibernate and Hibernate Search, it cannot be abstract. Resources are split into two categories, those with files, and those without. Projects, resources without files, exist to help with data management for multiple resources. InformationResource objects, resources with files, exist in a number of forms – Document, Dataset, Image, and supporting formats (Coding Sheet, and Ontology). InformationResource beans may be part of a Project. Resources can be managed and organized through ResourceCollection objects and are described via various Keyword Objects. Resources are also related to Creators (People and Institutions) through both rights and other roles.
Figure 1: tDAR Resource Class Hierarchy
Figure 2: tDAR Keyword Class Hierarchy
The inheritance and relationships are managed by JPA 2.0 and hibernate, as well as within the Java Bean Hierarchy. This affordance is likely necessary in the code, but does complicate some of the hibernate interactions. At the center of the data model are a set of interfaces and static classes that centralize and manage

Serialized Data Models

There are effectively three separate serialized data models for tDAR: the SQL database through hibernate, the Lucene indexes through hibernate search, Java objects through Freemarker, and XML and JSON through JAXB (mainly).
These serializations provide both benefits and complexities to tDAR. There is both the challenge of keeping the data in sync across all representations, but also filtering data that may not be appropriate to that context, or that the user does not have the rights to see.

Hibernate and the Postgres Database

The primary serialization or representation of tDAR is the Hibernate managed Postgres database. We leverage hibernate to manage the persistence and interaction with the database through JPA 2.0. Abstracting as much of the database interaction allows us to avoid as much Postgres specific database knowledge within the application layer. Hibernate nicely supports inheritance models within the database which allows us to manage and map tDAR objects into database objects and back. Hibernate also simplifies management of sets, lists, and object relationships in most places, thus simplifying the code that we have to generate.
The complexity to the tDAR object model does provide some challenges for hibernate. First is the issue of hash codes and identity. Due to how hash codes and identity are managed and generated within the system, there are no unique business keys that most objects can use as hash-code entries. Thus, we've taken the practice of using just the database generated id as the hash code where available. We've done this broadly throughout for consistency, even in a few cases where a hash code could have been generated from other values.
Another complexity is in querying and working with the object graph. The complex interrelationships between objects and object hierarchies in tDAR often mean that basic queries may return too much data, or suffer from N+1 issues. We have addressed these issues in a few ways, first, by using queries that utilize object constructors to produce "sparse" or "skeleton" records, or by severing bi-directional relationships and using secondary queries to populate data (an example of this being the relationship between projects and their resources, where a project does not have a relationship with each of its children, but the child has a relationship to its project). Moving forward, this type of issue represents one of our biggest performance bottlenecks. Some of this may be addressable using fetch profiles or other hibernate specific solutions.
A few other known issues with the database include, a few unique keys that cannot be represented in hibernate properly due to their multi-column nature. Better versioning of the database schema via tools like Liquibase. Better performance in general and better caching to name a few.

Non-Hibernate Data stored within the Database

The data stored within the tdardata database represents a different class of data. This data is loaded from data sets within the system and cannot be properly managed by hibernate because schema are generated on the fly as a data set is loaded and each is not backed by java entities, but instead by simple data objects like lists or sets. This data is managed through Spring's JDBC support and through abstractions in the PostgresDatabase class. The main interaction with this data is for two purposes, simple browsing, and data integration.
A second class of database data that is not managed in hibernate is the PostGIS data. We use PostGIS to perform reverse geolocation within the application. This enables us to allow contributors to draw bounding boxes within the system, and be able to utilize the bounding data when other users search for terms. E.g. a bounding box around the UK would enable a user to search for "England" and find it without that user entering the geographic keyword.
This work (the geolocation) is done locally for a few reasons. 1) privacy and security of the lookup (as the data may be a confidential site location) 2) customization (as we can load in our own shape files to query with content unique to our clients). Currently data loaded into this database includes country, county, state, and continent data, but it might include state and federal lands and other information.

Solr

There are a number of different queries and functions that would be more complex if relegated to the database. These include some of the complex queries built by the advanced search interface, full text queries that use data outside the database, queries that use inheritance, or resources with hierarchical rights assignments, for example. We leverage Solr to manage the Lucene indexes for us and Lucene to run these queries. When a record is saved, we either manually, or use Spring Events to index and the updated bean. Lucene provides some nice benefits in allowing us to transforms the data that's stored and managed in a flat format that is more conducive to many searches.

The flatter  indexes may prevent complex Boolean searches from being possible. Another challenge is that Lucene may not be as good at certain types of data queries, such as these that utilize numeric values. Spatial queries are particularly complex due to the bounding ranges. Lucene also gives us some benefits in being able to use query and data analyzers when working with data, so stemming, synonyms, and special character queries become easier.
One challenge with the database and Lucene split, however is the time between when the database change is made and when the index is updated. For large collections or projects, the permissions or values in the index may take more time to propagate than a second.

Moving forward, we may want to consider moving to a more schemaless solr model and trying to integrate the schema management into the app similar to how we handle liquibase.

Java Objects & Freemarker

The primary serialization model for beans is through Freemarker. Through struts, the beans the getters exposed on the controller are put onto the object stack for Freemarker. Freemarker can then iterate over and interact with the beans that are visible to populate XML or HTML as necessary. This model works quite nicely in many cases for us as it exposes and manages the beans elegantly. One challenge is that too much information can be exposed to the Freemarker layer in some cases. For example, if data needs to be obfuscated, such as Latitude or Longitude data, or if other information needs to be applied so the Freemarker layer can determine what to render. Ideally, model objects could be pruned, or managed prior to exposure to the Freemarker layer.

XML & JSON

tDAR uses XML and JSON for serialization of data to internal and external sources. Originally, tDAR used xStream and Json-lib to manage serialization of data to XML and JSON respectively. Over time, we've removed xStream and replaced it with JAXB, and have moved away from Json-lib in place of Jackson, though more work needs to be done here.
XML Serialization was primarily for internal use and re-use, it facilitated messaging and transfer of data between different parts of the system. XML serialization also allowed for logging of complex objects such as data integration, record serialization for messaging, record serialization into the filestore as an archival representation, and the import API. JSON serialization was mainly for use and re-use of tDAR data by the JavaScript layer of the tDAR software.
Moving forward, there are a number of challenges to approach with XML and JSON serialization. As we move toward pure JAXB serialization of data for both JSON and XML, tDAR must tackle the fact that the different serializations have different data requirements. While the internal record serialization should contain all fields and all values, JSON serialization may want to or need to filter out data such as email addresses or personally identifiable information. XML that may be useful in a full record serialization may not be appropriate for data import (transient values, for example). Another challenge is maintaining the XML schema versioning in line with the data model changes as both need to be revised at the same time, and changes to the schema cause backwards compatibility issues with the XML in the filestore.
One option to tackle some of these issues might be to implement Jackson's serialization profiles for different formats to serialize the same JSON. Similarly, MoXy may be a better JAXB implementation than the default.

Service Layer

Spring & Autowiring

tDAR uses the Spring Integration framework to assist in management of dependencies and services throughout the application. In many case, these dependencies and services are simply academic in their "interchangeability." That is, most of our Dao's and services are pretty implementation specific, though we have swapped a few out over time. Spring mainly allows us to configure new services and existing services quickly and easily. As the application has grown however, one of the challenges is that our services are dependent on other services. Although we've attempted to keep this to a minimum, it does add complexity. A few different solutions have been chosen (1) using getters to access the shared service (e.g. getting the AuthenticationAndAuthorizationService from the ObfuscationService); (2) moving shared components into a new service with no dependencies; and, (3) moving logic into the Dao layer if necessary. As tDAR continues to grow, some of the autowiring logic may need to be reviewed and revisited.
tDAR does have a few "configurable" services that allow it to be extended to different working environments. These are built around external dependencies, specifically authentication and authorization through the "AuthenticationAndAuthorizationService" and DOI generation via the DOIService. These services has multiple Daos to back their features. For DOIs, this is one for EZID and one for ANDS. For authentication and authorization, we utilize a number of DAOs, different methods for connecting to Atlassian's Crowd, LDAP, and a few local "test" setups.
The final use for autowiring is for tracking all implementations of various interfaces. The primary example being the Workflows. This use allows us to make sure that all implemented Tasks, and Workflows are wired in together without having to explicitly manage the instances ourselves.

Hibernate (Dao & Service Layers)

We have tried to adhere to the MVC model as closely as possible with a full Service and Dao layer to back up data access and management. tDAR is constructed using the Spring Integration framework, which allows for both hard-wiring and dynamic backing of services and dao layers into the system, thus allowing greater extensibility in the future. As the project has grown, the Service and Dao have grown in complexity and undergone a series of refactors to manage complexity, and to help with the IOC/Autowiring.
The goal of the first refactor, probably the most significant, was to manage duplication in the Dao and service layers. The challenge was maintaining duplicate, yet common methods for each of the services that backed each of the bean types; an additional challenge was that every bean required a Dao and Service regardless of whether it required unique functionality beyond some basic methods. This duplication enabled the potential for bugs to be increased whenever new beans were added. The first refactor adjusted the data and service layers to develop the GenericDao and GenericService objects. These provided a few distinct functions: a) they centralized all common functionality in a class that could be sub-classed or called by other Dao and Service layer objects, and (b) as they required a class to be passed to common methods like find(), they removed the need to create specific Service or Dao classes to cover common functionality, finally, c) as these had no dependencies, they alleviated a number of autowiring issues with Spring.
One challenge of this generic model, and how the tDAR Service and Dao layer are setup is that as they were migrated to use generics, flow of control became a bit more complicated. Specifically, the Dao layer can be used and re-used for different bean types. This becomes important and complicating as different beans may have different requirements or functionality around some basic methods. Two specific examples are those beans that implement the HasResource interface, and InformationResourceFileVersions. These two types of beans have unique deletion methods which require specific cases handled in the GenericDao.delete() method to override the default behavior of simply calling hibernate's delete method. We have attempted to maintain as little of this logic as possible.
In working with hibernate in general, we've attempted to keep as standard as possible, and as close to the JPA 2.0 standard as we can. A few additional issues or non-standard behaviors continue to remain within the system including: using the session vs. entity manager, using the database id for hashCode / equality, maintaining a few bidirectional relationships between entities (between InformationResourceFile and InformationResource, for example), and maintaining some database uniqueness keys that hibernate cannot completely manage (InformationResourceFileVersion).
A final challenge of Hibernate are the performance aspects of its inheritance model. Hibernate and JPA2 provides a few methods for dealing with inheritance and how it gets mapped to tables. Based on the distribution of data and our data model, we've chosen a model that uses the "joined" inheritance where subclasses have their own tables but they only contain fields unique to that subclass. This reduces the duplication in the schema and simplifies the model in some ways, but based on how hibernate performs queries, often unnecessarily binds many tables and may perform slow queries because it returns too much data or deals with locks of other tables. We have used a few methods to avoid this where possible including using "projection" in HQL or other queries, or trying to simplify queries. We are currently investigating some other methods such as FetchProfiles, more complex reflection, caching and views.

Query Builder DSL

The tDAR query builder DSL is pretty simple and maintains two different functions (a) to help with the generation of queries and maintaining of the FieldQueryParts and FieldGroups; and (b) to allow for the overriding of field analyzers at runtime to allow the incoming data to be analyzed differently from the data already stored in the index.

Search Query DSL

The tDAR query builder DSL functions as wrapper around Lucene in a different way than the Hibernate Search query builder DSL. Our DSL represents fields and groups as classes and allows them to be combined to create queries using the underlying lucene syntax. Besides basic field queries, more complex field objects exists to represent either "ranged" queries, to handle values with just IDs, objects, or complex query parts. A few examples:

  • FieldQueryPart – represents a generic field query part that can be used for any "simple" field, such as a string of enum

  • GeneralSearchResourceQueryPart – represents the general search for a resource and combine weighted searches on various fields to return a relevant result.

  • HydratableQueryPart – integrates a keyword, person, or any persistable that might simply have an ID being passed in from the controller and "hydrates" the value before performing the search.

  • SpatialQueryPart – enables latitude longitude box searches within the spatial data. It specifically handles edge cases around the International Date Line and issues of precision (scale) of the request – that is, if you draw a box around a specific region of Tuscon, you would get data filtered out of the search results for bounding boxes that cover the entire world or country.

The tDAR Filestore

When a file is uploaded to tDAR, it is automatically validated, inspected, processed, and stored on the file-system. The workflow chosen is unique to the type of file being stored, and the type of tDAR resource (Image, Document, etc.) to which the file belongs. A workflow commonly follows these steps:

  1. The original is stored within the filestore,

  2. the inspection process identifies the specific workflow,

  3. then the specific workflow is run as tDAR:

    1. inspects and validates files,

    2. creates derivatives for display or re-use,

    3. extracts metadata for further analysis,

    4. indexes text within the files, and finally

    5. additional metadata and generate files are stored within the filestore and logged.

tDAR's file storage and management model is heavily influenced by the California Digital Library's Micro-Services model. Data is stored on the file-system in a pre-determined structure described as a PairTree filestore https://confluence.ucop.edu/display/Curation/PairTree. The filestore maintains archival copies of all of the data and metadata in tDAR. This organization allows us to map any data stored within the Postgres database that supports the application's web interface with the data stored on the file-system, while also partitioning data on the file-system into manageable chunks Technically, the user interface is driven by the Postgres database and a set of Lucene indexes for search and storing data. The resource IDs (document, data set, etc.) are the keys to the Pairtree store. When a resource is saved or modified, the store is updated, keeping data in sync.. Each branch of the filestore is a folder for each record, "rec/" illustrated in Figure 2, below. Data associated with each tDAR record is stored in a structure inspired by the D-Flat https://confluence.ucop.edu/display/Curation/D-flat convention ensuring a consistent organization of the archival record.

Figure 2: tDAR Filestore Visualized

tDAR filestore visualized
/home/tdar/filestore/36/67/45$ tree --- rec/ (1) |-- record.2013-02-12--19-44-32.xml |-- record.2013-03-11--08-00-41.xml |-- record.2013-03-11--08-01-12.xml (2) |--- 7134/ (3) | | --- v1/ (4) | |-- aa-volume-376-no3.pdf |-- aa-volume-376-no3.pdf.MD5 (5) | |-- deriv/ |-- aa-volume-376-no3_lg.jpg |-- aa-volume-376-no3_md.jpg |-- aa-volume-376-no3_sm.jpg |-- aa-volume-376-no3.pdf.txt |-- log.xml More generally a path might look like: /home/tdar/filestore/resource id/file id/version/

Note: the numbered items in the figure above map to the numbered items in the list below.  Important terms referring back to parts of the figure are highlighted using a distinct font face.

Each section numbered in Figure 2 represents part of the tDAR Record and parts of the OAIS Model’s Submission Information Package (SIP), Dissemination Information Package (DIP), and Archival Information Package (AIP):

  1. XML representations of each metadata record are stored at the base of each record directory. They are dated and time-stamped to allow for multiple versions.  These are automatically generated when a user saves a file and provide a backup in case of a database error, and versions to see changes over time.

  2. Sub-folders are created for each file a user associated with that record using the file id as a folder name. 

  3. Within each folder associated with a “file”, is another folder for successive versions of a file – thus when/if a file is replaced, it is provisioned a new version number eg. v1.

  4. Within each “version” folder is the original uploaded version of the file along with the MD5 checksum for that file.  This MD5 is also stored in tDAR's metadata database in order to perform routine integrity checks on the file.

  5. Finally, a derivatives “deriv“ folder maintains additional supporting files. Each derivative could theoretically be generated or re-generated as needed, but we decided that improved performance is worth the cost of storing derivatives.   The derivatives include:

    1. 3 separate thumbnails (small, medium, large) for each document or image or other resources for which a thumbnail would be useful, for use in various displays in tDAR.

    2. Extracted metadata from the document header, for indexing and other purposes.

    3. Translated versions of data sets using coding sheets and ontologies

    4. Extracted text for full-text indexing for documents, data sets, or other files, for faster reindexing, which occurs when a record is saved or other points.

    5. Other files as needed