Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

The tDAR data model (bean model) is built around the needs of expressing and managing data about archaeological information and managing administrative information. At the center of the model is the Resource. Although Resource is not an abstract class, it is never explicitly instantiated – due to some functional requirements with Hibernate and Hibernate Search, it cannot be abstract. Resources are split into two categories, those with files, and those without. Projects, resources without files, exist to help with data management for multiple resources. InformationResource objects, resources with files, exist in a number of forms – Document, Dataset, Image, and supporting formats (Coding Sheet, and Ontology). InformationResource beans may be part of a Project. Resources can be managed and organized through ResourceCollection objects and are described via various Keyword Objects. Resources are also related to Creators (People and Institutions) through both rights and other roles.
Figure 1: tDAR Resource Class Hierarchy
Figure 2: tDAR Keyword Class Hierarchy
The inheritance and relationships are managed by JPA 2.0 and hibernate, as well as within the Java Bean Hierarchy. This affordance is likely necessary in the code, but does complicate some of the hibernate interactions. At the center of the data model are a set of interfaces and static classes that centralize and manage

Serialized Data Models

There are effectively three separate serialized data models for tDAR: the SQL database through hibernate, the Lucene indexes through hibernate search, Java objects through Freemarker, and XML and JSON through JAXB (mainly).
These serializations provide both benefits and complexities to tDAR. There is both the challenge of keeping the data in sync across all representations, but also filtering data that may not be appropriate to that context, or that the user does not have the rights to see.

...

The data stored within the tdardata database represents a different class of data. This data is loaded from data sets within the system and cannot be properly managed by hibernate because schema are generated on the fly as a data set is loaded and each is not backed by java entities, but instead by simple data objects like lists or sets. This data is managed through Spring's JDBC support and through abstractions in the PostgresDatabase class. The main interaction with this data is for two purposes, simple browsing, and data integration.
A second class of database data that is not managed in hibernate is the PostGIS data. We use PostGIS to perform reverse geolocation within the application. This enables us to allow contributors to draw bounding boxes within the system, and be able to utilize the bounding data when other users search for terms. E.g. a bounding box around the UK would enable a user to search for "England" and find it without that user entering the geographic keyword.
This work (the geolocation) is done locally for a few reasons. 1) privacy and security of the lookup (as the data may be a confidential site location) 2) customization (as we can load in our own shape files to query with content unique to our clients). Currently data loaded into this database includes country, county, state, and continent data, but it might include state and federal lands and other information.

...

Solr

There are a number of different queries and functions that would be more complex if relegated to the database. These include some of the complex queries built by the advanced search interface, full text queries that use data outside the database, queries that use inheritance, or resources with hierarchical rights assignments, for example. We leverage Hibernate Search Solr to manage the Lucene indexes for us and Lucene to run these queries. Hibernate Search allows us to tie into the persistence cycle of objects in the database to make sure that the database and Lucene indexes are maintained in sync. When a record is saved, hibernate search will index , we either manually, or use Spring Events to index and the updated bean. Lucene provides some nice benefits in allowing us to transforms the data that's stored and managed in a flat format that is more conducive to many searches. Hibernate search also enables us to perform a Lucene search and return hydrated beans from the database. There are performance considerations here, as with all things, if we call saveOrUpdate() multiple times, this may reindex the record multiple times, hence we sometimes have to block index calls. Alternately, transient data may not get updated. The flatter hibernate indexes may prevent complex Boolean searches from being possible. Another

The flatter  indexes may prevent complex Boolean searches from being possible. Another challenge is that Lucene may not be as good at certain types of data queries, such as these that utilize numeric values. Spatial queries are particularly complex due to the bounding ranges. Lucene also gives us some benefits in being able to use query and data analyzers when working with data, so stemming, synonyms, and special character queries become easier.
One challenge with the database and Lucene split, however is the time between when the database change is made and when the index is updated. For large collections or projects, the permissions or values in the index may take more time to propagate than a second.

Moving forward, hibernate search now provides query builders, spatial search, and a number of improvements that may simplify some of the logic we've built ourselves. Alternately, we need to investigate whether we're making the best use of what should be stored within these indexes and whether we might benefit from performing some of these operations in the database instead of Lucene, or whether re-architecting how Hibernate Search and Hibernate retrieves the data out of the database might provide more optimal performance. Another significant challenge is the total number of fields we're storing in Lucene and whether all are truly necessary for operation.we may want to consider moving to a more schemaless solr model and trying to integrate the schema management into the app similar to how we handle liquibase.

Java Objects & Freemarker

...

We have tried to adhere to the MVC model as closely as possible with a full Service and Dao layer to back up data access and management. tDAR is constructed using the Spring Integration framework, which allows for both hard-wiring and dynamic backing of services and dao layers into the system, thus allowing greater extensibility in the future. As the project has grown, the Service and Dao have grown in complexity and undergone a series of refactors to manage complexity, and to help with the IOC/Autowiring.
The goal of the first refactor, probably the most significant, was to manage duplication in the Dao and service layers. The challenge was maintaining duplicate, yet common methods for each of the services that backed each of the bean types; an additional challenge was that every bean required a Dao and Service regardless of whether it required unique functionality beyond some basic methods. This duplication enabled the potential for bugs to be increased whenever new beans were added. The first refactor adjusted the data and service layers to develop the GenericDao and GenericService objects. These provided a few distinct functions: a) they centralized all common functionality in a class that could be sub-classed or called by other Dao and Service layer objects, and (b) as they required a class to be passed to common methods like find(), they removed the need to create specific Service or Dao classes to cover common functionality, finally, c) as these had no dependencies, they alleviated a number of autowiring issues with Spring.
One challenge of this generic model, and how the tDAR Service and Dao layer are setup is that as they were migrated to use generics, flow of control became a bit more complicated. Specifically, the Dao layer can be used and re-used for different bean types. This becomes important and complicating as different beans may have different requirements or functionality around some basic methods. Two specific examples are those beans that implement the HasResource interface, and InformationResourceFileVersions. These two types of beans have unique deletion methods which require specific cases handled in the GenericDao.delete() method to override the default behavior of simply calling hibernate's delete method. We have attempted to maintain as little of this logic as possible.
In working with hibernate in general, we've attempted to keep as standard as possible, and as close to the JPA 2.0 standard as we can. A few additional issues or non-standard behaviors continue to remain within the system including: using the session vs. entity manager, using the database id for hashCode / equality, maintaining a few bidirectional relationships between entities (between InformationResourceFile and InformationResource, for example), and maintaining some database uniqueness keys that hibernate cannot completely manage (InformationResourceFileVersion).
A final challenge of Hibernate are the performance aspects of its inheritance model. Hibernate and JPA2 provides a few methods for dealing with inheritance and how it gets mapped to tables. Based on the distribution of data and our data model, we've chosen a model that uses the "joined" inheritance where subclasses have their own tables but they only contain fields unique to that subclass. This reduces the duplication in the schema and simplifies the model in some ways, but based on how hibernate performs queries, often unnecessarily binds many tables and may perform slow queries because it returns too much data or deals with locks of other tables. We have used a few methods to avoid this where possible including using "projection" in HQL or other queries, or trying to simplify queries. We are currently investigating some other methods such as FetchProfiles, more complex reflection, caching and views.

...

Query Builder DSL

tDAR's use of the Hibernate Search API existed prior to a number of major enhancements to the API including faceting, the query builder DSL, geospatial searching and indexing, and full-text (tika) indexing among others. Due to this, a number major infrastructure components were built within the system which may be duplicative of the Hibernate Search APIs.

Query Builder DSL

The tDAR query builder DSL The tDAR query builder DSL is pretty simple and maintains two different functions (a) to help with the generation of queries and maintaining of the FieldQueryParts and FieldGroups; and (b) to allow for the overriding of field analyzers at runtime to allow the incoming data to be analyzed differently from the data already stored in the index.

...

Search Query DSL

tDAR has a number of custom analyzers, and also overrides the default analyzer for Hibernate Search. These analyzers help handle a number of issues for us:

  • Proper formatting of numbers, dates, and latitude/longitude values for indexing and querying
  • Tokenization or non-tokenization of field values into keywords or phrases
  • ASCII normalization of special characters
  • Edge N-GRAM tokenization for auto-complete lookups

Search Query DSL

The tDAR query builder DSL functions as wrapper around Lucene in a different way than the Hibernate Search query builder DSLThe tDAR query builder DSL functions as wrapper around Lucene in a different way than the Hibernate Search query builder DSL. Our DSL represents fields and groups as classes and allows them to be combined to create queries using the underlying lucene syntax. Besides basic field queries, more complex field objects exists to represent either "ranged" queries, to handle values with just IDs, objects, or complex query parts. A few examples:

  • FieldQueryPart – represents a generic field query part that can be used for any "simple" field, such as a string of enum
  • GeneralSearchResourceQueryPart – represents the general search for a resource and combine weighted searches on various fields to return a relevant result.
  • HydratableQueryPart – integrates a keyword, person, or any persistable that might simply have an ID being passed in from the controller and "hydrates" the value before performing the search.
  • SpatialQueryPart – enables latitude longitude box searches within the spatial data. It specifically handles edge cases around the International Date Line and issues of precision (scale) of the request – that is, if you draw a box around a specific region of Tuscon, you would get data filtered out of the search results for bounding boxes that cover the entire world or country.

Interceptors and Class/FieldBridges

We maintain a few interceptors and field bridges for Hibernate Search to assist in indexing as well. These perform a few functions:

  • LatLongClassBridge – This class bridge helps with proper indexing of Latitude/Longitude values.
  • PersistentReaderBridge – This allows us to pass a number of URIs (file://) to the indexer to optimize reading of full-text data to be indexed.
  • TdarPaddedNumberBridge – This field bridge formats dates and numbers in a consistent format.
  • DontIndexWhenNotReadyInterceptor – this interceptor blocks Hibernate Search from indexing a resource until it's marked as ready. Hibernate Search will reindex whenever a save/update/delete or other action is called on an entity that it manages. Through the course of a controller interaction, this may be multiple times, hence, we block it until ready as this is expensive.

The tDAR Filestore

When a file is uploaded to tDAR, it is automatically validated, inspected, processed, and stored on the file-system. The workflow chosen is unique to the type of file being stored, and the type of tDAR resource (Image, Document, etc.) to which the file belongs. A workflow commonly follows these steps:

...

tDAR's file storage and management model is heavily influenced by the California Digital Library's Micro-Services model. Data is stored on the file-system in a pre-determined structure described as a PairTree filestore https://confluence.ucop.edu/display/Curation/PairTree. The filestore maintains archival copies of all of the data and metadata in tDAR. This organization allows us to map any data stored within the Postgres database that supports the application's web interface with the data stored on the file-system, while also partitioning data on the file-system into manageable chunks Technically, the user interface is driven by the Postgres database and a set of Lucene indexes for search and storing data. The resource IDs (document, data set, etc.) are the keys to the Pairtree store. When a resource is saved or modified, the store is updated, keeping data in sync.. Each branch of the filestore is a folder for each record, "rec/" illustrated in Figure 2, below. Data associated with each tDAR record is stored in a structure inspired by the D-Flat https://confluence.ucop.edu/display/Curation/D-flat convention ensuring a consistent organization of the archival record.

Anchor
_Ref236124535
_Ref236124535
Anchor
_Ref236124529
_Ref236124529
Figure 2: tDAR Filestore Visualized

Code Block
titletDAR filestore visualized
/home/tdar/filestore/36/67/45$ tree
--- rec/
(1) |-- record.2013-02-12--19-44-32.xml
    |-- record.2013-03-11--08-00-41.xml
    |-- record.2013-03-11--08-01-12.xml 
(2) |--- 7134/
(3) | 
    |  --- v1/
(4)    | 
       |-- aa-volume-376-no3.pdf
       |-- aa-volume-376-no3.pdf.MD5
(5)    | 
       |-- deriv/
               |-- aa-volume-376-no3_lg.jpg
               |-- aa-volume-376-no3_md.jpg
               |-- aa-volume-376-no3_sm.jpg
               |-- aa-volume-376-no3.pdf.txt
               |-- log.xml
More
generally a path might look like:
/home/tdar/filestore/resource id/file id/version/

...

AbstractPersistableController


Figure 3:tDAR AbstractPersistableController Class Hierarchy
The AbstractPersistableController is probably the most complex structure within the tDAR controller infrastructure and attempts to manage and simplify CRUD (Create / Update / Delete / View) actions within tDAR by centralizing most of the logic and flow, and allowing stub methods to be overridden by subsequent controllers to adjust the workflow as needed. The controller breaks actions down into the general following process:

...

The other major controller hierarchy is the SearchResultHandler interface and AbstractLookupController hierarchy. The goal of the abstract class and interface are intended to standardize and centralize how tDAR interacts with both search results and the SearchService. The interface provides standard names for parameters supporting searching including for pagination for the end-user interface. It also allows for a standard interface between search controllers and the search service for managing common parameters there such as the query, results, and sorting among others.
Over time, we've begun to use the SearchParameters and ReservedSearchParameters class to assist in the creation and management of queries within the system as well. These helper classes assist in the generation of Boolean search queries by collecting the objects for us without us manually generating groups of fields. The objects themselves were built out of refactoring the AdvancedSearchController to handle generic Boolean searches, but have also helped with simply simplifying the logic.
Figure 4: AbstractLookupController Hierarchy

Asynchronous Actions

Due to complexity of actions, a number of controllers have asynchronous actions associated with them. A few are interactive, while most are not. These asynchronous actions are associated with long-running tasks such as indexing, re-indexing, and loading or processing of data. Asynchronous data processing is done through two different models depending on the result.

...