Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The data stored within the tdardata database represents a different class of data. This data is loaded from data sets within the system and cannot be properly managed by hibernate because schema are generated on the fly as a data set is loaded and each is not backed by java entities, but instead by simple data objects like lists or sets. This data is managed through Spring's JDBC support and through abstractions in the PostgresDatabase class. The main interaction with this data is for two purposes, simple browsing, and data integration.
A second class of database data that is not managed in hibernate is the PostGIS data. We use PostGIS to perform reverse geolocation within the application. This enables us to allow contributors to draw bounding boxes within the system, and be able to utilize the bounding data when other users search for terms. E.g. a bounding box around the UK would enable a user to search for "England" and find it without that user entering the geographic keyword.
This work (the geolocation) is done locally for a few reasons. 1) privacy and security of the lookup (as the data may be a confidential site location) 2) customization (as we can load in our own shape files to query with content unique to our clients). Currently data loaded into this database includes country, county, state, and continent data, but it might include state and federal lands and other information.

...

Solr

There are a number of different queries and functions that would be more complex if relegated to the database. These include some of the complex queries built by the advanced search interface, full text queries that use data outside the database, queries that use inheritance, or resources with hierarchical rights assignments, for example. We leverage Hibernate Search Solr to manage the Lucene indexes for us and Lucene to run these queries. Hibernate Search allows us to tie into the persistence cycle of objects in the database to make sure that the database and Lucene indexes are maintained in sync. When a record is saved, hibernate search will index we either manually, or use Spring Events to index and the updated bean. Lucene provides some nice benefits in allowing us to transforms the data that's stored and managed in a flat format that is more conducive to many searches. Hibernate search also enables us to perform a Lucene search and return hydrated beans from the database. There are performance considerations here, as with all things, if we call saveOrUpdate() multiple times, this may reindex the record multiple times, hence we sometimes have to block index calls. Alternately, transient data may not get updated. The flatter hibernate indexes may prevent complex Boolean searches from being possible. Another challenge is that Lucene may not be as good at certain types of data queries, such as these that

The flatter  indexes may prevent complex Boolean searches from being possible. Another challenge is that Lucene may not be as good at certain types of data queries, such as these that utilize numeric values. Spatial queries are particularly complex due to the bounding ranges. Lucene also gives us some benefits in being able to use query and data analyzers when working with data, so stemming, synonyms, and special character queries become easier.
One challenge with the database and Lucene split, however is the time between when the database change is made and when the index is updated. For large collections or projects, the permissions or values in the index may take more time to propagate than a second.

Moving forward, hibernate search now provides query builders, spatial search, and a number of improvements that may simplify some of the logic we've built ourselves. Alternately, we need to investigate whether we're making the best use of what should be stored within these indexes and whether we might benefit from performing some of these operations in the database instead of Lucene, or whether re-architecting how Hibernate Search and Hibernate retrieves the data out of the database might provide more optimal performance. Another significant challenge is the total number of fields we're storing in Lucene and whether all are truly necessary for operationwe may want to consider moving to a more schemaless solr model and trying to integrate the schema management into the app similar to how we handle liquibase.

Java Objects & Freemarker

...

We have tried to adhere to the MVC model as closely as possible with a full Service and Dao layer to back up data access and management. tDAR is constructed using the Spring Integration framework, which allows for both hard-wiring and dynamic backing of services and dao layers into the system, thus allowing greater extensibility in the future. As the project has grown, the Service and Dao have grown in complexity and undergone a series of refactors to manage complexity, and to help with the IOC/Autowiring.
The goal of the first refactor, probably the most significant, was to manage duplication in the Dao and service layers. The challenge was maintaining duplicate, yet common methods for each of the services that backed each of the bean types; an additional challenge was that every bean required a Dao and Service regardless of whether it required unique functionality beyond some basic methods. This duplication enabled the potential for bugs to be increased whenever new beans were added. The first refactor adjusted the data and service layers to develop the GenericDao and GenericService objects. These provided a few distinct functions: a) they centralized all common functionality in a class that could be sub-classed or called by other Dao and Service layer objects, and (b) as they required a class to be passed to common methods like find(), they removed the need to create specific Service or Dao classes to cover common functionality, finally, c) as these had no dependencies, they alleviated a number of autowiring issues with Spring.
One challenge of this generic model, and how the tDAR Service and Dao layer are setup is that as they were migrated to use generics, flow of control became a bit more complicated. Specifically, the Dao layer can be used and re-used for different bean types. This becomes important and complicating as different beans may have different requirements or functionality around some basic methods. Two specific examples are those beans that implement the HasResource interface, and InformationResourceFileVersions. These two types of beans have unique deletion methods which require specific cases handled in the GenericDao.delete() method to override the default behavior of simply calling hibernate's delete method. We have attempted to maintain as little of this logic as possible.
In working with hibernate in general, we've attempted to keep as standard as possible, and as close to the JPA 2.0 standard as we can. A few additional issues or non-standard behaviors continue to remain within the system including: using the session vs. entity manager, using the database id for hashCode / equality, maintaining a few bidirectional relationships between entities (between InformationResourceFile and InformationResource, for example), and maintaining some database uniqueness keys that hibernate cannot completely manage (InformationResourceFileVersion).
A final challenge of Hibernate are the performance aspects of its inheritance model. Hibernate and JPA2 provides a few methods for dealing with inheritance and how it gets mapped to tables. Based on the distribution of data and our data model, we've chosen a model that uses the "joined" inheritance where subclasses have their own tables but they only contain fields unique to that subclass. This reduces the duplication in the schema and simplifies the model in some ways, but based on how hibernate performs queries, often unnecessarily binds many tables and may perform slow queries because it returns too much data or deals with locks of other tables. We have used a few methods to avoid this where possible including using "projection" in HQL or other queries, or trying to simplify queries. We are currently investigating some other methods such as FetchProfiles, more complex reflection, caching and views.

Hibernate Search

tDAR's use of the Hibernate Search API existed prior to a number of major enhancements to the API including faceting, the query builder DSL, geospatial searching and indexing, and full-text (tika) indexing among others. Due to this, a number major infrastructure components were built within the system which may be duplicative of the Hibernate Search APIs.

Query Query Builder DSL

The tDAR query builder DSL is pretty simple and maintains two different functions (a) to help with the generation of queries and maintaining of the FieldQueryParts and FieldGroups; and (b) to allow for the overriding of field analyzers at runtime to allow the incoming data to be analyzed differently from the data already stored in the index.

...

Search Query DSL

tDAR has a number of custom analyzers, and also overrides the default analyzer for Hibernate Search. These analyzers help handle a number of issues for us:

  • Proper formatting of numbers, dates, and latitude/longitude values for indexing and querying
  • Tokenization or non-tokenization of field values into keywords or phrases
  • ASCII normalization of special characters
  • Edge N-GRAM tokenization for auto-complete lookups

Search Query DSL

The tDAR query builder DSL functions as wrapper around Lucene in a different way than the Hibernate Search The tDAR query builder DSL functions as wrapper around Lucene in a different way than the Hibernate Search query builder DSL. Our DSL represents fields and groups as classes and allows them to be combined to create queries using the underlying lucene syntax. Besides basic field queries, more complex field objects exists to represent either "ranged" queries, to handle values with just IDs, objects, or complex query parts. A few examples:

  • FieldQueryPart – represents a generic field query part that can be used for any "simple" field, such as a string of enum
  • GeneralSearchResourceQueryPart – represents the general search for a resource and combine weighted searches on various fields to return a relevant result.
  • HydratableQueryPart – integrates a keyword, person, or any persistable that might simply have an ID being passed in from the controller and "hydrates" the value before performing the search.
  • SpatialQueryPart – enables latitude longitude box searches within the spatial data. It specifically handles edge cases around the International Date Line and issues of precision (scale) of the request – that is, if you draw a box around a specific region of Tuscon, you would get data filtered out of the search results for bounding boxes that cover the entire world or country.

Interceptors and Class/FieldBridges

We maintain a few interceptors and field bridges for Hibernate Search to assist in indexing as well. These perform a few functions:

  • LatLongClassBridge – This class bridge helps with proper indexing of Latitude/Longitude values.
  • PersistentReaderBridge – This allows us to pass a number of URIs (file://) to the indexer to optimize reading of full-text data to be indexed.
  • TdarPaddedNumberBridge – This field bridge formats dates and numbers in a consistent format.
  • DontIndexWhenNotReadyInterceptor – this interceptor blocks Hibernate Search from indexing a resource until it's marked as ready. Hibernate Search will reindex whenever a save/update/delete or other action is called on an entity that it manages. Through the course of a controller interaction, this may be multiple times, hence, we block it until ready as this is expensive.

The tDAR Filestore

When a file is uploaded to tDAR, it is automatically validated, inspected, processed, and stored on the file-system. The workflow chosen is unique to the type of file being stored, and the type of tDAR resource (Image, Document, etc.) to which the file belongs. A workflow commonly follows these steps:

...