Design Document

Open design issues

This is a living document to include any and all design issues/questions that crop up in tDAR. 

Mapping values to nodes in an ontology

Primary use case: change the basis for mapping string values to ontology nodes from the data table column to live in a coding sheet.  Associating a coding sheet with a data table column to translate it then brings in all the value to ontology node mappings.  

Thus, there will soon be a two primary ways to create a "mapping template":

  1. Directly from a coding sheet by associating an ontology with that coding sheet
  2. Directly from a data table column by associating an ontology with the data table column (Note: this should be exclusive via the web UI so you can associate only one supporting resource with a data table column - a coding sheet or an ontology).  Behind the scenes this will create an identity coding sheet to house the integration template / mappings.

Behind the scenes, this means that a CodingSheet is the primary entity encapsulating data value to ontology node mappings.  

Open questions:
  1. What happens if we change the coding sheet's mappings after it's been linked?  Should those changes be propagated transparently to all directly linked data table columns?
    1. For right now: yes.  Later on we can do some additional work to support versioning of coding sheets, notifying users that have linked to the older version that there is a newer version and showing them the diffs, etc.

Publishing Ontologies

How should we set up our ontology URIs, and do we need to publish them online somewhere, e.g., http://www.tdar.org/ontology/<resource-id>/<resource-title>#Artiodactyla ?  

Authority Management Logging (or: TDAR Application Logging)

Possible solutions:

  1. Store logs in the filesystem somewhere
    1. Pros: 
      1. Simple
    2. Cons: 
      1. Where do they belong and in which filestore?  Have to deal with discovery, parse, search
  2. Store logs in a table in the db
    1. Pros: 
      1. easier to query, report, display, comes with transactional semantics
    2. Cons:
      1. an error in deduping will rollback all the logging (may not be an issue since none of the dedupe should take affect - just log the error in the regular log4j logs)
      2. yet another table in our already turgid database

In either case we need to keep track of:

  1. person performing the action (just the id / name / email ?)
  2. action being performed (only DEDUPE at the moment, possibly DELETE / UPDATE in the future, and more)
  3. a payload of all the relevant data needed
    1. a giant blob of a JSON string 
    2. individual fields 
    3. a tradeoff is millions of little records that are atomic and easily searchable vs a big blob that's less duplication of data but possibly trickier to parse / search.

Use cases to keep track of:

  1. resource has a set of keywords, where multiple keywords may be deduped in the same operation to the same authority record

Ontology enhanced search

  1. The user maps data values (strings) to specific nodes in the ontology via the web interface. I've deferred this task for the time being to deal with bug fixes first but we can still have a test harness that creates these mappings internally in a hard-coded way as a way to test the ontology-based search.
  2. Initial ontology-enhanced-search use cases that we'd like to implement:
    1. synonym search, where a given search term maps into a set S of equivalent terms / synonyms (within an ontology), returning any resources / information resources with hits for any term within S. There are some issues here that I think we need to clarify before we can properly implement it:
      1. where will the synonyms / equivalence classes be generated? Are they editable by users or expected to be already encoded in the ontology?
      2. how should we represent the synonyms for a given node in the ontology? In our metadata RDBMS or within the OWL file itself as <owl:sameAs> or <owl:equivalentClass>elements, e.g.,

        <owl:Class rdf:ID="FootballTeam">
          <owl:sameAs rdf:resource="http://sports.org/US#SoccerTeam"/>
        </owl:Class>
        

        or

        <owl:Class rdf:ID="Wine">
          <owl:equivalentClass rdf:resource="&vin;Wine"/>
        </owl:Class>
        
    1. children-of search, where a given search term maps to a particular node in an ontology (potentially using the synonyms from the previous point), and all children of that node in the ontology (including synonyms?) are relevant to the search.

Re-using metadata mapping

If someone uses the same table structure / column names / etc., and wants to re-apply the metadata mapping they used for a previous dataset, how do they apply it there? Right now a metadata mapping may be associated with a coding sheet and so if the user selects a coding sheet to translate their dataset perhaps they automatically select the same metadata mapping for that coding sheet... will have to think about this more in the future.

File storage scheme

Currently there is a file.store.location property that acts as the root directory for wherever files should be placed (this can be either an absolute path or a relative path). In a production system this is probably best put in an absolute path that is backed up on a regular basis. We should come up with some consistent naming / path convention relative to this file.store.location root directory.

See FileStorageMigration

URL mappings

Enhance current URLs with a more RESTful set of URLs? This may be beneficial for future interoperability efforts.

Road Map