Introduction to the tDAR Application
Use Cases / Personas:
The tDAR application has four distinct audiences: visitors who are attempting to discover and access archaeological resources; contributors who are adding, managing, and archiving archaeological materials; researchers who are attempting to use tDAR's advanced tools to perform data integration tasks; and finally, administrators. In general, there's a high level of overlap in terms of functionality between the three user bases. All users need to search for materials, download them, the researchers and contributors need to be able to create resources, manage them, upload documents, add and modify metadata, manage permissions, and archive materials. Both researchers and contributors need to be able to properly document data sets using coding sheets, but researchers need to be able to integrate data sets.
High-Level Record Types within tDAR
tDAR manages and defines the following objects uniquely within the system.
Creators (People and Institutions)
Creators within the system have different uses. They uniquely represent a person or institution within tDAR. Creator records have names, addresses, descriptions, and URLs among other common metadata, as well as administrative metadata such as when a record was created or updated. Records may be "wrapped" by other objects to help with managing rights and permissions, or to provide distinct roles on a resource, such as authorship. Person records have additional fields to properly represent a user, including links to institutions.
tDAR has a few different generic types of keywords, those that are controlled, hierarchical, and uncontrolled. Keywords have labels, descriptions and administrative metadata such as creation/modification dates.
Uncontrolled keywords allow for the free-form entry of keyword data. All data here is user entered, and while users may utilize auto-completes to manage uniqueness, the keywords are created automatically as a user adds them.
Controlled keywords are managed by the system administrators as opposed to the end-users. Users are able to apply keywords to a resource, but cannot create new ones.
Lastly, controlled keywords may be organized hierarchically. These keywords may represent concepts of cultures, or locations, or other data that can be represented via a tree or hierarchical grouping.
The primary objects created by users within tDAR are resources. Different resource types roughly map to different generic file formats – documents, images, data sets, geospatial data, coding sheets, ontologies, and 3D & sensory data. The one object type that does not map to a general format is the concept of a Project.
Projects within tDAR allow users to associate resources within the system with shared metadata. Projects enable shared metadata by associating groups of resources within a project, and allowing those resources to inherit parts of or the complete project metadata. Projects also provide a basic organizational structure for resources within them.
Collections allow users to manage and organize resources. Collections are similar to tags in that resources may be part of multiple collections. Collections are also hierarchical. While projects enable inheritance of metadata, collections enable inheritance of administrative data, such as rights and permissions. Collections also allow for different displays of information for end-users based on the creator's choices.
The tDAR metadata schema maintains direct mappings to a few existing metadata schema, while extending them to include archaeologically specific metadata. The schema maintains direct mappings to Dublin Core and MODS metadata for general and bibliographic metadata. It however, extends these to include information on sites, locations, dates, notes, types of archaeological investigation, material types, and other metadata. As new resource types are added to tDAR, additional metadata fields are added to the schema to support these based on the needs of the resource type, and consulting with the Guides to Good Practice (http://guidestogoodpractice.org), and metadata that is already maintained within the object itself (such as play-time, format, etc.). In the future, the schema may also need to be extended to properly support PREMIS (http://www.loc.gov/standards/premis/) metadata for archiving.
A copy of the tDAR data dictionary is available at: https://dev.tdar.org/confluence/display/TDAR/Data+Dictionary
Visitor / Basic User Functionality:
The core functionality within tDAR focuses around the following actions: discovery (searching, browsing, downloading) and management (creating & editing, deleting, managing rights).
Users need to be able to do simple keyword searches for resources stored within tDAR. Queries should return resources that match the query in common fields such as title, description, keywords, or within the content of materials uploaded. Results should be ranked to return materials that are more relevant before those that are less. Higher occurrences of terms, or placement of terms in title and description should all result in more relevant results. Basic searches should also have the ability to limit results by location and resource type. Users should be able to search for people, institutions, collections, and resources.
Advanced search should function similar to a traditional Boolean search using a simple user-based query builder. Users should be able to search for data in specific fields on a field-by-field basis and combine field queries with an and/or.
Other types of search
Where possible, search should be enabled via API calls through OpenSearch, while also enabling services like google, and google scholar to properly search and index the system.
Besides searching for content, users need to be able to browse the content via separate interfaces. Ways to explore the content might include by controlled or uncontrolled keyword, title, contributor, year created, year contributed to tDAR, by map, by contact sheet (images), or other creative discovery method.
Special Types of Browsing
Unique resource types, such as images, data sets, and ontologies, for example should extend the browsing interface to allow authenticated users to be able to view and browse the data ahead of downloading. For example, a user should be able to browse a gallery of images if multiple are associated with a resource, they should be able to browse the data within a data set's tables, or see a visual representation of an ontology.
Users should be able to download records when permitted. Contributors should be able to download records and archival copies of records along with metadata.
Users should be able to manage content that interest them, this may include the creation of collections, tracking resources they recently viewed, or simply allowing them to bookmark items they find interesting.
Users should be in control of their own data, they should be able to view and modify their personal information and delete it as need be. They should also be able to reset their passwords.
Creation of new records within tDAR should comply with a few criteria:
- The creation process should be quick
- The creation process should be simple
- Records should be as complete as possible
The tDAR metadata schema is expansive as to properly capture the who /what / where /when and why around each archaeological record or report. Thus, when allowing users to create and modify records, the process needs to be as simple and easy as possible. Ways in which we enable this include: providing lookups and auto-completes on as many fields as possible, using and reusing information provided by the uploader such as map data. Allowing asynchronous operations, such as file uploads. Validating data on entry prior to save. Context-sensitive help should also be provided where possible. Data entry should ideally parallel the data entry page's layout – thus as a user moves through a document or other object that the system lays out the entry page in a logical format. For documents especially, the page should be laid out in a similar structure to a cover page so the user doesn't have to jump around the page as they enter fields. Finally, the system provides for a very limited "minimum viable record," but improvements might be made or found about what reasonable metadata for a record might be beyond the required fields. Another logical improvement might be to separate out the upload process and place it ahead of the data entry process to better validate data being uploaded, and potentially extract metadata.
Beyond the tDAR metadata, specific types of resources will also need additional metadata. Documents, for example may need metadata to complete a citation. Data sets may need metadata to describe columns, tables, and relationships between tables. These may require additional screens or parts of the metadata page. Where possible if the files being uploaded contain this information, it should be used to populate the data instead of prompting the user.
Unlike creating records where, ideally the user-flow is linear on the page, modifying records tends to be a more piecemeal process where the user-flow jumps around the page. Thus, tools like the scroll-spy at the top can enable the user to move between distinct sections quickly and easily – modifying rights, or correcting a map, without having to scroll manually to that part of the page. This method is a trade-off for having separate pages for each of these functions.
Within the context of being an "archive" deletion needs a different definition – actually deleting data must be extremely hard. Instead, deletion should function more as a "user flag" as opposed to actually deleting data. Deleted records should simply be marked as a status on primary data objects such as Resources, Creators, Keywords, Collections, and Files. Although, ultimately, it should be possible to purge each within the system if necessary. Finally, deletion should require a "reason" specified by a user.
Relationships between Resources
As the structures of the uploaded resources become more complex, it will become necessary to enable users to express relationships between them, e.g. image is part of a photo log, or data set is documented by a report.
All user and data actions should be logged with at minimum, who did what and when, and ultimately, with as much information as possible enabling the "undo" of the action at best. This logging may occur either by storage of XML data within the filestore, or logging specific actions within the Resource Revision Log.
One distinct curation action is the ability to manage and organize resources within the system. Within tDAR, organization can be done primarily through the use of Collections. Organization may take on a set of distinct functions: for presentation or for local management. Examples of this might include displaying a set of projects on a map so that visitors can explore data within the system via the map, or hierarchically organizing materials based on a local organizational model used by the uploading / contributing organization.
Often actions within tDAR require multiple records to be created, edited, or deleted, tDAR should provide tools to perform batch behaviors within the system. Actions should include, upload of materials, adding or removing metadata values, assigning rights, uploading files and other common actions.
Managing Users, Rights, & Permissions
Finally, users need to be able to manage rights and permissions at various levels, at the per-file level, the per-resource level, or at the aggregate level across multiple resources. Within tDAR each file may be marked as PUBLIC, CONFIDENTIAL, or EMBARGOED. Files marked as CONFIDENTAL or EMBARGOED limit access to those files to specific users, as opposed to all users. The uploader, or users the uploader designates should be able to control who has the ability to view and modify records, or download restricted files. As organizations are large and complex, users should also be able to manage groups of users or groups of resources easily within the system and apply rights appropriately. It should not require the owner to edit 100 records to assign a user rights to each.
Where possible, APIs should be provided to help manage and handle tasks. While the system may provide forms for the uploading and management of files and resources, it should be possible to load these via an API. Similarly, it should be possible to find records via an API. In the ideal world, these APIs should follow industry and library standards. This might include, SWORD, REST, OpenSearch, OAI-PMH, OAI-ORE, or z39.50, for example. Again, an ideal implementation would ensure that all actions that a user can perform within the system can be performed via APIs.
A unique function to tDAR is data integration. Data integration allows for multiple data sets to be merged together based on shared data and criteria in each data set. Data integration will not work with two data sets that have no overlap in types of data, e.g. a data set with ceramics data and a data set of weather data will likely have no commonality. Instead, two data with data from the same site, sharing locus and trench information, but one contain fauna and one containing ceramics, or two fauna data sets each from a different site would also be better examples. The main requirement is at least one column with similar data between the two data sets. Note, the data in each column and the way it is recorded does not have to be the same. Each column that will be used to merge or integrate the data sets needs to be mapped to a shared ontology. This ontology acts as a bridge or translation table between the two or more columns. The more columns mapped to ontologies between data sets, the more data that can be mapped or reconciled between the different data sets.
The result of this mapping will be a unified data set that contains selected columns from each data set. This unified data set allows for synthetic analysis across site, or across databases in ways that are traditionally difficult, time consuming, and require technical abilities beyond most users. The resulting data can then be fed into tools like SPSS, Stata, or R for statistical analysis.
Data integration requires a few logical steps:
- Documenting of data sets
- Optionally apply coding sheets (Lookup Table)
- Mapping columns to a shared ontology
- Performing the data integration.
Documenting Data Sets
Beyond basic documentation of names and descriptions for tables, columns, and relationships for data sets, users can document the types of data stored in each, column and then associate each column with a coding sheet and an ontology. Data types are not the traditional database level types, but instead represent higher level descriptions (Measurements, Coded Values, and Counts). Measurements and counts are useful for describing and working with numeric data to properly represent it. This type of information is often lost when working with data sets and can be imported as it also distinguishes data that may appear to be coded. Coded values are values that have a coding sheet (e.g. a lookup table). Coding sheets are commonly used in the field to speed data entry and simplify collection and entry into a database or into a paper form.
Ontologies & Coding Sheets
Within tDAR, Ontologies are applied through coding sheets. Each column can be marked as being a "coded value" and thus mapped to a coding sheet, or an "integration column" enabling mapping to an ontology. Ontologies, hierarchical data structures, are used to map values as they enable groupings of values as well as synonyms. A common ontology might be the taxonomy for fauna; whereby bones are mapped to Genus, species, family, or order. This sort of mapping enables data at different levels of specificity to be associated, such that all pigeons could be grouped and counted together, or all 'birds.'
Where possible, this complexity is hidden, if a user maps directly to an ontology, a coding sheet is generated in the background. If mapping to a coding sheet, if an ontology is already mapped, those mappings are re-used.
Mapping a set of values to an ontology can be time-consuming, thus by mapping them to an intermediate value, the coding sheet, we can apply the coding sheet to multiple data sets and thus map once and re-use the mapping. Updating values is also equally simple, change the coding sheet mapping and you've updated all associated ontologies.
In the future, we should be able to enable researchers, who are not the creators of data sets, to map data to ontologies and thus create their own, custom data integrations.
Once data sets are mapped to shared ontologies, an integration can be performed. Different models exist for integration, but a useful one would be to allow the user to dynamically create a spreadsheet representing the output. First, the user would choose which data sets to integrate. Second, they would identify the different columns they want in the output, dragging the specific columns over to map them to the output columns. Columns that were not mapped to a shared ontology would be included as "display values."
Finally, users might filter values and choose which ontology values should be grouped. The filtering would enable users to specify parts of data sets as opposed to the entire data set. Selecting ontology values would allow for grouping of the hierarchies based on user needs. Using the example above, it would allow the grouping and combining of all birds, or pigeons together instead of the exact species or sub-species.
The result of the integration ideally would include, a technical or descriptive summary of the integration performed that provides enough data to manually or automatically replay the entire integration, a pivot table which or count summary of the integration to allow user to see an overview of the data, and the data in a downloadable format. Current integrations are limited to working with specific tables, but in the future, it'd be useful to integrate using join tables as well.
Data on tDAR Content
One of the primary administrative functions within tDAR centers on simply understanding the repository. Data on usage statistics, files, resources, users, and what historically and currently is within tDAR is critical for managing the system. Beyond showing historical statistics on usage, files, and users, the system should be able to show current (real-time) usage and use information as well, including active users and actions.
Indexes & Re-indexing
There are very few true administrative activities that are run manually, but one of them is to rebuild the lucene indexes if necessary. This should be done within the administrative interface.
Regularly Scheduled Administrative Activities
The administrator should be able manually kick-off some administrative activities, but many should run daily or weekly and automatically. These include checking of files, generation of DOIs, checking for overdrawn accounts, cleanup of indexes, generation of cached or aggregate statistics, and other tasks. The administrative portal should enable the administrator to both view the output of these jobs but also, manually kick them off in case of error.
Administrators should be able to manage users within the system, changing group memberships, rights, permissions, and creating and deleting users if necessary.
Error and File management
The administrator should be able to view all errors within the system, and reprocess files that have processing errors or warning statuses. Reprocessing may be necessary if an error occurred such as memory usage, space, or because a bug was found and corrected.
Administrators should be able to view all logs within the system.
Large complex data bases often include issues surrounding data duplication. Users with different spelling variations for their name is a perfect example, but alternate spellings or names for a keyword might be another. The system should be able to manage and capture variants of people, institutions, and keywords within the system and resolve them. Duplicates should be handled properly in the index such that searching for the duplicate finds the master, but, in most cases the user entered values are what's displayed back to the user. Alternately, the deduplication should enable the administrator to completely delete the alternate value if so desired.
Unique requirements of archaeological data:
Protection and management of geospatial and location data is more important within the archaeological context not only because US law protects site locations, but because of the moral and ethical implications of exposing site location data to looters.
Latitude & Longitude
Location data can be found in a number of places, within the metadata of uploaded files (encoded in an image's metadata tags), within the contents of the uploaded file as text or within an embedded map, or within the metadata that a user enters (even if it is within the description). tDAR should attempt to warn users before exposing Lat/Long data where possible, and obfuscate it when entered into known locations, such as the declared maps.
Protection of uploaded materials (confidential files)
Separately, tDAR must provide methods to protect access to files within the system. It is recommended that contributors use appendixes to manage confidential information, or information that should be redacted. Regardless, it is necessary to allow users to restrict access to these or other files that may have location or other confidential data.
A central part of the tDAR model is financial self-sufficiency. Through this, users must be able to pay for the uploading of materials within the system. The system should prevent users from uploading more than they have the rights to, but if a user overdraws, they should be able to recover as much work and data as possible without penalization once paying the additional amount. Payments may be made by third parties, or managed by users for other users as well as a single-payer-user-uploader model. Examples may include a CRM firm paying for all of its employees, money being shared for a large grant, or a faculty member paying for a student to upload materials for them.
Ideally, the system should be dynamic enough to handle different charging and payment models where possible. For example, charging just on space, files, or resources, or by some combination. Finally, for marketing purposes, it may be useful to allow users or the system to generate coupons for users to allow accounts to be dynamically split. The system should provide logging and accounting for specific payments (invoices); groups of payments by a user (accounts); and groups of accounts if necessary.