CNRI is proposing to develop a framework and a related set of infrastructural tools that will greatly improve the ability of research organizations to register scientific data sets, either those they hold directly or those which they have funded and which are held elsewhere, and expose them for discovery, analysis, and further processing. The project will build the tools and low-level APIs to be suitable for use by data producers as well as organizations that are expert in metadata and data organization.

There is no widely adopted infrastructure currently in place for sharing research data. Individual pieces certainly exist, especially within given domains, but transparent and seamless sharing of scientific data requires a level of standardization and acceptance that simply doesn't exist today. To realize the potential of widely available scientific data, it must be discoverable, reference-able, and understandable, and it must be so without the investment of enormous amounts of time and effort on the part of those who are providing the data or those consuming the data. Research institutions currently expose their data through institution-specific web sites and APIs. The PI propose to build a pair of registries that will enable the use of a common API as well as the ability to federate registries across institutions when it makes sense, without requiring the existing underlying storage and management systems to change. We also propose to design basic metadata schemas to be used in those registries.

The first of the two registries is a metadata registry in which data sets can be registered and described. A common API will be built both for the registration process as well as for access to the resulting metadata objects. Each metadata object and, if required, each data set, will be given a unique, persistent identifier. These identifiers will resolve to the metadata objects and data sets respectively and their assignment will be part of the deposit API. We will also enable related objects to be associated with each other through the registry and through identifier resolution, depending on the specific cases in hand. This will be transparent to users of the access API.

The second of the two registries is a type registry. The metadata objects and data sets will each be typed and the type registry will provide the information needed to decipher those types. The goal is to be able to answer the question of, given a specific identifier or piece of data, what does it represent and how should I interpret it. This interaction will be made as transparent as possible to the access API. The interaction between these two registries is key to the proposed framework.

The proposed deliverables will include an open source release of the metadata registry and the type registry software, the basic metadata schemas applicable for those registries, and a prototype service that demonstrates the infrastructure capability by federating research data from at least two sources.

Project Report

This brief project looked at improvements in the availability and reuse of scientific data. The reuse of such data beyond the original researchers is seen as a potentially large benefit to the evolution of science and technology, as more people both inside and outside of the normal science and technology communities gain access and can build on existing data. Mere availability, however, is not sufficient for successful reuse and must be complemented with understandability of such data. Our project addressed these issues through the creation of two different registries: a Type Registry and a Metadata Registry. Building on previous efforts in this area, the CNRI PIs built both registries and evaluated the framework using real data to identify if the two registries aid in increased understanding and interpretation of data. The PIs also engaged the transnational Research Data Alliance (RDA) through the creation of an RDA Working Group on Data Type Registries. CNRI extended its Digital Object Registry and the Digital Object Repository software to produce Configurable Turnkey Registry (CTR) software. CTR is a server application that can be configured to accept one or more kinds of metadata records. A Metadata Registry describing datasets and a Data Type Registry describing data structures and the detailed background information on how those structures were assembled and what they mean are two examples of configurations that can be made using the software. Once configured and instantiated, the CTR instance will automatically: a) Provide a programming interface, i.e., an API, to the CTR instance enabling clients to register metadata records that conform to the configured schema. Invalid metadata records are rejected. The API also enables clients to retrieve and search for registered metadata records. b) Allot unique and persistent identifiers to registered metadata records. Handle client libraries and Handle web proxies (located at http://hdl.handle.net) can be used to retrieve or get redirected to the metadata records held in the CTR instance. c) Produce a human user interface customized to the metadata elements described in the configured schema. End users can then manually register metadata records, and search for and retrieve registered metadata records. With the help of the CTR software, specifically after instantiating a Type Registry and a Metadata Registry, datasets can be registered and data structures and conventions pertaining to those datasets can be recorded in a way that aids both machines and humans to understand, process, and reuse those datasets. The results of this project demonstrate that automation of data-typing process could be performed and, in general, current data infrastructures could be expanded, along the lines of the registries developed in this project, to improve data understanding and processing that aids in data reuse. The results exceeded our expectations and follow-on work is already underway. A number of data production and curation agencies across a variety of disciplines have expressed interest in further evaluating the CTR, Metadata, and Type Registry technologies, and CNRI is currently working with them to set up such registries and to release the underlying software in open source.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1349985
Program Officer
Robert Chadduck
Project Start
Project End
Budget Start
2013-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2013
Total Cost
$99,986
Indirect Cost
Name
Corporation for National Research Initiatives (NRI)
Department
Type
DUNS #
City
Reston
State
VA
Country
United States
Zip Code
20191