The PI is proposing work that will begin with a consideration of government data but will be extended through the Research Data Alliance (RDA) thereby bolstering the prospects of a common set of data models, APIs and registries for both government and scientific data. Given the global connections of RDA,the proposed work offers the potential to advance common data infrastructure even beyond the US landscape. Johns Hopkins University will develop a set of implementation-independent data models and application programming interface (API) specifications to support semantically useful sharing and machine action over metadata aggregated from heterogeneous data sources. Support for aligning and reasoning over common concepts within these data will be provided through a system of types and properties (fields, concepts, etc.) associated with these types. The initial specification will include a type (and associated properties) that defines a set of core metadata to which participating data providers can map each registered dataset. Allowing multiple types will support extending these core properties and permitting each metadata record to enumerate the set of types to which it conforms. Properties may describe high-level attributes of a dataset and more detailed features. The PI and his team will develop and clarify the initial set of core metadata and to develop an exemplar set of extension types. Intellectual Merit : Seamless and effortless transfer of government and research data across time, geographies, and scientific domains is a difficult problem and this small and admittedly risky project can only hope to provide one piece of the puzzle, but the potential rewards of success are significant and well worth pursuing. Broader Impacts: The proposed work would address a range of diverse data types from government and scientific sources, making them generally available for adoption without encumbrance. The coordination of outreach and adoption through the Research Data Alliance would amplify the results of this proposed to a wide range of communities, data producers and data consumers. Additionally, since the proposal team works within a research library, there is a natural venue for outreach to a community that provides capacity for additional outreach, adoption and sustainability. Finally, it is worth noting that one of the team members (DiLauro) is African-American.
Introduction This project addressed need for discovery and access of data created by government agencies and data produced through federal grants. The variability of access methods and descriptive vocabularies makes it difficult to find and query such data. The standard approach to this problem is to provide a catalog or registry of the data resources, which have the following issues: Complex models, requiring large investment to implement; Rigid models that stifle the evolution of vocabularies over time; Minimal or no support for upstream services that wish to determine which of the registered resources may be served by their capabilities. Consequently, we have investigated a more flexible and lightweight approach, taking advantage of less record-oriented and more graph-based models, that reflects the "Making Open and Machine Readable the New Default for Government Information" Executive Order of May 9, 2013[1] and the associated OMB Open Data Policy memorandum M-13-13[2]. Overall Logical Architecture The architecture (Figure 1) comprises metadata and type registries, which provide a common framework for shared understanding among the various participants: Type Registry Service The role of the type registry (TR) service is to keep track of the properties, the types, and the relationships among them. Data Model Explanation Figure 2 below shows the type registry data model. The following paragraphs explain key concepts and rationales for the structure of the model. Transformations represent the ability to convert from a source type or property to a target type or property. Any changes to entries within the TR that represent activity are captured as Event Data. Type Registry Usage Scenarios Create/update a property and type entries. Create/update a type, given a set of attributes and a set of member property and/or type IDs. Create a transform from a set of attributes and source and target type or property IDs. Register conceptual relationships or transforms between a subject and object type or property. Delete a property, type, transform, or conceptual relationship, given its ID. Get a list of members, given a list of one or more type IDs. Get list of possible source types or properties, given the ID of a target type/property. Get list of conceptual relationships registered between source and target types/properties. Metadata Registry Service The metadata registry service (MR) supports access to data by capturing normalized expressions of data properties and providing a discovery mechanism based on those properties. The concept of an information type is used to cluster these properties into useful groupings. Data Model Explanation Figure 3 shows the metadata registry data model. The paragraphs below explain key concepts and rationales for the structure of the model. Each registered resource is described in a Metadata Registry (MR) entry, multiple versions of which may exist. Each entry contains one or more property value pairs, each consisting of a property ID (URI) and a value. Events represent activity with the MR. Any change to entries within the MR is the result of an event. Information associated with that event is captured as Event Data. Our overarching use case called for four types of data, each of which is supported by our general data model: Core Metadata - General and meant to be applied consistently across all registered content. Custom Metadata - Domain-, model-, or format-specific; not applicable to the full breadth of registered data. Relationship Metadata - Express a relationship between the described resource and some other resource. Create/update a resource registry entry Get the list of MR entries that support the required properties of a given list of types. Search the registry for entries that match a given search criteria. Return the events associated with a given resource. Return the events associated with a given version of a resources entry. Metadata Registry Usage Scenarios Data Repository A data repository (DR) comprises a technical platform and any associated human and machine services for storing, retrieving, and (sometimes) managing its content. Harvester Since some data repositories will not have the sufficient technical platform, expertise or staffing to perform modifications necessary to interact with the MR, we incorporated the concept of the harvester that can interact with the TR and MR on behalf of data repositories. Conclusions Broader Impacts This project reflects work of the RDA[3] Persistent Identifier Information Types Working Group[4]. Some products of this research have been adopted as candidate approaches in NIST’s Common Access Platform[5] design activities. Future Work It would be useful to incorporate the final results of this project into the RDA, to engage the community for adoption of specific vocabulary for the relationships described within the models and, finally, to test our assertion that metadata and type registries and their models are sufficient to manage descriptions of software, workflows, and service artifacts. [1] www.whitehouse.gov/the-press-office/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government- [2] www.whitehouse.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf [3] https://rd-alliance.org [4] https://rd-alliance.org/groups/pid-information-types-wg.html [5] www.nist.gov/data/itag.cfm