In analyzing the needs of instrument data providers, several challenges are clear: - Data is coming in faster, in greater volumes and outstripping our ability to perform adequate quality control. - Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision. - We often fail to capture, represent and propagate manually generated information that need to go with the data flows. - Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects. - The task of event determination and feature classification is onerous and we don't do it until after we get the data.

These statements point to the lack of a comprehensive, re-useable data ingest framework that consists of a semantically rich set of annotations along the data ingest workflow and a smart storage, propagation and retrieval mechanism for the provenance and derivation information. For the purpose of this project, provenance is defined as: the origin or source from which something comes, its intention for use, who/what is was generated for, its manner, sense of place, and time of manufacture, production or discovery, history of subsequent owners, and documented in detail sufficient to allow reproducibility. Thus, the goal of this project is to provide an extensible representation for provenance for data ingest systems. Initially, we limit our focus to the set of solar coronal physics instruments operated at the Mauna Loa Solar Observatory in Hawaii by the High Altitude Observatory, National Center for Atmospheric Research. Over time, we will target the broader area of solar and solar-terrestrial physics, including the proposed Coronal Solar Magnetism Observatory. This project leverages innovative work with the Inference Web explanation framework which provides a set of tools for generating, validating, manipulating, summarizing, and presenting knowledge provenance. In addition we will utilize its Interlingua for provenance, justification, and trust representation - PML - the Proof Markup Language.

Two important concepts to be captured are the data quality and nature of the processing stages that the data has passed through. Both qualitative and quantitative encodings of data quality are very important to a scientist determining if the data of interest are useful, applicable or accurate enough for the intended use.

The provenance work will have broad applicibility since it will include domain-independent portions geared for any data ingest system as well as a domain-literate module aimed at solar and solar-terrestrial physics. This project is expected to generate a science-driven extension to PML that will provide representational primitives required for scientific provenance. The project will also contribute to community standards by adding meta data to ontologies developed in related projects with a wide degree of applicability to similar community and government programs.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Application #
0968277
Program Officer
Kevin L. Thompson
Project Start
Project End
Budget Start
2009-04-21
Budget End
2012-08-31
Support Year
Fiscal Year
2009
Total Cost
$565,627
Indirect Cost
Name
Rensselaer Polytechnic Institute
Department
Type
DUNS #
City
Troy
State
NY
Country
United States
Zip Code
12180