As research digital data collections created through computational science experiments proliferate, it becomes increasingly important to address the provenance issues of the data validity and quality: to record and manage information about where each data object originated, the processes applied to the data products, and by whom. The first outcome of this work is a provenance collection and experience reuse tool that makes minimal assumptions about the software environment and imposes minimal burden on the application writer. It stores and produces results in a form suitable for publication to a digital library. The provenance collection system is a standalone system that imposes a minimal burden on users to integrate it into their application framework and it exhibits good performance.

A second outcome of the work is a recommender system for workflow completion that employs case-based reasoning to provenance collections in order to make suggestions to users about future workflow-driven investigations. The workflow completion tool builds on computer models of case-based reasoning to develop a support system that leverages the collective experience of the users of the provenance system to provide suggestions. As a key part of effectively evaluating aspects of the tool, this work builds a gigabyte benchmark database of real and synthetic provenance information. Real workflows are sought from the community, with synthetic extensions to the data set for completeness for purposes of testing. The software and database are available to the research community.

Project Report

Beth Plale (PI), David Leake (co-PI) In this project, a team of computer science experts investigated ways in which provenance, or lineage information, of scientific data can be captured automatically as data is being generated. The provenance of a scientific data product captures the history of a data product – how it came into being, by what software processes that were applied, and what data was used as input. The provenance of a data product is important for its reuse because it carries information that a scientist would typically use to determine how much trust to put in the data. As scientific data is shared more widely, it will be increasingly important for this trust information to travel with the data. Our investigations looked at provenance capture "in the wild", that is, capture carried out not tied to a single workflow system as has been done by others. Provenance collection is often a tightly coupled part of a workflow system, a service frequently found in research cyberinfrastructure, but is better served as a standalone tool. We theorized that the cost of provenance capture when not tightly embedded in a workflow system is a function of manual effort (E) and completeness (C) of the provenance information. Manual effort (E) is programmer or user effort expended to install provenance capture hooks into an application. E and C are generally in linear proportion to one another, that is a unit of additional effort yields an additional unit of provenance information. Through the lightweight ways of provenance capture we developed, we have decreased the manual effort so that a full unit of completeness can be achieved at a fraction of an increase in manual labor. We investigated algorithms that can retrieve historical provenance information and use it to give advice to a scientist who is trying to compose a new workflow. For instance, we developed methods for making recommendations that use provenance traces even when the traces conflict with each other. We applied our provenance capture techniques to several different applications. One is the Life Science Grid, an open source platform developed in late 2000’s by Eli Lilly Corp. The Life Science Grid gave pharmacology researchers what could be likened to a desktop rich with tools that can act on, say, a gene sequence. One tool might visualize the sequence, another might find similar sequences and so on. We applied provenance capture to demonstrate that it could be a valuable part of the documentation of the researcher’s discovery process, needed for new molecules as they are advanced through the steps to becoming new drugs. We also applied provenance collection to the processing pipeline for imagery from the Advanced Microwave Scanning Radiometer for EOS (AMSR-E) instrument aboard a satellite. From this effort we developed relevancy methods to distinguish provenance that is interesting to a scientist from provenance that is housekeeping in nature. Two images of provenance captured from the AMSR-E pipeline are included. Finally, in collaboration with the Indiana University Global Research Network Operations Center (GRNOC) we applied provenance capture to the NSF funded GENI network, which is a computer network for testing new network protocols. In this ongoing project, our goal is to expose information about the experiments that are carried out on the GENI network. The project resulted in the Karma provenance tool, a standalone tool that can be added to existing cyberinfrastructure for purposes of collection and representation of provenance data. Karma utilizes a modular architecture that permits support for multiple instrumentation plugins that make it usable in different architectural settings. The Karma Provenance tool, licensed under Apache License, Version 2.0, is available at http://pti.iu.edu/d2i/provenance_karma. The project involved 8 computer science graduate students and 2 undergraduate students, and resulted in six talks and 11 papers.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
0721674
Program Officer
Robert Chadduck
Project Start
Project End
Budget Start
2007-09-01
Budget End
2011-08-31
Support Year
Fiscal Year
2007
Total Cost
$437,954
Indirect Cost
Name
Indiana University
Department
Type
DUNS #
City
Bloomington
State
IN
Country
United States
Zip Code
47401