This project is addressing a systemic problem in scientific research: although datasets collected through scientific protocols may be properly stored, the protocol itself is often only recorded on paper or stored electronically as the script developed to implement the protocol. Once the scientist who has implemented the protocol leaves the laboratory, this record may be lost. Collected datasets become meaningless without a description of the process used to produce them; furthermore, the experiment designed to produce the data is not reproducible.
This research is developing a database (ProtocolDB) to manage scientific protocols and the collected datasets obtained from their execution. The approach will allow scientists to query, compare and revise protocols, and express queries across protocols and data. The research is also addressing the issue of recording and querying the provenance (the why and where) of data. ProtocolDB will benefit scientists by providing a scientific portfolio for the laboratory which not only enables querying and reasoning about protocols, executions of protocols and collected datasets, but enables data sharing and collaborations between teams.
The intellectual merit of the research includes the design of a model for scientific workflows, and a query language to retrieve, transform, compare scientific workflows, integrate datasets, and reason about data provenance. This theoretical contribution will establish advances in the development of systems supporting the expression of scientific protocols. The ProtocolDB implementation will be evaluated by our scientific partners. The broader impact resulting from the project is the development of a general-purpose system for managing scientific protocols and their collected datasets. The established collaborations, involving academic, governmental, and private institutions, will contribute significantly to the breadth of its use.
The ProtocolDB project led to the development of new technology to support scientific workflows and a strong education impact. Scientific Outcomes - The scientific outcomes of the project included a new model to represent scientific workflows, a database to store workflow descriptions, and new methods to support reasoning on scientific workflows. The model is a multi-layer approach that maps a semantic workflow expressed in the terms of a domain ontiology, mapped to one or more implementations where the tools and methods that implement each research tasks are idemtified, and, finally, one or more data flows, that correspond to workflow executions. At each level, the workflow is expressed with an algebaric expression. Equivallences between algebraic expressions supports reasoning ion scientific worksflows. In particular, one can support resource discovery, optimization, data provenance, etc. The findings were presented at various international conferences, peer-reviewed, and published in journals and proceedings. In addition, a new international workshop on Resource Discovery (RED) was created in the context of the project. Ecucation outcome - The educational outcomes were particularly strong. The benefits were specifically the cross-disciplinary involvement of the project. Several students enrolled in the Professional Science Master in Computational Biosciences, Bioinformatics, and Genomics were enrolled on specific projects to validate a 4 credit mandatory class Applications and Complex Problem Solving in Computational Biology in year 1 and a 6 credit internship in the 2nd year. Also through several REU grants three undergraduate students were involved in the project.