While biological databases are well-curated and richly interconnected, data integration remains a manual, time-intensive, and error-prone process. The proposal develops a computational infrastructure that exploits biologists' domain expertise to express and execute data integration protocols for biological pipelines. Several challenges to be addressed include: (1) an exploration interface that can express high-level complex queries that in turn are translated into lower-level data manipulation operators, (2) the specification and population of an alternative splice protein analysis pipeline, and (3) a mediation testbed that is implemented using XML based wrapper technology and mediator technology (IBM DB2 II).

The metrics of links and paths existing between integrated databases may be used to characterize query results in ways that are useful to biologists and data administrators and this proposal develops models to predict these metrics. Navigational queries require traversing multiple paths that differ in cost and benefit (result cardinality). Cost models and domain and task specific semantics are used to choose the best path or set of paths for a biological pipeline.

The research addresses many of the SEIII challenges of large scale data sharing. The project addresses the exploration of databases by biologists; captures and exploits domain specific knowledge; develops efficient methodology to compute results and designs and populates a publicly accessible pipeline and website. The broader impact beyond the specific biological resources and protein pipeline is the development of a methodology and evaluation platform that can be applied to any task that requires access to, and analysis of, multiple inter-connected heterogeneous resources.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0430915
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2005-02-01
Budget End
2009-07-31
Support Year
Fiscal Year
2004
Total Cost
$781,615
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742