An overwhelming and growing volume and diversity of biological data is available on the Web. The objective of the proposed work is to help biology students and researchers make effective use of this wealth of data by developing a proof-of-concept software assistant, BioLogica, capable of inferring information from multiple and diverse data sources. A cross-disciplinary team has been assembled to carry out this work. Automated logical deduction will be used to achieve semantic integration of multiple heterogeneous biological data sources and to answer queries that involve combination of data from diverse data sources. An axiomatic biological theory will be developed with the ability to 1) Express fundamental biological concepts and relationships 2) Represent metadata describing the capabilities of diverse data sources, 3) Understand the meaning of complex queries, 4)Decompose queries into simpler components to be answered by available sources and 5)Assemble answers by combining results from multiple component queries. While queries can be posed in English or conveyed via a graphical user interface, ultimately they are rephrased as conjectures, that is, theorems to be proven in the formal biological theory. The theorem is proven by an automatic theorem prover, and an answer to the query is extracted automatically from the proof. The project includes the development of a formal biological language and theory, the development of techniques to automate the formation of "procedural attachments" (software to access the data sources), and the discovery of domain-specific strategies to accelerate the theorem-proving process. BioLogica will demonstrate that the application of spatial and temporal inference and more general reasoning applied to a domain-specific formal theory provides an effective approach to the semantic integration of scientific data sources.
The formal knowledge base and prototype tool will be made available on the Web, and annual workshops will be held to acquaint the broader community with use of the tool and to exchange ideas about improving the underlying technology. A variety of planned student projects will train students in techniques for integration of diverse biological data sources and in development of bioinformatics tools. We will work with interested faculty to incorporate use of the tool into courses. Among the social impacts are developing a software assistant that enables research biologists to deal with the bewildering multiplicity of available online data and making such data available to instructors and students. The natural-language component of the proposed prototype is capable of spoken, as well as typed, interaction. Thus BioLogica could be adapted for use by researchers and students with visual or other impairments. It is expected that the techniques introduced by BioLogica for the biological sciences will more generally be applicable to semantic integration of data sources in all the sciences.