One central information technology problem of the next decade will be the creation of a means through which to query a heterogeneous set of life science databases, generally via the Internet. Life science web databases hold information that is critical to researchers. Even though, in the above databases, data have been collected and automated procedures for data manipulation have been provided, user accessibility to such databases is very often still inadequate due to the lack of a comparable data representation, a unified interface for data exchange among the databases, and a customizable infrastructure (i.e. support for views) that addresses the individual needs of scientists. The research objective of this career plan is to address these issues by developing an information integration system for life science databases that supports views (BACIIS+ Biological and Chemical Information Integration System Plus -- views).
BACIIS+ will allow for better communication between life science databases, will provide for continuous and rapid expansion and adaptation to the evolving biological field, and will provide better and more customizable approaches for data access and data analysis through dynamic views. The open research issues involved in the development of BACIIS+ include : 1) the management of large data sets, 2) the interoperability of geographically distributed autonomous databases, 3) the seamless semantic-based integration of these databases with total transparency to the user, and 4) the support for distributed multi-database views. Semantic integration aims at integrating data in a meaningful way while syntactic integration consists of just collecting and pasting together data from different databases. Static views consist of a limited set of views predefined in scope. Whereas dynamic views are created on-demand and their scope is completely defined by the user.
Two new graduate courses are planned: Life science information systems, and Computational biology algorithms. The first course will cover the complexity of information extraction and management in the context of life science data. The second course will cover the foundation of sequential and parallel algorithms for sequence similarity analysis. Collaborators from Eli Lilly & Company and Dow AgroSciences are involved in the courses, providing industry perspectives. The proposed courses build on a current course by the PI that covers advanced molecular biology and includes three major sections: bioinformatics, computational modeling, and molecular machinery. The plan actively seeks the participation of undergraduate and minority students through senior design projects. A campus-wide bioinformatics initiative includes a newly established School of Informatics and the Indiana Genomics Center. An international workshop on the interoperability of life science databases is in development.