The objective of the proposed research is to develop a general and robust machine learning system for integrated analysis of high-throughput biological data for the purpose of prediction of gene function and protein-protein interactions. Achieving this goal requires addressing multiple challenges that include data heterogeneity, variable data quality, high noise levels in data, and a paucity of training samples. These challenges have prevented the successful application of traditional machine learning methods to diverse biological data. The research team will leverage diverse bioinformatics, machine learning, and biology expertise of the co-PIs and collaborators to develop accurate and effective approaches optimized for integrated analysis of genomic data. For prediction of protein-protein interactions, this investigation will focus on Bayesian approaches based on successful preliminary research. For gene function prediction, the focus will be on developing novel machine learning methods. These learning methods will use heterogeneous biological data as well as protein-protein interactions predicted by the system. The proposed research will lead to development of a general bioinformatics system that will utilize diverse large-scale biological data, including gene expression microarrays, physical and genetic interactions datasets, sequence and literature data, to produce an accurate map of protein-protein interactions and predictions of function for each of the proteins. This system will address the critical need in genomics to extract accurate biological information from disparate high-throughput data sources, enabling the first step in accurate and comprehensive study of cellular processes on a whole-genome level. Additionally, the proposed analysis will provide genomics researchers with quantitative rankings of the relative reliability of high-throughput experimental technologies, thereby providing biologists with data on which high-throughput technologies are more accurate than others. A significant advantage of this plan is that the research team will work closely with biologists to evaluate the predictions and feed the information back into the investigation to further improve the system and the quality of the resulting predictions.
The proposed system will provide predictions that will drive biological experimentation, enabling genome-wide annotation of unknown genes. The system will be publicly available to genomics researchers through its integration with the Saccharomyces Genome Database, a model organism database for yeast, and also via distribution of this integrated framework to other model databases. The interdisciplinary approach of this proposal will further the impact of advanced computer science on biology and will precipitate further interactions between the two fields, both through research and through interdisciplinary education.