The objective of the proposed research is to develop a general and robust machine learning system for integrated analysis of high-throughput biological data for the purpose of prediction of gene function and protein-protein interactions. Achieving this goal requires addressing multiple challenges that include data heterogeneity, variable data quality, high noise levels in data, and a paucity of training samples. These challenges have prevented the successful application of traditional machine learning methods to diverse biological data. The research team will leverage diverse bioinformatics, machine learning, and biology expertise of the co-PIs and collaborators to develop accurate and effective approaches optimized for integrated analysis of genomic data. For prediction of protein-protein interactions, this investigation will focus on Bayesian approaches based on successful preliminary research. For gene function prediction, the focus will be on developing novel machine learning methods. These learning methods will use heterogeneous biological data as well as protein-protein interactions predicted by the system. The proposed research will lead to development of a general bioinformatics system that will utilize diverse large-scale biological data, including gene expression microarrays, physical and genetic interactions datasets, sequence and literature data, to produce an accurate map of protein-protein interactions and predictions of function for each of the proteins. This system will address the critical need in genomics to extract accurate biological information from disparate high-throughput data sources, enabling the first step in accurate and comprehensive study of cellular processes on a whole-genome level. Additionally, the proposed analysis will provide genomics researchers with quantitative rankings of the relative reliability of high-throughput experimental technologies, thereby providing biologists with data on which high-throughput technologies are more accurate than others. A significant advantage of this plan is that the research team will work closely with biologists to evaluate the predictions and feed the information back into the investigation to further improve the system and the quality of the resulting predictions.

The proposed system will provide predictions that will drive biological experimentation, enabling genome-wide annotation of unknown genes. The system will be publicly available to genomics researchers through its integration with the Saccharomyces Genome Database, a model organism database for yeast, and also via distribution of this integrated framework to other model databases. The interdisciplinary approach of this proposal will further the impact of advanced computer science on biology and will precipitate further interactions between the two fields, both through research and through interdisciplinary education.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0513552
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2005-07-15
Budget End
2009-06-30
Support Year
Fiscal Year
2005
Total Cost
$471,442
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08540