Currently a gap exists between the explosion of high-throughput data generation in molecular biology and the relatively slower growth of reliable functional information extracted from the data. This gap is largely due to the lack of specificity necessary for accurate gene function prediction in the currently available large-scale experimental technologies for rapid protein function assessment. Bioinformatics methods that integrate diverse data sources in their analysis achieve higher accuracy and thus alleviate this lack of specificity, but there's a paucity of generalizable, efficient, and accurate methods for data integration. In addition, no efficient methods exist to effectively display diverse genomic data, even though visualization has been very valuable for analysis of data from large scale technologies such as gene expression microarrays. The long-term goal of this proposal is to develop an accurate and generalizable bioinformatics framework for integrated analysis and visualization of heterogeneous biological data. We propose to address the data integration problem with a Bayesian network approach and effective visualization methods. We have shown the efficacy of this method in a proof-of-principle system that increased the accuracy of gene function prediction for Saccharomyces cerevisiae compared to individual data sources. Building on our previous work, we present a two-part plan to improve and expand our system and to develop novel visualization methods for genomic data based on the scalable display technology. First, we will investigate the computational and theoretical issues behind accurate integration, analysis and effective visualization of heterogeneous high-throughput data. Then, leveraging our existing system and algorithmic improvements developed in the first part of the project, we will design and implement a full-scale data integration and function prediction system for Saccharomyces cerevisiae that will be incorporated with the Saccharomyces Genome Database (SGD), a model organism database for yeast. The proposed system would provide highly accurate automatic function prediction that can accelerate genomic functional annotation through targeted experimental testing. Furthermore, our system will perform general integration and will offer researchers a unified view of the diverse high-throughput data through effective integration and visualization tools, thereby facilitating hypothesis generation and data analysis. Our scalable visualization methods will enable teams of researchers to examine biological data interactively and thus support the highly collaborative nature of genomic research. In addition to contributing to S. cerevisiae genomics, the technology for efficient and accurate heterogeneous data integration and visualization developed as a result of this proposal will form a basis for systems that address the same set of issues for other organisms, including the human.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM071966-05
Application #
7595813
Study Section
Special Emphasis Panel (ZRG1-BDMA (01))
Program Officer
Lyster, Peter
Project Start
2005-04-01
Project End
2010-12-31
Budget Start
2009-04-01
Budget End
2010-12-31
Support Year
5
Fiscal Year
2009
Total Cost
$243,004
Indirect Cost
Name
Princeton University
Department
Biostatistics & Other Math Sci
Type
Schools of Engineering
DUNS #
002484665
City
Princeton
State
NJ
Country
United States
Zip Code
08544
Zhou, Jian; Theesfeld, Chandra L; Yao, Kevin et al. (2018) Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet 50:1171-1179
Kaletsky, Rachel; Yao, Victoria; Williams, April et al. (2018) Transcriptome analysis of adult Caenorhabditis elegans cells reveals tissue-specific gene and isoform expression. PLoS Genet 14:e1007559
Dannenfelser, Ruth; Nome, Marianne; Tahiri, Andliena et al. (2017) Data-driven analysis of immune infiltrate in a large cohort of breast cancer and its association with disease progression, ER activity, and genomic complexity. Oncotarget 8:57121-57133
Nirschl, Christopher J; Suárez-Fariñas, Mayte; Izar, Benjamin et al. (2017) IFN?-Dependent Tissue-Immune Homeostasis Is Co-opted in the Tumor Microenvironment. Cell 170:127-141.e15
Watson, Emma; Olin-Sandoval, Viridiana; Hoy, Michael J et al. (2016) Metabolic network rewiring of propionate flux compensates vitamin B12 deficiency in C. elegans. Elife 5:
Zhou, Jian; Troyanskaya, Olga G (2016) Probabilistic modelling of chromatin code landscape reveals functional diversity of enhancer-like chromatin states. Nat Commun 7:10528
Krishnan, Arjun; Zhang, Ran; Yao, Victoria et al. (2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat Neurosci 19:1454-1462
Zhou, Jian; Troyanskaya, Olga G (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12:931-4
Gorenshteyn, Dmitriy; Zaslavsky, Elena; Fribourg, Miguel et al. (2015) Interactive Big Data Resource to Elucidate Human Immune Pathways and Diseases. Immunity 43:605-14
Wong, Aaron K; Krishnan, Arjun; Yao, Victoria et al. (2015) IMP 2.0: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 43:W128-33

Showing the most recent 10 out of 67 publications