Gene expression experiments are an abundant and robust source of functional genomics data, with thousands of microarray and a growing number of high throughput RNA sequencing studies publicly available, most interrogating clinical and biological systems relevant to disease. They hold the promise of data-driven characterization of gene function and regulation, including in specific tissues, cell lines, and disease states, and can advance the understanding and modeling of regulatory changes that form the basis of human disease. However, these data remain largely underutilized, as biology researchers do not have effective tools to explore and analyze the entire data collection to generate novel hypotheses and direct experiments. The situation is similar to that of the Internet before the search engines - a biology researcher has to know a priori which datasets pertain to the biological question she is asking, reflect the tissue/cell-lineage specific signals of interest to her, and accurately measure the expression of genes related to her pathways of interest. There is a clear need for methods that will enable biology researchers to use their domain-specific knowledge to direct their exploration of public human expression data, enabling them to generate hypotheses and direct experiments addressing challenging biomedical questions. Such a system should provide users with ability to effectively explore automatically identified datasets relevant to their biological question of interest, leverage metazoan complexity including cell lineage and disease specific signals, and allow the researcher to securely include their unpublished data in the analysis. To address these challenges, this proposal describes a "Google-style" public search engine for large collections of gene expression data built using novel search algorithms and leveraging cloud-computing technologies. This system implements a novel query-based context-sensitive algorithm for search of large expression compendia that exploits the complexity of metazoan organisms, including cell-lineage complexity and disease aspects inherent to human expression studies. Furthermore, the challenge of heterogeneity in human samples will be addressed by developing novel hierarchical learning methods to predict cell-lineage or tissue-specific gene expression based on the compendium and to identify these signals in each dataset. This will enable users to explore tissue-specific expression and also will be integrated with the search algorithm to improve search accuracy. Proposed algorithms, search engine, and user interface will be extensively evaluated in close collaboration with biology researchers, and top predictions will be tested experimentally. These methods will be implemented in a user-friendly public search system that will leverage cloud computing to provide robust interactive query response and will enable biology researchers to explore both published data collections and their own pre-publication datasets in a context-specific, integrated, and secure manner.

Public Health Relevance

We will develop a Google-style search engine for massive collections of human gene expression data. Our system will enable researchers to use their domain knowledge to explore the entirety of public human expression data to generate hypotheses and direct experiments addressing a diverse range of challenging biomedical questions. Public availability of our system will advance genome-level understanding of human biology and facilitate development of novel drugs, therapies, and personalized medical treatments.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Princeton University
Biostatistics & Other Math Sci
Schools of Engineering
United States
Zip Code
Zhou, Jian; Troyanskaya, Olga G (2014) Global quantitative modeling of chromatin factor interactions. PLoS Comput Biol 10:e1003525
Park, Christopher Y; Wong, Aaron K; Greene, Casey S et al. (2013) Functional knowledge transfer for high-accuracy prediction of under-studied biological processes. PLoS Comput Biol 9:e1002957
Ju, Wenjun; Greene, Casey S; Eichinger, Felix et al. (2013) Defining cell-type specificity at the transcriptional level in human disease. Genome Res 23:1862-73
Caudy, Amy A; Guan, Yuanfang; Jia, Yue et al. (2013) A new system for comparative functional genomics of Saccharomyces yeasts. Genetics 195:275-87
Guan, Yuanfang; Dunham, Maitreya J; Troyanskaya, Olga G et al. (2013) Comparative gene expression between two yeast species. BMC Genomics 14:33
Chikina, Maria D; Troyanskaya, Olga G (2012) An effective statistical evaluation of ChIPseq dataset similarity. Bioinformatics 28:607-13
Guan, Yuanfang; Yao, Victoria; Tsui, Kyle et al. (2011) Nucleosome-coupled expression differences in closely-related species. BMC Genomics 12:466