Gene expression experiments are an abundant and robust source of functional genomics data, with thousands of microarray and a growing number of high throughput RNA sequencing studies publicly available, most interrogating clinical and biological systems relevant to disease. They hold the promise of data-driven characterization of gene function and regulation, including in specific tissues, cell lines, and disease states, and can advance the understanding and modeling of regulatory changes that form the basis of human disease. However, these data remain largely underutilized, as biology researchers do not have effective tools to explore and analyze the entire data collection to generate novel hypotheses and direct experiments. The situation is similar to that of the Internet before the search engines - a biology researcher has to know a priori which datasets pertain to the biological question she is asking, reflect the tissue/cell-lineage specific signals of interest to her, and accurately measure the expression of genes related to her pathways of interest. There is a clear need for methods that will enable biology researchers to use their domain-specific knowledge to direct their exploration of public human expression data, enabling them to generate hypotheses and direct experiments addressing challenging biomedical questions. Such a system should provide users with ability to effectively explore automatically identified datasets relevant to their biological question of interest, leverage metazoan complexity including cell lineage and disease specific signals, and allow the researcher to securely include their unpublished data in the analysis. To address these challenges, this proposal describes a """"""""Google-style"""""""" public search engine for large collections of gene expression data built using novel search algorithms and leveraging cloud-computing technologies. This system implements a novel query-based context-sensitive algorithm for search of large expression compendia that exploits the complexity of metazoan organisms, including cell-lineage complexity and disease aspects inherent to human expression studies. Furthermore, the challenge of heterogeneity in human samples will be addressed by developing novel hierarchical learning methods to predict cell-lineage or tissue-specific gene expression based on the compendium and to identify these signals in each dataset. This will enable users to explore tissue-specific expression and also will be integrated with the search algorithm to improve search accuracy. Proposed algorithms, search engine, and user interface will be extensively evaluated in close collaboration with biology researchers, and top predictions will be tested experimentally. These methods will be implemented in a user-friendly public search system that will leverage cloud computing to provide robust interactive query response and will enable biology researchers to explore both published data collections and their own pre-publication datasets in a context-specific, integrated, and secure manner.

Public Health Relevance

We will develop a Google-style search engine for massive collections of human gene expression data. Our system will enable researchers to use their domain knowledge to explore the entirety of public human expression data to generate hypotheses and direct experiments addressing a diverse range of challenging biomedical questions. Public availability of our system will advance genome-level understanding of human biology and facilitate development of novel drugs, therapies, and personalized medical treatments.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Princeton University
Biostatistics & Other Math Sci
Schools of Engineering
United States
Zip Code
Zhou, Jian; Theesfeld, Chandra L; Yao, Kevin et al. (2018) Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet 50:1171-1179
Watson, Emma; Olin-Sandoval, Viridiana; Hoy, Michael J et al. (2016) Metabolic network rewiring of propionate flux compensates vitamin B12 deficiency in C. elegans. Elife 5:
Zhou, Jian; Troyanskaya, Olga G (2016) Probabilistic modelling of chromatin code landscape reveals functional diversity of enhancer-like chromatin states. Nat Commun 7:10528
Krishnan, Arjun; Zhang, Ran; Yao, Victoria et al. (2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat Neurosci 19:1454-1462
Zhou, Jian; Troyanskaya, Olga G (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12:931-4
Wong, Aaron K; Krishnan, Arjun; Yao, Victoria et al. (2015) IMP 2.0: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 43:W128-33
Goya, Jonathan; Wong, Aaron K; Yao, Victoria et al. (2015) FNTM: a server for predicting functional networks of tissues in mouse. Nucleic Acids Res 43:W182-7
Park, Christopher Y; Krishnan, Arjun; Zhu, Qian et al. (2015) Tissue-aware data integration approach for the inference of pathway interactions in metazoan organisms. Bioinformatics 31:1093-101
Greene, Casey S; Krishnan, Arjun; Wong, Aaron K et al. (2015) Understanding multicellular function and disease with human tissue-specific networks. Nat Genet 47:569-76
Zhu, Qian; Wong, Aaron K; Krishnan, Arjun et al. (2015) Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods 12:211-4, 3 p following 214

Showing the most recent 10 out of 23 publications