Pathway analysis of genomic data?the use of prior knowledge about how genes function together in biological systems?plays an increasingly critical role in gaining biological insights from large-scale genomic studies, and particularly in cancer research. However, even the richest source of computer-accessible biological pathway information, the Gene Ontology (GO), is very incomplete, hampering pathway analyses. Over the past three years, the GO Consortium has developed a project that has shown that, by utilizing a rigorous phylogenetic approach, we can increase the amount of knowledge for human genes by five-fold through careful use of experimental data obtained in model organisms such as the mouse, fruit fly, and yeast. The GOC project, however, relies on expert human biologists, and will not scale to the entire human genome. Here, we propose to develop a computational approach that leverages the experience gained in the GOC project. We will develop an accurate, scalable computational solution to the gene function inference problem, which will dramatically increase the amount of biological information that can be used in analysis of genome-scale human datasets. In brief, the task is to integrate knowledge obtained from experiments across multiple organisms, in the context of the family tree that relates the genes, by constructing a probabilistic model of function conservation and divergence. The main application of the probabilistic model will be to infer the function of human genes, from experiments in other organisms. While each gene family will have a specific model depending on its own, unique history, to avoid overfitting we will estimate only a small number of parameters that are shared across all families. We propose to use the same, rigorous model of functional evolution as employed in the GOC project, which is based on evolutionary gain and loss of different kinds of functions (e.g. a catalytic function, binding function or even participation in a biological process or pathway), using not only GO annotations but additional information such as protein domain structure and active sites. We will use the manually-curated examples from the GO Consortium as a training set for developing, as well as a test set for assessing, our computational inference method. We expect that this work will result in a dramatic increase in the number of GO annotations for human genes, resulting in much more informative results from pathway analysis, thus generating additional insights into human disease risk, progression and potential therapies. While our approach is general, we will focus manual validation on cancer-related pathways in order to ensure applicability specifically in cancer research.

Public Health Relevance

An enormous amount of biological information has been painstakingly accumulated from experiments not only in human cells, but in many other organisms as well. This information has been critical for using genomics experiments to understand cancer progression and potential therapies, but current approaches make use of only a small fraction of the available information. This project will dramatically increase the amount of biological information available, improving genomics analysis in studies of cancer and other diseases.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Southern California
Los Angeles
United States
Zip Code
Ryser, Marc D; Min, Byung-Hoon; Siegmund, Kimberly D et al. (2018) Spatial mutation patterns as markers of early colorectal tumor cell mobility. Proc Natl Acad Sci U S A 115:5774-5779
Liu, Jie; Liang, Gangning; Siegmund, Kimberly D et al. (2018) Data integration by multi-tuning parameter elastic net regression. BMC Bioinformatics 19:369
Moss, Lilit C; Gauderman, William J; Lewinger, Juan Pablo et al. (2018) Using Bayes model averaging to leverage both gene main effects and G?×? E interactions to identify genomic regions in genome-wide association studies. Genet Epidemiol :
Ritz, Beate R; Chatterjee, Nilanjan; Garcia-Closas, Montserrat et al. (2017) Lessons Learned From Past Gene-Environment Interaction Successes. Am J Epidemiol 186:778-786
Gauderman, W James; Mukherjee, Bhramar; Aschard, Hugues et al. (2017) Update on the State of the Science for Analytical Methods for Gene-Environment Interactions. Am J Epidemiol 186:762-770
Thomas, Duncan C (2017) Estimating the Effect of Targeted Screening Strategies: An Application to Colonoscopy and Colorectal Cancer. Epidemiology 28:470-478
Rao, D C; Sung, Yun J; Winkler, Thomas W et al. (2017) Multiancestry Study of Gene-Lifestyle Interactions for Cardiovascular Traits in 610 475 Individuals From 124 Cohorts: Design and Rationale. Circ Cardiovasc Genet 10:
The Gene Ontology Consortium (2017) Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res 45:D331-D338
Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya et al. (2017) PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Res 45:D183-D189
Gref, Anna; Merid, Simon K; Gruzieva, Olena et al. (2017) Genome-Wide Interaction Analysis of Air Pollution Exposure and Childhood Asthma with Functional Follow-up. Am J Respir Crit Care Med 195:1373-1383

Showing the most recent 10 out of 28 publications