The overall goal of this project is to develop novel statistical methods for integrative analysis of genomic data in cancer research. We propose to develop analytical tools that can integrate data from multiple genomic platforms and incorporate external omic information from publically available databases. These tools will be applicable to both etiological studies geared toward causal discovery and to clinical and translational studies geared toward predictive modeling. Advances in high-throughput molecular technologies have enabled large-scale omic projects (e.g. Encode, The Cancer Genome Atlas, Epigenome Roadmap) to generate vast amounts of information on the structure, function and regulation of the genome. In addition to this publically available data, individual studies are increasingly generating multiplatform genomic profiles (e.g. genotypes, gene expression, methylation copy number variation, miRNA) to elucidate the complex mechanisms of cancer development and progression, and investigate determinants and predictors of health and clinical outcomes. Integration across these multiple genomic ?dimensions? and incorporation of the available external information can increase the ability to discovery causal relationships (e.g. Cancer-SNP associations), enhance prediction and prognosis modeling (e.g. cancer aggressiveness), and provide insights into biological mechanisms. We propose two analytic approaches aimed at addressing the challenges to effective integration across multiplatform genomic data and incorporation of external information from omic projects. The first approach (Aim 1) is a Bayesian regression and feature selection method that can integrate prior omic information in a very flexible manner allowing the data to `speak for itself' to determine which pieces of external information are relevant for the problem at hand. The method works with individual-level data and also with meta-analytic summaries, making it well suited for analyzing data from large multi-study consortia. The second approach (Aim 2) is a regularized regression and feature selection method for integrating multiplatform genomic features measured on the same set of individuals. The method is designed to scale to the very large numbers of features typical of genomewide platforms, to account for the different properties of each genomic data type, and to incorporate relevant external information to increase efficiency. Both approaches can be applied for causal discovery and for developing predictive and prognostic models. We will apply our methods to search for novel risk variants in the CORECT consortium of genome association studies, and to construct a prognostic model of CRC recurrence based on genomewide expression methylation data in the ColoCare consortium cohort of CRC patients. This work will provide new tools for analyzing high-dimensional multi-platform genomic that can take advantage of available external information.

Public Health Relevance

Cancer results from a complex series of alterations of the structure, function, and regulation of the genome. Integration of information across these multiple genomic `dimensions' can provide insights into the development and progression of cancer and accelerate the discovery of novel biomarkers for prediction and prognosis. The goal of this project is to develop novel statistical methods for integrating multiple levels of genomic information to elucidate the complex mechanisms of cancer development and progression and to investigate the determinants and predictors of cancer clinical outcomes. We will apply these methods to two studies that have characterized germline and somatic variation in tumors, one of colorectal cancer patients followed for clinical outcomes, and one large consortium of colorectal cancer association studies.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Southern California
Los Angeles
United States
Zip Code
Ryser, Marc D; Min, Byung-Hoon; Siegmund, Kimberly D et al. (2018) Spatial mutation patterns as markers of early colorectal tumor cell mobility. Proc Natl Acad Sci U S A 115:5774-5779
Liu, Jie; Liang, Gangning; Siegmund, Kimberly D et al. (2018) Data integration by multi-tuning parameter elastic net regression. BMC Bioinformatics 19:369
Moss, Lilit C; Gauderman, William J; Lewinger, Juan Pablo et al. (2018) Using Bayes model averaging to leverage both gene main effects and G?×? E interactions to identify genomic regions in genome-wide association studies. Genet Epidemiol :
Ritz, Beate R; Chatterjee, Nilanjan; Garcia-Closas, Montserrat et al. (2017) Lessons Learned From Past Gene-Environment Interaction Successes. Am J Epidemiol 186:778-786
Gauderman, W James; Mukherjee, Bhramar; Aschard, Hugues et al. (2017) Update on the State of the Science for Analytical Methods for Gene-Environment Interactions. Am J Epidemiol 186:762-770
Thomas, Duncan C (2017) Estimating the Effect of Targeted Screening Strategies: An Application to Colonoscopy and Colorectal Cancer. Epidemiology 28:470-478
Rao, D C; Sung, Yun J; Winkler, Thomas W et al. (2017) Multiancestry Study of Gene-Lifestyle Interactions for Cardiovascular Traits in 610 475 Individuals From 124 Cohorts: Design and Rationale. Circ Cardiovasc Genet 10:
The Gene Ontology Consortium (2017) Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res 45:D331-D338
Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya et al. (2017) PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Res 45:D183-D189
Gref, Anna; Merid, Simon K; Gruzieva, Olena et al. (2017) Genome-Wide Interaction Analysis of Air Pollution Exposure and Childhood Asthma with Functional Follow-up. Am J Respir Crit Care Med 195:1373-1383

Showing the most recent 10 out of 28 publications