This proposal is to develop advanced statistical methods for analyzing large next generation sequencing data in genetic cancer epidemiological studies. The genomic era provides an unprecedented promise of understanding multifactorial diseases, such as cancer, and of identifying specific targets that can be used to develop patient-tailored therapies. Although hundreds of genome-wide association studies in the last few years have identified over a thousand common genetic variants associated with many complex diseases, these variants only explain a small fraction of heritability of diseases. The recent advance of next generation sequencing technologies provides an exciting new opportunity for discovering genes and biomarkers associated with diseases or traits, studying gene-environment interactions, predicting disease risk, and advancing personalized medicine. However, large sequencing data, especially rare variants, present fundamental statistical and computational challenges in data analysis and result interpretation. A shortage of appropriate and powerful statistical methods for analysis of next generation sequencing data has become a bottleneck for effectively using these rich resources to rapidly develop novel molecular cancer prevention and treatment strategies. The purpose ofthis proposal is to respond to this need. The proposed methods are motivated by and applied to the Harvard Lung Cancer and Breast Cancer exome and targeted sequencing association studies, in which the investigators play a major leadership role.
The specific aims are: (1) To develop a unified, powerful and robust statistical framework to test the association between rare variants and diseases and traits in sequencing association studies;(2) To develop penalized likelihood-based methods for risk prediction in population based sequencing studies;(3) To use the causal inference framework for mediation analysis to estimate and test for the direct effects of genetic rare variants and their indirect effects mediated through environmental risk factors on disease risk in sequencing studies;and account for measurement error in exposures. (4) To develop efficient user-friendly open access statistical software. This project integrates closely with Projects 1 and 2 with a common theme of analysis of large and complex observational study data, and takes advantage ofthe expertise of Projects 1 and 2 in causal inference on mediation analysis and modeling environmental exposures in studying the interplay of genes and environment. It also relies heavily on the Statistical Computing Core, and the organizational infrastructure, team'building strategies, workshops and visitor program provided through the Administrative Core.

Public Health Relevance

This project aims to develop statistical methods to advance cancer prevention and intervention strategies by using next generation sequencing data to identify genetic variants associated with cancer, to build genetic risk prediction models for cancer risk;and to study the direct and indirect effects of genetic variants in the interplay of genes and environment in cancer risk and progression.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-RPRB-2)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
United States
Zip Code
Bobb, Jennifer F; Claus Henn, Birgit; Valeri, Linda et al. (2018) Statistical software for analyzing the health effects of multiple concurrent exposures via Bayesian kernel machine regression. Environ Health 17:67
Chen, Han; Cade, Brian E; Gleason, Kevin J et al. (2018) Multiethnic Meta-Analysis Identifies RAI1 as a Possible Obstructive Sleep Apnea-related Quantitative Trait Locus in Men. Am J Respir Cell Mol Biol 58:391-401
Pierce, Brandon L; Kraft, Peter; Zhang, Chenan (2018) Mendelian randomization studies of cancer risk: a literature review. Curr Epidemiol Rep 5:184-196
Barfield, Richard; Feng, Helian; Gusev, Alexander et al. (2018) Transcriptome-wide association studies accounting for colocalization using Egger regression. Genet Epidemiol 42:418-433
Liu, Zhonghua; Lin, Xihong (2018) Multiple phenotype association tests using summary statistics in genome-wide association studies. Biometrics 74:165-175
Emilsson, Louise; García-Albéniz, Xabier; Logan, Roger W et al. (2018) Examining Bias in Studies of Statin Treatment and Survival in Patients With Cancer. JAMA Oncol 4:63-70
Sun, Ryan; Carroll, Raymond J; Christiani, David C et al. (2018) Testing for gene-environment interaction under exposure misspecification. Biometrics 74:653-662
Antonelli, Joseph; Cefalu, Matthew; Palmer, Nathan et al. (2018) Doubly robust matching estimators for high dimensional confounding adjustment. Biometrics :
Wilson, Ander; Zigler, Corwin M; Patel, Chirag J et al. (2018) Model-averaged confounder adjustment for estimating multivariate exposure effects with linear regression. Biometrics 74:1034-1044
Valeri, Linda; Reese, Sarah L; Zhao, Shanshan et al. (2017) Misclassified exposure in epigenetic mediation analyses. Does DNA methylation mediate effects of smoking on birthweight? Epigenomics 9:253-265

Showing the most recent 10 out of 192 publications