This proposal is to develop advanced statistical methods for analyzing large next generation sequencing data in genetic cancer epidemiological studies. The genomic era provides an unprecedented promise of understanding multifactorial diseases, such as cancer, and of identifying specific targets that can be used to develop patient-tailored therapies. Although hundreds of genome-wide association studies in the last few years have identified over a thousand common genetic variants associated with many complex diseases, these variants only explain a small fraction of heritability of diseases. The recent advance of next generation sequencing technologies provides an exciting new opportunity for discovering genes and biomarkers associated with diseases or traits, studying gene-environment interactions, predicting disease risk, and advancing personalized medicine. However, large sequencing data, especially rare variants, present fundamental statistical and computational challenges in data analysis and result interpretation. A shortage of appropriate and powerful statistical methods for analysis of next generation sequencing data has become a bottleneck for effectively using these rich resources to rapidly develop novel molecular cancer prevention and treatment strategies. The purpose ofthis proposal is to respond to this need. The proposed methods are motivated by and applied to the Harvard Lung Cancer and Breast Cancer exome and targeted sequencing association studies, in which the investigators play a major leadership role.
The specific aims are: (1) To develop a unified, powerful and robust statistical framework to test the association between rare variants and diseases and traits in sequencing association studies;(2) To develop penalized likelihood-based methods for risk prediction in population based sequencing studies;(3) To use the causal inference framework for mediation analysis to estimate and test for the direct effects of genetic rare variants and their indirect effects mediated through environmental risk factors on disease risk in sequencing studies;and account for measurement error in exposures. (4) To develop efficient user-friendly open access statistical software. This project integrates closely with Projects 1 and 2 with a common theme of analysis of large and complex observational study data, and takes advantage ofthe expertise of Projects 1 and 2 in causal inference on mediation analysis and modeling environmental exposures in studying the interplay of genes and environment. It also relies heavily on the Statistical Computing Core, and the organizational infrastructure, team'building strategies, workshops and visitor program provided through the Administrative Core.

Public Health Relevance

This project aims to develop statistical methods to advance cancer prevention and intervention strategies by using next generation sequencing data to identify genetic variants associated with cancer, to build genetic risk prediction models for cancer risk;and to study the direct and indirect effects of genetic variants in the interplay of genes and environment in cancer risk and progression.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-RPRB-2 (M1))
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
United States
Zip Code
Valeri, Linda; Reese, Sarah L; Zhao, Shanshan et al. (2017) Misclassified exposure in epigenetic mediation analyses. Does DNA methylation mediate effects of smoking on birthweight? Epigenomics 9:253-265
Chipman, J; Braun, D (2017) Simpson's paradox in the integrated discrimination improvement. Stat Med 36:4468-4481
Wilson, Ander; Chiu, Yueh-Hsiu Mathilda; Hsu, Hsiao-Hsien Leon et al. (2017) Bayesian distributed lag interaction models to identify perinatal windows of vulnerability in children's health. Biostatistics 18:537-552
García-Albéniz, Xabier; Hsu, John; Hernán, Miguel A (2017) The value of explicitly emulating a target trial when using real world evidence: an application to colorectal cancer screening. Eur J Epidemiol 32:495-500
Krieger, Nancy; Feldman, Justin M; Waterman, Pamela D et al. (2017) Local Residential Segregation Matters: Stronger Association of Census Tract Compared to Conventional City-Level Measures with Fatal and Non-Fatal Assaults (Total and Firearm Related), Using the Index of Concentration at the Extremes (ICE) for Racial, Econ J Urban Health 94:244-258
Barnett, Ian; Mukherjee, Rajarshi; Lin, Xihong (2017) The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies. J Am Stat Assoc 112:64-76
Lee, Kyu Ha; Tadesse, Mahlet G; Baccarelli, Andrea A et al. (2017) Multivariate Bayesian variable selection exploiting dependence structure among outcomes: Application to air pollution effects on DNA methylation. Biometrics 73:232-241
Asafu-Adjei, Josephine; Mahlet, G Tadesse; Coull, Brent et al. (2017) Bayesian Variable Selection Methods for Matched Case-Control Studies. Int J Biostat 13:
Di, Qian; Wang, Yan; Zanobetti, Antonella et al. (2017) Air Pollution and Mortality in the Medicare Population. N Engl J Med 376:2513-2522
Cefalu, Matthew; Dominici, Francesca; Arvold, Nils et al. (2017) Model averaged double robust estimation. Biometrics 73:410-421

Showing the most recent 10 out of 178 publications