This proposal is to develop advanced statistical methods for analyzing large next generation sequencing data in genetic cancer epidemiological studies. The genomic era provides an unprecedented promise of understanding multifactorial diseases, such as cancer, and of identifying specific targets that can be used to develop patient-tailored therapies. Although hundreds of genome-wide association studies in the last few years have identified over a thousand common genetic variants associated with many complex diseases, these variants only explain a small fraction of heritability of diseases. The recent advance of next generation sequencing technologies provides an exciting new opportunity for discovering genes and biomarkers associated with diseases or traits, studying gene-environment interactions, predicting disease risk, and advancing personalized medicine. However, large sequencing data, especially rare variants, present fundamental statistical and computational challenges in data analysis and result interpretation. A shortage of appropriate and powerful statistical methods for analysis of next generation sequencing data has become a bottleneck for effectively using these rich resources to rapidly develop novel molecular cancer prevention and treatment strategies. The purpose ofthis proposal is to respond to this need. The proposed methods are motivated by and applied to the Harvard Lung Cancer and Breast Cancer exome and targeted sequencing association studies, in which the investigators play a major leadership role.
The specific aims are: (1) To develop a unified, powerful and robust statistical framework to test the association between rare variants and diseases and traits in sequencing association studies;(2) To develop penalized likelihood-based methods for risk prediction in population based sequencing studies;(3) To use the causal inference framework for mediation analysis to estimate and test for the direct effects of genetic rare variants and their indirect effects mediated through environmental risk factors on disease risk in sequencing studies;and account for measurement error in exposures. (4) To develop efficient user-friendly open access statistical software. This project integrates closely with Projects 1 and 2 with a common theme of analysis of large and complex observational study data, and takes advantage ofthe expertise of Projects 1 and 2 in causal inference on mediation analysis and modeling environmental exposures in studying the interplay of genes and environment. It also relies heavily on the Statistical Computing Core, and the organizational infrastructure, team'building strategies, workshops and visitor program provided through the Administrative Core.

Public Health Relevance

This project aims to develop statistical methods to advance cancer prevention and intervention strategies by using next generation sequencing data to identify genetic variants associated with cancer, to build genetic risk prediction models for cancer risk;and to study the direct and indirect effects of genetic variants in the interplay of genes and environment in cancer risk and progression.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-RPRB-2 (M1))
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
United States
Zip Code
García-Albéniz, Xabier; Maurel, Joan; Hernán, Miguel A (2015) Why post-progression survival and post-relapse survival are not appropriate measures of efficacy in cancer randomized clinical trials. Int J Cancer 136:2444-7
Aschard, Hugues; Vilhjálmsson, Bjarni J; Greliche, Nicolas et al. (2014) Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet 94:662-76
VanderWeele, Tyler J; Tchetgen Tchetgen, Eric J; Cornelis, Marilyn et al. (2014) Methodological challenges in mendelian randomization. Epidemiology 25:427-35
Krieger, Nancy; Kosheleva, Anna; Waterman, Pamela D et al. (2014) 50-year trends in US socioeconomic inequalities in health: US-born Black and White Americans, 1959-2008. Int J Epidemiol 43:1294-313
Holme, Øyvind; Løberg, Magnus; Kalager, Mette et al. (2014) Effect of flexible sigmoidoscopy screening on colorectal cancer incidence and mortality: a randomized clinical trial. JAMA 312:606-15
Bobb, Jennifer F; Obermeyer, Ziad; Wang, Yun et al. (2014) Cause-specific risk of hospital admission related to extreme heat in older adults. JAMA 312:2659-67
Lee, Seunggeung; Abecasis, Gonçalo R; Boehnke, Michael et al. (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95:5-23
Arvold, Nils D; Wang, Yun; Zigler, Cory et al. (2014) Hospitalization burden and survival among older glioblastoma patients. Neuro Oncol 16:1530-40
Zigler, Corwin Matthew; Dominici, Francesca (2014) Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model Averaged Causal Effects. J Am Stat Assoc 109:95-107
Wang, Yun; Schrag, Deborah; Brooks, Gabriel A et al. (2014) National trends in pancreatic cancer outcomes and pattern of care among Medicare beneficiaries, 2000 through 2010. Cancer 120:1050-8

Showing the most recent 10 out of 40 publications