Large-scale genomic, proteomic and other """"""""omic"""""""" research has become increasingly important and common for discovering disease genes and """"""""omic"""""""" biomarkers for cancer prevention and intervention, and for studying gene-environment interactions in population-based studies. Such high-dimensional """"""""omic"""""""" data present fundamental statistical and computational challenges in data analysis and result interpretation. Limited statistical developments have been made on analysis of high-dimensional """"""""omic"""""""" data in populationbased studies. Such a methodological shortage limits the speed of using genomic and proteomic data to effectively advance population sciences. The purpose of this proposal is to respond to this need by developing advanced statistical methods in conjunction with other advanced quantitative methods for analysis of high-dimensional genomic and proteomic data arising from population-based studies.
The specific aims are: (1) To develop regularized estimating equation-based variable selection methods for gene/biomarker discovery in the presence of a large number of SNPs or proteins and in studying gene-environment (space) interactions. The methods are developed for (a) continuous and discrete cross-sectional/case-control data, (b) longitudinal, clustered and spatial data, (c) independent, clustered, and spatial survival data;(2) To develop penalized likelihood-based methods for multiple testing for high-dimensional genomic and proteomic data subject to moderate/high correlation, such as microarrays and proteomic mass-spectrometry data, with the goal of providing higher statistical power and better false discovery rate (FDR) estimation;(3) To develop a suite of tools using contemporary advances in signal processing based on local Fourier analysis to effectively preprocess mass spectrometry (MS) proteomic data;(4) To develop supervised clustering methods for array CGH (aCGH) data to identify aCGH profiles related to survival;(5) To develop efficient user-friendly statistical software that implement these methods with the goal of disseminating them freely to health science researchers. The proposed methods will be applied to data from the motivating Harvard/MGH lung cancer genetic susceptibility and progression studies, the Harvard/MGH lung cancer proteomic study, the DFCI lung cancer LBK mutation micorarray study, the longitudinal HIV codon mutation study, and the Harvard/MGH brain tumor aCGH study. This project integrates closely with the spatial and surveillance projects 1 and 2 and the cores, as they have a common theme of analysis of high-dimensional observational study data;need advanced computing, and jointly provide tools for studying gene-space interactions.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-RPRB-7)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
United States
Zip Code
Valeri, Linda; Reese, Sarah L; Zhao, Shanshan et al. (2017) Misclassified exposure in epigenetic mediation analyses. Does DNA methylation mediate effects of smoking on birthweight? Epigenomics 9:253-265
Chipman, J; Braun, D (2017) Simpson's paradox in the integrated discrimination improvement. Stat Med 36:4468-4481
Wilson, Ander; Chiu, Yueh-Hsiu Mathilda; Hsu, Hsiao-Hsien Leon et al. (2017) Bayesian distributed lag interaction models to identify perinatal windows of vulnerability in children's health. Biostatistics 18:537-552
García-Albéniz, Xabier; Hsu, John; Hernán, Miguel A (2017) The value of explicitly emulating a target trial when using real world evidence: an application to colorectal cancer screening. Eur J Epidemiol 32:495-500
Krieger, Nancy; Feldman, Justin M; Waterman, Pamela D et al. (2017) Local Residential Segregation Matters: Stronger Association of Census Tract Compared to Conventional City-Level Measures with Fatal and Non-Fatal Assaults (Total and Firearm Related), Using the Index of Concentration at the Extremes (ICE) for Racial, Econ J Urban Health 94:244-258
Barnett, Ian; Mukherjee, Rajarshi; Lin, Xihong (2017) The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies. J Am Stat Assoc 112:64-76
Lee, Kyu Ha; Tadesse, Mahlet G; Baccarelli, Andrea A et al. (2017) Multivariate Bayesian variable selection exploiting dependence structure among outcomes: Application to air pollution effects on DNA methylation. Biometrics 73:232-241
Asafu-Adjei, Josephine; Mahlet, G Tadesse; Coull, Brent et al. (2017) Bayesian Variable Selection Methods for Matched Case-Control Studies. Int J Biostat 13:
Di, Qian; Wang, Yan; Zanobetti, Antonella et al. (2017) Air Pollution and Mortality in the Medicare Population. N Engl J Med 376:2513-2522
Cefalu, Matthew; Dominici, Francesca; Arvold, Nils et al. (2017) Model averaged double robust estimation. Biometrics 73:410-421

Showing the most recent 10 out of 178 publications