Large-scale genomic, proteomic and other """"""""omic"""""""" research has become increasingly important and common for discovering disease genes and """"""""omic"""""""" biomarkers for cancer prevention and intervention, and for studying gene-environment interactions in population-based studies. Such high-dimensional """"""""omic"""""""" data present fundamental statistical and computational challenges in data analysis and result interpretation. Limited statistical developments have been made on analysis of high-dimensional """"""""omic"""""""" data in populationbased studies. Such a methodological shortage limits the speed of using genomic and proteomic data to effectively advance population sciences. The purpose of this proposal is to respond to this need by developing advanced statistical methods in conjunction with other advanced quantitative methods for analysis of high-dimensional genomic and proteomic data arising from population-based studies.
The specific aims are: (1) To develop regularized estimating equation-based variable selection methods for gene/biomarker discovery in the presence of a large number of SNPs or proteins and in studying gene-environment (space) interactions. The methods are developed for (a) continuous and discrete cross-sectional/case-control data, (b) longitudinal, clustered and spatial data, (c) independent, clustered, and spatial survival data;(2) To develop penalized likelihood-based methods for multiple testing for high-dimensional genomic and proteomic data subject to moderate/high correlation, such as microarrays and proteomic mass-spectrometry data, with the goal of providing higher statistical power and better false discovery rate (FDR) estimation;(3) To develop a suite of tools using contemporary advances in signal processing based on local Fourier analysis to effectively preprocess mass spectrometry (MS) proteomic data;(4) To develop supervised clustering methods for array CGH (aCGH) data to identify aCGH profiles related to survival;(5) To develop efficient user-friendly statistical software that implement these methods with the goal of disseminating them freely to health science researchers. The proposed methods will be applied to data from the motivating Harvard/MGH lung cancer genetic susceptibility and progression studies, the Harvard/MGH lung cancer proteomic study, the DFCI lung cancer LBK mutation micorarray study, the longitudinal HIV codon mutation study, and the Harvard/MGH brain tumor aCGH study. This project integrates closely with the spatial and surveillance projects 1 and 2 and the cores, as they have a common theme of analysis of high-dimensional observational study data;need advanced computing, and jointly provide tools for studying gene-space interactions.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
United States
Zip Code
Bobb, Jennifer F; Claus Henn, Birgit; Valeri, Linda et al. (2018) Statistical software for analyzing the health effects of multiple concurrent exposures via Bayesian kernel machine regression. Environ Health 17:67
Chen, Han; Cade, Brian E; Gleason, Kevin J et al. (2018) Multiethnic Meta-Analysis Identifies RAI1 as a Possible Obstructive Sleep Apnea-related Quantitative Trait Locus in Men. Am J Respir Cell Mol Biol 58:391-401
Pierce, Brandon L; Kraft, Peter; Zhang, Chenan (2018) Mendelian randomization studies of cancer risk: a literature review. Curr Epidemiol Rep 5:184-196
Barfield, Richard; Feng, Helian; Gusev, Alexander et al. (2018) Transcriptome-wide association studies accounting for colocalization using Egger regression. Genet Epidemiol 42:418-433
Liu, Zhonghua; Lin, Xihong (2018) Multiple phenotype association tests using summary statistics in genome-wide association studies. Biometrics 74:165-175
Emilsson, Louise; García-Albéniz, Xabier; Logan, Roger W et al. (2018) Examining Bias in Studies of Statin Treatment and Survival in Patients With Cancer. JAMA Oncol 4:63-70
Sun, Ryan; Carroll, Raymond J; Christiani, David C et al. (2018) Testing for gene-environment interaction under exposure misspecification. Biometrics 74:653-662
Antonelli, Joseph; Cefalu, Matthew; Palmer, Nathan et al. (2018) Doubly robust matching estimators for high dimensional confounding adjustment. Biometrics :
Wilson, Ander; Zigler, Corwin M; Patel, Chirag J et al. (2018) Model-averaged confounder adjustment for estimating multivariate exposure effects with linear regression. Biometrics 74:1034-1044
Chipman, J; Braun, D (2017) Simpson's paradox in the integrated discrimination improvement. Stat Med 36:4468-4481

Showing the most recent 10 out of 192 publications