Large-scale genomic, proteomic and other """"""""omic"""""""" research has become increasingly important and common for discovering disease genes and """"""""omic"""""""" biomarkers for cancer prevention and intervention, and for studying gene-environment interactions in population-based studies. Such high-dimensional """"""""omic"""""""" data present fundamental statistical and computational challenges in data analysis and result interpretation. Limited statistical developments have been made on analysis of high-dimensional """"""""omic"""""""" data in populationbased studies. Such a methodological shortage limits the speed of using genomic and proteomic data to effectively advance population sciences. The purpose of this proposal is to respond to this need by developing advanced statistical methods in conjunction with other advanced quantitative methods for analysis of high-dimensional genomic and proteomic data arising from population-based studies.
The specific aims are: (1) To develop regularized estimating equation-based variable selection methods for gene/biomarker discovery in the presence of a large number of SNPs or proteins and in studying gene-environment (space) interactions. The methods are developed for (a) continuous and discrete cross-sectional/case-control data, (b) longitudinal, clustered and spatial data, (c) independent, clustered, and spatial survival data;(2) To develop penalized likelihood-based methods for multiple testing for high-dimensional genomic and proteomic data subject to moderate/high correlation, such as microarrays and proteomic mass-spectrometry data, with the goal of providing higher statistical power and better false discovery rate (FDR) estimation;(3) To develop a suite of tools using contemporary advances in signal processing based on local Fourier analysis to effectively preprocess mass spectrometry (MS) proteomic data;(4) To develop supervised clustering methods for array CGH (aCGH) data to identify aCGH profiles related to survival;(5) To develop efficient user-friendly statistical software that implement these methods with the goal of disseminating them freely to health science researchers. The proposed methods will be applied to data from the motivating Harvard/MGH lung cancer genetic susceptibility and progression studies, the Harvard/MGH lung cancer proteomic study, the DFCI lung cancer LBK mutation micorarray study, the longitudinal HIV codon mutation study, and the Harvard/MGH brain tumor aCGH study. This project integrates closely with the spatial and surveillance projects 1 and 2 and the cores, as they have a common theme of analysis of high-dimensional observational study data;need advanced computing, and jointly provide tools for studying gene-space interactions.
Showing the most recent 10 out of 192 publications