The major focus of this project is the development of methodologies for high-dimensional data that arise from new emerging high-throughput genomic technologies. The types of data that we focus on are single nucleotide polymorphism (SNP) data from genome-wide association studies (GWAS) and whole genome exome sequencing data, though many methods developed here can be readily applied to other types of high-dimensional data. One feature of these data is that the number of predictors (genes or SNPs) p is typically much larger than the number of observations n. The key to handle these high-dimensional data is to reduce the dimensionality effectively. There are several challenges in reducing the dimensionality. First, there are many variants which contribute to complex diseases. GWAS target common variants that typically only have modest effects, whereas variants in sequencing studies that have larger effects are more rare. The consequence is that the variants that are associated with the trait do not stand out, because of stochastic variation as well as the number of variants under study. Secondly, many of these variants act in combination with environmental factors and other variants. This poses even more challenges, as the number of potential gene-environment and gene-gene interactions is much greater than the number of marginal analyses. Thirdly, to elucidate complex disease risk, a comprehensive approach which considers many genetic variants, environmental factors, and their interactions is needed. Developing methods that deal with large numbers of variants and environmental factors is the focus of this project. Using adaptive function estimation techniques, which have been developed for many large nonparametric regression problems, we will develop a suite of statistical and computational techniques for the identification of environmental factors that modify genetic effects, for the predicting of disease risk from many thousands of SNPs, and for identifying significant predictors in exome sequencing studies. In adaptive function estimation, an unknown function is modeled as a combination of many basis functions. Model selection techniques, such as the lasso and boosting, have been developed for selecting which combination of basis functions is best at predicting a (disease) outcome. These approaches are very suited to the problems studied in this project. The investigators on this project are directly involved in a number of genetic association studies as (principal) investigator.
The specific aims that we propose are in response to actual analysis problems facing these projects. This direct relation to projects ensures the relevance of the methods we intend to develop.
The major focus of this proposal is the development of analytical approaches for high-dimensional data that arise from genome-wide association studies and whole exome sequencing studies. In particular, we propose to develop adaptive methods to construct predictive models and to identify gene-environment interactions in GWAS, and to improve power for association studies in whole exome sequencing studies.
|Di, Chongzhi; Crainiceanu, Ciprian M; Jank, Wolfgang S (2014) Multilevel sparse functional principal component analysis. Stat 3:126-143|
|Logsdon, Benjamin A; Dai, James Y; Auer, Paul L et al. (2014) A variational Bayes discrete mixture test for rare variant association. Genet Epidemiol 38:21-30|
|Tapsoba, Jean de Dieu; Kooperberg, Charles; Reiner, Alexander et al. (2014) Robust estimation for secondary trait association in case-control genetic studies. Am J Epidemiol 179:1264-72|
|Carty, Cara L; Bhattacharjee, Samsiddhi; Haessler, Jeff et al. (2014) Analysis of metabolic syndrome components in >15 000 african americans identifies pleiotropic variants: results from the population architecture using genomics and epidemiology study. Circ Cardiovasc Genet 7:505-13|
|Saegusa, Takumi; Di, Chongzhi; Chen, Ying Qing (2014) Hypothesis testing for an extended cox model with time-varying coefficients. Biometrics 70:619-28|
|Pashova, H; LeBlanc, M; Kooperberg, C (2013) Boosting for detection of gene-environment interactions. Stat Med 32:255-66|