The major focus of this project is the development of methodologies for high-dimensional data that arise from new emerging high-throughput genomic technologies. The types of data that we focus on are single nucleotide polymorphism (SNP) data from genome-wide association studies (GWAS) and whole genome exome sequencing data, though many methods developed here can be readily applied to other types of high-dimensional data. One feature of these data is that the number of predictors (genes or SNPs) p is typically much larger than the number of observations n. The key to handle these high-dimensional data is to reduce the dimensionality effectively. There are several challenges in reducing the dimensionality. First, there are many variants which contribute to complex diseases. GWAS target common variants that typically only have modest effects, whereas variants in sequencing studies that have larger effects are more rare. The consequence is that the variants that are associated with the trait do not stand out, because of stochastic variation as well as the number of variants under study. Secondly, many of these variants act in combination with environmental factors and other variants. This poses even more challenges, as the number of potential gene-environment and gene-gene interactions is much greater than the number of marginal analyses. Thirdly, to elucidate complex disease risk, a comprehensive approach which considers many genetic variants, environmental factors, and their interactions is needed. Developing methods that deal with large numbers of variants and environmental factors is the focus of this project. Using adaptive function estimation techniques, which have been developed for many large nonparametric regression problems, we will develop a suite of statistical and computational techniques for the identification of environmental factors that modify genetic effects, for the predicting of disease risk from many thousands of SNPs, and for identifying significant predictors in exome sequencing studies. In adaptive function estimation, an unknown function is modeled as a combination of many basis functions. Model selection techniques, such as the lasso and boosting, have been developed for selecting which combination of basis functions is best at predicting a (disease) outcome. These approaches are very suited to the problems studied in this project. The investigators on this project are directly involved in a number of genetic association studies as (principal) investigator.
The specific aims that we propose are in response to actual analysis problems facing these projects. This direct relation to projects ensures the relevance of the methods we intend to develop.

Public Health Relevance

The major focus of this proposal is the development of analytical approaches for high-dimensional data that arise from genome-wide association studies and whole exome sequencing studies. In particular, we propose to develop adaptive methods to construct predictive models and to identify gene-environment interactions in GWAS, and to improve power for association studies in whole exome sequencing studies.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Ramos, Erin
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Fred Hutchinson Cancer Research Center
United States
Zip Code
Cheng, Yichen; Dai, James Y; Paulson, Thomas G et al. (2017) Quantification of Multiple Tumor Clones Using Gene Array and Sequencing Data. Ann Appl Stat 11:967-991
Pashova, Hristina; LeBlanc, Michael; Kooperberg, Charles (2017) Structured detection of interactions with the directed lasso. Stat Biosci 9:676-691
Su, Yu-Ru; Di, Chong-Zhi; Hsu, Li (2017) Hypothesis testing in functional linear models. Biometrics 73:551-561
Cheng, Yichen; Dai, James Y; Kooperberg, Charles (2016) Group association test using a hidden Markov model. Biostatistics 17:221-34
Wang, Xiaoyu; Li, Xiaohong; Cheng, Yichen et al. (2015) Copy number alterations detected by whole-exome and whole-genome sequencing of esophageal adenocarcinoma. Hum Genomics 9:22
Coram, Marc A; Candille, Sophie I; Duan, Qing et al. (2015) Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach. Am J Hum Genet 96:740-52
Logsdon, Benjamin A; Dai, James Y; Auer, Paul L et al. (2014) A variational Bayes discrete mixture test for rare variant association. Genet Epidemiol 38:21-30
Carty, Cara L; Bhattacharjee, Samsiddhi; Haessler, Jeff et al. (2014) Analysis of metabolic syndrome components in >15 000 african americans identifies pleiotropic variants: results from the population architecture using genomics and epidemiology study. Circ Cardiovasc Genet 7:505-13
Di, Chongzhi; Crainiceanu, Ciprian M; Jank, Wolfgang S (2014) Multilevel sparse functional principal component analysis. Stat 3:126-143
Tapsoba, Jean de Dieu; Kooperberg, Charles; Reiner, Alexander et al. (2014) Robust estimation for secondary trait association in case-control genetic studies. Am J Epidemiol 179:1264-72

Showing the most recent 10 out of 18 publications