Despite the success of genome-wide association studies to identify over hundreds of loci that are associated with common and complex diseases, significant challenges remain for statistical inference in these high- dimensional data. Specifically, rare variants generated by emerging genome-wide sequencing studies may explain the "missing heritability", but pose a challenge to the traditional locus-by-locus approach. Studies of gene-environment interactions have not generated many successes, possibly due to limitations of existing analytical methods. Mediation of genetic effects by intermediate outcomes is an emerging topic of interest that may lead to disease prevention or treatment. The existing statistical methods for inferring mediation effect, however, have been underdeveloped. In this proposal, we plan to build novel statistical methods to address these challenges. The methodological research is motivated by, but not limited to, the genome-wide association studies and the sequencing project in the Women's Health Initiative (WHI), including the "Genomics and Randomized Trials Network" (GARNET), "Population Architecture of Genes and Environment" (PAGE) and the "Exome Sequencing Project" (ESP). The feature of this proposal is that the PI and co-investigators are indeed conducting these studies, thus methodological innovations proposed will be applied immediately to address scientific questions of interest. A number of statistical methods for rare variant analysis have been proposed recently. None of the existing methods accounts for the presence of neutral variants, i.e., alleles which do not have functional influence on the trait. Inclusion of neutral variants in the aforementioned gene-set tests certainly dilutes power. In this proposal, we propose a class of finite mixture models that explicitly teases out neutral variants to improve power. The main challenge in identifying gene-environment interactions is lack of power due to limited sample size and typically small magnitude of interactions. Dimension reduction, such as gene-set based inference, is critical to reduce the amount of hypothesis tests and enrich weak genetic effects. We will develop a suite of gene-set based, two-stage filtering procedures for detecting gene-environment interaction. We will also develop a multivariate sparse gene-set testing framework with a L1 penalty to assemble weak genetic effects in a gene or a pathway. The difficulty in inferring mediation of genetic effects on diseases by intermediate outcomes is how to control for unknown confounders. Current approaches exploit "Mendelian Randomization", the random segregation of alleles, and use known genetic risk alleles as instrumental variables to infer causality. Limitations of the existing framework, mainly on overly restrictive assumptions and inability to model the causal effect on binary outcomes, have impeded applicability of such inference. We will revamp the instrumental variable framework originally developed in econometrics to fit better to genetic studies.

Public Health Relevance

The focus of this proposal is to develop novel statistical methods for analysis of high-throughput genotyping and sequencing data, focusing on three outstanding challenges in current genetic epidemiology: rare variants, gene-environment interactions, and mediation by intermediate outcomes. The proposed methods will identify genetic predisposition and environmental exposures that lead to prevention and treatment of common diseases.

National Institute of Health (NIH)
National Heart, Lung, and Blood Institute (NHLBI)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Burwen, Dale R
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Fred Hutchinson Cancer Research Center
United States
Zip Code
Dai, James Y; Zhang, Xinyi Cindy; Wang, Ching-Yun et al. (2016) Augmented case-only designs for randomized clinical trials with failure time endpoints. Biometrics 72:30-8
Cheng, Yichen; Dai, James Y; Kooperberg, Charles (2016) Group association test using a hidden Markov model. Biostatistics 17:221-34
Dai, James Y; Tapsoba, Jean de Dieu; Buas, Matthew F et al. (2016) Constrained Score Statistics Identify Genetic Variants Interacting with Multiple Risk Factors in Barrett's Esophagus. Am J Hum Genet 99:352-65
Dai, James Y; de Dieu Tapsoba, Jean; Buas, Matthew F et al. (2015) A newly identified susceptibility locus near FOXP1 modifies the association of gastroesophageal reflux with Barrett's esophagus. Cancer Epidemiol Biomarkers Prev 24:1739-47
Dai, James Y; Zhang, Xinyi Cindy (2015) Mendelian randomization studies for a continuous exposure under case-control sampling. Am J Epidemiol 181:440-9
Wang, Xiaoyu; Li, Xiaohong; Cheng, Yichen et al. (2015) Copy number alterations detected by whole-exome and whole-genome sequencing of esophageal adenocarcinoma. Hum Genomics 9:22
Logsdon, Benjamin A; Dai, James Y; Auer, Paul L et al. (2014) A variational Bayes discrete mixture test for rare variant association. Genet Epidemiol 38:21-30
Tapsoba, Jean de Dieu; Kooperberg, Charles; Reiner, Alexander et al. (2014) Robust estimation for secondary trait association in case-control genetic studies. Am J Epidemiol 179:1264-72
Li, Shuying S; Gilbert, Peter B; Tomaras, Georgia D et al. (2014) FCGR2C polymorphisms associate with HIV-1 vaccine protection in RV144 trial. J Clin Invest 124:3879-90
Dai, James Y; Chan, Kwun Chuen Gary; Hsu, Li (2014) Testing concordance of instrumental variable effects in generalized linear models with application to Mendelian randomization. Stat Med 33:3986-4007

Showing the most recent 10 out of 17 publications