Statistical inference in genome-wide association and sequencing studies

Dai, James

Abstract

Despite the success of genome-wide association studies to identify over hundreds of loci that are associated with common and complex diseases, significant challenges remain for statistical inference in these high- dimensional data. Specifically, rare variants generated by emerging genome-wide sequencing studies may explain the missing heritability, but pose a challenge to the traditional locus-by-locus approach. Studies of gene-environment interactions have not generated many successes, possibly due to limitations of existing analytical methods. Mediation of genetic effects by intermediate outcomes is an emerging topic of interest that may lead to disease prevention or treatment. The existing statistical methods for inferring mediation effect, however, have been underdeveloped. In this proposal, we plan to build novel statistical methods to address these challenges. The methodological research is motivated by, but not limited to, the genome-wide association studies and the sequencing project in the Women's Health Initiative (WHI), including the Genomics and Randomized Trials Network (GARNET), Population Architecture of Genes and Environment (PAGE) and the Exome Sequencing Project (ESP). The feature of this proposal is that the PI and co-investigators are indeed conducting these studies, thus methodological innovations proposed will be applied immediately to address scientific questions of interest. A number of statistical methods for rare variant analysis have been proposed recently. None of the existing methods accounts for the presence of neutral variants, i.e., alleles which do not have functional influence on the trait. Inclusion of neutral variants in the aforementioned gene-set tests certainly dilutes power. In this proposal, we propose a class of finite mixture models that explicitly teases out neutral variants to improve power. The main challenge in identifying gene-environment interactions is lack of power due to limited sample size and typically small magnitude of interactions. Dimension reduction, such as gene-set based inference, is critical to reduce the amount of hypothesis tests and enrich weak genetic effects. We will develop a suite of gene-set based, two-stage filtering procedures for detecting gene-environment interaction. We will also develop a multivariate sparse gene-set testing framework with a L1 penalty to assemble weak genetic effects in a gene or a pathway. The difficulty in inferring mediation of genetic effects on diseases by intermediate outcomes is how to control for unknown confounders. Current approaches exploit Mendelian Randomization, the random segregation of alleles, and use known genetic risk alleles as instrumental variables to infer causality. Limitations of the existing framework, mainly on overly restrictive assumptions and inability to model the causal effect on binary outcomes, have impeded applicability of such inference. We will revamp the instrumental variable framework originally developed in econometrics to fit better to genetic studies.

Public Health Relevance

The focus of this proposal is to develop novel statistical methods for analysis of high-throughput genotyping and sequencing data, focusing on three outstanding challenges in current genetic epidemiology: rare variants, gene-environment interactions, and mediation by intermediate outcomes. The proposed methods will identify genetic predisposition and environmental exposures that lead to prevention and treatment of common diseases.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Heart, Lung, and Blood Institute (NHLBI)
Type: Research Project (R01)
Project #: 5R01HL114901-04
Application #: 8867044
Study Section: Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer: Wolz, Michael

Project Start: 2012-07-15
Project End: 2017-06-30
Budget Start: 2015-07-01
Budget End: 2017-06-30
Support Year: 4
Fiscal Year: 2015
Total Cost
Indirect Cost

Institution

Name: Fred Hutchinson Cancer Research Center
Department
Type
DUNS #: 078200995

City: Seattle
State: WA
Country: United States
Zip Code: 98109

Related projects


NIH 2015 R01 HL	Statistical inference in genome-wide association and sequencing studies Dai, James / Fred Hutchinson Cancer Research Center
NIH 2014 R01 HL	Statistical inference in genome-wide association and sequencing studies Dai, James / Fred Hutchinson Cancer Research Center
NIH 2013 R01 HL	Statistical inference in genome-wide association and sequencing studies Dai, James / Fred Hutchinson Cancer Research Center	$408,856
NIH 2012 R01 HL	Statistical inference in genome-wide association and sequencing studies Dai, James / Fred Hutchinson Cancer Research Center	$430,205

Publications

Dai, James Y; Wang, Xiaoyu; Buas, Matthew F et al. (2018) Whole-genome sequencing of esophageal adenocarcinoma in Chinese patients reveals distinct mutational signatures and genomic alterations. Commun Biol 1:174

Dai, James Y; Peters, Ulrike; Wang, Xiaoyu et al. (2018) Diagnostics for Pleiotropy in Mendelian Randomization Studies: Global and Individual Tests for Direct Effects. Am J Epidemiol 187:2672-2680

Dai, James Y; Liang, C Jason; LeBlanc, Michael et al. (2018) Case-only approach to identifying markers predicting treatment effects on the relative risk scale. Biometrics 74:753-763

Cheng, Yichen; Dai, James Y; Paulson, Thomas G et al. (2017) Quantification of Multiple Tumor Clones Using Gene Array and Sequencing Data. Ann Appl Stat 11:967-991

Contino, Gianmarco; Vaughan, Thomas L; Whiteman, David et al. (2017) The Evolving Genomic Landscape of Barrett's Esophagus and Esophageal Adenocarcinoma. Gastroenterology 153:657-673.e1

Pashova, Hristina; LeBlanc, Michael; Kooperberg, Charles (2017) Structured detection of interactions with the directed lasso. Stat Biosci 9:676-691

Dai, James Y; Tapsoba, Jean de Dieu; Buas, Matthew F et al. (2016) Constrained Score Statistics Identify Genetic Variants Interacting with Multiple Risk Factors in Barrett's Esophagus. Am J Hum Genet 99:352-65

Cheng, Yichen; Dai, James Y; Kooperberg, Charles (2016) Group association test using a hidden Markov model. Biostatistics 17:221-34

Dai, James Y; Zhang, Xinyi Cindy; Wang, Ching-Yun et al. (2016) Augmented case-only designs for randomized clinical trials with failure time endpoints. Biometrics 72:30-8

Wang, Xiaoyu; Dai, James Y (2016) TwoPhaseInd: an R package for estimating gene-treatment interactions and discovering predictive markers in randomized clinical trials. Bioinformatics 32:3348-3350

Showing the most recent 10 out of 24 publications