Cardiovascular diseases (CVD) affect millions of people in US and across the world. There is strong evidence of a genetic component in cardiovascular diseases (CVD) and related traits. An emerging consensus is that both genes and environment and, perhaps more importantly, their interactions are responsible for this complex disease. As a result, many genetic epidemiological (GE) studies of CVD use a study design that tests hundreds of thousands of genetic predictors (e.g., single nucleotide polymorphism (SNP) markers) and hundreds of (related) disease phenotypes and environmental covariates. This has brought tremendous analytical challenges, particularly the high dimensionality of the data and the obscure interactions among the many variables. As a result, searching for CVD disease genes has become a task of selecting important variables from a vast number of SNPs and other predictor variables. Our real data analyses in several ongoing large scale CVD related studies motivated us to consider new methodological solutions to the variable selection problem. This application is developed upon these positive preliminary findings. Our main idea is to develop a strategy for selecting important predictors of CVD by integrating multiple sources of information via the method of statistical learning (i.e., optimizing the selection by repeated learning from examples). In this strategy, we will first develop a method for selecting significant SNPs in moderate-dimensional data (e.g., lower thousands of SNPs, in candidate genes studies) by an integrated classifier. The method will build upon existing techniques assessing information of SNPs in haplotype similarity, imputed functional potential, and gene-gene interactions. We then scale up the new method to the high-dimensional setting of genome-wide association studies (e.g., at least hundreds of thousands of SNPs), by dimension reduction that utilizes the local linkage-disequilibrium (LD) structure in SNPs and by combining latent factor analysis of correlated CVD traits and pathway-based analysis to account for gene-environment (GxE) interactions. A fast-search algorithm will also be developed based on an existing search heuristic that was successfully applied in high-dimensional data of gene expression and genomic sequence analysis. The new methods and algorithms will be coded into R programs and distributed as tool set for an association analysis pipeline. Evaluations of the new methods will be performed by intensive simulation studies and by applying to existing datasets in ongoing studies of CVD and related diseases. Results from evaluation studies, together with the ancillary databases generated by the study such as imputed functional scores of potential or known CVD SNPs will be distributed on a dedicated project website. By doing so, we believe that the utilities resulted from the proposed research will make a significant contribution to many ongoing genetic epidemiological studies of CVD and related traits.

Public Health Relevance

This project is aimed at timely development of computational tools for emerging large-scale genome-wide association studies of cardiovascular diseases (CVD) that affect millions of people in US and across the world. The new methods deal with the analytical challenges brought forth by the high dimensionality of the data and the obscure interactions among the many variables in these studies, and the tools will be applied to ongoing studies of CVD and related diseases. The results, together with the computer programs and ancillary databases will make a significant contribution to many ongoing and new genetic epidemiological studies of CVD and related diseases.

Agency
National Institute of Health (NIH)
Institute
National Heart, Lung, and Blood Institute (NHLBI)
Type
Research Project (R01)
Project #
3R01HL091028-01A1S1
Application #
7845764
Study Section
Cardiovascular and Sleep Epidemiology (CASE)
Program Officer
Wolz, Michael
Project Start
2009-08-01
Project End
2011-07-31
Budget Start
2009-08-01
Budget End
2011-07-31
Support Year
1
Fiscal Year
2009
Total Cost
$229,243
Indirect Cost
Name
Washington University
Department
Biostatistics & Other Math Sci
Type
Schools of Medicine
DUNS #
068552207
City
Saint Louis
State
MO
Country
United States
Zip Code
63130
Barve, Ruteja A; Gu, C Charles; Yang, Wei et al. (2016) Genetic association of left ventricular mass assessed by M-mode and two-dimensional echocardiography. J Hypertens 34:88-96
Climer, Sharlee; Yang, Wei; de las Fuentes, Lisa et al. (2014) A custom correlation coefficient (CCC) approach for fast identification of multi-SNP association patterns in genome-wide SNPs data. Genet Epidemiol 38:610-21
Yang, Wei; Charles Gu, C (2014) Random forest fishing: a novel approach to identifying organic group of risk factors in genome-wide association studies. Eur J Hum Genet 22:254-9
Yang, Wei; Gu, C Charles (2013) A whole-genome simulator capable of modeling high-order epistasis for complex disease. Genet Epidemiol 37:686-94
de las Fuentes, Lisa; Yang, Wei; Dávila-Román, Victor G et al. (2012) Pathway-based genome-wide association analysis of coronary heart disease identifies biologically important gene sets. Eur J Hum Genet 20:1168-73
Yang, Wei; de las Fuentes, Lisa; Dávila-Román, Victor G et al. (2011) Variable set enrichment analysis in genome-wide association studies. Eur J Hum Genet 19:893-900
Ray, Monika; Ruan, Jianhua; Zhang, Weixiong (2008) Variations in the transcriptome of Alzheimer's disease reveal molecular networks involved in cardiovascular diseases. Genome Biol 9:R148