Cardiovascular diseases (CVD) affect millions of people in US and across the world. There is strong evidence of a genetic component in cardiovascular diseases (CVD) and related traits. An emerging consensus is that both genes and environment and, perhaps more importantly, their interactions are responsible for this complex disease. As a result, many genetic epidemiological (GE) studies of CVD use a study design that tests hundreds of thousands of genetic predictors (e.g., single nucleotide polymorphism (SNP) markers) and hundreds of (related) disease phenotypes and environmental covariates. This has brought tremendous analytical challenges, particularly the high dimensionality of the data and the obscure interactions among the many variables. As a result, searching for CVD disease genes has become a task of selecting important variables from a vast number of SNPs and other predictor variables. Our real data analyses in several ongoing large scale CVD related studies motivated us to consider new methodological solutions to the variable selection problem. This application is developed upon these positive preliminary findings. Our main idea is to develop a strategy for selecting important predictors of CVD by integrating multiple sources of information via the method of statistical learning (i.e., optimizing the selection by repeated learning from examples). In this strategy, we will first develop a method for selecting significant SNPs in moderate-dimensional data (e.g., lower thousands of SNPs, in candidate genes studies) by an integrated classifier. The method will build upon existing techniques assessing information of SNPs in haplotype similarity, imputed functional potential, and gene-gene interactions. We then scale up the new method to the high-dimensional setting of genome-wide association studies (e.g., at least hundreds of thousands of SNPs), by dimension reduction that utilizes the local linkage-disequilibrium (LD) structure in SNPs and by combining latent factor analysis of correlated CVD traits and pathway-based analysis to account for gene-environment (GxE) interactions. A fast-search algorithm will also be developed based on an existing search heuristic that was successfully applied in high-dimensional data of gene expression and genomic sequence analysis. The new methods and algorithms will be coded into R programs and distributed as tool set for an association analysis pipeline. Evaluations of the new methods will be performed by intensive simulation studies and by applying to existing datasets in ongoing studies of CVD and related diseases. Results from evaluation studies, together with the ancillary databases generated by the study such as imputed functional scores of potential or known CVD SNPs will be distributed on a dedicated project website. By doing so, we believe that the utilities resulted from the proposed research will make a significant contribution to many ongoing genetic epidemiological studies of CVD and related traits.
This project is aimed at timely development of computational tools for emerging large-scale genome-wide association studies of cardiovascular diseases (CVD) that affect millions of people in US and across the world. The new methods deal with the analytical challenges brought forth by the high dimensionality of the data and the obscure interactions among the many variables in these studies, and the tools will be applied to ongoing studies of CVD and related diseases. The results, together with the computer programs and ancillary databases will make a significant contribution to many ongoing and new genetic epidemiological studies of CVD and related diseases.