We continue to explore the utility of various machine learning methods in genome-wide association studies and in analyses of whole-exome sequence data, particularly with respect to power and detection of gene-gene and gene-environment interactions. We previously published a study using GWAS genotype data from the Framingham Heart Study data repository with computer simulated trait data, thus allowing us to show that these methods may be able to detect interaction effects in suitably-powered studies. We are continuing to pursue the use of machine learning methods in genomics studies. In the past, we have evaluated the power of several of these methods in whole-exome sequence data from the 1000 Genomes Project using computer simulated phenotypes as part of Genetic Analysis Workshop 17 (GAW17). We published several papers concerning data mining in the GAW17 data in late 2011. We have published a paper showing that our novel recurrency method in Random Forests seems to better differentiate between variables of high importance vs. low importance than other current methods. We have also used this recurrency approach to detect low quality SNVs in whole exome and whole genome sequence data and applied this method to GAW19 data. Ongoing studies have also shown that this method can detect epistatic interactions in the absence of main effects in simulated genetic data, with these results presented at several scientific meetings. We have further developed and tested a limited permutation method that allows estimation of false positive rates in conjunction with our recurrency approach. Simulations further suggest that our new recurrency method is powerful in multiple situations and controls false positives and that it allows the detection of epistatic interactions in a more powerful fashion than is possible with parametric methods when there are no main effects. Ongoing work has involved further research to improve control of false positive rates while retaining excellent power , adding approaches to increase power when number of features is very large and there are only interaction effects on risk of disease and comparison of our new methods to several other feature-selection schemes. We have also been developing and testing a new approach for specifically identifying which selected features are actually interacting as opposed to acting independently. We have developed and released a software package, r2VIM, which is available on Dr. Bailey-Wilsons website for broad access and have published three papers describing this method. We are currently developing The Machine Suite which will be an extension of r2VIM. A manuscript presenting the updates to our methods is under preparation and results will be presented at two upcoming scientific meetings. We have also been developing a novel method to analyze matched case-control, or case-parent trio data using Random Forests. By combining results from a large number of classification trees, we have a flexible solution to analyze matched datasets and a paper was published (Li et al., 2015) presenting some of this work along with an applied analysis of oral cleft GWAS data. Work to efficiently implement this method for large-scale genomic data is ongoing and additional manuscripts are in development. We have developed novel tools for analysis and interpretation of whole exome sequence (WES) and whole genome sequence (WGS) data, including strategies for combining linkage and sequence results, various schemes of collapsing rare variants in genes and gene networks to improve the power of sequence analysis, and methods for integrating sequence analyses with existing genomics databases. Development of these analysis methods and tools are ongoing, driven by our own WES and WGS sequence data from multiple studies of complex traits. We have recently completed development of a sequence data quality assurance pipeline, a visualization program to display regions where individuals share multiple rare variants, and scripts to automate two-point linkage analysis (parametric and non-parametric) of whole exome and whole genome sequence data. We have worked on optimizing methods for performing multipoint analyses using extremely dense WES, WGS and exome chip data sets. Work is ongoing to improve pipelines for application of family-based methods for improved quality control in whole genome sequence data. Our WGS pipeline has been presented at several scientific meetings this past year. Given the limitations of the GAW simulated datasets, we have developed and tested our own simulation pipeline to simulate genome-wide association data with realistic haplotype block structures that will be representative of (at least) European Caucasian and African-American populations. These simulations are allowing us to test and compare analysis methods across a wide array of biological models including complex trait models that include geneXgene and geneXenvironment interactions. Simulations are ongoing to compare our new methods to existing methods and to test the methods using more complex biological models. In collaboration with Dr. Ruzong Fan at NICHD and Dr. Chi-Yan Chiu (a Guest Researcher who is a faculty member at University of Tennessee Health Sciences Center), we have contributed to the development of new generalized functional linear models for gene-based tests of both quantitative and qualitative traits as well as mixed effects models. These new methods have been shown to be more powerful than other gene-based tests while retaining good control of false positive rates.We have published multiple papers in this area in previous years and one paper has been published reporting these results has been published in this reporting period 1. We have collaborated with members of Dr. Alexander Wilsons group (Genometrics Section, CSGB, NHGRI) on several methods development projects including an ongoing project to develop approaches for selecting significant variants from GWAS when no replication samples are available, such that false positive rates are well controlled. A manuscript reporting on our new approach and software has been published in this reporting period 2. Finally we have continued a collaboration with Drs. Ingo Ruczinski and Alexander Bureau on approaches to identifying causal rare variants in pedigree data. We extended our existing methodology (Rare Variant Sharing, RVS) to introduce gene-based analyses, a partial sharing test based on RV sharing probabilities for subsets of affected relatives and a haplotype-based RV definition. RVS also has the desirable feature of not requiring external estimates of variant frequency or control samples, provides functionality to assess and address violations of key assumptions, and is available as open source software for genome-wide analysis. We have published a paper reporting on this work this year 3.
Showing the most recent 10 out of 30 publications