A major project of this section is the development of new statistical genetics methodology as prompted by the needs of our applied studies and the testing and comparison of novel and existing statistical methods. We continue to explore the utility of various machine learning methods in genome-wide association studies and in analyses of whole-exome sequence data, particularly with respect to power and detection of gene-gene and gene-environment interactions. We previously published a study using GWAS genotype data from the Framingham Heart Study data repository with computer simulated trait data, thus allowing us to show that these methods may be able to detect interaction effects in suitably-powered studies. We are continuing to pursue the use of machine learning methods in genomics studies (1-3). In the past, we have evaluated the power of several of these methods in whole-exome sequence data from the 1000 Genomes Project using computer simulated phenotypes as part of Genetic Analysis Workshop 17 (GAW17). We published several papers concerning data mining in the GAW17 data in late 2011. We are currently pursuing several novel methods utilizing probability machines, synthetic variables and meta-analysis using Random Forests. We have published a paper showing that our novel recurrency method in Random Forests seems to better differentiate between variables of high importance vs. low importance than other current methods (1). We have also used this recurrency approach to detect low quality SNVs in whole exome and whole genome sequence data and applied this method to GAW19 data and this paper (2). Ongoing studies have also shown that this method can detect epistatic interactions in the absence of main effects in simulated genetic data, with these results presented at several scientific meetings. We have further developed and tested a limited permutation method that allows estimation of false positive rates in conjunction with our recurrency approach. Simulations further suggest that our new recurrency method is powerful in multiple situations and controls false positives and that it allows the detection of epistatic interactions in a more powerful fashion than is possible with parametric methods when there are no main effects. We have developed and released a software package, r2VIM, which is available on Dr. Bailey-Wilsons website for broad access and have published two papers describing this method, including one this year (1). We are currently developing The Machine Suite which will be an extension of r2VIM. A manuscript presenting the updates to our methods is under preparation. We have also been developing a novel method to analyze matched case-control, or case-parent trio data using Random Forests. By combining results from a large number of classification trees, we have a flexible solution to analyze matched datasets and a paper was published (Li et al., 2015) presenting some of this work along with an applied analysis of oral cleft GWAS data. Work to efficiently implement this method for large-scale genomic data is ongoing and additional manuscripts are in development. We have developed novel tools for analysis and interpretation of whole exome sequence (WES) and whole genome sequence (WGS) data, including strategies for combining linkage and sequence results, various schemes of collapsing rare variants in genes and gene networks to improve the power of sequence analysis, and methods for integrating sequence analyses with existing genomics databases. Two papers presenting these results were published in late 2011 and another in 2014. In particular we showed that family-based studies such as two point linkage analysis controlled false positive rates well and were more powerful than most methods that utilized the same number of unrelated individuals for detection of rare variants of large effect. We followed this up with a linkage study in the GAW18 to evaluate significance thresholds for linkage analysis in whole genome sequence data and found that false positive rates were less well controlled for WGS data than WES, suggesting that more stringent thresholds might be necessary. Development of these analysis methods and tools are ongoing, driven by our own WES and targeted sequence data from multiple studies of complex traits. We have recently completed development of a sequence data quality assurance pipeline, a visualization program to display regions where individuals share multiple rare variants, and scripts to automate two-point linkage analysis (parametric and non-parametric) of whole exome and whole genome sequence data. We have developed programs to analyze runs of homozygosity data across different types of genotype and sequence data. We have worked on optimizing methods for performing multipoint analyses using extremely dense WES and exome chip data sets, and have shown that several linkage methods that purport to adequately adjust for intermarker linkage disequilibrium do not control false positive rates adequately when data of this extreme density is analyzed. This research was awarded a platform presentation at the 2015 International Genetic Epidemiology Society meeting (CL Simpson). Given the limitations of the GAW simulated datasets, we have developed and tested our own simulation pipeline to simulate genome-wide association data with realistic haplotype block structures that will be representative of (at least) European Caucasian and African-American populations. These simulations are allowing us to test and compare analysis methods across a wide array of biological models including complex trait models that include geneXgene and geneXenvironment interactions. Simulations are ongoing to compare our new methods to existing methods and to test the methods using more complex biological models. In collaboration with Dr. Ruzong Fan at NICHD, we have contributed to the development of new generalized functional linear models for gene-based tests of both quantitative and qualitative traits as well as mixed effects models. These new methods have been shown to be more powerful than other gene-based tests while retaining good control of false positive rates. Two papers were published this year presenting a comparison of fixed and mixed effect models and presenting extensions to these methods for multivariate analysis (3, 4). We are now in the process of applying these approaches to several of our genome-wide datasets. This year we have also collaborated with various investigators in our International Consortium for Prostate Cancer Genetics (see report HG200331-13, Genetic Epidemiology of Cancer) to develop and test two new methods. One is a gene-based association method for rare variant analysis (5) and one is a novel machine-learning-based method for annotation of genetic variants that combines existing prediction approaches with protein-prediction modeling (6). Finally, we have collaborated with members of Dr. Alexander Wilsons group (Genometrics Section, CSGB, NHGRI) on several methods development projects including an ongoing project to develop approaches for selecting significant variants from GWAS when no replication samples are available, such that false positive rates are well controlled. A manuscript is under preparation for this recent project.

Project Start
Project End
Budget Start
Budget End
Support Year
19
Fiscal Year
2017
Total Cost
Indirect Cost
Name
Human Genome Research
Department
Type
DUNS #
City
State
Country
Zip Code
Chiu, Chi-Yang; Jung, Jeesun; Wang, Yifan et al. (2017) A comparison study of multivariate fixed models and Gene Association with Multiple Traits (GAMuT) for next-generation sequencing. Genet Epidemiol 41:18-34
Larson, Nicholas B; McDonnell, Shannon; Cannon Albright, Lisa et al. (2017) gsSKAT: Rapid gene set analysis and multiple testing correction for rare-variant association studies using weighted linear kernels. Genet Epidemiol 41:297-308
König, Inke R; Auerbach, Jonathan; Gola, Damian et al. (2016) Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19. BMC Genet 17 Suppl 2:1
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7
Fan, Ruzong; Chiu, Chi-Yang; Jung, Jeesun et al. (2016) A Comparison Study of Fixed and Mixed Effect Models for Gene Level Association Studies of Complex Traits. Genet Epidemiol 40:702-721
Ioannidis, Nilah M; Rothstein, Joseph H; Pejaver, Vikas et al. (2016) REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet 99:877-885
Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152
Wang, Yifan; Liu, Aiyi; Mills, James L et al. (2015) Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet Epidemiol 39:259-75
Pendergrass, Sarah A; Verma, Shefali S; Hall, Molly A et al. (2015) Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using biofilter, and gene-environment interactions using the Phenx Toolkit*. Pac Symp Biocomput :495-505
Li, Qing; Kim, Yoonhee; Suktitipat, Bhoom et al. (2015) Gene-Gene Interaction Among WNT Genes for Oral Cleft in Trios. Genet Epidemiol 39:385-94

Showing the most recent 10 out of 30 publications