A major project of this section is the development of new statistical genetics methodology as prompted by the needs of our applied studies and the testing and comparison of novel and existing statistical methods. The project to develop propensity scores in linkage analyses as a method for inclusion of covariate effects is ongoing. This method appears promising in that it is generally more powerful than including the covariates directly into the model, and does not have strongly inflated Type I error rates. We have created programs for calculating permutation p-values for the linkage results obtained when using propensity scores in LODPAL in the S.A.G.E. program package and have created programs that convert multipoint IBD sharing values calculated in SIMWALK2 so that they can be used by LODPAL in place of the IBD sharing calculated by the GENIBD program. The SIMWALK2 results are often more accurate than the approximations from GENIBD and calculation time is much faster. This year we have utilized propensity scores in IBDReg, are developing a program to compute empirical p-values and are performing studies of its performance. We plan on applying these methods to Dr. Bailey-Wilson's lung cancer and prostate cancer data. We continue to explore the utility of various machine learning methods in genome-wide association studies and in analyses of whole-exome sequence data, particularly with respect to power and detection of gene-gene and gene-environment interactions. We previously published a study using GWAS genotype data from the Framingham Heart Study data repository with computer simulated trait data, thus allowing us to show that these methods may be able to detect interaction effects in suitably-powered studies. We are continuing to pursue the use of machine learning methods in genomics studies, and have evaluated the power of several of these methods in whole-exome sequence data from the 1000 Genomes Project using computer simulated phenotypes as part of Genetic Analysis Workshop 17 (GAW17). We published several papers concerning data mining in the GAW17 data in late 2011. We are currently pursuing several novel methods utilizing probability machines, synthetic variables and meta-analysis using Random Forests. We have developed a recurrency method in Random Forests that seems to better differentiate between variables of high importance vs. low importance than other current methods. We have also used this recurrency approach to detect low quality SNVs in whole exome and whole genome sequence data. One manuscript is in revision and others are in preparation. We have also been developing a novel method to analyze matched case-control, or case-parent trio data using Random Forests. By combining results from a large number of classification trees, we have a flexible solution to analyze matched datasets. This novel method is undergoing additional testing. We have used the GAW17 simulated whole-exome sequence (WES) data to develop novel tools for analysis and interpretation of WES data, including strategies for combining linkage and sequence results, various schemes of collapsing rare variants in genes and gene networks to improve the power of sequence analysis, and methods for integrating sequence analyses with existing genomics databases. Two papers presenting these results were published in late 2011. In particular we showed that family-based studies such as two point linkage analysis controlled false positive rates well and were more powerful than most methods that utilized the same number of unrelated individuals for detection of rare variants of large effect. We followed this up with a linkage study in the GAW18 to evaluate significance thresholds for linkage analysis in whole genome sequence data and found that false positive rates were less well controlled for WGS data than WES, suggesting that more stringent thresholds might be necessary. This paper has been accepted for publication in BMC Proceedings 1 . Development of these analysis methods and tools are ongoing, driven by our own WES and targeted sequence data from multiple studies of complex traits. We have recently completed development of a sequence data quality assurance pipeline, a visualization program to display regions where individuals share multiple rare variants, and scripts to automate two-point linkage analysis (parametric and non-parametric) of whole exome and whole genome sequence data. We have developed programs to analyze runs of homozygosity data across different types of genotype and sequence data. Given the limitations of the GAW simulated datasets, we have developed and tested our own simulation pipeline to simulate genome-wide association data with realistic haplotype block structures that will be representative of (at least) European Caucasian and African-American populations. These simulations are allowing us to test and compare analysis methods across a wide array of biological models including complex trait models that include geneXgene and gene by environment interactions. To date, we have shown that Random Forests, Pinpoint and logistic regression all have similar good control of false positive rate under the null, and that under simple additive models of disease causation, these 3 methods have similar power to detect a small number of causal variants of small to moderate effect size. Simulations further suggest that our new recurrency method is powerful in multiple situations and controls false positives. Simulations are ongoing to compare additional methods and to test the methods using more complex biological models. In collaboration with Dr. Qing Li in my section and Dr. Ingo Ruczinski at Johns Hopkins Bloomberg School of Public Health, an R program has been developed to simulate case-parent trio data for use in testing our ongoing development of methods for analyzing trio data. Dr. Li in my section has also been developing a haplotype-based association method to analyze longitudinal data in collaboration with Dr. Kelly Benke at Johns Hopkins Bloomberg School of Public Health.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Human Genome Research Institute
Zip Code
König, Inke R; Auerbach, Jonathan; Gola, Damian et al. (2016) Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19. BMC Genet 17 Suppl 2:1
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7
Fan, Ruzong; Chiu, Chi-Yang; Jung, Jeesun et al. (2016) A Comparison Study of Fixed and Mixed Effect Models for Gene Level Association Studies of Complex Traits. Genet Epidemiol :
Ritchie, Marylyn D; Holzinger, Emily R; Li, Ruowang et al. (2015) Methods of integrating data to uncover genotype-phenotype interactions. Nat Rev Genet 16:85-97
Pendergrass, Sarah A; Verma, Shefali S; Hall, Molly A et al. (2015) Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using biofilter, and gene-environment interactions using the Phenx Toolkit*. Pac Symp Biocomput :495-505
Li, Qing; Kim, Yoonhee; Suktitipat, Bhoom et al. (2015) Gene-Gene Interaction Among WNT Genes for Oral Cleft in Trios. Genet Epidemiol 39:385-94
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206
Wang, Yifan; Liu, Aiyi; Mills, James L et al. (2015) Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet Epidemiol 39:259-75
Bureau, Alexandre; Younkin, Samuel G; Parker, Margaret M et al. (2014) Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives. Bioinformatics 30:2189-96
Schwender, Holger; Li, Qing; Neumann, Christoph et al. (2014) Detecting disease variants in case-parent trio studies using the bioconductor software package trio. Genet Epidemiol 38:516-22

Showing the most recent 10 out of 23 publications