Statistical models for genetics data are often surprisingly challenging, and often require advanced and new statistical methods. This project continues to investigate a number of such areas, including, for example, a global analysis of X-chromosome dosage compensation. We begin by noting that Drosophila has a special dosage compensation complex, which upregulates the male X chromosome about two-fold relative to the autosomes, thus maintaining X-versus-autosomal genic balance. However, this complex is only present in the soma, not in the germline. Nevertheless, germline tissues also display striking two-fold upregulation of genes on the male X-chromosome, as revealed by careful measurements of gene expression using microarrays (Gupta, Malley, Oliver, et al., 2006). Analysis of published data from mouse and worm expression arrays reveals a similar balance between X and autosomal genes. Taken together, these results (with indicate that multiple means have evolved to achieve the same end) emphasize our fundamental ignorance of the underlying transcription-linked process that is being regulated. We note that this paper by Gupta, Malley, Oliver, et al. (J. Biology, Feb. 2006) was accessed more than 8,500 times in the year following its appearance in Feburary 2006, and was the third most accessed paper in this journal over that time period. More recently we have undertaken the study of genome wide associations and how statistical learning machines can be applied to such ultra large data (500K or 1,000K snps), with the aim of locating the most predictive genes or snps among the available features and understanding how linkage disequilibrium compromises or assists these detection methods. More recently, as discussed above, we are rapidly expanded our search and fusion program of analyzing ultra large scale genetic data sets. Routinely, we derive fully validated error rates and top lists of most important predictors from two million snps per subject. Reproducible error rates and congruent lists of predictors are now obtainable rather easily, using learning machines implemented on the NIH Biowulf cluster.

Project Start
Project End
Budget Start
Budget End
Support Year
12
Fiscal Year
2010
Total Cost
$67,880
Indirect Cost
Name
Center for Information Technology
Department
Type
DUNS #
City
State
Country
Zip Code
Shah, Mona; Mamyrova, Gulnara; Targoff, Ira N et al. (2013) The clinical phenotypes of the juvenile idiopathic inflammatory myopathies. Medicine (Baltimore) 92:25-41
Malley, J D; Kruppa, J; Dasgupta, A et al. (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med 51:74-81
Kim, Yoonhee; Li, Qing; Cropp, Cheryl D et al. (2011) Performance of random forests and logic regression methods using mini-exome sequence data. BMC Proc 5 Suppl 9:S104
Dasgupta, Abhijit; Sun, Yan V; König, Inke R et al. (2011) Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience. Genet Epidemiol 35 Suppl 1:S5-11
Nicodemus, Kristin K; Malley, James D (2009) Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics 25:1884-90
Strobl, Carolin; Malley, James; Tutz, Gerhard (2009) An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods 14:323-48
Kim, Yoonhee; Wojciechowski, Robert; Sung, Heejong et al. (2009) Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects. BMC Proc 3 Suppl 7:S64