This projects studies statistical learning machines as applied to biomedical and clinical prediction, probabilitiy assignment, regresssion, and ranking problems. The algorithms involved include Random Forests, support vector machines, neural networks, and variations of the boosting algorithm. These are all recently developed techniques orginally constructed by the machine learning community, and which are only now starting to see applications in biomedical problems. These methods were not designed through familiar parametric statistical reasoning, but using the more advanced methods of nonparametric density estimation, are known to be provably Bayes risk consistent. Hence, as the data set grows the methods do optimally classify cases and subjects, for example. As routinely applied to data collected by clinicians or biomedical researchers, these new techniques require modifications and enhancements appropriate to data collected from these alternate sources. In particular, we address the problem of (1) greatly unbalanced data sets, where the researcher typically has only a handful of positive cases and a great many negative cases, (2) the issue of accurate estimates of prediction error rates, where the researcher typically has a relatively small data set upon which to do both model fitting and testing, and (3) the interpretation of the means by which the prediction engine operates and the development of practical prognostic factors. These three problems are essential questions facing the use of modern prediction engines, but have been only lightly studied by the machine learning community. On the other hand, the rigorous methods of the mathematical statistics community have demonstrated the unusual versatility and flexibility of these methods. We have applied these statistical learning machine schemes to a wide variety of biological datasets, such as a 1,000K SNP data set on childhood-onset schizophrenia. At the invitaion of Cambridge University Press we are writing a textbook on """"""""Statistical Learning for Biological Data"""""""";completion of text and publication is anticipated in 2009

Project Start
Project End
Budget Start
Budget End
Support Year
7
Fiscal Year
2009
Total Cost
$211,036
Indirect Cost
Name
Center for Information Technology
Department
Type
DUNS #
City
State
Country
Zip Code
Battogtokh, Bilguunzaya; Mojirsheibani, Majid; Malley, James (2017) The optimal crowd learning machine. BioData Min 10:16
Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7
Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2
Kruppa, Jochen; Liu, Yufeng; Biau, GĂ©rard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63
Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin et al. (2014) Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res 24:1209-23
Greenstein, Deanna; Kataria, Rachna; Gochman, Peter et al. (2014) Looking for childhood-onset schizophrenia: diagnostic algorithms for classifying children and adolescents with psychosis. J Child Adolesc Psychopharmacol 24:366-73

Showing the most recent 10 out of 28 publications