This projects studies statistical learning machines as applied to biomedical and clinical prediction, probabilitiy assignment, regresssion, and ranking problems. The algorithms involved include Random Forests, support vector machines, neural networks, and variations of the boosting algorithm. These are all recently developed techniques orginally constructed by the machine learning community, and which are only now starting to see applications in biomedical problems. These methods were not designed through familiar parametric statistical reasoning, but using the more advanced methods of nonparametric density estimation, are known to be provably Bayes risk consistent. Hence, as the data set grows the methods do optimally classify cases and subjects, for example. As routinely applied to data collected by clinicians or biomedical researchers, these new techniques require modifications and enhancements appropriate to data collected from these alternate sources. In particular, we address the problem of (1) greatly unbalanced data sets, where the researcher typically has only a handful of positive cases and a great many negative cases, (2) the issue of accurate estimates of prediction error rates, where the researcher typically has a relatively small data set upon which to do both model fitting and testing, and (3) the interpretation of the means by which the prediction engine operates and the development of practical prognostic factors. These three problems are essential questions facing the use of modern prediction engines, but have been only lightly studied by the machine learning community. On the other hand, the rigorous methods of the mathematical statistics community have demonstrated the unusual versatility and flexibility of these methods. We have applied these statistical learning machine schemes to a wide variety of biological datasets, such as a two million snp data sets on childhood-onset schizophrenia. At the invitaion of Cambridge University Press we wrote a complete and detailed book on machine learning methods for the biomedical research community. Our text """"""""Statistical Learning for Biological Data"""""""" (coauthors, K Malley and S Pajevic;est, 320 pages plus figures and index) is scheduled to be published by C.U.P. in January 2011. It will appear in their peer-reviewed, multi-volumed series Practical Guides to Biostatistics and Epidemiology.

Project Start
Project End
Budget Start
Budget End
Support Year
8
Fiscal Year
2010
Total Cost
$84,850
Indirect Cost
Name
Center for Information Technology
Department
Type
DUNS #
City
State
Country
Zip Code
Battogtokh, Bilguunzaya; Mojirsheibani, Majid; Malley, James (2017) The optimal crowd learning machine. BioData Min 10:16
Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7
Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206
Malley, James D; Moore, Jason H (2014) First complex, then simple. BioData Min 7:13
Malley, James D; Malley, Karen G; Moore, Jason H (2014) O brave new world that has such machines in it. BioData Min 7:26
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2
Kruppa, Jochen; Liu, Yufeng; Biau, GĂ©rard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63

Showing the most recent 10 out of 28 publications