This projects studies statistical learning machines as applied to biomedical and clinical prediction, probabilitiy assignment, regresssion, and ranking problems. The algorithms involved include Random Forests, support vector machines, neural networks, and variations of the boosting algorithm. These are all recently developed techniques orginally constructed by the machine learning community, and which are only now starting to see applications in biomedical problems. These methods were not designed through familiar parametric statistical reasoning, but using the more advanced methods of nonparametric density estimation, are known to be provably Bayes risk consistent. They are, therefore, well-adapted to large data, especially whole genome data. More recently we have found methods to calculate risk and hazard using probability machines. These methods, now called Risk Machines, are entirely model-free and are provably valid using current techniques in mathematical statistics. No model or parametric input is required by the researcher. Personalized risks can be calculated for individuals relative to any possible predictor or environmental hazard or any interaction between gene and environment. These methods solve the problems first posed in our earlier book: "Statistical Learning for Biological Data" (coauthors, K Malley and S Pajevic;published 2011). The practical applications of these solutions, including Risk Machines, will appear in our next book: "Estimation of Risk and Probability: A Machine Learning Approach" (in preparation). Consistent, valid estimation of genetic risks can be found using risk machines, for such problem as childhood-onset schizophrenia. Also, new predictive features can be constructed from observed features that account for known problem in statistical genetics such as recombination hot spots and linkage disequilibrium. These synthetic features can then be evaluated separately for risk estimation to the subject. This is another example of personalized medicine provided by statistical learning machines.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Center for Information Technology
Zip Code
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2
Malley, James D; Moore, Jason H (2014) First complex, then simple. BioData Min 7:13
Kruppa, Jochen; Liu, Yufeng; Biau, GĂ©rard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63
Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin et al. (2014) Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res 24:1209-23
Greenstein, Deanna; Kataria, Rachna; Gochman, Peter et al. (2014) Looking for childhood-onset schizophrenia: diagnostic algorithms for classifying children and adolescents with psychosis. J Child Adolesc Psychopharmacol 24:366-73
Pan, Qinxin; Hu, Ting; Malley, James D et al. (2014) A system-level pathway-phenotype association analysis using synthetic feature random forest. Genet Epidemiol 38:209-19
Malley, James D; Moore, Jason H (2014) Innovation is often unnerving: the door into summer. BioData Min 7:12
Shah, Mona; Mamyrova, Gulnara; Targoff, Ira N et al. (2013) The clinical phenotypes of the juvenile idiopathic inflammatory myopathies. Medicine (Baltimore) 92:25-41
Malley, James D; Moore, Jason H (2013) The disconnect between classical biostatistics and the biological data mining community. BioData Min 6:12

Showing the most recent 10 out of 13 publications