This projects studies statistical learning machines as applied to biomedical and clinical prediction, probabilitiy assignment, regresssion, and ranking problems. The algorithms involved include Random Forests, support vector machines, neural networks, and variations of the boosting algorithm. These are all recently developed techniques orginally constructed by the machine learning community, and which are only now starting to see applications in biomedical problems. These methods were not designed through familiar parametric statistical reasoning, but using the more advanced methods of nonparametric density estimation, are known to be provably Bayes risk consistent. They are, therefore, well-adapted to large data, especially whole genome data. More recently we have found methods to calculate risk and hazard using probability machines. These methods, now called Risk Machines, are entirely model-free and are provably valid using current techniques in mathematical statistics. No model or parametric input is required by the researcher. Personalized risks can be calculated for individuals relative to any possible predictor or environmental hazard or any interaction between gene and environment. These methods solve the problems first posed in our earlier book: """"""""Statistical Learning for Biological Data"""""""" (coauthors, K Malley and S Pajevic;published 2011). The practical applications of these solutions, including Risk Machines, will appear in our next book: """"""""Estimation of Risk and Probability: A Machine Learning Approach"""""""" (in preparation). Consistent, valid estimation of genetic risks can be found using risk machines, for such problem as childhood-onset schizophrenia. Also, new predictive features can be constructed from observed features that account for known problem in statistical genetics such as recombination hot spots and linkage disequilibrium. These synthetic features can then be evaluated separately for risk estimation to the subject. This is another example of personalized medicine provided by statistical learning machines.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Center for Information Technology
Zip Code
Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206
Greenstein, Deanna; Kataria, Rachna; Gochman, Peter et al. (2014) Looking for childhood-onset schizophrenia: diagnostic algorithms for classifying children and adolescents with psychosis. J Child Adolesc Psychopharmacol 24:366-73
Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin et al. (2014) Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res 24:1209-23
Ishwaran, Hemant; Malley, James D (2014) Synthetic learning machines. BioData Min 7:28
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2
Malley, James D; Moore, Jason H (2014) First complex, then simple. BioData Min 7:13
Kruppa, Jochen; Liu, Yufeng; Biau, Gérard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63

Showing the most recent 10 out of 23 publications