This projects studies statistical learning machines as applied to biomedical and clinical prediction, probabilitiy assignment, regresssion, and ranking problems. The algorithms involved include Random Forests, support vector machines, neural networks, and variations of the boosting algorithm. These are all recently developed techniques orginally constructed by the machine learning community, and which are only now starting to see applications in biomedical problems. These methods were not designed through familiar parametric statistical reasoning, but using the more advanced methods of nonparametric density estimation, are known to be provably Bayes risk consistent. They are, therefore, well-adapted to large data, especially whole genome data. More recently we have found methods to calculate risk and hazard using probability machines. These methods, now called Risk Machines, are entirely model-free and are provably valid using current techniques in mathematical statistics. No model or parametric input is required by the researcher. Personalized risks can be calculated for individuals relative to any possible predictor or environmental hazard or any interaction between gene and environment. These methods solve the problems first posed in our earlier book: """"""""Statistical Learning for Biological Data"""""""" (coauthors, K Malley and S Pajevic;published 2011). The practical applications of these solutions, including Risk Machines, will appear in our next book: """"""""Estimation of Risk and Probability: A Machine Learning Approach"""""""" (in preparation). Consistent, valid estimation of genetic risks can be found using risk machines, for such problem as childhood-onset schizophrenia. Also, new predictive features can be constructed from observed features that account for known problem in statistical genetics such as recombination hot spots and linkage disequilibrium. These synthetic features can then be evaluated separately for risk estimation to the subject. This is another example of personalized medicine provided by statistical learning machines.
Battogtokh, Bilguunzaya; Mojirsheibani, Majid; Malley, James (2017) The optimal crowd learning machine. BioData Min 10:16 |
Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152 |
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7 |
Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14 |
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50 |
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206 |
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2 |
Kruppa, Jochen; Liu, Yufeng; Biau, GĂ©rard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63 |
Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin et al. (2014) Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res 24:1209-23 |
Greenstein, Deanna; Kataria, Rachna; Gochman, Peter et al. (2014) Looking for childhood-onset schizophrenia: diagnostic algorithms for classifying children and adolescents with psychosis. J Child Adolesc Psychopharmacol 24:366-73 |
Showing the most recent 10 out of 28 publications