This project studies statistical learning machines as applied to biomedical and clinical prediction, probability assignment, regression, and ranking problems. Emphasis is given for fast and freely-available methods such as random forests, and nearest neighbors. Our machine learning schemes (full list given above) include: risk machines, regression collectives, feature importance profiles, and rapid interaction detection (over genes, networks and pathways) using probability and risk machines. The methods are completely model-free. Statistical inference and clinical interpretation is given that coincides with the methods that are standard for biomedical researchers, when those models are correct. These methods solve several problems first posed in our earlier book: """"""""Statistical Learning for Biological Data"""""""" (coauthors, K Malley and S Pajevic;published 2011). The practical applications of these solutions, will appear in our next book, now underway: Statistical Learning Machines: Inference and Interpretation (with Hemant Ishwaran, University of Florida) Inventions and methods developed in FY 2014: 1. Risk machines: -machines as black boxes can now be interpreted without having to define any model or guess at interactions or correlations. -uses notion of multiple machines, each provably consistent on some subset of the data. -paper accepted at BioData Mining, 1 March 2014. 2. Interaction detection and estimation: -researcher doesn't have to specify any functional form of the interaction or introduce any new feature to describe the interaction. -completely model-free and provably optimal. -paper in the BioData Mining paper (above). 3. Misclassification uncertainty estimation: -helps answer the question: what is the probability that some scheme is misclassifying a subject? -estimate of uncertainty for a declared subject classification, specific to each subject. 4. Feature importance using the method of recurrency (and see below): -helps rigorously answer the question: what is the probability that a given feature is in the Top 10 list of most predictive features? -method uses recurrency: a feature that repeatedly does well generates an estimate of its probability of being in Top 10. -is provably optimal using probability machine theory (see Background above). 5. Noise reduction using method of recurrence: -can be used to filter out noise features. -any learning machine can be used. -Silke Szymczak (NHGRI, and University of Kiel) used 550,000 SNPs with 9 causals and recovered 7 of them, while including 14 false positives. methods developed in two paper to be submitted in next few months. 6. Feature content for most predictive features: probability content: -how do the most important features work? how does the black box work? -use strong prediction terminal nodes in a Random Forest and generate histogram of most probable splits. -used for Drosophila transcription start site prediction: probability machine had error around 10% or less and exactly recovered the known tetramers at the true start sites (several thousand of each;several tissue types;two species); -paper accepted at Genome Research (May 2014). 7. Synthetic features: -basic idea: introduce sets of features as single new ones, and then sending them to a learning machine. -is entirely model-free -paper in Genetic Epidemiology, February 2014. -biologically defined networks can be treated as new features in a learning machine and these can be compared: which networks impact others? -which are predictive? -how do the synthetic features and networks interact? 8. Regression collectives: - any collection of learning machines can be combined for probability or risk estimation - paper in review at Journal of Multivariate Analysis;August 2014). - not a committee or ensemble method or averaging method. - method is provably at least as good as the best in the collection, for any data set. - and dont need to identify the best in the collection, and indeed this may vary across data sets. 9. Synthetic Machines and Synthetic forests: - basic idea: use multiple machines each as synthetic features, for input to a random forest or another machine. - paper in review at BioData Mining (August 2014) 10. New book started: Statistical Learning Machines: Inference and Interpretation: (with H. Ishwaran) - will describe all methods above, with worked examples. - The core vision: Moving from the black box of a machine to interpretable results for the lab and the clinic.
Battogtokh, Bilguunzaya; Mojirsheibani, Majid; Malley, James (2017) The optimal crowd learning machine. BioData Min 10:16 |
Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152 |
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7 |
Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14 |
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50 |
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206 |
Pan, Qinxin; Hu, Ting; Malley, James D et al. (2014) A system-level pathway-phenotype association analysis using synthetic feature random forest. Genet Epidemiol 38:209-19 |
Malley, James D; Moore, Jason H (2014) First complex, then simple. BioData Min 7:13 |
Malley, James D; Malley, Karen G; Moore, Jason H (2014) O brave new world that has such machines in it. BioData Min 7:26 |
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2 |
Showing the most recent 10 out of 28 publications