Statistical Learning for Biomedical Data

Malley, James

Abstract

This project studies statistical learning machines as applied to biomedical and clinical prediction, probability assignment, regression, and ranking problems. Emphasis is given for fast and freely-available methods such as random forests, and nearest neighbors. Our machine learning schemes (full list given above) include: risk machines, regression collectives, feature importance profiles, and rapid interaction detection (over genes, networks and pathways) using probability and risk machines. The methods are completely model-free. Statistical inference and clinical interpretation is given that coincides with the methods that are standard for biomedical researchers, when those models are correct. These methods solve several problems first posed in our earlier book: """"""""Statistical Learning for Biological Data"""""""" (coauthors, K Malley and S Pajevic;published 2011). The practical applications of these solutions, will appear in our next book, now underway: Statistical Learning Machines: Inference and Interpretation (with Hemant Ishwaran, University of Florida) Inventions and methods developed in FY 2014: 1. Risk machines: -machines as black boxes can now be interpreted without having to define any model or guess at interactions or correlations. -uses notion of multiple machines, each provably consistent on some subset of the data. -paper accepted at BioData Mining, 1 March 2014. 2. Interaction detection and estimation: -researcher doesn't have to specify any functional form of the interaction or introduce any new feature to describe the interaction. -completely model-free and provably optimal. -paper in the BioData Mining paper (above). 3. Misclassification uncertainty estimation: -helps answer the question: what is the probability that some scheme is misclassifying a subject? -estimate of uncertainty for a declared subject classification, specific to each subject. 4. Feature importance using the method of recurrency (and see below): -helps rigorously answer the question: what is the probability that a given feature is in the Top 10 list of most predictive features? -method uses recurrency: a feature that repeatedly does well generates an estimate of its probability of being in Top 10. -is provably optimal using probability machine theory (see Background above). 5. Noise reduction using method of recurrence: -can be used to filter out noise features. -any learning machine can be used. -Silke Szymczak (NHGRI, and University of Kiel) used 550,000 SNPs with 9 causals and recovered 7 of them, while including 14 false positives. methods developed in two paper to be submitted in next few months. 6. Feature content for most predictive features: probability content: -how do the most important features work? how does the black box work? -use strong prediction terminal nodes in a Random Forest and generate histogram of most probable splits. -used for Drosophila transcription start site prediction: probability machine had error around 10% or less and exactly recovered the known tetramers at the true start sites (several thousand of each;several tissue types;two species); -paper accepted at Genome Research (May 2014). 7. Synthetic features: -basic idea: introduce sets of features as single new ones, and then sending them to a learning machine. -is entirely model-free -paper in Genetic Epidemiology, February 2014. -biologically defined networks can be treated as new features in a learning machine and these can be compared: which networks impact others? -which are predictive? -how do the synthetic features and networks interact? 8. Regression collectives: - any collection of learning machines can be combined for probability or risk estimation - paper in review at Journal of Multivariate Analysis;August 2014). - not a committee or ensemble method or averaging method. - method is provably at least as good as the best in the collection, for any data set. - and dont need to identify the best in the collection, and indeed this may vary across data sets. 9. Synthetic Machines and Synthetic forests: - basic idea: use multiple machines each as synthetic features, for input to a random forest or another machine. - paper in review at BioData Mining (August 2014) 10. New book started: Statistical Learning Machines: Inference and Interpretation: (with H. Ishwaran) - will describe all methods above, with worked examples. - The core vision: Moving from the black box of a machine to interpretable results for the lab and the clinic.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: Center for Information Technology (CIT)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIACT000271-12
Application #: 8941407
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 12
Fiscal Year: 2014
Total Cost
Indirect Cost

Institution

Name: Computer Research and Technology
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2017 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Computer Research and Technology
NIH 2016 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Computer Research and Technology
NIH 2015 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Computer Research and Technology
NIH 2014 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Computer Research and Technology
NIH 2013 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Center for Information Technology	$85,800
NIH 2012 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Center for Information Technology	$44,209
NIH 2010 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Center for Information Technology	$84,850
NIH 2009 ZIA CT	Statistical Learning for Biomedical Data Malley, James D. / Center for Information Technology	$211,036

Publications

Battogtokh, Bilguunzaya; Mojirsheibani, Majid; Malley, James (2017) The optimal crowd learning machine. BioData Min 10:16

Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152

Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7

Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14

Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50

Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206

Malley, James D; Moore, Jason H (2014) First complex, then simple. BioData Min 7:13

Malley, James D; Malley, Karen G; Moore, Jason H (2014) O brave new world that has such machines in it. BioData Min 7:26

Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2

Kruppa, Jochen; Liu, Yufeng; Biau, Gérard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63

Showing the most recent 10 out of 28 publications

Comments

Be the first to comment on James Malley's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: