This project studies statistical learning machines as applied to personalized biomedical and clinical probability estimates for outcomes, patient-specific risk estimation, synthetic features, noise detection, feature selection. In more detail: 1. Probability machines can generate personalized probability predictions for multiple phenotypes and outcomes, such as tumor, not tumor. These methods fully supercede simple classification methods, those that only generate zero or one predictions. The distinction is this: A pure classification scheme will produce the same prediction for these two outcomes: an 85% chance of tumor, and a 58% chance of tumor. These outcomes can be expected to have distinct and critical patient level evaluations, prognosis, and treatment plans, specific to patient subgroups. A probability machine produces provably consistent probability outcomes (85% or 58%) for each patient, and does so using any number or type of predictors, with no model specification required, and arbitrary correlation structure in the features. Thus, a probability machine is a significantly better use of the available information in the data. If a specific, classical analysis model such as a logistic regression scheme is assumed to be exactly correct for the data, then the probability machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. Moreover, no specified interaction terms are required to be defined by the researcher: the probability machine is provably consistent in the absence of any user-input interaction terms or so-called confounders. 2. Risk machines are based on multiple probability machines, and counterfactual detection engines. They provide provably consistent estimates of all manner of risk effects estimates: log odds, risk ratios, risk differences. Most critically, they provided patient-specific risk estimates. They are entirely model free and can use any number or type of predictors, and allow for arbitrary, unspecified correlation structure in the features. If a specific, classical analysis model such as a logistic regression scheme is known to be correct for the data at hand, then the risk machine will provide estimates than can fully support, question, or challenge the validity of the logistic regression. That is, the risk machine can provide a fully model free validation of a smaller parametric model, if correct, by generating risk effects sizes that agree with the logistic regression model parameters. As with any probability machine, no user-input interaction terms are required: the risk machines can, indeed, be used for interaction detection, in the absence of any parametric model. 3. The introduction of synthetic features considerably expands the classical notion of features or predictors, by allowing the research to assemble new sets of features or networks and allowing a statistical learning machine to then process the data using both original and synthetic features. Typically, a small linear parametric model is invoked to remove the effects of confounders, such as age, gender, population stratification, or more. Unless the model is known to be exactly correct, this treatment of confounders is certain to be in error. The use of synthetic features is a fully nonparametric alternative approach to this problem. 4. Crowd machines can optimally combine the results of any number of learning machines, in a model-free scheme. They can also relieve the researcher from having to optimally set any learning machine tuning parameters. The results of any learning machine analysis therefore become independent of any required tuning parameters, such as support vector learning machine kernel, any details of a neural net. The crowd machine combines detection from any number of machines, specifically allowing for one or another machine to be optimal for some subset of patients and or some subset of features. The crowd machine is not a simple ensemble, or committee or voting scheme. It has been shown to be provably optimal as a statistical data analysis scheme, at least as good as the best machine in the collection. It does not require naming a winner among the collection of machines. Indeed the search for such winners is easily shown to be suboptimal, for example when a machine is best for some portion of the data but not so for other subsets of the dad. 5. Probability machines can be used for feature selection using the new and validated notion of recurrency. No linear ranking of features is ever necessary. In fact, simple examples show that such linear ranking can be inconsistent and contradictory. Features that may be only weakly predictive can be reliably detected using the method of recurrency. That is, the data may have no main effects, no single features that are critical for estimating the personalized probability for an outcome, or the patient-specific risk effect sizes. Yet multiple subsets of features, none strongly predictive, may jointly provide excellent probability and risk estimates. The method of recurrency, and locates these features in the data. 6. Similarly, the method of recurrency can be used to remove features that are clearly noise and that only obscure the truly predictive features in the data. 7. Probability and risk machines can jointly provide nonparametric detection of interacting features. Such detection--entanglement maps--can be undertaken in a fully model-free environment. Simple examples show that interactions among features are often not recovered using the pair-wise products of these features in any model. Entanglement mapping has immediate application to genome-wide interaction detection, even when no single genetic marker, any SNP say, is by itself a predictive feature.
Battogtokh, Bilguunzaya; Mojirsheibani, Majid; Malley, James (2017) The optimal crowd learning machine. BioData Min 10:16 |
Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152 |
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7 |
Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14 |
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50 |
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206 |
Malley, James D; Moore, Jason H (2014) First complex, then simple. BioData Min 7:13 |
Malley, James D; Malley, Karen G; Moore, Jason H (2014) O brave new world that has such machines in it. BioData Min 7:26 |
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2 |
Kruppa, Jochen; Liu, Yufeng; Biau, GĂ©rard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63 |
Showing the most recent 10 out of 28 publications