This project studies statistical learning machines as applied to personalized biomedical and clinical probability estimates for outcomes, patient-specific risk estimation, noise detection, feature (variable) selection. During the year covered by this report, I invented the Optimal Crowd learning machine, which can accurately combine the results of any number of learning machines, in an entirely model-free environment. This major accomplishment is described in this section with some related inventions. The Optimal Crowd machine unburdens the researcher from having to derive or assign any tuning parameters for any machine in the family, making the results independent of any required tuning parameters or parameter estimation, such as support vector learning machine kernel, or any details of a neural net. In addition, the Optimal Crowd combines detection over any number of machines, specifically allowing for one or another machine to be optimal for some subset of patients and or some subset of features. This method improves upon simple ensemble, committee or voting schemes in that the method has been proven to be optimal, that is, always at least as good as the best machine in the collection. These other existing aggregation schemes have not proven this. The Optimal Crowd does not require separate statistical optimization, and in fact, does not even require naming a winner among the collection of machines. Indeed, I showed the search for such winners is suboptimal in general modeling and statistical analysis, for example when one machine is best for some portion of the data but not so for other subsets of the data. The Optimal Crowd is a statistically optimal data division scheme, using all the predictions by all the machines in the family over all the given data. I proved that the accuracy of the Optimal Crowd also does not depend on the number of machines in the family, from mathematically rigorous theoretical work and from extensive data validation using well-known data sets. Equally important, the Optimal Crowd does not require any training data on its own: it reasons directly from the many predictions in the family of machines. The Optimal Crowd can be applied to probability and risk estimation across any family of distinct estimation schemes, and for patient-specific predictions. The probability and risk predictions made by any set of learning machines is thus optimized across the entire family and with no additional computational requirements. The Optimal Crowd can be used for feature/variable selection using the notion of recurrency, also invented during the year covered by this annual report. Briefly, recurrency estimates the probability that a feature is noise or predictive. No linear ranking of features, which is shown to sometimes be inconsistent and contradictory by simple examples, is necessary. Recurrency can reliably detect even weakly predictive features, where the data may have no main effects, no single features that are critical for estimating the personalized probability of an outcome, but does have multiple subsets of features, none of which are strongly predictive, that jointly provide excellent probability and risk estimates. The Optimal Crowd can work with the method of recurrency to remove features that are clearly noise and that only obscure the truly predictive features in the data. Finally, the Optimal Crowd can provide model-free, nonparametric detection of interacting features. Such detection, which I named entanglement maps, can now be undertaken in a fully model-free environment. Simple examples show that interactions among features are often not recovered using the pairwise products of these features in any model. Entanglement mapping has immediate application to genome-wide interaction detection, even when no single genetic marker, any SNP say, is by itself a strong predictor.
Battogtokh, Bilguunzaya; Mojirsheibani, Majid; Malley, James (2017) The optimal crowd learning machine. BioData Min 10:16 |
Holzinger, Emily R; Szymczak, Silke; Malley, James et al. (2016) Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data. BMC Proc 10:147-152 |
Szymczak, Silke; Holzinger, Emily; Dasgupta, Abhijit et al. (2016) r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 9:7 |
Li, Jing; Malley, James D; Andrew, Angeline S et al. (2016) Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 9:14 |
Salem, Ghadi H; Dennis, John U; Krynitsky, Jonathan et al. (2015) SCORHE: a novel and practical approach to video monitoring of laboratory mice housed in vivarium cage racks. Behav Res Methods 47:235-50 |
Holzinger, Emily Rose; Szymczak, Silke; Dasgupta, Abhijit et al. (2015) Variable selection method for the identification of epistatic models. Pac Symp Biocomput :195-206 |
Malley, James D; Moore, Jason H (2014) First complex, then simple. BioData Min 7:13 |
Malley, James D; Malley, Karen G; Moore, Jason H (2014) O brave new world that has such machines in it. BioData Min 7:26 |
Dasgupta, Abhijit; Szymczak, Silke; Moore, Jason H et al. (2014) Risk estimation using probability machines. BioData Min 7:2 |
Kruppa, Jochen; Liu, Yufeng; Biau, GĂ©rard et al. (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J 56:534-63 |
Showing the most recent 10 out of 28 publications