This projects studies statistical learning machines as applied to biomedical and clinical prediction, probabilitiy assignment, regresssion, and ranking problems. The algorithms involved include Random Forests, support vector machines (SVMs), neural networks (NNs), and variations of the boosting algorithm (Freund and Schapire, 1995). These are all recently developed techniques orginally constructed by the machine learning community, and which are only now starting to see applications in biomedical problems. As the methods were not designed through statistical reasoning or applied to data collected by clinicians or biomedical researchers, these new techniques require modifications and enhancements appropriate to data collected from these alternate sources. In particular, we address the problem of (1) greatly unbalanced data sets, where the researcher typically has only a handful of positive cases and a great many negative cases, (2) the issue of accurate estimates of prediction error rates, where the researcher typically has a relatively small data set upon which to do both model fitting and testing, and (3) the interpretation of the means by which the prediction engine operates and the development of practical prognostic factors. These three problems are essential questions facing the use of modern prediction engines, such as SVMs, NNs, and boosting methods, but have been only lightly studied by the machine learning community. We are in the process of preparing an invited review and tutorial article for the journal Statistics in Medicine, in an effort to introduce, explain and promote these methods for the biostatistical community. We have applied these statistical learning machine methods to (1) a data set involving ischemic stroke six-month functional outcomes (in collaboration with Dr. Andreas Ziegler, University of Luebeck, Germany); (2) use of Random Forest technology on data collected to develop prognostic factors in systemic lupus erythematosus (in collaboration with Dr. Michael Ward, MD, NIAMS/NIH); (3) use of the Random Forests applied to case-control myositis data, for estimation of importance of specific HLA alleles, possible data clustering and prediction; and (4) a support vector machine committee classification method for computer aided polyp detection in CT colonography (in collaboration with Anna Jerebko, PhD (Siemens, Inc.) and Dr. Ronald Summers, MD (NIH, CC, Dept. of Diagnostic Radiology). Research papers are in preparation or are currently under journal review for all these studies. Results include significant improvements in sensitivity and specificity using SVMs and boosting, when compared to conventional logistic regression followed by parameter shrinkage.
Showing the most recent 10 out of 12 publications