This projects studies statistical learning machines as applied to biomedical and clinical prediction, probabilitiy assignment, regresssion, and ranking problems. The algorithms involved include Random Forests, support vector machines, neural networks, and variations of the boosting algorithm. These are all recently developed techniques orginally constructed by the machine learning community, and which are only now starting to see applications in biomedical problems. These methods were not designed through familiar parametric statistical reasoning, but using the more advanced methods of nonparametric density estimation, are known to be provably Bayes risk consistent. Hence, as the data set grows the methods do optimally classify cases and subjects, for example. As routinely applied to data collected by clinicians or biomedical researchers, these new techniques require modifications and enhancements appropriate to data collected from these alternate sources. In particular, we address the problem of (1) greatly unbalanced data sets, where the researcher typically has only a handful of positive cases and a great many negative cases, (2) the issue of accurate estimates of prediction error rates, where the researcher typically has a relatively small data set upon which to do both model fitting and testing, and (3) the interpretation of the means by which the prediction engine operates and the development of practical prognostic factors. These three problems are essential questions facing the use of modern prediction engines, but have been only lightly studied by the machine learning community. On the other hand, the rigorous methods of the mathematical statistics community have demonstrated the unusual versatility and flexibility of these methods. We have applied these statistical learning machine schemes to a wide variety of biological datasets, such as a 1,000K SNP data set on childhood-onset schizophrenia. At the invitaion of Cambridge University Press we are writing a textbook on """"""""Statistical Learning for Biological Data""""""""; completion of text and publication is anticipated in 2009
Showing the most recent 10 out of 12 publications