Predictive models that generate estimate probabilities for medical outcomes have become widely used in health services research, in health policy, and increasingly, for the assessment of health care and for real-time decision support. Logistic regression models for medical events are central to most probabilistic predictive clinical decision aids and are fundamental to comparative analyses of medical care based on risk-adjusted events. In such applications, inaccurate assessment of patient risk can have significant health care and health policy implications. New computer-based modeling techniques including generalized additive models, classification trees, and neural networks may potentially capture information that regression methods may miss or misrepresent. However, these methods use very local information in model construction and may be overfit to the sample data and thus not transport well to new settings. In years 1-3, we investigated the relative accuracy of predictions made by these modeling methods under a variety of data structures, including the presence of outliers and missing data. For many of these data structures we found that the more """"""""local"""""""" procedures frequently did not generalize to new test data as well as traditional regression methods. However, our results suggest that as sample size and data complexity increases the performance of these procedures may substantially improved. Thus, to test these findings under more general conditions, we now propose two additional years of research to 1) rigorously assess the relative predictive performance and transportability of other new innovative modeling methods and of original hybrid model construction methods; 2) systematically investigate the relative predictive performance and model transportability of modeling methods applied to large and complex data structures; and 3) explore and assess procedures for handling outliers and missing data for classification trees and neural networks. The completion of the proposed work will result in the first systematic exploration of the factors affecting the predictive performance of the major modeling methods used to predict medical outcomes, and the comparative performance of models constructed by these methods on the extremely large data sets of the type that are becoming increasing available to researchers.
Terrin, Norma; Schmid, Christopher H; Griffith, John L et al. (2003) External validity of predictive models: a comparison of logistic regression, classification trees, and neural networks. J Clin Epidemiol 56:721-9 |
Schmid, C H; D'Agostino, R B; Griffith, J L et al. (1997) A logistic regression model when some events precede treatment: the effect of thrombolytic therapy for acute myocardial infarction on the risk of cardiac arrest. J Clin Epidemiol 50:1219-29 |