Early onset of alcohol use during adolescence is associated with increased probability of later alcohol dependence, polydrug abuse, victimization, conduct problems, psychiatric comorbidities, and delayed achievement of adult milestones. Methods that yield rapid, accurate, and reliable predictions of which children and teens are at risk for early onset can improve the targeting of prevention interventions and enable the concentration of resources on the most debilitating and costly cases. One promising and untapped approach to this prediction problem is machine learning (also called ?statistical learning,? ?data mining,? or ?predictive modeling?), a class of techniques arising from statistics, computer science, and engineering that seeks to build data-driven predictive algorithms. These techniques are most noticeably distinguished from ?traditional? statistical methods (e.g., ordinary least squares regression) by their extreme emphasis on prediction of future cases, rather than explanation of the current data, and thus they may offer dramatic advantages over traditional approaches to identifying which children and teens will develop early onset alcohol use. This proposal will explore the potential contribution of machine learning methods by directly comparing their predictive performance to that of the traditional approach in a large-scale, multisite longitudinal study of the development of early onset alcohol use (N = 731). If machine learning methods do significantly outperform the traditional approach, future directions might include the development and implementation of machine-learning- based screening methods for real-world use. On the other hand, if machine learning methods do not outperform the traditional approach, this will suggest that at least in the context of the present study (i.e., these predictors, timeline, and outcome), machine learning does not improve the prediction of early onset alcohol use. Analyses will investigate whether the performance of machine learning methods varies across the nature of predictor variables use, the age span covered, and the outcome to be predicted. Thus, the current proposal uses an extant longitudinal dataset to carry out two specific aims: (1) Train five different machine learning algorithms and one traditional algorithm (ordinary logistic regression) for predicting later early onset alcohol use in a subset (70%) of the data. (2) Test these six predictive algorithms on the rest (30%) of the data and directly compare their predictive performance in multiple contexts.
Prospectively predicting which children and teens are at risk for early onset alcohol use enables targeted implementation of preventive interventions. Machine learning is a promising yet untapped approach that may be well-suited to this task. This study investigates the potential of several machine learning algorithms to contribute to the rapid, accurate, and reliable identification of individuals at risk for early onset alcohol use.