The analysis of observational data in the behavioral, social, and economic sciences is commonly undertaken with statistical modeling. Over the past decade, another approach to data analysis has been evolving in applied mathematics, computer science, and statistics that some have called ``algorithmic.'' There is usually no effort to construct a model of how the data were generated. The goal is to link a set of inputs to one or more outputs so that some clear objective function is optimized by a computer algorithm. This proposal focuses on ``ensemble methods,'' which are an especially promising special case of algorithmic methods. The goal this is to help foster more effective interactions between the developers of ensemble methods and empirical researchers in the behavioral, social, and economic sciences. The approach is to apply, using real data sets, ensemble methods to important social science data analysis problems. These include 1) evaluation procedures for complex computer simulations, 2) diagnostic procedures for conventional statistical models, 3) adjustments for confounding in observational studies, and 4) classification and prediction exercises when the response variable is highly skewed. In each case, the performance of the ensemble methods will be compared to the performance of conventional modeling using ten assessment criteria detailed in the proposal.

The proposed application of ensemble methods to real data sets should have several broad benefits for the behavioral, social and economic sciences, as well as for the mathematical and statistical sciences. Powerful and rapidly developing data analysis tools, under the broad rubric of "data mining," will be applied to difficult data analysis problems. These exercises will illustrate strengths and weakness of ensemble methods for certain kinds of demanding empirical research. This experience will help inform the behavioral, social, and economic sciences about when ensemble methods can be useful and what their limitations can be. Equally important, the applications will provide a ``test bed'' for a variety of ensemble methods from which will likely emerge potential refinements in existing ensemble procedures and new technical questions insufficiently addressed in the current literature. Thus, the mathematical and statistical sciences can benefit as well. Finally, each of the data sets to be used is potentially rich in substantive implications. Although the emphasis in this proposal is methodological, it is entirely possible that the data analyses to be undertaken will also be instructive from subject- matter perspective.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0653802
Program Officer
Tomek Bartoszynski
Project Start
Project End
Budget Start
2006-07-01
Budget End
2008-11-30
Support Year
Fiscal Year
2006
Total Cost
$347,884
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104