This research proposes to develop methods for feature (variable) selection in the challenging domain of high-dimensional, mixed (numerical and categorical predictors and/or responses), dirty, nontraditional data from complex systems. Complex systems generate rich, wide data sets with dozens to hundreds (to even thousands) of variables and interdisciplinary research teams are often challenged to learn from such information. A hybrid ensemble strategy is proposed for feature selection that combines both serial and parallel ensembles of decision trees. Latent variables need not be constructed. Instead, variable importance scores will be developed that consider masking effects and redundancy, and provide statistically valid conclusions through artificial, generated variables. The methods will confront the inherent data challenges as well as nonlinear models, interactions, effects of different magnitudes and scales. Leverage of the computational capabilities that are widely disseminated will be made for a modern, comprehensive approach to this problem.

If successful, the results of this research will provide a transformative solution (that leverages widely-disseminated computing resources) for modeling complex systems. Hundreds of different measurements from these systems create a bottleneck both conceptually for an understanding of the system and technically for model performance that this research will address. An interdisciplinary research team will be able to apply the methods developed to identify key features, redundant features, compact models, and so forth, with statistical validity. The National Science Foundation areas of emphasis: earth systems, dynamics of coupled natural and human systems, materials use, and ecology of infectious diseases, provide examples, and complex manufacturing, supply chains, design optimizations, and transportation provide additional examples. Rather than limited data assumptions, the methods are to be developed to apply in the context of high-dimensional, dirty, redundant, missing, mixed data and nonlinear, interactive models that such systems often require.

Project Start
Project End
Budget Start
2007-09-01
Budget End
2010-02-28
Support Year
Fiscal Year
2007
Total Cost
$99,999
Indirect Cost
Name
Arizona State University
Department
Type
DUNS #
City
Tempe
State
AZ
Country
United States
Zip Code
85281