This research proposes to develop methods for feature (variable) selection in the challenging domain of high-dimensional, mixed (numerical and categorical predictors and/or responses), dirty, nontraditional data from complex systems. Complex systems generate rich, wide data sets with dozens to hundreds (to even thousands) of variables and interdisciplinary research teams are often challenged to learn from such information. A hybrid ensemble strategy is proposed for feature selection that combines both serial and parallel ensembles of decision trees. Latent variables need not be constructed. Instead, variable importance scores will be developed that consider masking effects and redundancy, and provide statistically valid conclusions through artificial, generated variables. The methods will confront the inherent data challenges as well as nonlinear models, interactions, effects of different magnitudes and scales. Leverage of the computational capabilities that are widely disseminated will be made for a modern, comprehensive approach to this problem.
If successful, the results of this research will provide a transformative solution (that leverages widely-disseminated computing resources) for modeling complex systems. Hundreds of different measurements from these systems create a bottleneck both conceptually for an understanding of the system and technically for model performance that this research will address. An interdisciplinary research team will be able to apply the methods developed to identify key features, redundant features, compact models, and so forth, with statistical validity. The National Science Foundation areas of emphasis: earth systems, dynamics of coupled natural and human systems, materials use, and ecology of infectious diseases, provide examples, and complex manufacturing, supply chains, design optimizations, and transportation provide additional examples. Rather than limited data assumptions, the methods are to be developed to apply in the context of high-dimensional, dirty, redundant, missing, mixed data and nonlinear, interactive models that such systems often require.