More than 2.5 quintillion bytes of data are created daily in the form of sensor measurements, web posts and clicks, surveillance videos, purchase transactions, and health-care records. However, not all data collected is informative and not all features are relevant to the outcomes of interest. While several researchers have focused attention on compressive sampling for minimum error data reconstruction to improve data storage and acquisition, the objective of this research is broader and is focused on salient feature discovery. The key insight is sparsity, namely, that there is a tight coupling between a small relevant set of observations and the outcomes of interest. This research is focused on the sparse identification of the most relevant observations that are essential to predicting the outcomes.
The investigators will develop a new information-theoretic framework and algorithmic tools for understanding the intrinsic relationships and sparse interactions between outcomes of interest and a set of features/observations in order to improve inferencing capabilities for enhanced decision-making in the context of this data-deluge. The goal is to discover the most relevant sparse subset of features that are essential to predicting the outcomes, and to uncover the fundamental limits of the associated sparse models. The approach is based on a unifying Shannon information-theoretic framework, whereby the problem of salient feature identification is mapped to a problem of capacity analysis for an equivalent channel model. This research addresses challenges posed by models with correlated features, models with missing features, models with latent variables, and the non-linearities of the measurement processes, in a unified way. Furthermore, this research involves the development of data-driven sparse recovery algorithms that reinforce the value of information when the underlying statistical models are partially or completely unknown. The investigators will use the developed methods to enhance the detection of sparse mixtures of explosives with fluorescence sensor arrays, and to identify high-degree hubs in computer networks using network tomography.