Massive volumes of data are now generated in social, scientific, business and military applications. However, access to the data is limited, due to computational, bandwidth, power limitations, as well as human constraints such as attention and privacy. A fundamental problem is thus to obtain most useful information at minimum cost. Most existing techniques are heuristics without guarantees, do not scale or cannot cope with dynamic, distributed phenomena that change over time. In addition, existing algorithms typically disregard human aspects of the information gathering task.

This research (1) develops principled algorithms for optimized information gathering in distributed, uncertain, dynamic domains, (2) studies human aspects of optimized information gathering, considering attention, perception and privacy as fundamental constraints and (3) pursues novel, real-world applications. To accomplish these goals, fundamentally new techniques in combinatorial optimization, probabilistic reasoning and decision theory are developed. The approaches are unified in the mathematical framework of submodular function optimization. In addition to developing theoretically well-founded, rigorous approaches, an important part of the research is evaluation on interdisciplinary, real world applications in social computing, scientific data analysis and stochastic optimization in sustainability.

These applications of optimized information gathering have the potential to enable new classes of systems and services for science and society. Impact is achieved through collaboration with partners from industry (Microsoft Corporation) and governmental facilities (USGS and JPL), as well as by broad dissemination of the results through an integrated education plan, tutorials, shared data and open-source software, and interdisciplinary collaborations.

Project Report

Massive volumes of data are now generated from corporate and public sources every second, in social, scientific, commercial and military applications. In addition, more and more sensor devices become available to allow better and better observation of the physical world. However, access to the data and sensing resources is limited, due to cost, computational, bandwidth, power limitations, as well as human constraints such as attention and privacy. Thus, one of the key problems is: How can we obtain most useful information at minimum cost? This problem has been extensively studied in a number of areas. Most existing techniques for this problem however are either heuristics with no guarantees, or do not scale to large problems. Our recent research has shown that many information gathering problems satisfy submodularity, an intuitive diminishing returns criterion, stating that adding an observation helps more when we have made few observations so far, and less if we have already made many observations. Exploiting this criterion allowed us to develop algorithms that have strong performance guarantees and perform well on real applications. However, the algorithms existing prior to the project were limited in that they could not cope with dynamic phenomena that change over time, and were inherently centralized and thus did not allow distributed implementation. One of the main outcomes of the project was the discovery of a new structural property – adaptive submodularity – that allows overcoming several of the challenges faced by the existing methods. Crucially, it enables dealing with settings where new information arrives over time, which can be taken into account to make better decisions in the future. We have developed new algorithms that utilize this property, and evaluated them on several applications. For example, we tackled the problem of Bayesian experimental design with noisy observations, where we need to adaptively select from a number of expensive experiments in order to identify an unknown hypothesis. In particular, we considered a Bayesian experimental design problem in Behavioral Economics. The goal of this study was to tease apart competing economic theories of how people make decisions under uncertainty. Our results indicate that our novel design criterion outperforms existing techniques in terms of accuracy and efficiency. We also demonstrated how the algorithms are useful for more general sequential decision problems under uncertainty. Concretely, in collaboration with researchers from the United States Geological Survey and the US Fish and Wildlife Service, we applied our techniques to the problem of protecting endangered species by recommending patches of land to be used for conservation purposes. Furthermore, we pioneered several novel, principled techniques for large-scale unsupervised learning. For example, we have developed a novel algorithm for crowdsourcing the analysis of large data sets: Multiple annotators provide groupings for small collections of data points (e.g., images) according to perceived similarity, and our algorithm reconciles these disparate groupings into a single, consistent clustering over the entire data set. We have also developed novel approaches for summarizing massive data sets in a way that is provably sufficient to accurately carry out certain analysis tasks, such as fitting probabilistic mixture models. These techniques allow scaling existing model fitting methods to massive data sets. As another outcome of the project, we also developed new courses, including a new graduate course on Data Mining – Learning from Large Data Sets, which presents state of the art approaches for coping with the data deluge.

Project Start
Project End
Budget Start
2010-05-01
Budget End
2013-04-30
Support Year
Fiscal Year
2009
Total Cost
$331,511
Indirect Cost
Name
California Institute of Technology
Department
Type
DUNS #
City
Pasadena
State
CA
Country
United States
Zip Code
91125