We are undergoing a revolution in data. We have grown accustomed to constant upheaval in computing -- quicker processors, bigger storage and faster networks -- but this century presents the new challenge of almost unlimited access to raw data. Whether from sensor networks, social computing, or high-throughput cell biology, we face a deluge of data about our world. Scientists, engineers, policymakers, and industrialists need to use these enormous floods of data to make better decisions. This research project is about providing foundations for tools to achieve these goals. Simple models give only coarse understanding. The world is sophisticated and dynamic, providing rich information. Furthermore, representation of uncertainty is critical to discovering patterns in complex data. Not only are many natural processes intrinsically random, but our knowledge is always limited. The calculus of probability allows us to represent this uncertainty and design algorithms to act effectively in an unpredictable world. The gold standard for probabilistic analysis is Markov chain Monte Carlo (MCMC), a way to identify hypotheses about the unobserved structure of the world that are consistent with observed data. It is a powerful and principled way to perform data analysis, but traditional MCMC methods do not map well onto modern computing environments. MCMC is a sequential procedure that cannot generally take advantage of the parallelism offered by multi-core desktops and laptops, cloud computing, and graphical processing units. This research will develop new methods for MCMC that are provably correct, but that take advantage of large-scale parallel computing. There are a variety of broader impacts of this work. In addition to the core technical contributions, the project engages in deep scientific collaborations. New photovoltaic materials will lead to better solar cells and more sustainable energy production. New techniques for uncovering genetic regulatory mechanisms will lead to better understanding of disease. Quantitative models of mouse activity will give insight into the neural basis of behavior and provide a deeper understanding of brain disorders.

From a technical point of view, this work pursues two complementary approaches to large-scale Bayesian data analysis with MCMC: 1) a novel general-purpose framework for sharing of information between parallel Markov chains for faster mixing, and 2) a new computational concept for speculative parallelization of individual Markov chains. These theoretical and practical explorations, combined with the release of associated open source software, will yield more robust and scalable probabilistic modeling. It will develop provably-correct foundations and efficient new algorithms for parallelization of Markov transition operators for posterior simulation. These operators will be used in three collaborations that are representative of the methodological demands for large-scale statistical inference: 1) predicting the efficiencies of novel organic photovoltaic materials, 2) discovering new genetic regulatory mechanisms, and 3) quantitative neuroscientific models for mouse behavior. While this proposal focuses on the generalizable technical challenges of these problems, these collaborations provide compelling examples of how machine learning can be broadly transformative.

Finally, the project includes a significant outreach component, engaging with local middle schoolers, and involving underrepresented minorities in summer research.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Weng-keen Wong
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
United States
Zip Code