With tremendous advancements in spatial referencing technologies such as Global Positioning Systems that can identify geographical coordinates with a simple hand-held device, researchers in various disciplines have gathered an unprecedented variety of geo-coded temporal data. Consequently, modeling spatiotemporal data with flexible statistical models has become an enormously active area of research over the last decade in many disciplines including the environmental sciences, health sciences and oceanography, among others. In all these applications, researchers require efficient data modeling tools that can adapt to the complexity and size of modern spatiotemporal data, empowering them to quickly fit a variety of scientific models that explain the intricate nature of associations. This research project develops a new class of distributed Bayesian statistical algorithms, the Aggregated Monte Carlo (AMC), that enables efficient modeling of massive spatiotemporal data on an unprecedented scale. While the motivation of the PIs comes primarily from complex modeling and uncertainty quantification of massive spatiotemporal data, the proposed algorithm is general enough to set important footprints in the related literature of machine learning and computer experiments. The overarching goal also includes the development of software toolkits to better serve practitioners in related disciplines.
There has been an explosion in the size, complexity, and availability of spatiotemporally indexed data. This event has outpaced the development in Bayesian statistical methodology in that the fitting of state-of-the-art methods based on stochastic processes for analyzing spatiotemporal point referenced and point process data is prohibitively slow unless restrictive assumptions are imposed. The main problem is that the Monte Carlo (MC) computations in Markov chain Monte Carlo (MCMC) methods for fitting these models scale poorly with the size of the data. Solving this problem, the PIs develop a general framework, called Aggregated Monte Carlo (AMC), for scaling MC computations in the stochastic process-based modeling of massive space-time data using a divide-and-conquer technique. AMC has three stages that involve dividing the data into smaller subsets, obtaining posterior samples of the unknown parameters and latent variables across all the subsets using MCMC, and combining the MCMC samples from all the subsets. AMC is tuned to boost the scalability of any state-of-the-art model based on a stochastic process using a divide-and-conquer technique. Computationally, the main innovations include the development of general division and combination schemes for data with diverse spatiotemporal structures. Theoretically, the project provides bounds on the number of subsets such that the posterior distribution estimated using AMC provides a near optimal approximation of the full data posterior distribution in terms of decay of the posterior risks and contraction rates. Conceptually, AMC provides a natural extension of the existing results for combination using the barycenter of subset posterior distributions in parametric models to non-parametric models with complex spatiotemporal structures. The most appealing features of AMC are that it exploits parallel computer architecture for efficient and flexible modeling of massive spatiotemporal data and it provides posterior inference and uncertainty estimates with theoretical guarantees.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.