Ever-increasing use of data-intensive methods in scientific discoveries has led to a paradigm shift in science in recent years. High throughput scientific experiments, routine use of digital sensors, and intensive computer simulations have created a data deluge imposing new challenges on scientific communities to find effective and computationally feasible methods for processing and analyzing very large datasets. Despite many attempts, however, the necessary development of theoretical and computational foundations for big data analysis is lagging far behind. Many existing statistical methods are not capable of handling such data-intensive problems in terms of theoretical foundation as well as computational complexity and scalability. For analyzing high dimensional data with possibly complex structures, this research will offer a set of fundamental solutions using principled statistical methods. The resulting methods will provide a robust framework for big data analysis and allow scientists to use statistical models beyond their current limited applicability. The techniques developed in this project are likely to gain widespread acceptance across a broad spectrum of scientific disciplines, as well as in industry.
The focus of this research is mainly on Bayesian statistics. Many recent methods aim to improve computational efficiency of Bayesian models by approximating the likelihood function using a small subset of data. In contrast, the objective of this research is to explore underlying structures of probability models and exploit these features to design efficient and scalable computational methods and algorithms for Bayesian inference in big data analysis. To this end, (1) the PIs will define and study the structure of probability distributions in order to develop novel geometrically motivated methods for statistical inference; (2) the PIs will develop efficient and scalable computational methods that accurately approximate probability distributions by exploiting their geometric properties; (3) finally, the PIs will apply these methods to real computationally-intensive problems from biological sciences. Due to its interdisciplinary nature, this research is expected to contribute to several fields, including statistics, machine learning, applied mathematics, and data-intensive computing.