Ever-increasing use of data-intensive methods in scientific discoveries has led to a paradigm shift in science in recent years. High throughput scientific experiments, routine use of digital sensors, and intensive computer simulations have created a data deluge imposing new challenges on scientific communities to find effective and computationally feasible methods for processing and analyzing very large datasets. Despite many attempts, however, the necessary development of theoretical and computational foundations for big data analysis is lagging far behind. Many existing statistical methods are not capable of handling such data-intensive problems in terms of theoretical foundation as well as computational complexity and scalability. For analyzing high dimensional data with possibly complex structures, this research will offer a set of fundamental solutions using principled statistical methods. The resulting methods will provide a robust framework for big data analysis and allow scientists to use statistical models beyond their current limited applicability. The techniques developed in this project are likely to gain widespread acceptance across a broad spectrum of scientific disciplines, as well as in industry.

The focus of this research is mainly on Bayesian statistics. Many recent methods aim to improve computational efficiency of Bayesian models by approximating the likelihood function using a small subset of data. In contrast, the objective of this research is to explore underlying structures of probability models and exploit these features to design efficient and scalable computational methods and algorithms for Bayesian inference in big data analysis. To this end, (1) the PIs will define and study the structure of probability distributions in order to develop novel geometrically motivated methods for statistical inference; (2) the PIs will develop efficient and scalable computational methods that accurately approximate probability distributions by exploiting their geometric properties; (3) finally, the PIs will apply these methods to real computationally-intensive problems from biological sciences. Due to its interdisciplinary nature, this research is expected to contribute to several fields, including statistics, machine learning, applied mathematics, and data-intensive computing.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1622490
Program Officer
Christopher Stark
Project Start
Project End
Budget Start
2016-08-01
Budget End
2019-07-31
Support Year
Fiscal Year
2016
Total Cost
$249,964
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697