This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5.

This grant supports research in adapting and optimizing Markov Chain Monte Carlo methods to compute Bayesian models on large data sets resident on secondary storage, exploiting database systems techniques. The work will seek to optimize computations, preserve model accuracy and accelerate sampling techniques from large and high dimensional data sets, exploiting different data set layouts and indexing data structures. The team will develop weighted sampling methods that can produce models of similar quality as traditional sampling methods, but which are much faster for large data sets that cannot fit on primary storage. One sub-goal will study how to compress a large data set preserving its statistical properties for parametric Bayesian models, and then adapting existing methods to handle compressed data sets.

Intellectual Merit and Broader Impact

This endeavor requires developing novel computational methods that can work efficiently with large data sets and numerically intensive computations. The main technical difficulty is that it is not possible to obtain accurate samples from subsamples of a large data set. Therefore, the team will focus on accelerating sampling from the posterior distribution based on the entire data set. This problem is unusually difficult because stochastic methods require a high number of iterations (typically thousands) over the entire data set to converge. However, if the data set is compressed it becomes necessary to generalize traditional methods to use weighted points combined with higher order statistics, beyond the well-known sufficient statistics for the Gaussian distribution. Developing optimizations combining primary and secondary storage is quite different from optimizing an algorithm that works only on primary storage. This research effort requires comprehensive statistical knowledge on both Bayesian models and stochastic methods, beyond traditional data mining methods. A strong database systems background in optimizing computations with large disk-resident matrices is also necessary. This research will enable a faster solution of larger scale problems compared to modern statistical packages to solve stochastic models. Bayesian analysis and model management will be easier, faster and more flexible.

Broad Impact

This research will occur within the context of three separate application areas: cancer, water pollution, and medical data sets with patients having cancer and heart disease. The educational component of this grant will enhance current teaching and research on data mining. In an advanced data mining course students will apply stochastic methods to compute complex Bayesian models on hundreds of variables and millions of records. Data mining research projects will be enhanced with Bayesian models, promoting interaction between statistics and computer science.

Keywords: Bayesian model, stochastic method, database system

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0914861
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2009-08-01
Budget End
2013-07-31
Support Year
Fiscal Year
2009
Total Cost
$339,023
Indirect Cost
Name
University of Houston
Department
Type
DUNS #
City
Houston
State
TX
Country
United States
Zip Code
77204