Enormous amount of data are now being generated in many areas. Direct applications of existing statistical methods do not satisfy the computational need for performing on-line analytical processing (OLAP) on such massive data. Computer scientists have developed a data warehouse environment called data cube to reduce computational cost by compressing subsets given by some partitioning variables. Analysis of any subset can then possibly be achieved by aggregating the compressed data, and the computational cost becomes low because of no need to access the raw data. For complicated analyses, it is challenging to find proper compression and aggregation schemes and to study the statistical property of the aggregated analysis. Similar issues exist in another massive data environment, data stream. Existing development in these areas either aims to achieve lossless analysis, which has achieved very limited successes only for simple calculations, or provide no theoretical evaluation for the analysis from aggregation. The purpose of this proposed research is to develop statistically sound compression and aggregation methods for advanced statistical analysis of data cubes and data streams, use the above compression-then-aggregation strategy to improve computational efficiency of some statistical analysis, and develop the associated asymptotic theory.

Massive data sets are common nowadays, and many traditional statistical techniques become inapplicable due to high computational costs. In this proposal, the investigator will extend the current data cube techniques to support more complicated OLAP of massive data sets by studying the statistical properties of the desired analysis. This interdisciplinary project will result in significant contributions to data warehousing, OLAP technology, and statistical computing. It will bring great impacts to important applications in large-scale medical studies, national and homeland security, stream data mining, high-performance computing, and information technology. This project's findings will be broadly disseminated to the academic community and industry through scholarly publications and conferences. We will also use the new findings as new course materials in education and training of information analysts and university students.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0906023
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2009-07-15
Budget End
2012-06-30
Support Year
Fiscal Year
2009
Total Cost
$119,934
Indirect Cost
Name
Washington University
Department
Type
DUNS #
City
Saint Louis
State
MO
Country
United States
Zip Code
63130