In Divide and Recombine (D&R) the data are divided into subsets by the data analyst. These are the S computations because they create the subsets. Statistical and visualization methods are applied to each subset without communication among the computations. These are the W computations because they are within subsets. Then the W computation outputs are recombined across subsets. These are the B computations because they are between subsets. One goal of D&R is deep analysis, an ability to study the data in detail despite the size and complexity. A second goal is an ability to carry out analysis wholly from within an interactive language for data analysis (ILDA) such as R. D&R achieves the goals by introducing a simple parallelization, not of the analysis methods themselves which is very complex, but of the data. This results in ``embarrassingly parallel'' computations that can be efficiently carried out by a distributed computational environment like Hadoop. Also, Hadoop can be merged with an ILDA. The investigators will research two areas of statistical theory and methods for D&R. The first is development of D&R statistical division and recombination procedures. This is very broad because there are many analysis methods, and the procedures need to change with the methods and the data structures they address. The second topic is a foundational mathematical theory. In the current fundamental paradigm for statistics, an analysis method is applied directly to all of the data in one big computation. The S, W, and B computations use all of the data too, but the results are in general not the same as those for direct computation and have different statistical properties. This introduces a new fundamental paradigm for statistical accuracy and optimality.

In Divide and Recombine (D&R), large complex data are divided into subsets. Statistical and visualization methods are applied to each of the subsets separately. Then the results of each method are recombined across subsets. This new analysis framework for large complex data can readily exploit current distributed computational environments because it leads to very simple parallel computation. The investigators will develop statistical procedures for division and recombination that result in good statistical accuracy for the analysis methods. Accuracy tends to be less than that from direct computation on all of the data in one big computation, which is impractically long or simply infeasible. D&R trades some accuracy for computational feasibility. The result is that almost any statistical or visualization method can be successfully applied to large complex data. This enables a deep, detailed analysis that does not risk losing important information in the data, which is feasible today only with small data.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1228348
Program Officer
Yong Zeng
Project Start
Project End
Budget Start
2012-09-01
Budget End
2017-08-31
Support Year
Fiscal Year
2012
Total Cost
$314,995
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907