Increasing volume and variety of data opens opportunities, but much of these data are not carefully curated, leading to uncertainty. Data analysis techniques are needed that accurately characterize uncertainty. This project develops principled approaches to managing uncertainty, particularly through clustering and subsetting data, and then combining results from analysis of the subsets. Dividing data into smaller problems promises scalability to Big Data, while the ability to combine results in a theoretically sound manner manages the uncertainty inherent in large data collections.

The key idea is that Wasserstein barycenter of subset posteriors can be used to efficiently perform posterior approximation. The project extends the theoretical understanding of Wasserstein barycenters, enhancing ability to model uncertainty. New mathematical tools are being developed to bound the accuracy of approximations in terms of the problem's size and nature, and computational time. The algorithms are evaluated on a rich variety of massive data sets, ranging from large-scale networks to biomedical data sets collecting huge numbers of biomarkers. In addition, the project provides interdisciplinary training to young talent in big data analytics to improve competitiveness of the workforce and increase the cohort of data science researchers.

Project Start
Project End
Budget Start
2015-11-01
Budget End
2020-10-31
Support Year
Fiscal Year
2015
Total Cost
$985,882
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705