Increasing volume and variety of data opens opportunities, but much of these data are not carefully curated, leading to uncertainty. Data analysis techniques are needed that accurately characterize uncertainty. This project develops principled approaches to managing uncertainty, particularly through clustering and subsetting data, and then combining results from analysis of the subsets. Dividing data into smaller problems promises scalability to Big Data, while the ability to combine results in a theoretically sound manner manages the uncertainty inherent in large data collections.
The key idea is that Wasserstein barycenter of subset posteriors can be used to efficiently perform posterior approximation. The project extends the theoretical understanding of Wasserstein barycenters, enhancing ability to model uncertainty. New mathematical tools are being developed to bound the accuracy of approximations in terms of the problem's size and nature, and computational time. The algorithms are evaluated on a rich variety of massive data sets, ranging from large-scale networks to biomedical data sets collecting huge numbers of biomarkers. In addition, the project provides interdisciplinary training to young talent in big data analytics to improve competitiveness of the workforce and increase the cohort of data science researchers.