BIGDATA: Mid-Scale: DA: Distribution-based machine learning for high dimensional datasets

Singh, Aarti; Verstynen, Timothy; Poczos, Barnabas

Abstract

Many applications call for representation and analysis of 'distributional' data sets where each data point is a collection of samples from a high dimensional distribution (as opposed to valuations of a typically vector valued random variable). In this setting, each data point can be modeled by a collection of distributions, one for each measured attribute. A concrete example of distributional data arises in the context of brain connectivity mapping. The human brain contains around a hundred billion neurons with several hundred trillion physical connections. Neuroimaging approaches, like Diffusion Spectrum Imaging attempt to visualize the underlying anatomical architecture of neural pathways by creating 3D probability distributions of water diffusion along nerve fiber bundles, called orientation distribution functions.

The project aims to develop new statistical and algorithmic approaches to natural generalizations of a class of standard machine learning problems (where multi-dimensional vector valued data points are replaced by distributions), including techniques for measuring distances and inner products between distributional data points, estimating variants of entropy, mutual information, conditional mutual information, clustering distributional data, constructing low-dimensional embeddings of distributional data, and learning classifiers and function approximators from distributional data. The resulting methods will be evaluated on large diffusion scan imaging data sets (where the data point for each patient consists of 500,000 distributions).

The novel machine learning approaches for descriptive and predictive modeling of distributional data resulting from this project are expected to benefit other scientific fields where data points can be naturally modeled by sets of distributions, which is a common situation in physics, psychology, economics, epidemiology, medicine, and social network-analysis. New distributional data set to be obtained at CMU to augment the data available from NTU are likely to allow other research groups to engage in research on big data analytics from distributional data. Release of open source software, video tutorials, research-training of graduate students contribute to the broader impacts of the project. Additional information about the project can be found at: www.autonlab.org/autonweb/20928.html.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Application #: 1247658
Program Officer: Sylvia Spengler

Project Start
Project End
Budget Start: 2013-01-01
Budget End: 2016-12-31
Support Year
Fiscal Year: 2012
Total Cost: $1,000,000
Indirect Cost

BIGDATA: Mid-Scale: DA: Distribution-based machine learning for high dimensional datasets
Singh, Aarti Verstynen, Timothy Poczos, Barnabas
Carnegie-Mellon University, Pittsburgh, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments