Many applications call for representation and analysis of 'distributional' data sets where each data point is a collection of samples from a high dimensional distribution (as opposed to valuations of a typically vector valued random variable). In this setting, each data point can be modeled by a collection of distributions, one for each measured attribute. A concrete example of distributional data arises in the context of brain connectivity mapping. The human brain contains around a hundred billion neurons with several hundred trillion physical connections. Neuroimaging approaches, like Diffusion Spectrum Imaging attempt to visualize the underlying anatomical architecture of neural pathways by creating 3D probability distributions of water diffusion along nerve fiber bundles, called orientation distribution functions.

The project aims to develop new statistical and algorithmic approaches to natural generalizations of a class of standard machine learning problems (where multi-dimensional vector valued data points are replaced by distributions), including techniques for measuring distances and inner products between distributional data points, estimating variants of entropy, mutual information, conditional mutual information, clustering distributional data, constructing low-dimensional embeddings of distributional data, and learning classifiers and function approximators from distributional data. The resulting methods will be evaluated on large diffusion scan imaging data sets (where the data point for each patient consists of 500,000 distributions).

The novel machine learning approaches for descriptive and predictive modeling of distributional data resulting from this project are expected to benefit other scientific fields where data points can be naturally modeled by sets of distributions, which is a common situation in physics, psychology, economics, epidemiology, medicine, and social network-analysis. New distributional data set to be obtained at CMU to augment the data available from NTU are likely to allow other research groups to engage in research on big data analytics from distributional data. Release of open source software, video tutorials, research-training of graduate students contribute to the broader impacts of the project. Additional information about the project can be found at: www.autonlab.org/autonweb/20928.html.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1247658
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-01-01
Budget End
2016-12-31
Support Year
Fiscal Year
2012
Total Cost
$1,000,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213