Machine learning techniques for high-dimensional inference are becoming progressively more important as many sources of abundant data ranging from MRI, medical imaging and biological data to sensor networks and to more traditional speech and computer vision data become avail- able and require automated processing. This project will address theoretical and algorithmic issues surrounding manifold and geometric methods for high- dimensional inference.

Intellectual Merit: Three of the fundamental challenges for modern machine learning can be summarized as follows: . High dimensionality of the data. . Complex nonlinear structures in the data. . A large amount of data obtained from modern data sources is unlabeled. A promising line of research, known as Manifold Learning, emerged in recent years as a way to use certain geometric ideas to construct compact low- dimensional representation of the data and to use unlabeled data for learning. These algorithms have now been successfully used for a variety of applications from motion segmentation to Markov decision processes. However, our theoretical understanding of these methods is still in its infancy. The main focus of this project is to develop a theoretical framework for analysis of algorithms utilizing geometry of high-dimensional data. Such a framework will bring together techniques from computer science, statistics and mathematics to gain insight into properties of real-world data. This framework will provide guidelines for designing better algorithms for existing problems as well as extending existing methods to new domains, such as analysis complex output spaces and time dependent data. The PI also plans to investigate usefulness of these ideas in Computer Vision.

Broader impacts: This project aims to build a theoretical foundation for a new class of inference algorithms as well as to design new algorithms for high- dimensional inference and to consider its application. A rigorous theoretical understanding of unlabeled data and its use in learning tasks is likely to have a significant impact in algorithms design and in applications of machine learning techniques in practice. This project will provide research and education opportunities for graduate and undergraduate students, and acquaint researchers from other areas and industry with recent developments and encourage collaborations through interdisciplinary workshops and a Machine Learning school.

www.cse.ohio-state.edu/~mbelkin/nsfcareerresearch

Project Report

The fundamental problem of machine learning is to extract useful information from data. The modern data often consist of complex and high-dimensional objects, which have a large number of different attributes. For example, documents may be represented by frequencies of various words and word combinations, while images are often represented by the brightness and colors of individuals pixel or sets of more sophisticated characteristics, such as edge locations. Even the smallest images contain hundreds or thousands of pixels. Thus, it becomes crucial to understand the relations between the different characteristics of such objects or "the shape of the data". In the course of the project we have advanced our fundamental understanding of the properties of data by connecting it to certain basic mathematical objects, known as manifolds, which are high-dimensional generalization of surfaces. We have shown theoretically that for the data coming from such surfaces, the underlying geometry of the surface can be reconstructed from the finite data. Moreover, knowing this geometry turns out to be helpful in many applied tasks of machine learning, such as classifying objects (e.g., images or documents) into different categories or predicting their numerical attributes. Thus, we use the geometry of the underlying data to construct better predictors for various tasks and to make better inference methods possible. We have used this theoretical understanding to develop practical algorithms for such inferential tasks and tested them on a number of real world data-sets. We have also advanced the theoretical understanding of some widely used practical methods, such as spectral clustering and Gaussian Mixture Models allowing the researchers to better understand the limits of their applicability. Finally, our research has addressed another key problem, that of increasing the efficiency of learning methods by allowing them to work with bigger datasets.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0643916
Program Officer
Todd Leen
Project Start
Project End
Budget Start
2007-01-01
Budget End
2012-12-31
Support Year
Fiscal Year
2006
Total Cost
$498,972
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210