Most machine learning algorithms operate on fixed dimensional feature vector representations. In many applications, however, the natural representation of the data consists of more complex objects, for example functions, distributions, and sets, rather than finite-dimensional vectors. This project aims to develop a new family of machine learning algorithms that can operate directly on these complex objects. The key innovation is efficient estimation of certain information theoretic quantities for learning predictive models from complex data. The research is organized around three specific aims: (a) Development and analysis of nonparametric estimators for certain important functionals of densities, such as entropy, mutual information, conditional mutual information, and divergence; and study of the theoretical properties of these estimators including consistency, convergence rates of the bias and variance, and asymptotic normality. (b) Use of the preceding estimators to design new learning algorithms for clustering, classification, regression, and anomaly detection that work directly on sets, functions, and distributions without any additional, hand-made feature extraction, histogram creation, or density estimation steps that could lead to loss of information. (c) Study of the theoretical properties of these new machine learning algorithms (computation time, sample complexity, generalization error) and empirical evaluation of the algirithms them to a variety of important real-world problems, including nuclear detection astronomical data analysis, and computer vision in collaboration with researchers at Lawrence Livermore, University of Washington and Johns Hopkins University, and Carnegie Mellon University respectively.
Broader Impact. The project, if successful, could substantially advance the current state-of-the-art in building predictive models from complex data. The results of research, including publications and open source software, will be freely disseminated to the larger scientific community. The project provides enhanced research-based training opportunities for graduate and undergraduate students at Carnegie Mellon University as well as the collaborating institutions.
The most common machine learning algorithms operate on vectorial feature representations. In many applications, however, the natural representation of the data consists of more complex objects, for example functions, distributions, and sets, rather than ?nite-dimensional vectors. In this two-year-long NSF-EAGER project we developed new machine learning algorithms that can operate directly on these complex objects. In order to achieve this goal, as a very first step we had to develop new nonparametric statistical methods to estimate entropy, divergences, and mutual information. We proved the consistency of these estimators, derived upper bounds on the convergence rate of the estimators, and for some of these novel estimators we proved that they are minimax optimal. We have achieved many important results in this estimation problem. Nonetheless, many open questions remained unanswered, and we would like to continue this research direction in the near future. In particular, many questions are open in the high-dimensional setting when the dimensions of the distributions are large, and the big data setting when the number of instances is large. Efficient estimation of conditional mutual information is also a challenging problem. We developed new algorithms for distribution regression and function-to-function regression, and analyzed the statistical properties of these methods. Although the possible applications of these methods are innumerable, in this project we focused our effort on computer vision, astronomy, and nuclear safety. Our novel methods, however, also have been successfully applied to anomaly detection in embankment dam piezometer data, neuroscience, and data analysis in Step-Down Unit (SDU) and Intensive Care Unit (ICU) monitoring equipments. Most of our research efforts have focused on the supervised setting: developing machine learning algorithms on complex objects for classification and regression. In the future we would like to continue this research direction and answer many of the remaining open questions such as how to learn the structure of these complex objects (e.g. cluster structure, low-dimensional manifold learning structure) and exploit this structure to develop better and faster algorithms. Our results have been published at the top machine learning conferences including NIPS, ICML, UAI, and AISTATS. The bulk of the requested funding was used to support graduate students for two years. Selected aspects of this research became part of the teaching materials for the graduate and undergraduate education at CMU in the School of Computer Science (technical aspects) and in Heinz College (public policy and it aspects).