Many modern applications ranging from computer vision to biology require modeling and inferring high-dimensional continuous variables based on distributions with multimodality, skewness, and rich latent structures. Most existing models in this regime rely heavily on parametric assumptions where the components of the model are typically assumed to be discrete or multivariate Gaussian, or the relations between variables are linear, which may be very different from the actual data generating processes. Furthermore, existing algorithms for discovering the latent dependency structures and learning the latent parameters largely are restricted to local search heuristics such as expectation maximization. Conclusions inferred under these restricted assumptions and suboptimal solutions can be misleading, if the underlying assumptions are violated or if the suboptimal solutions differ greatly from the globally optimal ones. This project aims to develop a novel framework which can (i) discover and take advantage of latent structures in the data, while (ii) allowing parts to handle near-arbitrary distributions, and (iii) allowing the models to scale to modern massive datasets in a local-minimum-free fashion.
The key innovation in the project is a novel nonparametric latent variable modeling framework based on kernel embedding of distributions. The basic idea is to map distributions into infinite dimensional feature spaces using kernels, such that subsequent comparisons and manipulations of distributions can be achieved via feature space operations, such as inner products, distances, projections, linear transformations and spectral analysis. Conceptually, the framework represents components from latent variable models, such as marginal distributions over a single variable, joint distributions over variable pairs, triplets and more variables, as infinite dimensional vectors, matrices, tensors and high-order tensors respectively. Probabilistic relations between these components, i.e., conditional distributions, Sum Rule, Product Rule etc. become linear transformations and relations between these feature space components.
The framework supports modeling data with diverse statistical features without the need for making restrictive assumptions about the type of distributions and relations. It supports the application of a large pool of linear and multi-linear algebraic (tensor) tools for addressing challenging graphical model problems in the presence of latent variables, including structure discovery, inference, parameter learning and latent feature extraction. The framework applies not only to general continuous variables, but also to variables that take values on strings, graphs, groups, compact manifolds, and other domains on which kernels may be defined.
Besides advancing the state of the art in machine learning,the new non-parametric methods resulting from the project find applications in image data and understanding and gene expression data analysis. It also contributes to research-based training of graduate and undergraduate students at Georgia Tech and CMU.