Jebara, Tony Chudnovsky, Maria

Columbia University, New York, NY, United States

This proposal investigates new computational and combinatorial tools for learning from data and performing inference about data. As such, it serves as a bridge between the machine learning community and the combinatorics community. The latter field develops efficient algorithms or shows when an efficient solution to a particular problem exists. The project will support graduate students in the intersection of these two fields to combine recent theoretical results with practical data problems. The team will produce a website of tutorials, data sets, downloadable code, interactive visual examples and course materials to allow better integration across the two fields. The combinatorics community has identified a family of inputs called perfect graphs where otherwise hard problems are efficiently solvable. This proposal will investigate how to bridge this powerful theoretical result to the area of machine learning and formally characterize which learning and inference problems are efficiently solvable.

More specifically, the team will investigate data learning problems (such as clustering a data set into subgroups that are self-similar or finding anomalies in a database) and statistical inference problems (computing the most likely or most typical outcome from observed measurements). These problems will be compiled or represented as graphs and networks. Then, these graphs and networks can be more formally diagnosed using perfect graph theory to determine if the instance of the problem is easy (or hard) and to provide efficient solutions via exact algorithms on perfect graphs. These solvers will be implemented using message passing or convex programming.

In the 1700s, Thomas Bayes invented Bayes' rule as a way of performing inference with a probability distribution. Inference and learning require answering two fundamental questions about a probability distribution: what is the most likely value of any given variable (e.g. maximum a posteriori (MAP) inference) and what is the expected value of any given variable (e.g. marginal inference (MI)). These problems are easy when the probability distributions only contain a few variables or the variables only have simple inter-dependencies. But, they become hard or inefficient if there are many variables at hand with many cycles of interdependence. A graphical model is a probability distribution represented as a network with edges between variables that are interdependent. Inference is hard in many real-world graphical models: when the variables are nodes in social networks, in power grids, in deep belief networks, in medical diagnostics networks, in networks of image pixels, and so on. This project tackled inference and learning problems over complicated networks of variables by using recent break-throughs in combinatorics and graph theory. The theory says that efficiency can emerge when graphs have some excluded sub-graph. For example, trees contain no cycles, perfect graphs contain no odd cycles, and claw-free graphs contain no claws. Recent theoretical results by the team show that hard problems become easy on such families of graphs. For example, otherwise hard graph coloring problems and graph stable-set problems become efficient to solve. By reformulating inference in terms of perfect graph theory and combinatorial algorithms, we are now able to solve MAP and MI problems on many large networks efficiently. Previously only heuristic algorithms such as loopy belief propagation were used to solve these problems and these approaches lacked efficiency, optimality or convergence guarantees. This grant produced algorithms that provide optimal inference answers efficiently. For example, in large and loopy networks that are fully associative (i.e. nodes encourage their neighbors to match their configuration), we can perform marginal inference in polynomial time. If the network is not fully associative, we double the number of nodes through a clever double-cover construction and then flip certain edges to obtain a fully associative surrogate network. This double-cover network can then be processed efficiently. Finally, by iterating our MAP and MI algorithms, we can perform learning and efficiently fit the graphical models and probability distributions to data.

- Agency
- National Science Foundation (NSF)
- Institute
- Division of Information and Intelligent Systems (IIS)
- Application #
- 1117631
- Program Officer
- Todd Leen

- Project Start
- Project End
- Budget Start
- 2011-08-01
- Budget End
- 2014-07-31
- Support Year
- Fiscal Year
- 2011
- Total Cost
- $456,417
- Indirect Cost

- Name
- Columbia University
- Department
- Type
- DUNS #

- City
- New York
- State
- NY
- Country
- United States
- Zip Code
- 10027