Swaminathan Vishwanathan, Purdue University; Manfred Warmuth, University of California, Santa Cruz
Machine learning is currently indispensible for building predictive models from massive data sets. A large majority of widely used machine learning algorithms are based on minimizing a convex loss function. A fundamental problem with all such models is that they are not robust to outliers. To address this limitation, this project develops probabilistic models based on a parametric family of distributions, namely, the t-exponential family, that lead to quasi-convex loss functions and yield models that are robust to outliers.
The key challenge when working with the t-exponential family of distributions, as in the case of the exponential family, is to compute the log-partition function and perform inference efficiently. The project addresses this challenge in two specific cases. For problems with small number of classes exact iterative schemes are being developed. For problems where the number of classes is exponentially large, approximate inference techniques are being developed by extending variational methods.
In partnership with Google, some of the data mining algorithms resulting from this project are being applied to a challenging real-world problem of recognizing text in photos (the PhotoOCR problem). The project offers opportunities for research-based advanced training of graduate students as well as research opportuinities for undergraduates in machine learning and data mining. Algorithms for constructing predictive models from data that are robust in the presence of outliers are likely to find use in a broad range of applications. Open source implementions of algorithms, publications, and data sets resulting from the project are being made available through the project web page at: http://learning.stat.purdue.edu/wiki/tentropy/start