Swaminathan Vishwanathan, Purdue University; Manfred Warmuth, University of California, Santa Cruz
Machine learning is currently indispensible for building predictive models from massive data sets. A large majority of widely used machine learning algorithms are based on minimizing a convex loss function. A fundamental problem with all such models is that they are not robust to outliers. To address this limitation, this project develops probabilistic models based on a parametric family of distributions, namely, the t-exponential family, that lead to quasi-convex loss functions and yield models that are robust to outliers.
The key challenge when working with the t-exponential family of distributions, as in the case of the exponential family, is to compute the log-partition function and perform inference efficiently. The project addresses this challenge in two specific cases. For problems with small number of classes exact iterative schemes are being developed. For problems where the number of classes is exponentially large, approximate inference techniques are being developed by extending variational methods.
In partnership with Google, some of the data mining algorithms resulting from this project are being applied to a challenging real-world problem of recognizing text in photos (the PhotoOCR problem). The project offers opportunities for research-based advanced training of graduate students as well as research opportuinities for undergraduates in machine learning and data mining. Algorithms for constructing predictive models from data that are robust in the presence of outliers are likely to find use in a broad range of applications. Open source implementions of algorithms, publications, and data sets resulting from the project are being made available through the project web page at: http://learning.stat.purdue.edu/wiki/tentropy/start
The exponential family of distributions play an important role in statistics and machine learning. They underlie numerous models such as logistic regression and probabilistic graphical models. However, exponential family based probabilistic models are vulnerable to outliers. We focused on the binary and multiclass classification problems and proposed a new robust algorithm, t-logistic regression. t-logistic regression is obtained by replacing the exponential family in logistic regression by a more generalized distribution family, namely the t-exponential family from statistical Physics. Unlike the exponential family, the key challenge in working with the t-exponential family is the computation of the log-partition function, for which no general closed form solution exists. We showed how to overcome this difficulty via a computationally efficient iterative algorithm. Furthermore, we demonstrated the robustness of t-logistic regression both theoretically and empirically. Via extensive empirical evaluation, we showed that our algorithm is empirically stable, even though the empirical risk may have multiple local minima. In a second line of research we investigated "dropout" which is a heuristic used for training deep neural networks: for each training example a number of nodes in the neural net are randomly dropped. We were able show in an online learning context that a version of dropout applied in connection with the Follow the Leader Heuristic provable has optimal regret for both worst-case and iid data. Third, we found a certain learning problem that is hard to learn for any deep neural network provided that the network is trained via gradient descent. So far we have proven the hardness only of single neurons. However for larger networks we have experimental evidence of the hardness.