The broad goal of this line of research is to give a principled answer to the question, "What sort of data is efficiently learnable, and by what algorithms?" The current state-of-the-art in machine learning is that there is an overwhelming number of possible algorithms that can be tried on a new machine learning problem, with no clear understanding of which techniques can be expected to work on which problems.
Further, it is often the case that machine learning algorithms that work well "in theory" do not perform as well "in practice," and vice versa. The PIs have outlined a plan for resolving these difficulties, finding a unification of disparate methods via the Polynomial Method, and investigating how efficient this method can be. On a more immediate level the PIs will aim for broad impact through advising and guiding graduate students and widely disseminating research results.
Specifically, the PIs will investigate the effectiveness of the "Polynomial Method" in machine learning theory. The PIs observe that nearly all learning algorithms, in theory and in practice, can be viewed as fitting a low-degree polynomial to data. The PIs plan to systematically develop this Polynomial Method of learning by working on the following three strands of research:
1. Understand the extent to which low-degree polynomials can fit different natural types of target functions, under various data distributions and noise rates. This research involves novel methods from approximation theory and analysis.
2. Develop new algorithmic methods for finding well-fitting polynomials when they exist. Here the PIs will work to adapt results in geometry and probability for the purposes of identifying and eliminating irrelevant data.
3. Delimit the effectiveness of the Polynomial Method. The PIs will show new results on the computational intractability of learning intersections of linear separators, and on learning linear separators with noise.
Our project was motivated by the observation that across the wide range of different learning algorithms that are used in machine learning today, a common thread is that nearly all contemporary supervised learning algorithms (where a binary-valued function or prediction ruleis being learned from a sample of data points that have been labeled according to the prediction rule) employ a "Polynomial Threshold Function" as their hypothesis. A Polynomial Threshold Function, or PTF, is simply a prediction rule in which an example point x is labeled positive or negative according to whether a particular real polynomial p(x) takes a positive or negative value on input x. With this motivation, our project aimed to study polynomial threshold functions from several different points of view as they relate to learning, with the twin goals both of developing new effective learning algorithms using polynomial threshold functions, and of understanding the inherent limitations of learning using polynomial threshold functions. Below we describe two of the main results we achieved and their relation to learning. One of our results was a new bound on the "average sensitivity" of any degree-d polynomial threshold function (when the polynomial p(x_1,...,x_n) mentioned above has degree d). The average sensitivity of a binary-valued function measures the probability that the value of the function flips if a randomly chosen input bit x_i of a randomly chosen input bit-string x=(x_1,...,x_n) is flipped. We gave the first nontrivial upper bound on this probability, making progress on a conjecture from 1994. While this could be viewed as a pure probability or geometry result (the connection with geometry is that our result can equivalently be stated as bounding the size of a certain intersection between a degree-d polynomial surface and the Boolean hypercube), we also showed that our upper bound directly yields the first efficient algorithm for learning binary prediction rules that can be expressed as degree-d polynomial threshold functions when the labels of examples given to the learner have been corrupted with a challenging type of adversarial noise, as long as the examples themselves (the inputs x) are uniformly drawn from the Boolean hypercube. A second major result from our project was a "bad-news" result showing that it can be computationally hard to find a low-degree polynomial threshold function that correctly labels noisy data points. The positive result described in the previous paragraph shows that for a collection of data points that are uniformly chosen at random from the n-dimensional Boolean hypercube, there is a polynomial-time algorithm that can find a degree-d polynomial threshold functions whose accuracy in classifying those data points is nearly as good as that of the *best*possible degree-d polynomial threshold function classifier. In strong contrast to this, our second result shows that if the collection of labeled data points is chosen by an adversary, then even small amounts of noise in the labels of the points can make it computationally intractable to find a decent polynomial threshold function hypothesis.Under a widely accepted computational hardness conjecture (the "Unique Games" conjecture, a strengthening of the famous "P not equal to NP" conjecture), we show that even if there is a degree-d PTF that agrees with 99% of the examples in a data set of labeled examples, there is no polynomial-time algorithm that can in general find even a degree-d PTF that agrees with 51% of the data set.