This project develops semi-supervised machine learning algorithms that are practical, and at the same time guided by rigorous theory. In particular, the project is developing learning theory that quantifies when and to what extent the combination of labeled and unlabeled data is provably beneficial. Based on the theory, novel algorithms are being developed to address issues that currently hinder the wide adoption of semi-supervised learning. The new algorithms will be able to guarantee that using unlabeled data is at least no worse, and often better, than supervised learning. The new algorithms will also be able to learn from unlimited amounts of supervised and unsupervised data as they arrive in real-time, something humans can do but computers cannot so far.
This project has a number of broader impacts: (1) An open-source software will be an enabling tool for new discoveries in science and technology, by making machine learning possible or better in situations where labeled data is scarce. Since the software specifically targets non-machine-learning-experts, the impact is expected to be across the whole spectrum of science and technology that utilizes machine learning. (2) It advances our understanding of the learning process via new machine learning theory, which can be applied to both computers and humans. (3) The proposal contains projects ideally suited to engage students in computer science education and research.
One key task in machine learning is to make automatic decisions: Is this email spam? What is in that photo? Is this patient healthy? Traditionally, computers need to be trained on a large amount of labeled data to make such decisions. Labeling data means someone, usually a domain expert, has to evaluate the email, annotate the photo, or perform medical diagnosis on the patient. Often, such labeling process is slow or expensive, limiting the amount of labeled training data available to the computers and hindering the performance of machine learning. On the other hand, unlabeled data are usually abundant. This projects studied semi-supervised learning, a machine learning method which combines labeled and unlabeled data to improve automatic decisions. In the course of the project we have advanced our fundamental understanding of semi-supervised learning. Our research highlighted the dependency of semi-supervised learning on its underlying assumption on data distribution. We studied theory that quantified when and to what extent the combination of labeled and unlabeled data is beneficial. We proposed semi-supervised learning algorithms that handled big data: The algorithms learn from unlimited amount of supervised and unsupervised data as they stream in. We also created semi-supervised algorithms that can explore complex structures within data, widening the applicability of learning. Overall, we enhanced machine learning's ability to make better automatic decisions. This project also had broader impact beyond machine learning and computer science. For instance, we showed that semi-supervised learning is a valid mathematical model to quantify human learning. As an example, children learn from labeled data (daddy points at a dog and says "Dog!"), as well as from unlabeled data (seeing various animals over time, without being told the names). Both kinds of experiences combine to shape concept learning (e.g., dog). Our semi-supervised learning algorithms served as cognitive models that quantitatively fit human behaviors under such settings. Our models also predicted novel human behaviors. For example, we predicted that making decisions on unlabeled items can change human's decision boundary. This has since been validated by human behavioral experiments. In summary, this project advanced the frontier on semi-supervised learning in both computer science and cognitive science, and enriched our understanding of the learning process in both computers and humans.