Machine learning currently offers one of the most cost-effective approaches to building predictive models (e.g., classifiers for categorizing the millions of messages, news articles, and blogs that are generated every day). However, the effective use of machine learning methods in such settings is limited by the availability of a training corpus (i.e., a representative set of instances that have been labeled with the correponding categories). In domains where labeled data are scarce or expensive to acquire, there is an urgent need for cost-effective approaches to selectively acquiring labels for data samples used to train predictive models using machine learning.

This project explores novel techniques that take advantage of the low cost of micro-outsourcing using systems such as Amazon's mechanical Turk, to engage a large number of workers from around the world for acquiring the labels of instances to be used to construct the training corpus. There is currently little understanding of how to utilize the multiple noisy labels obtained using micro-outsourcing. There is a need for advanced techniques for taking advantage of the low cost of micro-outsourcing in order to improve data quality and the quality of models built from the available data. It explores novel approaches for utilizing multiple labels given to an instance by different labelers. It also extends active learning techniques for active selection of samples to be labeled to take into account the multi-sets of labels that have been already obtained from a pool of labelers.

Advances in techniques for active selection of data instances to be labeled in a micro-outsourcing setting can significantly improve the quality of data used to build predictive models in a broad range of applications, including gene annotation, image annotation, text classification, sentiment analysis, and recommender systems, where unlabeled data are plentiful yet labeled data are sparse. The project will provide research opportunities for students at University of Central Arkansas, a primarily undergraduate institution and help expand the STEM pipeline. Additional information about the project can be found at:

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Vasant G. Honavar
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Central Arkansas
United States
Zip Code