Supervised machine learning is a critical component of software systems in a wide variety of applications. Although models induced via supervised learning algorithms often provide state-of-the-art accuracy, they are not applied as widely as they could be because they require labeled training instances, which are often expensive to acquire. One promising approach to addressing this limitation is to employ active learning algorithms. These methods are able to make queries in which they choose which instances are labeled and added to the training set. The goal of this project is to develop a new class of algorithms for active learning that broadens the applicability of this approach to more complex, realistic settings. Specifically, we will develop methods that (i) address complex learning tasks such as biological network reconstruction and event extraction from natural language, (ii) assemble batches of queries when it is cost effective to do so, (iii) are able to employ a variety of query types, and (iv) reason about the costs incurred for various queries.
Machine learning represents an important methodology for inferring models that can make useful predictions in scientific, educational, health-care, business and consumer applications. The methods to be developed in this project will provide substantial benefits to machine-learning applications in such problem domains by reducing the cost required to obtain enough data to learn accurate models. Moreover, because this project is connected to specific collaborations with biologists, it is likely to have a direct impact on the ability of scientists to design, conduct and interpret experiments investigating networks of complex relationships such as host-virus interactions. The project will also play a role in training undergraduate and graduate students in interdisciplinary research, and in recruiting undergraduate students from under-represented minority groups into scientific careers.