Machine learning (ML) has witnessed tremendous success both in establishing firm theoretical foundations and reaching out to major applications ranging from the scientific (e.g. computational biology) to the practical (e.g. financial fraud detection, spam detection). However the reach of machine learning has been hampered by an underlying inductive framework that largely has not evolved from using only labeled instances of concepts (e.g. emails and yes/no labels on whether they are spam) and its overly simple view of the role of the user or subject matter expert (SME) as a mere provider of the labels for the training instances. However, when instructing humans, teachers provide richer information: Why is an instance of a concept a good positive example? What are key differences between instances belonging to different classes? Which properties are transient and which are invariant? Where should the learner focus attention? What does the current learning task have in common with previously acquired concepts or processes? Answers to such questions not only enrich the learning process, but they also can effectively reduce the hypothesis space and provide significant speed ups in learning than can be achieved with use of class membership feedback only.

The aim of this project is to bring this kind of richer interaction into the realm of machine learning by developing frameworks as well as machine learning methods that can take advantage of fuller mixed-initiative communication. In particular, this project aims to develop ML algorithms that can exploit information from SME's such as (1) identification of landmark instances; (2) proposing rules of thumb; (3) providing feedback on similarity of instances; and (4) transfer of similarity measures themselves. This project brings to bear four streams of research: (1) algorithms based on similarity functions and landmark instances; (2) active and "pro-active" learning; (3) Bayesian active transfer learning; and (4) learning to cope with temporal evolution in the underlying data distribution. In order to reach practical results, this project focuses on challenges where these new methods are both most needed and likely to prove most effective, such as learning in dynamic environments with concept drift, and where potential for long-term transfer learning is present. Broader impacts include more effective learning by incorporating scientific domain knowledge in eScience, for instance in computational proteomics. Educational and research-community outreach includes participation of graduates and undergraduates from Howard University, for instance in yearly research gatherings involving all students on the project, and reusable open-source methods and data sets.

Project Report

This project was aimed at advancing core machine learning technologies by developing principled algorithms that use novel forms of interaction between learning algorithms and domain experts. Traditional machine learning methods are based on observing large amounts of random labeled data (e.g., images labeled by whether or not they contain a face, or documents labeled by whether or not they are of interest to a user). These methods can be quite successful when labeled data is plentiful, labels are accurate, and the tasks being learned are not too complex. However, in many important application areas, these conditions do not hold. To address this challenge, this work investigated new machine learning methods that take a different approach and interact with domain experts, or more generally with the environment in which they are operating, through alternative channels besides passive observation of labeled data. These channels included interactive and batch active probing for labels of instances and regions, complex queries in DNF learning, and the role of Bayesian priors. Three specific outcomes were the following. First, a notoriously difficult type of rule for machine learning algorithms are "DNF formulas," which can represent scenarios where data objects belong to the positive class for multiple different and fairly complex reasons (such as in detecting financial fraud which can takemultiple different forms). This project showed that if in addition to asking a domain expert to label objects as positive or negative, the algorithm can also query whether two positively-labeled objects were positive for the same reason, then one can achieve much stronger guarantees than known for traditional learning models. Second, many applications ranging from crowdsourcing toscientific investigation have the property that experts can be queried for labels on data points but (a) the answers are highly noisy and (b) a "batch" of data can be labeled nearly as cheaply as a single point (e.g. in computational biology, a microarray high-throughput experiment can yield dozens of labels at once for essentially the same cost as a single point, but do not provide highper-point accuracy). To address such settings, this work developed "buy at bulk" active learning algorithms that can optimize how they query for labels based on the cost structure for the problem at hand. Third, a problem that has received significant attention in computer science is how to best deploy defensive resources such as customs agents or security cameras when faced with attackers who will respond to your strategy in the best way for themselves. These problems (called "security games") are traditionally studied under the assumption that the goals (or "payoffs") of the attacker are fully known. However, if they are not fully known, strategies that optimize against just an educated guess can unfortunately perform poorly. This work developed new learning algorithms that, based on observing responses to previous strategies, are able to adapt and over time perform nearly as well as if the attackers' goals were known in advance. In addition to the above outcomes, this project also developed algorithms for learning in electronic-commerce settings as well as machine learning algorithms for efficiently communicating with sensor networks. Application of the methods in the project have already been extended to computational biology for high-throughput batch active learning utilizing micro-arrays, and are being investigated in contexts such as host-pathogen protein interaction graph induction.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1065251
Program Officer
Hector Munoz-Avila
Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$1,048,227
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213