Machine learning models are trained on data that are annotated (labeled) by humans. The accuracy of the trained models generally improves with the number of annotated data examples. Yet, annotating takes time, money, and effort. Active learning aims to minimize the costs by determining which exemples are most informative and directing the human labeler to them. Improvements in active learning will lower the costs associated with data annotation and lead to faster implementations of intelligent systems for a range of applications including robotics, speech technology, error and anomaly detection (for example in medicine, financial fraud, and condition-based maintenance of infrastructure), targeted advertising, human-computer interfaces, and bioinformatics.
In traditional active learning approaches, algorithms are limited in the types of information they can acquire, and they often do not provide any rationale to the user as to why a particular exemplar is chosen for annotation. This CAREER project develops a new paradigm dubbed "rich and transparent active learning." This new paradigm opens a communication channel between algorithms and users whereby they can exchange a rich set of queries, answers, and explanations. By using rich feedback from users the algorithms will be able to learn the target concept more economically, reducing the resources required to build an accurate predictive model. By explaining their reasoning, these algorithms will achieve transparency, build trust, and open themselves to scrutiny.
Towards that end, the project develops methods that allow algorithms to use a rich set of queries for resource-efficient model training, and generate explanations that are informative but not overwhelming for the users. The methods developed build on expected loss minimization, information theory, and principles from human-computer interaction. Approaches are evaluated using publicly available datasets and user studies carried out as part of the project. The project develops case studies on two high-impact real-world problems: detecting fraudulent health-care claims, and identifying patients at risk of disease.
The rich and transparent active learning paradigm provides unique educational opportunities. In contrast to standard machine learning algorithms, operated as black boxes, interactive and transparent machine learning is expected to raise students' interest and motivation for data science. Two PhD and several undergraduate and high school students are being trained under this award. A new graduate course on interactive machine learning is being developed. Finally the PI ensures effective outreach to under-represented groups by partnering with a Chicago public high school whose student population includes 90% minorities.