Individuals and organizations must cope with massive amounts of unstructured text information: individuals sifting through a lifetime of e-mail and documents, journalists understanding the activities of government organizations, companies reacting to what people say about them online, or scholars making sense of digitized documents from the ancient world. This project's research goal is to bring together two previously disconnected components of how users understand this deluge of data: algorithms to sift through the data and interfaces to communicate the results of the algorithms. This project will allow users to provide feedback to algorithms that were typically employed on a "take it or leave it" basis: if the algorithm makes a mistake or misunderstands the data, users can correct the problem using an intuitive user interface and improve the underlying analysis. This project will jointly improve both the algorithms and the interfaces, leading to deeper understanding of, faster introduction to, and greater trust in the algorithms we rely on to understand massive textual datasets. The resulting source code and functional demos will be broadly disseminated, and tutorials will be shared online and in person in educational efforts and to aid the adoption of the methodologies.

This project enables computer algorithms and humans to apply their respective strengths and collaborate in managing and making sense of large volumes of textual data. It "closes the loop" in novel ways to connect users with a class of big data analysis algorithms called topic models. This connection is made through interfaces that empower the user to change the underlying models by refining the number and granularity of topics, adding or removing words considered by the model, and adding constraints on what words appear together in topics. The underlying model also enables new visualizations in the form of a Metadata Map that uses active learning to focus users' limited attention on the most important documents in a collection. Users annotate documents with useful meta-data and thereby further improve the quality of the discovered topics. The project includes evaluations of these methods through careful user studies and in-depth case studies to demonstrate that topics are more coherent, users can more quickly provide annotations, users trust the underlying algorithms more, and users can more effectively build an understanding of their textual data. The project web site ( will include pointers to the project Git repositories for source code, project demos, tutorials, and publications communicating experimental results.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Hector Munoz-Avila
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Brigham Young University
United States
Zip Code