There is a growing need for (semi-)automated tools to analyze and organize large collections of electronic information. In response, there is a surge of research on machine learning of probabilistic topic models, which automatically discover the hidden thematic structure in a large collection of documents. Once made explicit, this hidden structure facilitates browsing, searching, organizing, and summarizing vast amounts of information.

This research program will significantly build on the current state-of-the-art in topic modeling.

1. We will develop topic modeling algorithms that discover trends in document streams. Modeling evolutionary and revolutionary change of topics over time will be an important new capability for corpora analysts, providing methods of forecasting and understanding the changing patterns in serial collections such as news feeds, scientific publications, or web blogs.

2. Many modern corpora, such as Wikipedia, contain important links between the documents. We will develop topic models of such interconnected collections that explicitly represent and generalize inter-document and/or inter-topic relationships. Such relationships may be hyper-links, scholarly citation, shared authorship, or statistical correlations. Capturing the patterns in these connections, and understanding their relationship to the texts, will have important implications for a great variety of scholarly, commercial, and personal 'recommender' systems.

3. Very often, analysts and other users approach a corpora with particular questions in mind. To facilitate focused, personalized exploration, we will develop supervised methods for discovering topic models that predict document-specific variables -- notably forms of relevance -- of online material such as scholarly papers, legal briefs, media sources, and product specifications.

This project addresses significant current limitations of topic modeling, and will provide practical new research and education tools for understanding and organizing modern repositories of information. We will make these tools available as open-source software to support and encourage their application to real-world problems, and we will fold the results of our research into ongoing education and outreach programs.

Project Report

Over the course of this award, my research group and I havesignificantly pushed the state of the art in probabilistic topicmodeling. Probabilistic topic models are a suite of algorithms foranalyzing massive collections of text to uncover their latent themes.These "topics" can then be used to organize, summarize, visualize, andunderstand the collection. Some of our accomplishments in the periodof this grant included the following. - We developed new scalable methods for topic modeling, that can handle massive document collections. As an example, we can now analyze millions of documents on a single CPU. We adapted these methods to social networks and many other settings. - We developed new methods for checking topic models, identifying those topics that are salient and those that might be artifacts of its statistical assumptions. These techniques are vital to deploying topic models in real-world scenarios. - We developed new methods for incorporating side information, such as sentiment or other meta-data, into topic models. These methods both let us form predictions and find topics that reflect the meta-data. As an example, some of our methods can look at public user behavior data (such as scientists sharing their research libraries). This lets us use topic models to build recommendation systems and to understand how readers implicitly organize the collections. - We developed new methods for uncovering trees of topics from text collections. This gives a richer representation and summarization of texts. For example, on a large collection of news articles, we find a general "government" topic and then more specific topics related to the various branches of the government. - We developed new methods for uncovering influential documents in long sequences of document collections. These methods can identify which documents appeared to have an impact on future documents. Analysis of scientific articles, for example, found seminal papers without any external information (such as citation counts). Thanks to the support of this grant, topic models are now widely usedacross government, science, and the humanities to help practitionersunderstand and manage large collections of texts.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0745520
Program Officer
Todd Leen
Project Start
Project End
Budget Start
2008-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2007
Total Cost
$549,943
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08540