The goal of this project addresses the open challenge of Multi-field Hierarchical Discovery and Tracking (mf-HDT) of emerging topics at different granularity levels based on combined evidence in heterogeneous data. The technical approaches consist of a new Bayesian framework with powerful inference algorithms, namely the multi-field Hierarchical Correlated Topic Modeling, for discovering multi-field hierarchies of latent topics, capturing inter-topic and cross-hierarchy correlations, and enabling query-driven threading of topics over a Markov chain of hierarchies. These technical innovations and capabilities go beyond existing Topic Detection and Tracking (TDT) methods and graphical models used to represent relationships between topics, citations, etc. Significant improvements are expected in both effectiveness and scalability over the existing methods, especially in terms of detecting newly emerging topics and tacking time-sensitive impact. The proposed approach will be evaluated on a four large datasets of scientific literature data in a broad range (Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics) as well as news stories, with human-produced queries and relevance judgments and human-assigned topic labels to support task-oriented evaluations.

Productivity of researchers, educational practitioners and students, government agencies supporting research and industries highly depends on the availability of up-to-date big pictures about scientific emergence and co-emergence within and across many fields, along with evidence of the impact of new technologies, and research or development funding. The proposed techniques, if successful, will provide principled and effective solutions with a broad future impact in the applications above and beyond. Web site (http://nyc.lti.cs.cmu.edu/mfhdt/) will provide access to open-source software, of datasets, results and publication in order to enable comparative evaluations and further studies by related research communities. The students involved in the project benefit from direct experience with using and evaluating cutting-edge IT technologies in real-world applications. This is complementary to classroom teaching where the students can observe first-hand the direct implication of choosing various strategies for categorization, active learning and distributed computing.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1216282
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2012-10-01
Budget End
2016-09-30
Support Year
Fiscal Year
2012
Total Cost
$515,182
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213