There are billions of new digital documents created around the world every day. Examples include emails, blog posts, legal documents, and news articles. To enable effective information management, many of these documents are processed by information retrieval systems, such as desktop search tools or Web search engines. Most existing technologies represent documents digitally. To a computer, these representations are nothing more than a sequence of bits, completely devoid of any explicit meaning. Since most modern search engines utilize such basic representations, they often fail to properly account for the meaning of the words found in the documents, thereby diminishing the quality of their results. Despite the importance of this fundamental problem, there have been surprisingly few attempts to build, and subsequently search, document representations that encode the deeply rich meaning of text, especially for data sets that contain millions or billions of text documents.

This research investigates how to automatically construct, index, and search next-generation super-enriched document representations. The approach relies on the careful integration of traditional text representations with natural language processing-based sources (e.g., named entities, synonyms, and paraphrases), rich knowledge sources (e.g., Wikipedia and Freebase), contextual sources, and other value-added sources of content. Constructing such representations for large document collections requires computationally intensive batch processing to mine, aggregate, and join data across disparate sources. To overcome these challenges, a scalable, massively distributed cloud computing solution is adopted. The resulting enriched document representations can be effectively applied to a wide variety of information retrieval, natural language processing, and data mining tasks.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1265301
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2012-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2012
Total Cost
$236,111
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213