An effective document representation is a crucial text processing component and without it, even the most sophisticated methods and models perform poorly. Current document representations such as the bag of words or Markov n-gram models ignore nearly all sequential information and focus instead on the histogram of words or short phrases. The proposed work develops sequential representations for documents that go beyond bag of words and Markov models and effectively capture a wide range of sequential information. The main idea behind these representations is to use smoothing techniques to transform the word sequence into smooth curves representing sequential content through changes in the local word histogram. By varying the amount of smoothing, the proposed representations interpolate between different sequential resolutions, thus conveniently capturing sequential details at varying levels of granularity. The proposed work provides improved document analysis, including the classification, segmentation, and summarization of documents. Furthermore, it enables visualizing the sequential trends in documents thus leading to the emergence of computer-assisted document browsing technology. In addition to computer experiments validating improved modeling accuracy, the project involves a series of user studies thus demonstrating the wide applicability of the project.

Broader impacts include the development of visualization tools that will assist users in reading and browsing documents thus potentially helping millions of people to quickly and effectively absorb textual information. Other education components include assisting foreign language learning, strengthening the computational aspects of the statistics program at Purdue and mentoring minority students.

www.stat.purdue.edu/~lebanon/research/projects/multiResDocuments/

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0906550
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2008-09-09
Budget End
2014-04-30
Support Year
Fiscal Year
2009
Total Cost
$405,458
Indirect Cost
Name
Georgia Tech Research Corporation
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30332