Traditional approaches to document retrieval focus on conversion to electronic text followed by indexing of the text content. Recently some work in the community has focused on indexing document image content directly. Such techniques break down when text content is limited or highly degraded. Work on document quality estimation will be extended image quality to address structural quality, a factor that is important for determining if traditional document processing operations will succeed or not. Then,the team will explore the effects of enhancement on classification and retrieval and extend existing work to adapt to changes in quality. The research is motivated by the need for analysts to deal with very large collections of image data. The traditional goal of converting all documents on an electronic form and using traditional text analysis methods fails when dealing with heterogeneous collections and very noisy (possibly multilingual) content. The approach will allow document image retrieval systems to scale to orders of magnitude beyond current capabilities, and permit users to move beyond content features and use structural similarity to explore large collections. This will permit the users to mine large collections for clusters of similar content without knowing a priori specifically what the collection contains through classification. The result will be adaptive techniques that can learn from small numbers of samples without knowledge of sources of degradation.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1359902
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-10-01
Budget End
2015-09-30
Support Year
Fiscal Year
2013
Total Cost
$234,225
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742