EAGER: Large Scale Document Image Triage, Indexing and Retrieval

Doermann, David; Davis, Larry

Abstract

Structural similarity search and retrieval in images that include both printed text and handwritten text remains a challenging problem, especially with collections that are noisy, and heterogeneous. Approaches currently in use generally convert documents before filtering. This work provides triage as a way to filter very large collections through structural similarity with known attributes, then new clustering with broader terms and hashing to extend the scale of collections considered. The work will provide new directions for document image retrieval, especially in conditions where there is a wide variation in structure and layout and will be made scalable in cloud environments. Another approach to scaling, especially in the area of duplicate detection, will extend multi-level locality sensitive hashing and generalize it to other analysis indexing and retrieval issues. In addition to including graduate students, results and software will be made available through Creative Commons licensing to provide for replication and extension of the results.

Project Report

Our research is motivated by the need to deal with very large collections of image data. The traditional goal of converting all documents on an electronic form and using traditional text analysis methods fails when dealing with heterogeneous collections and very noisy (possibly multilingual) content. First, we present a general approach for document image classification using Convolutional Neural Networks (CNN). CNN is one kind of neural networks that shares weights among neurons in the same layer. CNNs are good at discovering spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. With multiple layers and pooling between layers, CNNs automatically learn the hierarchical layout features with tolerance to spatial translation, and by sharing weights it captures repeating patterns efficiently. We employ rectified linear units and dropout to prevent overfitting. Experiments on real-world unconstrained datasets show that our approach is more effective than previous approaches. Second, we addressed the problem of signature matching. The goal of signature matching is to identify signatures in large collections that look similar. Authentication (and/or) verification can be performed once the number of candidate signatures is more manageable. We model the signature matching problem using supervised latent Dirichlet allocation (sLDA). SLDA is a statistical model developed from latent Dirichlet allocation (LDA) and was originally used for labeling documents. Co-occurring observations are combined in latent distributions called topics, which have an unknown distribution over the vocabulary. The collection of documents share a set of topics and a specific mixture of topics are represented by each document. The work is tested on the DS-I Tobacco dataset and the DS-II UMD dataset. We achieved high accuracy with fast speed compared to previous work. Finally, we addressed the problem of scene text detection. Text in natural scenes carries important semantic information. Localizing text aids scene understanding and it is also relevant to a number of computer vision applications such as internet image indexing, mobile vision and low vision aids. We approach the text detection problem from an image partitioning perspective, and proposed a novel framework to detect multi-oriented scene text lines with less dependency on font or language. Similar elements in the image first form weak hypotheses of groups, and a fine clustering is performed considering long range interactions as typically seen in text lines. Finally, a text/non-text classification is performed on each region of the clustering result. We compare with the methods that aim at detecting multi-oriented and multi-language text. On a recently published dataset, our method generates promising results compared to the state of the art methods.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 1262122
Program Officer: Sylvia Spengler

Project Start
Project End
Budget Start: 2012-10-01
Budget End: 2014-09-30
Support Year
Fiscal Year: 2012
Total Cost: $300,000
Indirect Cost

EAGER: Large Scale Document Image Triage, Indexing and Retrieval
Doermann, David Davis, Larry
University of Maryland College Park, College Park, MD, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments