RI: Small: Coordinating Language Modeling, Computer Vision, and Machine Learning for Dramatic Advances in Optical Character Recognition

Learned-Miller, Erik; McCallum, Andrew

Abstract

The goal of this research is to develop new methods for improving the performance of optical character recognition (OCR) systems. In particular, the PI investigates "iterative contextual modeling", an approach to OCR in which high confidence recognitions of easier document portions are used to help in developing document specific models. These models can be related to appearance--for example a sample of correct words can be used to develop a model for the font in a particular document. In addition, the models can be based on language and vocabulary information. For example, after recognizing a portion of the words in a document, the general topic of the document may be detected, at which point the distribution over likely words in the document can be changed. The ability to modify character appearance distributions and language statistics and tune them specifically to the document at hand is expected to produce significant increases in the quality of OCR results.

Project Report

In this work, we have made a number of important steps forward for research in optical character recognition (OCR). In our original proposal, we made the point that while OCR has been a successful commercial application of pattern recognition for some time, the level of accuracy is still not sufficient for many applications. We have made five principal contributions to the field supported by this grant. These include 1) a method for assessing which outputs of an OCR system have probability of error below some preset threshold and a method to use the high confidence outputs to build an OCR system specifically adapted to a particular document, without any human intervention, 2) a method to do OCR in any alphabetic language, such as Russian, English, or Greek, given only an electronic dictionary for that language and no information about the appearance of any characters in that language, 3) a new model for the spelling of syllables that can be used to improve research in difficult OCR problems, 4) a new system for recognizing text in outdoor scenes, and 5) a new method for segmenting text in outdoor scenes based on computer graphics rendering. 1. Confidence assessment and a document-specific OCR system. In this work, we ask the question, can we find a way to set aside some of the high confidence outputs of an OCR system, and retrain the character models using some of the data from the current document. That is, can we use correctly recognized text to better model the appearance of words in a document with unique visual appearance? To do this, we need to run a baseline OCR system and figure out which words are correct. Previously, there was no reliable way to assess the confidence in an OCR system's output. In a paper in the Journal of Machine Learning Research, we describe how to assess an upper bound on the probability that a word from an OCR system is correct. Given such a bound, we can set aside words that have very low probability of error. Using these highly reliable words, we can train a new OCR system which has been trained on the specific document of interest. We show how this leads to significant gains in accuracy over using a generically trained OCR system. 2. Font-free OCR. When most people hear that there is an OCR system which isn't given any information about what particular characters look like, they think there has been some error in communication--it sounds impossible. But consider the problem of trying to decode a word in which each letter has been consistently replaced by a number, such as "01221221331". It turns out that the only English word that fits this pattern is "Mississippi". Hence, we can decode the word without knowing anything a priori about how each letter is represented. Such techniques are often called cryptogram technicques, since a cryptogram is a code in which each letter has been substituted with another symbol. Previously, cryptogram techniques had only been applied to very clean, high-resolution documents. In an invited IJDAR paper, we showed how to apply them in difficult-to-read documents where even character segmentation is difficult. We significantly extended the purview of such methods, even outperforming Google's Greek OCR, despite the fact that our method used no training data. This method can be used to perform OCR in any alphabetic language in which the characters are separable. Thus it could be used for Russian, Greek, or Spanish, but not Chinese (non-alphabetic), or Arabic (characters are not separable). 3, 4, 5. Our last three contributions (all presented at ICDAR 2013) are for the problem of "scene text recognition", in which the goal is to recognize words from photos of the real-world--street signs, restaurant signs, and movie marquees. This problem is characterized by very difficult fonts (stylized by graphic designers), non-English words (since proper names are common), and very short texts (1-5 words) that make using a language model difficult or impossible. One contribution here was a syllable model that scores the pronoucability of a word. This is useful for ruling out unpronouncable guesses at words while maintaining proper nouns. Another contribution was a segmentation method that could handle difficult lighting situations (such as shadows and blur) by finding parameters of a real-world environment which would cause the observed image. A final contribution was an end-to-end scene text recognition system that includes a detection component, a segmentation component, a recognition component, and a language model. These three contributions significantly improve the state-of-the-art in the difficult scene text recognition problem.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 0916555
Program Officer: Jie Yang

Project Start
Project End
Budget Start: 2009-09-01
Budget End: 2013-08-31
Support Year
Fiscal Year: 2009
Total Cost: $487,395
Indirect Cost

RI: Small: Coordinating Language Modeling, Computer Vision, and Machine Learning for Dramatic Advances in Optical Character Recognition
Learned-Miller, Erik McCallum, Andrew
University of Massachusetts Amherst, Amherst, MA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments