Project Report

Much of our collective knowledge is still stored in books and other printed media. As more of this printed media becomes digitized and available online, it is increasingly important to recognize the text in these images accurately in order to allow better search and organization. My work focuses on the problem of accurately recognizing machine printed text in a scanned image of a book or newspaper, generally referred to as Optical Character Recognition or OCR. OCR remains a difficult problem for noisy documents or documents not scanned at high resolution. Many current approaches rely on stored font models that are vulnerable to cases in which the document is noisy or is written in a font dissimilar to the stored fonts in the system. These problems are apparent in the historical newspaper dataset I am currently analyzing. In previous work, I addressed these problems by learning character models directly from the document itself, rather than use pre-stored font models. In particular, we first used an external OCR system to obtain an initial translation of the document, and then identified which words in the translation we believe to be correctly recognized with high confidence. In this way, we learn document-specific character models that are adapted to the noise level/font in the document. However, a limitation of this approach is that we may not have models for all character classes. For my Summer project, I aimed to overcome this limitation by adapting existing, pre-stored font models to account for cases where the document-specific character model is unavailable, thus accounting for all character classes. In my approach, I first use a pre-stored font model to account for the missing character classes. I then adapted these new models to the noise in the target document using a probabilistic model. I found that using simple degradations such as Gaussian blurring are not enough to sufficiently adapt these external font models. This motivates the need to do a more general document-specific adaptation, for which I used a probabilistic model. This model can find a reasonable approximation of the global noise level in a document but cannot capture the finer details of characters. For example, I can get a reasonable model of characters such as "b" but not of characters like an "s". Overall using these adapted character models results in a small improvement over using the non-adapted characters but more work needs to be done to obtain better document-specific degradations. Based on my findings, I am confident that a model which can better capture document-specific degradations can help improve OCR accuracy for noisy documents. My work in document-specific modeling can be a low-cost, automatic way to improve OCR accuracy on this kind of data. By improving recognition accuracy, we can better search and organize this data, allowing further study by others. Thus, this work should also be of interest to researchers in Information Retrieval. In addition, my approach is not specific to machine printed text and without major modification, it can also be applied toward handwriting recognition. One potential benefit of this work is the ability to build OCR systems for languages in which no OCR systems currently exist. Building a conventional OCR system is an expensive, time-consuming effort that companies would make only for popular languages with many potential users. But for many less-popular languages, there is no incentive to create an OCR system since there are few users and the cost is prohibitively expensive. Our approach is not speci?c to English and is general enough to recognize text in other languages. We do require an initial translation but this can be done if the initial OCR system is trained in a "nearby" language or by having someone manually translate a handful of documents. Our approach bene?ts with more annotated documents, but having a few at the beginning may be enough to obtain a satisfactory OCR system for the language. The overall bene?t to society is that it is possible to convert texts in less popular languages into machine readable form which can then be accessed by anyone online. In this way, we can also facilitate further study into and raise awareness of less popular languages.

Agency
National Science Foundation (NSF)
Institute
Office of International and Integrative Activities (IIA)
Application #
1108152
Program Officer
Carter Kimsey
Project Start
Project End
Budget Start
2011-06-01
Budget End
2012-05-31
Support Year
Fiscal Year
2011
Total Cost
$5,700
Indirect Cost
Name
Kae Andrew
Department
Type
DUNS #
City
Amherst
State
MA
Country
United States
Zip Code
01002