The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections. To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features. When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.

Project Report

The major goals of this project were to develop tools, processes and systems to anaylze and mine and make accessible large quantities of text. Major Activities: 1. To support our partner at UMass Amherst by providing access to metadata, page images, and other data needed for experiments and demonstration systems. 2. The Archive continued to digitize and add to the corpus during this grant period, and had engineering staff conducting QA, OCR and partner support activities to facilitate the research within the grant project. 3.The Internet Archive promoted this project and has been a user of the beta search tools developed for this project that will enable users to better understand the collections of text in the archive. 4. The Internet Archive made available a number of compute nodes on its cluster for use by UMass Amherst researchers. Specific Objectives: Providing technical guidance on access to and use of the Archive's massive book collection, and project guidance in partnership with U Mass Amherst and Tufts University. This includes use cases, evaluations of existing analytical software and recommendations on features and capabilities required to conduct analyses on large text collections. Significant Results: The addition of books to the Archive's free, publicly accessible digital archive during the project period expands the content available to students, teachers, researchers (within and beyond this project) and to the reading public in general. The tools eveloped by U Mass Amherst will increase the user base for these books as well as an enhancedunderstanding of their contents. Internet Archive and U Mass have begun talking about how to move some of U Mass's discoveries into the Internet Archive's code (language recognition, duplicate detection). This project provided June Goldsmith and Kristine Hanna with new project management skills including partner communication and coordination, as well as technical knowledge relating to the types of digital files in the Archive's collections, the mechanisms for downloading and using the books, and the technical requirements of the partners in accessing and processing the books. The project also provided Alexis Rossi and Hank Bromley an opportunity to work with our colleagues to better understand their needs in working with our code base. The Internet Archive disseminated the results to communities of interest and will continue to promote this project with our partners.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0911018
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2009-10-01
Budget End
2013-09-30
Support Year
Fiscal Year
2009
Total Cost
$305,200
Indirect Cost
Name
Internet Archive
Department
Type
DUNS #
City
San Francisco
State
CA
Country
United States
Zip Code
94129