The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections. To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features. When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.
Our work in the NSF 'Mining a Million Books' grant has focused on two complementary tasks, led by two different researchers. In years 1 and 2, we focused primarily upon working with raw OCR-generated text at scale. In years 3, 4 and 5, we focused on integrating the generation of crucial, but very expensive, linguistic data from large corpora with language learning. In the first two years, we focused on discovering linguistic information in the million book collection of the Internet Archive. This work has involved six levels of analysis: mining a corpus of 27,000 manually/automatically dated historical Latin texts drawn from a much larger collection of 1.2 million books in multiple languages; enhancing the metadata and enabling historical research by identifying the date of composition (as opposed to the print date) for a subset of those 27,000 books; analyzing changing lexical trends in Latin over that historical period; manually identifying a set of parallel Latin-English texts in the 1.2 million book collection to create a sense inventory and labeled training instances for automatic word sense disambiguation; using that trained model to automatically tag the word senses for all of the Latin words in the 27,000 work collection; and using that dated, sense-tagged collection to discover variation in Latin and English word senses over time. In the second phase, we focused upon the challenge of automatically producing linguistic data. The study of language in general and especially of historical languages must draw upon annotated corpora, where every word has one or more linguistic functions labelled. Morpho-syntactic annotation captures the form of individual words and their function. Nonetheless automated syntactic analysis is still imperfect and human annotators can be significantly more accurate. At the same time, syntactic analysis is challenging for human analysts as well. This makes syntactic analysis expensive in the best of cases. When we are working with historical languages for which, by definition, no native speakers are available, then the difficulties and costs go up accordingly. On the other hand, the ability to analyze syntax correlates well with the ability to translate a language, suggesting that the practice of syntactic annotation can help students perfect their knowledge of the language while they contribute to the development of new linguistic resources. One major goal was to compare errors made by students with the errors made by machines while dependency parsing ancient Greek. A dependency graph is a tree in which every word is represented as a node and the directed arcs between the nodes show syntactic modifiers of the words. Dependency parsing is the task of generating dependency graph of that sentence. Data-driven dependency parsing techniques use an annotated corpus and learn to generate dependency graphs automatically from it. To compare errors made by students with errors made by automatic parsers, we first studied the errors made by students and different parsers and then finally compared those errors. Twenty-two students from an advanced Greek course at Tufts University participated in this research. Annotations were made according to the guidelines of the Perseus Project's Ancient Greek Dependency Treebank (AGDT). The texts chosen for the experiment were the Iliad and the Odyssey of Homer from the AGDT, which provides gold standard annotations for a number of Ancient Greek texts. For machine annotation, we used a group of three different state of the art dependency parsers. Our experiments shows that the human annotators and parsers made very similar errors. Both methods produced very similar frequencies of different error types. What is hard for the students is hard for the parsers as well. Future Work The Open Philology Project at Leipzig, led by PI Crane, is carrying forward both threads of the work begun in this project. The Open Greek and Latin subproject is building on the enhanced OCR, lexical trend analysis, and other scalable tasks from the first phase of work. The E-learning Subproject builds upon the second phase. The overlap between the mistakes of automated parsers and of students suggests that we can combine two normally distinct tasks - teaching students a historical language and improving the linguistic data about that language. By having students pay particular attention to the annotations that pose the greatest challenge to them and training them with texts for which gold standard annotations exist, we can help them improve their skills. By allowing multiple students to annotate sentences for which no gold standard yet exists, having the students and their instructors review the results, and then publish their analyses with their names attached, we can begin to integrate data production with learning, providing a new motivation for learners. The shared human/machine errors thus help us develop a strategy to generate the linguistic metadata needed for growing collections of Classical Greek, Latin and other languages.