This project, supported by an EArly-concept Grant for Exploratory Research (EAGER), is developing computational models of how manuscripts of premodern texts changed over time due to copying with errors, intentional editing, and translation into different languages. The purpose of these models is to reconstruct the original texts and to better understand the forces that shaped them. We are building on work applying ideas from computational evolutionary biology to the task, but the main focus of the project is to explore whether cutting-edge ideas from computational linguistics and natural language processing are better suited for modeling the evolution of natural-language texts. In particular we are exploring the use of techniques from nonprojective dependency parsing to model the tree of relationships among manuscripts and statistical machine translation to model the relationship between pairs of manuscripts.

The tools that result from the project will be made publicly available in order to foster cross-disciplinary research. These tools will enable scholars of ancient and medieval literature to use our models to analyze collections of manuscripts that may not have been possible to analyze by hand before. The techniques explored will shed light on computationally hard learning and search problems such as those that frequently arise in natural language processing.

Project Report

This project, supported by an EArly-concept Grant for Exploratory Research (EAGER), has been developing computational models of how manuscripts of premodern texts changed over time due to copying with errors, intentional editing, and translation into different languages. The purpose of these models is to reconstruct the original texts and to better understand the forces that shaped them. Previous work in this area took inspiration from evolutionary biology, treating texts as genes; the main focus of this project has been to explore whether cutting-edge ideas from computational linguistics and natural language processing are better suited for modeling the evolution of natural-language texts. We developed two models during the course of this exploratory project. The first applied a technique called Structural EM (invented by Nir Friedman of Hebrew University) to this task, and achieved the best results we know of on a dataset created for the Computer Assisted Stemmatology Challenge (by Teemu Roos and Tuomas Heikkila of the University of Helsinki). The second model was a Bayesian approach, trained using Monte Carlo techniques. Although this model also occasionally performed excellently, it was not as consistent as the first approach. We extended this second approach to simultaneously model word-level and sound-level changes. For example, the word 'dog' could plausibly change into 'hound' (because it has similar meaning) or 'dock' (because it has similar sound). This multi-level structure is a special property of human language and does not have an exact analogue in genetics. Unfortunately, time and resource limitations have prevented our initial experiments with this more sophisticated model from yielding results as yet.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1011778
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2010-02-01
Budget End
2012-01-31
Support Year
Fiscal Year
2010
Total Cost
$75,000
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089