One of the oldest problems in linguistics is to reconstruct ancient protolanguages on the basis of their modern descendants. Identifying ancestral word forms makes it possible to evaluate proposals about the nature of language change and to draw inferences about human prehistory. Currently, linguists painstakingly reconstruct protolanguages by hand, using knowledge of the relationships between languages and the plausibility of sound changes. This research project develops statistical, computational methods that automate or augment the reconstruction process. Unlike past computational approaches, these new models use detailed phonological representations to infer hidden sound changes. Moreover, they automatically infer which words are co-descendent (cognates).

These advances, combined with new algorithms for large-scale statistical inference, enable the analysis of orders of magnitude more data than prior work. The models from this project significantly expand the computational tools available to linguists; large-scale reconstructions make it possible to collect quantitative data to help answer long-standing questions about language change. Beyond word reconstruction, the models and tools from this project will be useful for other related applications, such as machine translation, where reconstructions can be used to fill gaps in the mapping between the vocabularies of different languages, and the alignment of biological sequences, which requires considering which regions in those sequences are co-descendent. In addition, the technical advances in probabilistic modeling and approximate inference methods will have cross-cutting implications for a range of modeling problems in computational linguistics, bioinformatics, statistics, machine learning, and cognitive science.

Project Report

The languages we speak today - English, Spanish, Mandarin - are the result of thousands of years of small changes, accumulated as people learn the elements of language from one another. The words we use, and how we put those words together into sentences, change across generations - enough that you would find it hard to understand somebody speaking the English of a thousand years ago. This research project explored the question of whether these patterns of linguistic change could be modeled by computers, and whether those models could make it possible to reconstruct the forms of the languages spoken by our ancestors. Language reconstruction is a challenging problem, usually tackled by linguists engaging in painstaking manual effort. By automating this process, we might be able to get insight into ancient languages at a larger scale than manual reconstruction allows - looking at more languages, and potentially even looking further back in time. Using computer models of language change, this project was able to reconstruct the words used in a large family of ancient languages - the ancestors of the languages spoken in islands in the Pacific ocean. The automatic reconstructions showed strong agreement with the manual reconstructions provided by linguists. Being able to reconstruct so many languages simultaneously also made it possible to answer a key question about language change: which sounds are most likely to change? Analyzing the reconstructed languages showed that sounds that are less important in discriminating between words in a language are most likely to change, confirming a hypothesis that had first been offered fifty years ago. A similar technique was used to reconstruct the order in which words appear in languages. In English, we would say "dog bites man" - putting these words in another order would either result in nonsense, or change the meaning of the sentence. But other languages put words in different orders. Looking across modern languages and reconstructing the word-order used in their ancestors revealed a clear pattern, with structures closer to "dog man bites" being used to express the same meaning in the ancestors of many modern languages. Again, the computer models could be used to provide quantitative answers to questions about language change, such as how much information modern languages really provide about the word order of ancient languages (and how this changes the further back into the past we go). This research has resulted in new tools that can be used by linguists to complement their manual analyses and extend those analyses to a larger scale, as well as new insights about the theoretical nature of language change that are relevant to psychologists, anthropologists, and other cognitive scientists. Carrying out this work provided three young researchers with a uniquely interdisciplinary training experience, including education in linguistics as well as the latest methods from computer science.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1018733
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2010-08-15
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$460,143
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94710