This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
This project takes on two problems: (1) deciphering ancient texts using computers, and (2) training automated language translation systems without using parallel texts. Statistical language processing software has played little role to date in the analysis of ancient texts, where data is limited and human intuition has so far ruled. Data for automated language translation is more plentiful, and research has made great strides in the 21st century. However, researchers are addicted to training on large parallel texts, which are limited for the diversity of languages and domains for which people need automated translation.
The project develops unsupervised methods that compensate for the lack of parallel data, using alternative sources of linguistic knowledge. For ancient languages, these sources include known languages as decipherment targets, capitalizing on tight connections within a language family. In translation, large quantities of untranslated data are exploited to induce strong bilingual connections. Formulating these tasks in a decipherment framework brings powerful cryptographic theory and algorithms to bear. Such theory also helps estimate expected translation accuracy given fixed data resources, and gauge whether a lost language is decipherable, given a fixed amount of script.
Computational analysis of ancient scripts offers a better understanding of ancient cultures, and unsupervised techniques construct language connections of great interest to historical linguists. Applying such techniques to automated language translation offers the chance to bring many more language pairs and domains to the population at large.