In the last 50 years, computational linguistics research has touched barely 1% of the world's languages. In 100 years, 90% of them will be extinct or nearly so. What can computational linguistics offer to support the urgent task of documenting and analyzing the world's endangered languages? Based on the observation that bilingual parallel text is both the primary artifact collected in documentary linguistics as well as the primary object of statistical translation models, this project explores the use of machine translation to accelerate the global language documentation effort. Specifically, it develops novel ways to model any number of related languages simultaneously, pooling information from all the languages to make stronger inferences about each. In order to exploit language relationships, it explores methods that simultaneously model phonological, morphological, lexical, and syntactic phenomena. In addition, it develops algorithms to standardize highly variable transcription practices.
These technologies, which will be field-tested in the Eastern Highlands of Papua New Guinea, are designed to enable speakers of endangered languages who have no specialized linguistic training to create large collections of translated oral literature, providing an authentic and interpretable record of their language, serving current and future generations of scholars, teachers, and learners. They will do so, moreover, at much less cost than is needed to support the efforts to trained linguists and ethnographers to create such collections.
In the last 50 years, computational linguistics research has touched barely 1% of the world's languages. In 100 years, 90% of them will be extinct or nearly so. What can computational linguistics offer to support the urgent task of documenting and analyzing the world's endangered languages? Based on the observation that bilingual parallel text is both the primary artifact collected in documentary linguistics as well as the primary object of statistical translation models, this project explored the possibility of using machine translation to accelerate the global language documentation effort. It took the first steps towards enabling speakers of endangered languages who have no specialized linguistic training to create large collections of translated oral literature, providing an authentic and interpretable record of their language, serving current and future generations of scholars, teachers, and learners. We worked in the Eastern Highlands Province of Papua New Guinea, a country renowned for the great number and diversity of its languages. Our investigation was carried out in the context of organizing the International Workshop on Language Preservation, a two-week training course held at the University of Goroka in May 2012. The goal of the workshop was to provide hands-on training in digital technologies for language preservation, and to create archival documentation for several local languages. In all, we collected about 20,000 words of source text, of which about 16,000 were translated into another language (mostly English, with some into Tok Pisin and some into Alekano). The distribution of data across languages was unsurprisingly skewed (see accompanying image): most data was in Alekano (gah), which is the primary language in the Goroka area, and in Tokano (zuh), which is also spoken nearby and is closely related to Alekano. We plan to release data in all languages to the public via the Language Commons (languagecommons.org), and we hope that this data will be of interest to researchers in both linguistics and natural language processing. This is a significant amount of data for a few weeks of work, but we want to be able to document languages much faster than this. One of the major bottlenecks was the keyboard-and-mouse user interface used, and for the future, we believe that speech-based interfaces will be much more efficient. The project succeeded in (1) mobilizing endangered language communities to take part in language documentation efforts; (2) collecting a modest amount of parallel text in nearly 20 local languages, some of which are highly endangered; (3) providing insights into how to streamline the data collection process.