The purpose of this pilot project is to demonstrate the feasibility of a new approach to documenting endangered languages.

To allow wide-ranging investigation of a language even after it is no longer spoken, we need the equivalent of the million words of extant biblical Hebrew texts, or the five million words of extant classical Latin. But for endangered languages without a significant culture of literacy, diverse text collections on this scale seem out of reach.

Given typical speaking rates of about 10,000 word-equivalents per hour, a hundred hours of recorded speech -- conversations, narratives, or oral histories -- would give us the equivalent of a million words of text. With community involvement, hundreds of hours of such recordings are easily within reach.

However, transcribing such large audio collections is a daunting task, given the small number of literate native speakers and the time-consuming nature of such transcription, which can take 200 hours of work for every hour of audio. We propose to solve this problem by substituting re-speaking and verbal translation: one or more native speakers repeats each phrase of a recording, speaking slowly and carefully, and then translates it into a better-documented language.

The utility of translated passages as a way to analyze otherwise-unknown languages has been demonstrated many times, starting with the Rosetta Stone. This aspect of our task is easier, since at least a grammatical sketch will in general be available.

Our goal in this project is to demonstrate the utility of re-speaking. We believe that linguists, starting out with relatively little knowledge of a language, can produce phonetic transcriptions that will be good enough to support subsequent analysis resulting in coherent texts, in a process analogous to (but easier than) the process that allowed previous generations of scholars to learn to read ancient Egyptian or Sumerian.

Project Report

Thousands of the world's languages are not adequately documented, and the languages are falling out of use more rapidly than linguists can record and transcribe them. This project is developing mobile phone software for recording, respeaking, and oral translation, so that local linguistic communities can create interpretable documentation for their languages. The software prototype, currently available for Android phones, is called Aikuma, and recently won the Open Source Software World Challenge Grand Prize 2013. A field test is currently underway in Nepal, following previous rounds of testing and development in Papua New Guinea and Brazil. Laboratory experimentation has demonstrated that the audio collected by the phones is of sufficient quality to support instrumental phonetic investigation. Aikuma avoids the usual transcription bottleneck, which prevents linguists from transcribing more than an hour or so of recordings for any language studied. Instead, we rely on a protocol known as "careful respeaking", in which someone listens to a previously made recording and carefully repeats what was said, phrase by phrase. Aikuma permits the user to start respeaking at any stage during playback, and records what was said, and aligns it with the original source. Oral translation works in the same way. Accordingly, each source is associated with additional recordings that can be used by future linguists (working from the archive) to do their transcription and translation work, even once no speakers of the language remain. In ongoing work we are adding support for web-based transcription of audio sources with time-aligned respeakings and translations, and investigating issues with informed consent for audio file sharing amongst communities with limited exposure to digital preservation.

Agency
National Science Foundation (NSF)
Institute
Division of Behavioral and Cognitive Sciences (BCS)
Type
Standard Grant (Standard)
Application #
1160639
Program Officer
Shobhana Chelliah
Project Start
Project End
Budget Start
2012-07-01
Budget End
2014-12-31
Support Year
Fiscal Year
2011
Total Cost
$101,501
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104