Linguists studying endangered languages in the field often ask their informants to describe pictures that illustrate particular characteristics of their language, such as how it uses pronouns or spatial relations or concepts of time. This enables the field linguist to obtain natural language with minimal instruction, exercising minimal influence on what is said, so that accurate information about the language can be recorded. Typically such pictures are prepared in advance, based upon research hypotheses about the language -- but often new hypotheses emerge in the course of sessions with informants. Researchers at Columbia University have proposed to develop an aid to field linguists which makes it possible to test new research questions as they arise in the field. They will adapt existing text-to-scene generation software, WordsEye, which allows users to create 3D scenes from simple English input, to produce a novel tool for fieldwork called WELT, the WordsEye Linguistics Tool.

WELT tool will ultimately have two modes of operation: 1) In Phase 1, English input will automatically generate a picture which can be used to elicit a targeted description, 2) In Phase 2, input in the target language will automatically generate a picture representing the meaning of the input, to verify linguistic hypotheses with native speakers.

While WELT is intended ultimately for general use, it will initially be developed to study Arrernte, an endangered language spoken by ~6000-8000 Arrernte people in Central Australia. While some aspects of this language are well documented, a number of idiosyncratic lexical and morphological features of the language that relate to describing spatial relations are not well understood. Such features are interesting because they relate directly to how a language is used by its speakers to describe the way their perceive the world. The language group's remote location and insular culture have made it difficult to document by traditional means, so that tools such as WELT should be particularly useful. WELT will be tested in the field as part of an existing cooperation with Dr. Mark Dras and other researchers at Macquarie University, Sydney, Australia. The Division of Information & Intelligent Systems of the Directorate for Computer & Information Science & Engineering is [co-]funding this award as part of its commitment to support the development of computational tools and methods for the documentation of endangered languages.

Project Report

This work focused on developing the WordsEye Linguistics Tools (WELT), a novel toolset for field linguists studying endangered languages. WELT is based on WordsEye, an existing text-to-scene system for English, and includes tools for eliciting and documenting language data. While other fieldwork tools do exist, our work differs from these in that it will allow a linguist to create custom elicitation materials that can easily be modified in realtime, to document the semantics of an endangered language, and to create a text-to-scene system that generates 3-D scenes from endangered language input. We created the elicitation tool, which allows users to organize elicitation sessions around sets of 3D scenes they have created in WordsEye. While eliciting descriptions of these scenes from a native speaker of an endangered language, the field worker can easily modify scenes and create new ones in response to data collected. The tool also provides the means to record audio and type transcriptions, glosses, and notes. We used WordsEye to create about 40 scenes representing different spatial relations (based on the Max Planck topological relations picture series), and elicited Nahuatl descriptions of them from a native speaker informant. In preparation for eliciting data from native speakers of Arrernte, an aboriginal Australian language, we created custom content for WordsEye that is particularly relevant to the culture and geographic location of the Arrernte people. We created the tool for writing syntax-to-semantics rules, which map syntactic structures of a language into a semantic form compatible with WordsEye. We used this tool to create sample syntax-to-semantics rules for Arrernte. By combining these an existing Arrernte morphological analyzer and syntactic grammar, we built a demo pipeline that takes Arrernte text as input and generates a scene representing the meaning. We also modified the underlying semantics of WordsEye so that the resulting scenes would reflect Arrernte culture rather than American culture. This included changing the default backdrops of scenes to reflect landscapes typical of Central Australia and replacing objects associated with American football with their Australian footy equivalents. Instead of requiring the field linguists to build a computational grammar from scratch for each language, we began investigating ways that WELT could learn a syntactic parser based on annotated examples. We created a corpus of English descriptions of spatial and motion events and labeled these with syntactic dependency relations and used previously annotated corpora in other genres for five other languages. We implemented an algorithm for learning a dependency parser and ran experiments to see (a) how successfully we could learn a parser from limited examples and (b) which machine learning techniques were most successful. We found that this method performed very well across all six languages, outperforming the baseline even when trained on fewer than ten labeled sentences. Of the machine learning classifiers, we found that support vector machines performed the best overall, but there were some cross-linguistic differences that will need to be investigated further. Our ultimate goal is to help linguists studying endangered languages to document and analyze these languages more easily using our computational tools. Saving endangered languages not only helps to preserve people's cultures but it provides ways that new generations can explore the languages of their ancestors and motivates them to learn and use those languages.

Agency
National Science Foundation (NSF)
Institute
Division of Behavioral and Cognitive Sciences (BCS)
Type
Standard Grant (Standard)
Application #
1160700
Program Officer
Shobhana Chelliah
Project Start
Project End
Budget Start
2012-06-01
Budget End
2014-11-30
Support Year
Fiscal Year
2011
Total Cost
$98,219
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027