This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
The main goal of this project is to develop a tagging method which neither relies on target-language training data nor requires bilingual dictionaries and parallel corpora. The main assumption is that a model for the target language can be approximated by language models from one or more related source languages.
Exploiting cross-lingual correspondence leads to a better understanding of 1) what linguistic properties are crucial for morphosyntactic transfer; 2) how to measure language similarity at different levels: syntax, lexicon, morphology; 3) how this method applies to pairs that do not belong to the same family; 4) what determines the success of the model, and 5) how to quantify its potential for a given language pair. By exploiting cross-language relationships, the size, and hence cost, of the training data are significantly reduced.
This project is a new cross-fertilization between theoretical linguistics (especially typology and diachronic linguistics) and natural language processing. The practical contribution is a robust and portable system for tagging resource-poor languages. With this new approach, it is be possible to rapidly deploy tools to analyze a suddenly critical language. This approach can also enhance NSF's initiatives in documenting endangered low density languages as it leverages exactly the type of knowledge that a field linguist and a native speaker could provide. Additional benefits include high quality annotated data, automatically derived multilingual lexicons, annotation schemes for new languages, new typological generalizations, and graduate and undergraduate researchers with significant experience of highly practical work on difficult and underrepresented languages.