As modern technology infrastructure spreads throughout the world, the quantity of electronic text, written in hundreds of different languages, continues to grow in size and diversity. Building effective information retrieval, extraction, and translation systems across this vast array of languages currently requires time-consuming and expensive linguistic annotations for each language. Generic, fully unsupervised, methods are unlikely to provide a language independent solution to this problem.

Focusing on part-of-speech prediction, this project undertakes a novel approach, combining elements of supervised and unsupervised learning without assuming any specific knowledge of the target language. Instead of treating individual languages as closed systems, language-independent "universals" are statistically estimated from dozens of languages for which annotated corpora exist, and these learned universals are used to predict the part-of-speech categories of unannotated languages. At the heart of the project is a data-driven exploration of language-independent corpus characteristics that relate cross-lingual linguistic categories to surface statistics of text. These learned patterns are incorporated into expressive structured prediction models using novel approximate learning and inference methods developed by the Principal Investigators of the project.

Of the world?s spoken languages, hundreds are at risk of immediate extinction and thousands more are likely to disappear over the coming decades. By facilitating the rapid creation of language-independent linguistic analysis tools, the technology developed under this project has the potential to revolutionize the documentation of endangered languages. In the long-term, this research direction will also help realize the full social benefits of the global technology infrastructure by creating intelligent text processing tools for hundreds of low-resource languages.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1116097
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2011-08-01
Budget End
2013-03-31
Support Year
Fiscal Year
2011
Total Cost
$110,000
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104