RI: Small: Collaborative Research: Statistical Learning of Language Universals

Taskar, Ben

Abstract

As modern technology infrastructure spreads throughout the world, the quantity of electronic text, written in hundreds of different languages, continues to grow in size and diversity. Building effective information retrieval, extraction, and translation systems across this vast array of languages currently requires time-consuming and expensive linguistic annotations for each language. Generic, fully unsupervised, methods are unlikely to provide a language independent solution to this problem.

Focusing on part-of-speech prediction, this project undertakes a novel approach, combining elements of supervised and unsupervised learning without assuming any specific knowledge of the target language. Instead of treating individual languages as closed systems, language-independent "universals" are statistically estimated from dozens of languages for which annotated corpora exist, and these learned universals are used to predict the part-of-speech categories of unannotated languages. At the heart of the project is a data-driven exploration of language-independent corpus characteristics that relate cross-lingual linguistic categories to surface statistics of text. These learned patterns are incorporated into expressive structured prediction models using novel approximate learning and inference methods developed by the Principal Investigators of the project.

Of the world?s spoken languages, hundreds are at risk of immediate extinction and thousands more are likely to disappear over the coming decades. By facilitating the rapid creation of language-independent linguistic analysis tools, the technology developed under this project has the potential to revolutionize the documentation of endangered languages. In the long-term, this research direction will also help realize the full social benefits of the global technology infrastructure by creating intelligent text processing tools for hundreds of low-resource languages.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 1116097
Program Officer: Tatiana Korelsky

Project Start
Project End
Budget Start: 2011-08-01
Budget End: 2013-03-31
Support Year
Fiscal Year: 2011
Total Cost: $110,000
Indirect Cost

RI: Small: Collaborative Research: Statistical Learning of Language Universals
Taskar, Ben
University of Pennsylvania, Philadelphia, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments