As modern technology infrastructure spreads throughout the world, the quantity of electronic text, written in hundreds of different languages, continues to grow in size and diversity. Building effective information retrieval, extraction, and translation systems across this vast array of languages currently requires time-consuming and expensive linguistic annotations for each language. Generic, fully unsupervised, methods are unlikely to provide a language independent solution to this problem.
Focusing on part-of-speech prediction, this project undertakes a novel approach, combining elements of supervised and unsupervised learning without assuming any specific knowledge of the target language. Instead of treating individual languages as closed systems, language-independent "universals" are statistically estimated from dozens of languages for which annotated corpora exist, and these learned universals are used to predict the part-of-speech categories of unannotated languages. At the heart of the project is a data-driven exploration of language-independent corpus characteristics that relate cross-lingual linguistic categories to surface statistics of text. These learned patterns are incorporated into expressive structured prediction models using novel approximate learning and inference methods developed by the Principal Investigators of the project.
Of the world?s spoken languages, hundreds are at risk of immediate extinction and thousands more are likely to disappear over the coming decades. By facilitating the rapid creation of language-independent linguistic analysis tools, the technology developed under this project has the potential to revolutionize the documentation of endangered languages. In the long-term, this research direction will also help realize the full social benefits of the global technology infrastructure by creating intelligent text processing tools for hundreds of low-resource languages.