Previous approaches to statistical machine translation (SMT) have employed phrase-based models which represent phrases as sequences of fully-inflected words, and are otherwise devoid of linguistic detail. Such approaches are unable to generalize and essentially rely on memorizing the translations of words and phrases that are observed in training data.

This project aims to improve the quality of SMT through the introduction of more sophisticated models which represent phrases using multiple levels of information. This can include basic linguistic information such as part of speech, lemmas, and agreement information (case, number, person), as well as more sophisticated linguistic detail including semantic classes, argument structure, co-reference, phrase boundaries, and information propagated from syntactic heads.

By annotating all data with this information and extending models appropriately, there is the potential to learn much more from training than was possible under previous approaches. There is now the potential to learn translations of unseen words if other forms of the words occur; it is now possible to learn general facts about a language's word order; it is now feasible to use linguistic context to generate grammatical output. Such generalization has the potential to result in much higher quality translation, especially for languages that only have small amounts of training data. It therefore represents a significant advance over previous approaches to SMT.

Multi-level models have the potential for wide-ranging impact on all language technologies. Simultaneous modeling of different levels of representation is an extremely useful and natural way of describing language. This project is developing a general framework for the creation of multi-level probabilistic models of language and translation, and exploring its application to tasks beyond translation including generation, paraphrasing, and the automatic evaluation of natural language technologies.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0713448
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2007-08-01
Budget End
2012-07-31
Support Year
Fiscal Year
2007
Total Cost
$401,213
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218