Previous approaches to statistical machine translation (SMT) have employed phrase-based models which represent phrases as sequences of fully-inflected words, and are otherwise devoid of linguistic detail. Such approaches are unable to generalize and essentially rely on memorizing the translations of words and phrases that are observed in training data.
This project aims to improve the quality of SMT through the introduction of more sophisticated models which represent phrases using multiple levels of information. This can include basic linguistic information such as part of speech, lemmas, and agreement information (case, number, person), as well as more sophisticated linguistic detail including semantic classes, argument structure, co-reference, phrase boundaries, and information propagated from syntactic heads.
By annotating all data with this information and extending models appropriately, there is the potential to learn much more from training than was possible under previous approaches. There is now the potential to learn translations of unseen words if other forms of the words occur; it is now possible to learn general facts about a language's word order; it is now feasible to use linguistic context to generate grammatical output. Such generalization has the potential to result in much higher quality translation, especially for languages that only have small amounts of training data. It therefore represents a significant advance over previous approaches to SMT.
Multi-level models have the potential for wide-ranging impact on all language technologies. Simultaneous modeling of different levels of representation is an extremely useful and natural way of describing language. This project is developing a general framework for the creation of multi-level probabilistic models of language and translation, and exploring its application to tasks beyond translation including generation, paraphrasing, and the automatic evaluation of natural language technologies.