Research in machine translation of human languages has made substantial progress recently, and surface patterns gleaned automatically from online bilingual texts work remarkably well for some language pairs. However, for many language pairs, the output of even the best systems is garbled, ungrammatical, and difficult to interpret. Chinese-to-English systems need particular improvement, despite the importance of this language pair, while English-to-Chinese translation, equally important for communication between individuals, is rarely studied. This project develops methods for automatically learning correspondences between Chinese and English at a semantic rather than surface level, allowing machine translation to benefit from recent work in semantic analysis of text and natural language generation. One part of this work determines what types of semantic analysis of source language sentences can best inform a translation system, focusing on analyzing dropped arguments, co-reference links, and discourse relations between clauses. These linguistic phenomena must generally be made more explicit when translating from Chinese to English. A second part of the work integrates natural language generation into statistical machine translation, leveraging generation technology to determine sentence boundaries, ordering of constituents, and production of function words that translation systems tend to get wrong. A third part develops and compares algorithms for training and decoding machine translation models defined on semantic representations. All of this research exploits newly-developed linguistic resources for semantic analysis of both Chinese and English.
The ultimate benefits of improved machine translation technology are easier access to information and easier communication between individuals. This in turn leads to increased opportunities for trade, as well as better understanding between cultures. This project's systems for both Chinese-to-English and English-to-Chinese are developed with the expectation that the approaches will be applied to other language pairs in the future.
Statistical machine translation (MT) systems have improved greatly in the past several years and reached a point where they are widely used for at least getting the gist of foreign language documents and web pages. However, reading the output of even the best Chinese-English machine translation systems remains a painful experience. Furthermore, current systems perform well only on the type of text on which they have been trained (most often newswire text), and require very large amounts of texts from this domain. The project developed a number of new techniques to improve automatic translation between natural languages, focusing on Chinese to English translation. This project enabled co-operation between five universities on developing translation systems that are able to represent more of the underlying semantic of an input sentence, and to take advantage of relations derived in order to ensure that they are preserved in the translation output. Work at the University of Rochester focused on the translation system itself, and on decoding algorithms for finding the best translation output under the new semantically aware models. We developed systems that take advantage of semantic roles labeling of both the input and output language, and showed that these systems are able to improve the quality of translation output, as measured by agreement with human translators. For the related task of learning translation rules from parallel, bilingual training data, our group developed a new algorithm for sampling analyses of sentence pairs in order to find a translation grammar consisting of a small number of rules best able to explain the data as a whole.