Machine Translation (MT) into morphologically-rich languages poses unique challenges that have so far not been adequately addressed in state-of-the-art approaches. Even the best available MT systems into languages such as Arabic frequently produce translations that are disfluent and lack proper grammatical structure. This project explores novel approaches that address these issues by the development of a statistical MT framework that incorporates deeper levels of modeling of syntax and morphology. While the methods explored are largely language independent, the research is conducted and experimentally evaluated within the context of a large-scale English-to-Arabic MT system constructed using vast corpora available from LDC.
The research in this project focuses on novel approaches for combining syntactic and non-syntactic translation resources that are automatically acquired from vast amounts of parallel data and on exploring several alternative pathways for the integration of information provided by a high-accuracy morphological analysis and generation engine for Arabic into the MT framework. The project also explores methods for improving the syntax of MT output in Arabic using syntactic transfer rules that model syntactic divergences between English and Arabic. The goal is to develop an English-to-Arabic MT system that produces significantly more fluent, grammatical and accurate Arabic output than the current best systems, as measured by MT evaluation metrics (such as BLEU and METEOR), and as judged by human evaluators.
The availability of high-accuracy fully-automatic Machine Translation from English into Arabic has high potential value to the Arabic-speaking population at large, by opening up access to all English content available over the web. Such high-quality MT into Arabic may potentially also improve access to markets in the Arabic-speaking world for US and international companies.