Researchers at New York University, Monmouth University and the University of Colorado are constructing Japanese/English and Chinese/English machine translation systems which automatically acquire rules from ``deep'' linguistic analyses of parallel text. This work is a natural culmination of automated example-based Machine Translation (MT) projects that have become increasingly sophisticated over the last two decades. The following recent advances in Natural Language Processing (NLP) technologies make this inquiry feasible: (1) annotated data including bilingual treebanks and processors trained on this data (parsers, PropBankers, etc.); (2) semantic post-processors of parser output; (3) programs that automatically align bitexts; and (4) bilingual tree to tree translation models.
Natural languages vary widely in the ordering of corresponding words for equivalent expressions across linguistic boundaries and within a single language. This research investigates ways to minimize the variations within a single language using a type of semantic representation (GLARF) that is derived automatically from syntactic trees. Such semantic representation provides for: (1) a reduction in the number of ways of representing the same underlying message, and (2) a way to handle long distance dependencies (e.g. relative clauses) as local phenomena. Therefore, there is no need to resort to arbitrarily long sentence fragments or large trees for training. Furthermore, since less data is needed, it minimizes the sparse data problem.
In the training of this translation model, because of (1), the number of mapping rules between the source tree and the target tree is reduced. The translation model, then, is a tree transducer, with ``deep'' linguistically analyzed trees for both source and target representations. In order to provide efficient computer algorithms for such partial mappings, this research needs to focus on (a) the training algorithm and the (b) the constraints over the mapping rules in order to reduce the computational complexity.
This research is expected to yield several advantages: The core architecture of this transducer using ``deep'' linguistic analyses should yield more accurate results. The GLARF architecture allows control over different granularity of automatically-obtained linguistic analyses.
Broader Impact: The demand for machine translation spans from the local government (e.g. police forces) to national government (e.g. CIA) and the private sector. Given the growth of the Internet outside the English speaking world, better machine translation is of critical importance for the broader community. This work directly affects the ability of English speakers to understand websites written in Chinese and Japanese, two of the most widely used languages on the Internet. The technique is generalizable to other language pairs and can ultimately have even wider impact.