Example-based machine translation (EMBT) searches a parallel corpus of pre-translated texts for the closest match to each new sentence being translated. Traditional EBMT works well only when there is a very large relevant parallel corpus (e.g. over 200 MB). The proposed investigation extends EBMT by generalizing words into semantic equivalence classes, by syntactic canonicalization of the source and target corpora, and by composing multiple partial matches, rather than selecting a single "best" match. These new methods will be evaluated in at least Spanish-English and Korean-English machine translation. Generalized EBMT promises to produce significantly higher accuracy translations than traditional EBMT, given the same size training corpus, or alternatively produce equivalent-quality translations given an order of magnitude smaller corpus. Combining the inherently brief development cycle of EBMT with the much smaller bilingual corpus requirement, makes generalized EBMT the future technology of choice for rapid deployment of machine translation to new, possibly exotic, language pairs.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9618941
Program Officer
Ephraim P. Glinert
Project Start
Project End
Budget Start
1997-03-01
Budget End
2001-02-28
Support Year
Fiscal Year
1996
Total Cost
$723,304
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213