Parallel corpora, i.e. texts that are translations of each other, are an important resource for many natural language processing tasks, and especially for building data-driven machine translation systems. Unfortunately, for the majority of languages, parallel corpora are virtually non-existent. To be able to develop machine translation systems for those languages, we need to be able to learn from non-parallel corpora. Comparable corpora ? i.e. documents covering at least partially the same content ? are available in far larger quantities and can be easily collected on the Web. Examples include news published in many languages by Voice of America or BBC, and the multi-lingual Wikipedia. To make best use of comparable corpora it is not sufficient to extract sentence pairs, which are sufficiently parallel, thereby building a parallel corpus and then using proven training procedures. Rather, new techniques are required to find sub-sentential translation equivalences in non-parallel sentences. To extract phrase pairs from comparable corpora requires a cascaded approach: - find comparable documents using, for example, cross-lingual information retrieval techniques; - detect promising sentence pairs, i.e. those, which may contain translational equivalences; - apply robust phrase alignment techniques to detect phrase translation pairs within non-parallel sentence pairs; The main focus of the project lies on this third step: developing novel alignment algorithms, which do not rely on aligning all words within the sentences, as traditional word alignment algorithms do, but can separate parallel from non-parallel regions. The long term benefit of this work will be that machine translation technology can be applied to those languages, for which so far no translation systems are available, due to the lack of the language resources required by current technology. This will enable communication across language barriers, esp. in critical situations like medical assistance or disaster relieve.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0916866
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2009-09-15
Budget End
2012-08-31
Support Year
Fiscal Year
2009
Total Cost
$100,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213