RI-Small: Exploiting Comparable Corpora for Machine Translation (CC4MT)

Vogel, Stephan

Abstract

Parallel corpora, i.e. texts that are translations of each other, are an important resource for many natural language processing tasks, and especially for building data-driven machine translation systems. Unfortunately, for the majority of languages, parallel corpora are virtually non-existent. To be able to develop machine translation systems for those languages, we need to be able to learn from non-parallel corpora. Comparable corpora ? i.e. documents covering at least partially the same content ? are available in far larger quantities and can be easily collected on the Web. Examples include news published in many languages by Voice of America or BBC, and the multi-lingual Wikipedia. To make best use of comparable corpora it is not sufficient to extract sentence pairs, which are sufficiently parallel, thereby building a parallel corpus and then using proven training procedures. Rather, new techniques are required to find sub-sentential translation equivalences in non-parallel sentences. To extract phrase pairs from comparable corpora requires a cascaded approach: - find comparable documents using, for example, cross-lingual information retrieval techniques; - detect promising sentence pairs, i.e. those, which may contain translational equivalences; - apply robust phrase alignment techniques to detect phrase translation pairs within non-parallel sentence pairs; The main focus of the project lies on this third step: developing novel alignment algorithms, which do not rely on aligning all words within the sentences, as traditional word alignment algorithms do, but can separate parallel from non-parallel regions. The long term benefit of this work will be that machine translation technology can be applied to those languages, for which so far no translation systems are available, due to the lack of the language resources required by current technology. This will enable communication across language barriers, esp. in critical situations like medical assistance or disaster relieve.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 0916866
Program Officer: Tatiana D. Korelsky

Project Start
Project End
Budget Start: 2009-09-15
Budget End: 2012-08-31
Support Year
Fiscal Year: 2009
Total Cost: $100,000
Indirect Cost

RI-Small: Exploiting Comparable Corpora for Machine Translation (CC4MT)
Vogel, Stephan
Carnegie-Mellon University, Pittsburgh, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments