Current methods for machine translation (MT) rely on large amounts of text data. However, large data is not available for many languages or for specialized vocabularies even in major languages. This project elicits bilingual data from a fairly naive human bilingual informant. Bilingual speakers are available for a language even when large data and trained linguists are not. A Corpus Navigator uses knowledge from language typology to choose the pieces of data that are most valuable for automatic learning of MT rules. The Corpus Navigator employs active learning in the sense that its state is updated by eliciting data from a human translator.
Two hypotheses are being tested: an MT system can get by with less data if it is the right data, and that the right data can be acquired through an active learning process guided by linguistic knowledge. Current government-run MT evaluations provide a testbed for these hypotheses. The outputs of MT systems trained on different data sets are compared in order to determine whether the hypotheses are correct. An initial prototype Corpus Navigator is being produced as a proof-of-concept.
This project will make it easier to build MT systems in situations where large text resources are not available. Languages that will be tested may include Inupiaq, Bengali, Thai, Urdu, Uzbek, and Tigrinia. The output of Corpus Navigation is a parallel, word-aligned corpus annotated with a semantic feature structure. This data will be available to other researchers.