This multi-site research effort is aimed at developing a coherent, consistent, standardized Interlingual representation along with a methodology and sharable tools for annotating large bilingual corpora of parallel texts. It has four central components: First, six corpora are being compiled, each consisting of a number of texts in a particular source language along with three translations of each text into English. Second, a standardized interlingual representation is being developed based on a comparative analysis of these parallel text corpora. Third, the bilingual corpora are being annotated using the standardized interlingua and following a predefined annotation procedure. Fourth, metrics are being developed for evaluating the accuracy and appropriateness of the interlingual representations in terms of the grain size of the representation given a particular task. The metrics are based on inter-coder reliability, the growth rate of the interlingual representation, and quality of the target language text that is be generated from the interlingua.
The resulting annotated, multilingual, parallel corpora will be useful as an empirical basis for developing a wide variety of interlingual NLP systems for tasks such as machine translation, question answering, web searching, summarization, or presentation generation, as well as a host of other research and development efforts in theoretical and applied linguistics, foreign language pedagogy, translation studies, and other related disciplines.
The participants include CRL at NMSU, ISI at USC, UMIACS at the University of Maryland, LTI at CMU, Columbia University, and The MITRE Corporation. The source languages include Arabic, Chinese, French, Hindi, Japanese, Spanish and English.