This Small Grant for Exploratory Research is investigating novel methods for discriminative training of statistical language models for application to various human language technologies, such as automatic speech recognition (ASR) and machine translation (MT). A language model (LM) is conventionally estimated from a large corpus of text in the target domain via regularized maximum likelihood. Discriminative criteria have been used with some success in ASR, but their immense promise has been curtailed by the requirement of an additional corpus of transcribed speech needed to discriminate between correct word sequences and their incorrect ?cohorts.? This project is exploring ways to discriminatively estimate language models without requiring massive manual annotation, namely, transcribed speech for ASR or parallel text for MT. The key idea being explored is that if a large amount of (say) monolingual Chinese text is available, then the MT cohorts of Chinese words and phrases may be accurately estimated by attempting to translate this text into (say) English using an existing MT system and examining which English words and phrases are most frequently in competition with each other. It is not necessary to know which of the competing words or phrases in a cohort set is the correct translation in any particular instance! It suffices to learn who are most often in competition. The investigators are using monolingual English text to explore features that discriminate between observed incidences of each member of a cohort set and its putative competitors; the data for discriminative training are thus derived synthetically. They are investigating if such a discriminatively trained LM specifically targets the most debilitating ambiguities faced by the MT system. The ASR counterpart, with cohort sets derived from automatic transcription of unannotated speech, is also being explored. This project benefits both the ASR and MT research communities by exploring statistical language models that can adapt without human intervention to changing tasks or language-use, and that are less reliant on manually annotated data. Advances in ASR and MT in turn will facilitate more effective computer-aided access to information in multiple languages and media.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0840112
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2008-09-01
Budget End
2010-02-28
Support Year
Fiscal Year
2008
Total Cost
$137,464
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218