This project is conducting fundamental research in statistical language modeling to improve human language technologies, including automatic speech recognition (ASR) and machine translation (MT).

A language model (LM) is conventionally optimized, using text in the target language, to assign high probability to well-formed sentences. This method has a fundamental shortcoming: the optimization does not explicitly target the kinds of distinctions necessary to accomplish the task at hand, such as discriminating (for ASR) between different words that are acoustically confusable or (for MT) between different target-language words that express the multiple meanings of a polysemous source-language word.

Discriminative optimization of the LM, which would overcome this shortcoming, requires large quantities of paired input-output sequences: speech and its reference transcription for ASR or source-language (e.g. Chinese) sentences and their translations into the target language (say, English) for MT. Such resources are expensive, and limit the efficacy of discriminative training methods.

In a radical departure from convention, this project is investigating discriminative training using easily available, *unpaired* input and output sequences: un-transcribed speech or monolingual source-language text and unpaired target-language text. Two key ideas are being pursued: (i) unlabeled input sequences (e.g. speech or Chinese text) are processed to learn likely confusions encountered by the ASR or MT system; (ii) unpaired output sequences (English text) are leveraged to discriminate between these well-formed sentences from the (supposed) ill-formed sentences the system could potentially confuse them with.

This self-supervised discriminative training, if successful, will advance machine intelligence in fundamental ways that impact many other applications.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0964102
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2010-06-01
Budget End
2014-05-31
Support Year
Fiscal Year
2009
Total Cost
$519,050
Indirect Cost
Name
Oregon Health and Science University
Department
Type
DUNS #
City
Portland
State
OR
Country
United States
Zip Code
97239