The development of automatic speech recognition (ASR) systems is severely constrained by the need for large amounts of training data. Furthermore, available training data often does not match the recognition task in style and domain, which particularly affects the language modeling component in an ASR system.

This project is aimed at developing ways of artificially generating language model training data for ASR. Specifically, statistical machine translation (SMT) models are used to produce task-specific data from different but related data representing, for example, a different speech style, dialect, or domain. First, SMT models are trained on a small amount of parallel in-domain and out-of-domain data. The trained model is then applied to a larger set of out-of-domain data. Finally, the 'translated' output is filtered with respect to its relevance to the target task. In addition to using existing SMT models for data generation, a new type of SMT model is introduced in which words are represented as collections of features. This results in a factorized probability model that can be estimated more robustly than a standard model. In this project the above strategy is used to create training data for conversational speech from written text. It is evaluated by comparison with standard language model adaptation and training methods.

This technique is expected to significantly reduce the requirements for task-specific data when developing or porting an ASR system to a new recognition task. Moreover, this work will contribute to increased cross-fertilization between machine translation and ASR research.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0308297
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2003-09-01
Budget End
2008-09-30
Support Year
Fiscal Year
2003
Total Cost
$412,000
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195