The development of automatic speech recognition (ASR) systems is severely constrained by the need for large amounts of training data. Furthermore, available training data often does not match the recognition task in style and domain, which particularly affects the language modeling component in an ASR system.
This project is aimed at developing ways of artificially generating language model training data for ASR. Specifically, statistical machine translation (SMT) models are used to produce task-specific data from different but related data representing, for example, a different speech style, dialect, or domain. First, SMT models are trained on a small amount of parallel in-domain and out-of-domain data. The trained model is then applied to a larger set of out-of-domain data. Finally, the 'translated' output is filtered with respect to its relevance to the target task. In addition to using existing SMT models for data generation, a new type of SMT model is introduced in which words are represented as collections of features. This results in a factorized probability model that can be estimated more robustly than a standard model. In this project the above strategy is used to create training data for conversational speech from written text. It is evaluated by comparison with standard language model adaptation and training methods.
This technique is expected to significantly reduce the requirements for task-specific data when developing or porting an ASR system to a new recognition task. Moreover, this work will contribute to increased cross-fertilization between machine translation and ASR research.