Speech recognizers all include a component for predicting, based on the past context, what words are likely to appear next. Today these components, known as language models, operate at the symbol level, abstracted away from the details of how and when the words are spoken. Spoken language, however, is not just a symbolic or mathematical object, but is produced and understood by human brains, with specific processing constraints, and these can directly affect what happens when in dialog.
This project is developing language models and ``dialog models'' that explicitly use the information in the timings of words. Inspired by psychological research suggesting that dialog and language behaviors are the result of multiple simultaneously active cognitive processes, the working assumption is that the words likely to be spoken at a given time depend, probabilistically, on the elapsed time since various reference points: for example since the speaker began talking, since the speaker's last disfluency, since the listener's last back-channel, etc. Statistical analyses of large corpora of human-human spoken dialogs, with machine learning methods, are revealing patterns and regularities which are being used to build language models with improved predictive power.
These language models implicitly represent some aspects of dialog dynamics, with the potential to lead to an integrated understanding of the nature of dialog as a human ability. These improved language models are also likely to improve speech recognition accuracy, enabling the development of spoken language systems that are more accurate, more efficient, and more useful.
Many people use speech recognizers every day, for example in dialogs with automated customer service systems. However recognition errors are common, which causes user frustration and severely limits the useful deployment of these systems. One way to reduce recognition errors is to better use the context to predict what possible words and phrases the users is likely or unlikely to say next. The systems component responsible for these predictions is called the language model. Current language modelsrely exclusively on lexical context. However there is additional information that can be used to make predictions: temporal and prosodic aspects of the local context. Such features are especially informative because they reflect the cognitive and communicative processes underlying speech, and thus can support deeper models of the speaker's internal state, and thus what he is likely to say next. In this project we examined a large number of prosodic and timing features and evaluated them as sources of information on what words the speaker is likely to say next. The most informative of these included recent speaking rate, volume, and pitch, and time until endof utterance. Using simple combinations of such features to augment standard (trigram) language models gave up to a 8.4% reduction in unpredictabily (perplexity) on a standard collection of spontaneous dialogs (the Switchboard corpus). We further applied a mathematical modeling technique (Principal Component Analysis) to a larger set of 76 prosodic features spanning 6seconds of context, encoding the speaking styles and behaviors of both the primary speaker and the interlocutor. We found that many of the principal dimensions discovered by this method effectively capture well-known but previously unquantifiable aspects of mental state and dialog state, and that this gave an even better 27% reduction in perplexity. Finally, in a very preliminary study, using the simpler models in a speech recognizer for German, we obtained up to a 1.0% reduction in word error rate on the standard Verbmobil II corpus. More recent spin-off investigations have extended these techniques and applied them to identifying important regions in dialog, for purposes of compression, and to search in audio archives.