This project focuses on applying a model used in text-to-speech synthesis (TTS) to the task of automatic speech recognition (ASR). The standard method in ASR for addressing variability due to phonemic context, or ?coarticulation,? requires a large amount of training data and is sensitive to differences between training and testing conditions. Despite the effective use of stochastic models, current ASR systems are often unable to sufficiently account for the large degree of variability observed in speech. In many cases, this variability is not due to random factors, but is due to predictable changes in the speech signal. These factors are currently modeled in order to generate speech via TTS, but they are not yet modeled in order to recognize speech, largely because of non-local dependencies. We apply the Asynchronous Interpolation Model (AIM) used in TTS to the task of speech recognition, by decomposing the speech signal into target vectors and weight trajectories, and then searching weight-trajectory and stochastic target-vector models for the highest-probability match to the input signal.
The goal of this research is improve the robustness of ASR to variability that is due to phonemic and lexical context. This improvement will increase the use of ASR technology in automated information access by telephone, educational software, and universal access for individuals with visual, auditory, or speech-production challenges. More effective models of coarticulation may increase our understanding of both human speech perception and speech production. Results from this project are disseminated through technical papers and the CSLU Toolkit software package.
This project models the degree and manner of the coarticulation of speech. Coarticulation occurs when a conceptually isolated speech sound is influenced by, or becomes more similar to, a preceding or following speech sound. For example the vowel in the word "fear" is different from the vowel in the word "feet". Such research has applications in furthering the state of the art in fundamental speech production research, speech disorder diagnosis, text-to-speech synthesis, and potentially increasing the intelligibility of conversational speech. Specifically, we researched a methodology that models formant (spectral peaks) trajectories as a sum of phoneme targets weighted by coarticulation functions. Using a genetic algorithm search approach, we were able to determine the model parameters fully automatically, even for hard to estimate phonemes such as unvoiced bursts. To validate our findings, we carried out a perceptual listening test, and it was found that tokens reproduced by the model retained 95% of their original intelligibility; thus confirming a good model fit. As an application of the model, we focused on the difference in coarticulation between conversationally spoken speech and clearly spoken speech, and found evidence that, to some degree, conversational speech is a more coarticulated version of clear speech. The project supported the academic education of graduate and undergraduate students, as well as the creation presentation of publications at international conferences.