Spontaneous speech is typically interrupted by one disfluency every 10-20 words. While humans easily comprehend disfluent speech, computer speech recognizers often fail to separate disfluent regions from surrounding context, resulting in failed transcription and loss of meaning. This project develops disfluency recognition for automatic speech recognition by investigating perceptually salient acoustic correlates of disfluency. Effects of disfluency on the pitch, energy and voice source features are examined in the Switchboard corpus of spontaneous speech. The approximate repetition of pitch and energy contours is investigated as a cue marking the dependency between a disfluency and its subsequent repair. Analysis-by-synthesis techniques are adapted from the Stem-ML model of speech generation to recognize prosodic repetition that is often obscured by differences in scaling. Voice quality correlates of disfluency, such as glottalization, are tracked through several acoustic measures of the spectral envelope, with ROC testing performed to determine the best predictors. Correlations between disfluency and intonational features marking accent and phrasing are examined through the creation of a ToBI-standard intonation labeling of the speech corpus. Acoustic and prosodic correlates of disfluency are combined with a repetition language model in the design of a speech recognizer that automatically transcribes both words and disfluencies. The recognizer integrates cues at multiple linguistic levels which together serve to identify regions of disfluency in spontaneous speech.
This research will advance speech technology by enabling recognition of disfluency in natural speech. It will contribute new statistical and acoustic models of disfluency and a publicly accessible corpus of spontaneous speech with prosody and disfluency annotation.