Although it is now becoming increasingly common to encounter automatic speech recognition technology embedded in systems and applications used in daily life, the accuracy of recognition systems is frequently inadequate in difficult acoustical environments or in the presence of interfering sound sources. The well-known ability of the human auditory system to process and interpret a desired speech signal effectively, even in the presence of multiple interfering sounds, has caused the auditory system to serve as both an inspiration and a model for the design of automatic speech recognition systems. Nevertheless, most of these efforts to date have been largely unsuccessful, both because of the intrinsic difficulty in identifying those aspects of the speech signal that remain most resilient to interference and distortion, and because of a historical failure to match physiologically- and perceptually-motivated features of sounds to the characteristics of the speech recognition systems that make use of them.
This project has three major components. New features for speech recognition systems are being developed that are based on contemporary knowledge of auditory physiology and perception. Techniques based on computational auditory scene analysis are being used to identify and separate those components of a complex sound field that belong to a target speech signal, using missing-feature techniques to restore those components of the target signal that are distorted by interfering sounds. Most importantly, the speech recognition system itself is modified on several levels to take best advantage of the statistical attributes of the features that are extracted.
This project will address some of the most difficult problems in speech recognition in difficult acoustical environments. The attainment of our goals would have enormous impact in extending the automatic recognition of natural and causal speech to environments such as automobiles, personal digital assistants, and cell phones. In addition, this project has the potential of helping to unify the auditory and speech research communities that have until now developed largely independent perspectives on how knowledge of human audition can best be applied to robust automatic speech recognition.