Computer technology for speech recognition has advanced to an amazing degree over the past decade. Many of us use it daily -- to dictate text messages on smart phones or to navigate automated phone systems. As good as these systems are, humans still outperform them in complex, crowded, and noisy acoustic environments. If more were known concerning how humans adapt to these challenging situations, speech technology might be made more adaptive and robust. For example, computer systems for speech recognition use complex "deep learning" networks that often need to be trained in ways that are very different from how humans learn language. Although neural network models aimed at simulating human language processing are much simpler, which allows scientists to develop hypotheses about how human language processing works, they don't use real speech as input. Instead, they use phonetic features that are more like text than speech and so fail to address the core problem of how humans map the acoustics of speech to words. This research program focuses on bridging the gap between the complex artificial neural network models used in current technologies for speech recognition and the simpler neural network models used to investigate how humans actually perceive speech.
This research program builds on a new neural network model for speech that aims to achieve high recognition accuracy on many words produced by several speakers. Crucially, the model can do this with minimal complexity (using many fewer layers than commercial speech recognition systems), which allows researchers to understand the computations it performs. The research plans include extending the model to a large vocabulary, training on naturalistic speech, and adding biologically plausible preprocessing modeled on the human auditory pathways. The model will be compared with key aspects of human spoken word recognition behavior as well as with human neural responses to spoken speech. The work has the potential to generate new insights to advance speech technology by making it more robust in challenging environments, with potential impact on speech technology used for health, law, education, and the automatic captioning that makes speech accessible to the deaf and hard of hearing. In addition, individuals ranging from high school students to Ph.D. students will be part of the research team and will have rich research experiences that will promote development of technical skills useful for careers in academic research or a variety of non-academic careers.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.