Arguably the most recalcitrant problem in speech perception has been identifying invariant aspects of the acoustic or neural signals that correspond to speech segments. This is an especially difficult problem because of the large variation in speech produced by different speakers. For decades it has been assumed that the main cues for speech recognition come from the most salient frequencies in our voices and how these frequencies change as we produce consonants and vowels. However, recent results using a form of speech that mimics what is heard by cochlear implant users have pointed to the primary importance of temporal cues, especially for the recognition of consonants. Temporal features in the responses of the auditory nerve have been identified after presentation of the American English stop consonants /d/, /t/, /p/ and /b/. For each of these stop consonants, the temporal features are unique, relatively invariant despite large acoustic differences in the speech sounds, and could, therefore, provide the temporal cues necessary for speech recognition. The present work extends this research to all American English consonants (including fricatives such as /f/) and nasals (such as /n/ and /m/) produced by many different speakers. The hypothesis is that for each consonant there are unique temporal patterns in the responses of the auditory nerve and these are unchanged by variations in the acoustics of speech. The proposed experiments will examine the representation of consonant-vowel syllables in the auditory nerve of chinchillas, which hear over the same frequency range as humans. Syllables produced by 12 talkers will be taken from a public corpus, and also synthesized using a noise vocoder (which mimics what cochlear implant patients hear). The responses of individual auditory nerve fibers to a syllable will be pooled to create an ensemble response. Dynamic time warping, which correlates highly with the psychoacoustic recognizability of a speech token, will provide a quantitative measure of similarity and difference between ensemble responses.

The study of temporal cues in ensemble responses is a new and fundamentally different approach to speech recognition which will provide important insights into how recognition is achieved despite acoustic variability. The results from these experiments will be necessary for developing better speech recognition algorithms, improving speech rehabilitation strategies and for enhancing speech coding in cochlear implants.

Project Report

The primary goal of this research was to investigate the encoding of temporal features of speech by both individual auditory nerve fibers and ensembles of fibers. This information is important for developing better speech recognition strategies for devices ranging from phones to cochlear implants. The temporal features we examined were those identified in our previous studies and in experiments by other investigators. To provide the normal variation in speech that is encountered in every day listening, the syllables used as stimuli were spoken by many talkers. For about half of the talkers, we found identical temporal features in the ensemble responses, while there was a large variation in the responses of individual fibers. The presence of these temporal features for many speakers suggests that these cues make critical contributions to speech recognition. For the remaining talkers, we established that we did not record data from the fibers encoding the strongest frequencies in the speakers’ voices; consequently neither the ensemble nor the individual fiber responses contained the critical information. While one solution to this sampling problem would be to record from as many auditory nerve fibers as possible, we are going to use a different approach as described below. Our second goal was to investigate how the temporal features were represented when there was background noise. Both different intensities of noise relative to the speech and different overall levels of speech and noise were explored, since both conditions degrade intelligibility. We found that the noise severely degraded both individual and ensemble encoding of temporal features, except for a specific subset of auditory nerve fibers which had a low rate of spontaneous activity and higher thresholds for sound. At high overall levels of speech and noise, even this group of fibers displayed a diminished ability to transmit the temporal features. This finding has wider implications because it suggests that the optimal strategy for improving speech intelligibility in noisy environments may be to turn down the sound level. This may be especially important as one gets older and these low spontaneous fibers begin to degenerate. We will be pursuing this observation with psychoacoustic experiments over the next several years. Because we had to obtain data from a large number of auditory nerve fibers to compute accurate ensemble averages, we had planned to do an extensive series of experiments. After a failure of the air conditioning system in the animal colony, all of the chinchillas in the colony had hearing thresholds above 70 dB SPL, and could not provide acceptable data. The animal care facility did not complete the repairs to the air conditioning system and, in fact, has refused to do so. Rather than possibly exposing new chinchillas to excessive temperatures, we have terminated the animal experiments. This impedes data collection for responses to some of the talkers and means that we cannot investigate a wide set of consonants. These events emphasize the importance of non-institutional monitoring of animal colonies and the need for independent inspections that verify basic animal housing conditions including temperature. Our solution to the problem of obtaining enough auditory nerve data to compute the ensemble averages, given that we cannot undertake an extensive experimental series, is to try using recently developed artificial intelligence algorithms for modeling our existing data and then predicting the remaining responses. We are not sure if these algorithms will work. Since we have not had much success with current computational models of the auditory periphery, this seems like the most viable approach. While this method appears to be the best way forward at the moment, it is not what was proposed to NSF. We have, therefore, not requested a no-cost extension and have returned the remaining grant funds to NSF. We are also working on making all of our recorded responses easily and freely available online. We think that the results we obtained with NSF support will be a valuable resource for ourselves and other investigators for many years to come.

Agency
National Science Foundation (NSF)
Institute
Division of Behavioral and Cognitive Sciences (BCS)
Application #
0743915
Program Officer
Betty H. Tuller
Project Start
Project End
Budget Start
2008-03-15
Budget End
2014-02-28
Support Year
Fiscal Year
2007
Total Cost
$435,086
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820