Recognizing speech or other auditory objects in adverse environments -- e.g. with noise, reverberation, and multiple speakers -- is essential for human and animal communication. Current speech recognition technologies work well in high signal-to-noise conditions, but perform orders of magnitude below human performance in adverse conditions. Converging evidence from neuroscience suggests that auditory information is encoded in sparse and precisely timed spikes of sub-cortical neurons. However, the extent to which codes based on spike timing might underlie the robustness of human auditory object recognition has not yet been fully investigated. This project bridges this gap by devising a biologically inspired computational model of auditory processing at the cortical level and extracting computational principles that are essential for the model to achieve robust auditory object recognition.

The approach is to transform sounds into the spike sequences generated by feature-detecting thalamic auditory neurons, and to integrate these spikes spatially and temporally using the state-dependent dynamics of cortical neurons with active dendrites. In the proposed model, an auditory object first evokes sequential spiking of thalamic neurons that have been trained to detect useful features. Then, through feed-forward excitation and inhibition from the thalamus, and lateral excitation and inhibition from the cortical neurons, the state of the cortical network evolves, leading to temporal integration. Recognition of the auditory object is signaled when the cortical neurons reach a specific network state. The computational model is constrained by experimental results on the properties of cortical neurons, the organization principles of cortical networks, and the activity-dependent plasticity rules of the network structures. The project aims both to design feature detectors that can robustly represent auditory objects with spatiotemporal spike sequences, and to build a cortical network model that can recognize specific auditory objects using state transitions driven by the thalamic inputs, with neuron dynamics that can be compared with those observed in the auditory cortex. The recognition performance of the computational model will be evaluated and improved with auditory tasks designed to compare different approaches to speech recognition.

Project Report

Automatic speech recognition (ASR) technology has been under development for decades, but only recently has it become good enough for widespread commercial application. However, even today’s most advanced systems are not reliable. This is particularly true in the challenging acoustic environments in which we use devices today, whether it is a smartphone used in a noisy bar or a GPS system used in a car with the radio on. Humans are far superior than ASR, especially in challenging acoustic conditions with noise, reverberation, or multiple speakers. In this project, we proposed to take inspiration from neurophysiological studies on neurons in auditory cortex. A number of studies have demonstrated that spike responses can be: spatially and temporally sparse; repeatable on multiple presentations of a stimulus; precisely timed to the level of milliseconds; and invariant to added acoustic noise. We sought better ways of modeling the behavior of cortical neurons and using the resulting representations for decoding speech. In designing our method, we began with the notion that the sequence of spikes can carry important information on the structure of speech. Taking the view that speech comprises a sequence of phonetic features (onsets and offsets of vowels, hard consonants, etc.), we reasoned that each neuron in a population can perform a simple discrimination task of a single feature. We call these neurons "feature detectors." To build the feature detectors, we began with a well-known model of the auditory periphery. The model outputs signals corresponding to the auditory nerve (AN) firing rates in various frequency channels (Figure 1). To this vector of firing rates at a given time, we add in delayed versions of the AN firing so that the feature detectors can identify patterns across both time and frequency. We model the feature detectors as sum-and-threshold neurons. To train the neurons, we choose a set of weights for each neuron such that it fires sparsely and selectively in response to its preferred feature. We trained a population of 1100 neurons for detecting a wide variety of features. To compare the response properties of the feature detectors to those or cortical neurons, we computed their STRFs. The STRF characterizes the preferred stimuli of a neuron and is frequently used in experimental studies. The STRFs display a variety of spectrotemporal modulations (Figure 2). In order to display a set of STRFs representative of the full range of modulation types, we ranked the STRFs using an overall modulation score (Figure 2B). A low modulation score indicates a STRF with greater spectral modulation, while a high modulation score indicates one with greater temporal modulation. We found that the full range of STRF modulation types were essential to our system’s robust recognition performance - any attempt to select a subset of feature detectors based on its STRF characteristics yielded a decrease in performance. We next developed methods for decoding the spike sequences in order to identify words and perform ASR. Our approach from the beginning was to use the sequence of spikes in order to perform recognition. Figure 3A illustrates what this is means. To each neuron we assign a unique numerical label. The full spike timing code is then converted to a sequence of these labels according to the order in which the spikes occurred. In cases where multiple spikes occurred at the same discrete time step, the corresponding neuron labels at that time step were placed in ascending order. Note that the overall order was arbitrary. This sequence of labels is what we call a "spike sequence code". We were able to get good ASR performance using a novel template-based speech decoder. Instead of modeling the statistics of the speech, which requires some knowledge of the noise, we take a heuristic approach in which unknown speech is compared directly to template utterances stored from training data. Several examples of each word are used as templates. The best match is found using the longest common subsequence (LCS) between the test and template sequence. Because it allows for corruptions of the spike code in noise, the LCS gives very robust recognition results, shown in Figure 4. In particular, we achieve 80% recognition of speech at 0dB SNR. This exceeds even another model designed specifically for noise robustness on this same task. We have explored computational methods for producing spike-based representations of speech, as well as the use of these representations to perform recognition of noisy speech. The primary advantage of our system over other current ASR technologies is its highly robust performance in a variety of noise conditions despite being trained only on clean speech. We have carved out new territory in our use of spike-based processing for this task, and further improvements are sure to come as these methods are refined and developed.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1116530
Program Officer
Kenneth C. Whang
Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2011
Total Cost
$296,470
Indirect Cost
Name
Pennsylvania State University
Department
Type
DUNS #
City
University Park
State
PA
Country
United States
Zip Code
16802