Hidden Markov models (HMMs) have been successfully applied to automatic speech recognition for more than 35 years even though a key HMM assumption - the statistical independence of frames - is obviously violated by speech data. In fact, this data/model mismatch has inspired many attempts to modify or replace HMMs with alternative models that are better able to take into account the statistical dependence of frames. The scientific goal of this work is to discover predictable regions of statistical dependence in speech data and quantify their effect on HMM-based recognition accuracy. In contrast to previous studies of statistical dependency, this research uses the HMM to explore its departure from the data via exploratory data analysis (EDA). The methodology is to first analyze the data and its fit to the model, searching for regions of predictable statistical dependence - model/data mismatch. EDA is used again to develop simple models of the effect of the predictable mismatch on recognition accuracy. A key piece of this analysis is the development and use of graphical tools to visualize the statistical dependency, the recognition errors, and their relationship. The results of this research will provide important clues for the design of HMM generalizations. The analysis methodology is central to the field of statistics, but is rarely used in speech recognition research. Graduate students working on this project will learn its utility and how to use it on other problems. Open source versions of the software developed will be made available for free downloading.
Automatic speech recognition is an enormously successful application of statistical pattern recognition. Every day millions of people use applications based on this technology to solve problems that are most naturally accomplished by interacting with machines via voice. However, the most successful of these applications have always been rather limited in scope, because, although useful, speech recognition can be maddeningly unreliable. For example, human beings are easily able to understand one another despite loud background noise in a crowded room, severe distortion over a telephone channel, or wide variation in accents within their common language, but even much milder examples of these problems will completely derail a speech recognition system. The vision of Captain Kirk calmly interacting with his space ship's computer during a battle with the Klingon Empire must seem light years away to the poor soul trying to use a interactive voice menu to re-book a canceled flight over the phone in a busy airport. The goal of this project is to understand in a deep, quantitative way why the methodology used in nearly all speech recognizers is so brittle. At the heart of this methodology is a statistical model known as the hidden Markov model. This uses a model for the evolution of a discrete-time process, called a Markov chain, to approximate the temporal structure of the phonetic units that we use in speech recognition. (Technically this Markov chain is unobserved, or hidden, but it is used to explain the observable temporal structure in the phonetic units). A Markov chain assumes that the set of possible, underlying states in the process is finite. In addition, it specifies each of the transition probabilities from one possible state to the next. However, once the model is in a particular state - call it the current state - then the probability of transitioning from the current state to any other state depends only on the identity of the current state. In particular, this transition probability does not depend on what happened before the model arrived in the current state: this is the Markov property from whence Markov chains get their name. As the model advances along this hidden Markov chain, an observable, acoustic observation is emitted at each step according to a prescribed, marginal probability distribution. Thus, this model makes two very strong assumptions, both of which have always been known to be false for human speech data. The first assumption arises from the structure of the hidden Markov chain, it is called the conditional independence assumption, and it concerns the temporal structure of speech: that successive frames emitted at each step are independent from one another. The second assumption is in the choice that we make for the marginal probability distribution: this is almost always taken to be multivariate normal. In this project we use simulation and a novel sampling process to generate pseudo test data that deviate from the two hidden Markov model assumptions in a controlled fashion. The novel sampling process, called resampling, was adapted from Bradley Efron's work on the bootstrap. In essence resampling is a non-parametric analog of simulating data from a known parametric distribution: given a sample from an unknown population distribution, we simulate from the empirical distribution derived from the sample. To simulate using this empirical distribution, we simply do random draws (with replacement) from the sample, hence the terminology resampling. These processes allow us to generate pseudo data that, at one extreme, agree with all of the model's assumptions, and at the another extreme, deviate from the model in exactly the way real data do. In between, we can precisely control the degree of data/model mismatch. By measuring recognition performance on this pseudo test data, we are able to quantify the effect of this controlled data/model mismatch on recognition accuracy. The results of this study are startling enough that they should provoke further studies and a re-examination of where to improve the statistical models that we use in speech recognition. First of all, we demonstrate that if real speech data satisfied both of the hidden Markov model's assumptions, then speech recognizers would make virtually no errors. Secondly, our results show that long range statistical dependence that is present in speech data and at variance with the hidden Markov model's conditional independence assumption is the single largest source of recognition errors, dwarfing the errors due to the data violating our choice for the marginal probability distribution. Finally, we demonstrate that discriminative training - an extremely effective alternative to parameter estimation using maximum likelihood - is improving recognition accuracy by indirectly compensating for the hidden Markov model's conditional independence assumption. Taken together, these extremely surprising results strongly suggest that pursuing a deeper understanding of the nature of the dependency structure in speech data is a critical first step towards better statistical models for more robust recognition performance.