Spontaneous speech, accented speech, and speech in noise continue to
provide automatic speech recognition (ASR) technology with significant
challenges; error rates of ASR systems are still unacceptably high for
these types of speech. This project establishes a consistent
framework that seeks to cope with all of these conditions. The novel
approach to phonetic variability investigated here views the problem
as one of phonetic information underspecification: some subset of
information that the listener receives will be missing or uncertain.
Lexical access is thus a phonetic code-breaking problem --- how can a
system accumulate phonetic cues in each of these conditions to
recognize words on the basis of incomplete evidence?
The research program of this project takes a multidisciplinary
approach to integrating linguistic theory with speech recognition
technology; discriminative statistical models of linguistic features
are employed to model nonlinear, overlapping phonological effects
observed in speech. The framework allows derivation of new linguistic
insights through analysis of trained systems.
The educational program fosters interdisciplinary research (with
cross-disciplinary graduate seminars) and increases participation of
underrepresented students in Computer Science by introducing language
technology topics early into the undergraduate curriculum and
encouraging undergraduate research.
Apart from cultivating a new way of thinking about pronunciation
variation for ASR, the broader impacts of this research are to provide
collaborative resources for the ASR and linguistics communities to
discuss in tutorial and workshop settings. Addressing noise, accent,
and speaking style in a consistent framework will also improve ASR
technology for many who are underserved by current systems.
The main scientific premise of this project is that the way that we perceive speech in the face of noise and varying accents can bethought of as problem of breaking a phonetic code: humans perceive incomplete evidence that they are able to reassemble into messages. Computer models of speech for the process of getting computers to recognize what was said (the Automatic Speech Recognition problem) could be improved by including evidence combination techniques (a form of machine learning). Moreover, the ability to think about these machine learning techniques in an interdisciplinary way (combining insights from linguistics and computer science) can lead to new ways to think about general problems in linguistics. The main outcomes of the project included two general findings (summarized from roughly 25 publications): First, statistical methods called Conditional Random Fields (CRFs) that are relatively new to the Automatic Speech Recognition field have been shown to be effective combiners of linguistic information. For example, one view of speech sounds represents the sound patterns as whole blocks in time (known as phones); these are the traditional building blocks of ASR systems. However, these sounds can be broken into "phonological feature" categories -- a multi-dimensional representation of speech sounds. CRFs are shown to be a much more effective combination method of these different representations of speech than the tradition Hidden Markov Model (HMM), and can decrease the errors made by a system much more than either representation alone. Our explorations examined how we can think about feature combinations both within short, local windows of speech, or over longer timescales. A second outcome was a new method of thinking about how phonetic information is impacted by noise. When the human ear hears speech and noise together, some frequencies of the speech are blocked by the noise, in a process called masking; this is similar to the visual phenomenon of distant objects being partially obscured by closer objects in the line of sight. Noise severely degrades ASR performance (i.e., it increases the error). Previous methods tried to estimate what parts of the signal were noise-masked, and reconstruct the underlying speech. However, our research showed that, surprisingly, treating the masked components as completely absent was a better strategy than other reconstruction techniques. It is likely better to focus on mask estimation rather than reconstruction. Tying into the phonetic code aspect of the project, we found that one could improve mask estimation by using information from a speech recognition system, and then using machine learning techniques similar to CRFs to improve prediction of the mask. Another thread of research showed how CRFs could be used directly for mask prediction; this overall approach shows how we may be able to think about speech recognition and speech enhancement as complimentary, cooperative processes. The educational outcomes of this project included new techniques for teaching advanced machine learning concepts in speech and language technology classes, and helping students in linguistics disciplines utilize machine learning in their own dissertations. In terms of outreach, a tutorial on this material was presented at a major international conference, and three tutorial-style journal articles co-authored by the PI were influenced by this grant. The project fostered interdisciplinary research by being an example project presented at the new Buckeye Language Network, and members of the project participated in the OhioSpeaks workshop. The PI presented research from this and related projects on Capitol Hill as part of the 2008 Coalition for National Science Funding research day. The PI also gave a talk on how machine learning can improve autism research at the Central Ohio Autism Society.