Spontaneous speech, accented speech, and speech in noise continue to

provide automatic speech recognition (ASR) technology with significant

challenges; error rates of ASR systems are still unacceptably high for

these types of speech. This project establishes a consistent

framework that seeks to cope with all of these conditions. The novel

approach to phonetic variability investigated here views the problem

as one of phonetic information underspecification: some subset of

information that the listener receives will be missing or uncertain.

Lexical access is thus a phonetic code-breaking problem --- how can a

system accumulate phonetic cues in each of these conditions to

recognize words on the basis of incomplete evidence?

The research program of this project takes a multidisciplinary

approach to integrating linguistic theory with speech recognition

technology; discriminative statistical models of linguistic features

are employed to model nonlinear, overlapping phonological effects

observed in speech. The framework allows derivation of new linguistic

insights through analysis of trained systems.

The educational program fosters interdisciplinary research (with

cross-disciplinary graduate seminars) and increases participation of

underrepresented students in Computer Science by introducing language

technology topics early into the undergraduate curriculum and

encouraging undergraduate research.

Apart from cultivating a new way of thinking about pronunciation

variation for ASR, the broader impacts of this research are to provide

collaborative resources for the ASR and linguistics communities to

discuss in tutorial and workshop settings. Addressing noise, accent,

and speaking style in a consistent framework will also improve ASR

technology for many who are underserved by current systems.

Project Report

The main scientific premise of this project is that the way that we perceive speech in the face of noise and varying accents can bethought of as problem of breaking a phonetic code: humans perceive incomplete evidence that they are able to reassemble into messages. Computer models of speech for the process of getting computers to recognize what was said (the Automatic Speech Recognition problem) could be improved by including evidence combination techniques (a form of machine learning). Moreover, the ability to think about these machine learning techniques in an interdisciplinary way (combining insights from linguistics and computer science) can lead to new ways to think about general problems in linguistics. The main outcomes of the project included two general findings (summarized from roughly 25 publications): First, statistical methods called Conditional Random Fields (CRFs) that are relatively new to the Automatic Speech Recognition field have been shown to be effective combiners of linguistic information. For example, one view of speech sounds represents the sound patterns as whole blocks in time (known as phones); these are the traditional building blocks of ASR systems. However, these sounds can be broken into "phonological feature" categories -- a multi-dimensional representation of speech sounds. CRFs are shown to be a much more effective combination method of these different representations of speech than the tradition Hidden Markov Model (HMM), and can decrease the errors made by a system much more than either representation alone. Our explorations examined how we can think about feature combinations both within short, local windows of speech, or over longer timescales. A second outcome was a new method of thinking about how phonetic information is impacted by noise. When the human ear hears speech and noise together, some frequencies of the speech are blocked by the noise, in a process called masking; this is similar to the visual phenomenon of distant objects being partially obscured by closer objects in the line of sight. Noise severely degrades ASR performance (i.e., it increases the error). Previous methods tried to estimate what parts of the signal were noise-masked, and reconstruct the underlying speech. However, our research showed that, surprisingly, treating the masked components as completely absent was a better strategy than other reconstruction techniques. It is likely better to focus on mask estimation rather than reconstruction. Tying into the phonetic code aspect of the project, we found that one could improve mask estimation by using information from a speech recognition system, and then using machine learning techniques similar to CRFs to improve prediction of the mask. Another thread of research showed how CRFs could be used directly for mask prediction; this overall approach shows how we may be able to think about speech recognition and speech enhancement as complimentary, cooperative processes. The educational outcomes of this project included new techniques for teaching advanced machine learning concepts in speech and language technology classes, and helping students in linguistics disciplines utilize machine learning in their own dissertations. In terms of outreach, a tutorial on this material was presented at a major international conference, and three tutorial-style journal articles co-authored by the PI were influenced by this grant. The project fostered interdisciplinary research by being an example project presented at the new Buckeye Language Network, and members of the project participated in the OhioSpeaks workshop. The PI presented research from this and related projects on Capitol Hill as part of the 2008 Coalition for National Science Funding research day. The PI also gave a talk on how machine learning can improve autism research at the Central Ohio Autism Society.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0643901
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2006-12-15
Budget End
2012-11-30
Support Year
Fiscal Year
2006
Total Cost
$502,952
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210