Current speech recognition technology, while useful in constrained domains with cooperative speakers, still leads to unacceptably high error rates (30-50%) on unconstrained conversational or broadcast speech. An important difference between these tasks and high accuracy conditions is the larger variability in speaking style, even within data from a single speaker. Existing acoustic models do not account for the systematic factors behind this variability so must be ``broader,'' leading to more confusability among words and hence high error rates. This work proposes to improve acoustic models by representing sources of variability at three time scales: the syllable, short regions within an utterance, and the speaker. At the syllable level, automatic clustering will capture syllable position and phonetic reduction effects. At the region level, a slowly varying hidden speaking mode will indicate systematic differences in pronunciations associated with reduced vs. clearly articulated speech. At the speaker level, hierarchical models of the correlation among speech sounds will improve adaptation of acoustic models from small amounts of data. Experiments will involve large vocabulary recognition of conversational speech using a multi-pass search strategy to handle the cost of the higher-order models proposed here. By representing systematic variability, the proposed work should significantly advance both the target task of unconstrained speech recognition and human- computer speech communication more generally.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9618926
Program Officer
Ephraim P. Glinert
Project Start
Project End
Budget Start
1997-03-01
Budget End
1999-09-29
Support Year
Fiscal Year
1996
Total Cost
$679,170
Indirect Cost
Name
Boston University
Department
Type
DUNS #
City
Boston
State
MA
Country
United States
Zip Code
02215