The goal of this project is to improve the quality of text-to-speech synthesis. Text-to-speech synthesis is an increasingly more widely used technology that plays a core role in automated information access by telephone, and universal access for individuals with visual or other challenges, and educational software.

The project focuses on how humans perceive acoustic discontinuities in speech. Current text-to-speech synthesis technology operates by retrieving intervals of stored digitized speech from a database and splicing ("concatenating") them to form the output utterance. Unavoidably, there are acoustic discontinuities at the time points where the successive speech intervals meet. For reasons that are currently poorly understood, many of these acoustic discontinuities are not audible even when they seem large by any objective measure. This relative insensitivity of human hearing is the reason that concatenative synthesis works at all. However, conversely it also often occurs that seemingly small discontinuities are audible. These facts raise the scientific question of how one can construct an objective acoustic discontinuity measure that accurately predicts from the quantitative, acoustic properties of two to-be-concatenated speech intervals whether humans will hear a discontinuity.

This question is not only of interest for a better understanding of human hearing, but is also of immediate practical relevance. Many text-to-speech synthesis systems select speech intervals at run time from a large speech corpus. During selection, the systems search through the space of all possible sequences of speech intervals that can be used for the utterance and selects the sequence that has the lowest overall objective cost measure, such as the Euclidean distance between the final frame and initial frame of two successive intervals. However, research has already shown that this method and related methods do not predict well whether humans will hear a discontinuity. The current research, by being explicitly focused on perceptually optimized objective cost measures, will directly contribute to the perceptual accuracy of cost measures and hence to synthesis quality.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0313383
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2003-09-01
Budget End
2007-08-31
Support Year
Fiscal Year
2003
Total Cost
$410,000
Indirect Cost
Name
Oregon Health and Science University
Department
Type
DUNS #
City
Portland
State
OR
Country
United States
Zip Code
97239