The unnaturalness of synthesized speech found in current interactive voice response (IVR) systems is due to the lack of natural prosodic variation. While state-of-the-art IVR systems are often highly intelligible and may sound natural for short prompts or when the text to be spoken is close to speech recorded for the system's database, once they deviate from these narrow bounds, results range from "boring and mechanical" to "odd and confusing." To address these deficiencies, the PIs are developing a new method for learning contour assignment for dialogue systems that avoids the sparse data problem without massive new annotation. They have generated a series of hypotheses about which features of the dialogue context influence human speakers' choice of contour from corpora and are testing these hypotheses via a series of targeted laboratory experiments. By designing a carefully controlled set of production and perception studies, the PIs will be able to determine which intonational features prove to be most reliably correlated with contour choice and which are perceptually most salient for listeners.
From a practical viewpoint, an IVR system that incorporates the appropriate assignment of full intonational contours will greatly enhance the perceived naturalness of the system. From a scientific viewpoint, such a model will expand our understanding of how speakers use and hearers interpret intonational contour variation. From a social viewpoint, the creation of IVR systems that interact with users naturally will increase the acceptability of such systems, bringing the vision of ubiquitous access to information and services for all closer to reality.