The objective of this project is to develop a modeling framework in order to enable extensive use of prosodic information, such as pitch, duration and energy characteristics, in a large class of applications that call for spoken language understanding. For this purpose, prosodic features are extracted from the speech signal over regions defined by automatically detectable events. The result is a variable-length sequence of usually high-dimensional vectors, with mixed discrete and continuous distributions and undefined values. The focus of the project is the search for a transformation that, when applied to the prosodic features, results in a single vector that can adequately represent the important characteristics of the original sequence of prosodic features. The proposed transform is formed by projecting the distribution of the features in a certain sample onto a set of probability distributions represented by dynamic Bayesian networks in a predetermined dictionary.
The ultimate goal of the project is the creation of a general probabilistic model-based transform paradigm that can act robustly on complex feature sets. This work will therefore also contribute to other domains where features exhibit characteristics that are challenging for standard approaches. The tools and corpora developed during the project will be made available to the community. The results from this project will contribute to scientific knowledge on the use of prosodic information and increase the capabilities of spoken language understanding and dialog systems.