This EArly Grant for Exploratory Research aims to create neutral reference model from synthetic speech to contrast the emotional content of a speech signal. Emotional understanding is a crucial skill in human communication. For this reason, modeling and recognizing emotions is essential in the design and implementation of interfaces that are more in tune with the user's needs. Starting from the premise that paralinguistic information is non-uniformly conveyed across time, this study aims to identify emotionally prominent regions or focal points across various acoustic features. The study explores a novel approach based on synthetic speech to build reference models characterizing patterns observed in neutral speech. These reference models are used to contrast the emotional information observed in localized segments of a speech signal. The study builds a synthetic speech signal that conveys the same lexical information and is timely aligned with the target sentence in the database. Since it is expected that a single synthetic speech will not capture the full range of variability observed in neutral speech, the study explores approaches to produce different neutral synthetic realizations. After creating a parallel corpus with time-aligned synthetic speech, the study explores how well synthetic speech captures the acoustic patterns and emotional percepts of neutral, nonemotional speech. Then, a target signal from the database is compared with the properties observed across the family of synthesized signals.

The study presents a novel approach to build a robust emotion recognition system that exploits the underlying nonuniform externalization process of expressive behaviors. Algorithms that able to identify localized emotional segments have the potential to shift the current approaches used in the area of affective computing. Instead of recognizing the emotional content of pre-segmented sentences, the problem is formulated as a detection paradigm, which is appealing from an application perspective. These advances represent a transformative breakthrough in the area of behavioral analysis and affective computing. The proposed models and algorithms provide numerous insights to explore and extend theories in linguistic and paralinguistic human behavior. Having established the base infrastructure for this exploratory research, several new scientific avenues will emerge that serve as truly innovative advancements that will impact applications in security and defense, next generation of advanced user interfaces, health informatics, and education. Furthermore, the scientific methods are enriching venues for interdisciplinary training and mentoring for undergraduate and graduate students.

Project Report

Research Objective and Significance: This exploratory project aimed to create neutral reference model from synthetic speech to contrast the emotional content of a speech signal. Emotional understanding is a crucial skill in human communication. It plays an important role not only in interpersonal interactions, but also in many cognitive activities such as rational decision making, perception and learning. Starting from the premise that paralinguistic information is non-uniformly conveyed across time, this study aimed to identify emotionally prominent regions or focal points across various acoustic features. The approach consist in comparing frame-by-frame the acoustic features extracted from a given speech signal with the ones extracted from a synthetic speech signal. The overarching research goals of the proposal are: (a) create neutral synthetic reference signals that convey the same lexical information and is timely aligned with a target sentence; (b) evaluate the hypothesis that synthesized speech provides a valid template reference to describe neutral speech; and (c) contrast the localized emotional content of a target sentence with the reference synthetic speech. Research Project and Findings: The proposed system takes an input speech with its transcription in addition to word and phonetic alignment. The transcription is used to synthesize a speech signal conveying the same lexical information. This step is implemented with Festival, which is a general multi-lingual speech synthesis system. We create ten training voices using different synthesis methods. We evaluated the effect of synthesized speech quality on the overall emotion detection performance. After synthesizing the speech file, we temporally align the synthesized signal to the target sentence. The modification is done by the overlap-add method implemented in the Praat toolkit. The resulting aligned synthetic speech has word durations identical to the target speech. The emotional descriptors used in this study consist on the dimensions valence (negative versus positive) and arousal (calm versus active). We compare the property of neutral, synthetic and emotional speech using feature analysis. The acoustic properties of synthetic speech are compared with the ones observed on neutral and emotional speech. We estimated an exhaustive number of low level descriptors, including F0 contour, energy, and spectral features. Then, we estimate the similarity of the features by computing the normalized Euclidean distance between the features from the target and synthesized sentences. The analysis provided an objective metric to determine the areas within the activation-valence space that are closer to the acoustic properties of the synthetic speech. Figure 1 illustrates the similarity between the target and synthetic speech in the acoustic domain for the top eight features. Darker colors represent lower normalized Euclidean distance. Speech signals with high arousal are the ones that deviate the most from the synthetic reference signal. The results indicate that the approach can contrast emotional samples with low and high arousal. We conducted perceptual evaluations to assess the emotional percepts of neutral, synthetic and emotional speech. This analysis determined the emotional behaviors that deviate from the one conveyed on synthetic speech. The results from the perceptual evaluation are reported in Figure 2 of the attached document. We evaluated target (read crosses) and synthesized (blue circles) sentences. The histograms show the distribution for valence and activation of the synthetic sentences. The graph illustrates that arousal values of the synthesized sentences are lower than the target sentences. The evaluators perceived the reference sentences with low arousal. The perception of synthetic sentences for valence is skewed toward positive values. The underlying lexical content, which is preserved from the target sentences, may contribute to the perception of valence. This result agrees with previous finding showing that lexical information, and contextual information are important in the perception of valence. Classification experiments demonstrated the emotional discriminative power of the features estimated by contrasting the original signal with the template references. We estimate low versus high levels of activation and valence. The performance is better for activations, as expected. When the classifiers are trained with the proposed features, in addition to conventional acoustic features, the classification results improve. Contrasting the features is useful to identify emotional speech. Other outcomes include a parallel corpus for the SEMAINE corpus with synthetic sentences that are timely aligned with the target sentences, a PhD thesis and a Master thesis. We are disseminating this findings in international conferences and journals (over ten publications). The project involved areas of emotion recognition, machine learning and signal processing, providing attractive opportunities for training graduate students. Two research assistants and one undergraduate student have been involved in different aspects of the project. The proposed models and algorithms will provide numerous insights to explore and extend theories in linguistic and paralinguistic human behavior. Having established the base infrastructure for the proposed research, several new scientific avenues will emerge that serve as truly innovative advancements that will impact applications in security and defense, next generation of advanced user interfaces, health informatics, and education.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1329659
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2013-03-15
Budget End
2014-08-31
Support Year
Fiscal Year
2013
Total Cost
$59,338
Indirect Cost
Name
University of Texas at Dallas
Department
Type
DUNS #
City
Richardson
State
TX
Country
United States
Zip Code
75080