This award develops the MSP-Podcast corpus, which is to be the largest, publicly available, naturalistic speech emotional database. Affective computing is an important research area aiming to understand, analyze, recognize, and synthesize human emotions. Providing emotion capabilities to current interfaces can facilitate transformative applications in areas related to human computer interaction, healthcare, security and defense, education and entertainment. Speech provides an accessible modality for current interfaces, carrying important information beyond the verbal message. However, automatic emotion recognition from speech in realistic domains is a challenging task given the subtle expressive behaviors that occur during human interactions. Current speech emotional databases are limited in size, number of speakers, inadequate/inconsistent emotional descriptors, lack of naturalistic behaviors, and unbalanced emotional content. This CISE community research infrastructure addresses these key barriers, opening new opportunities to explore novel and powerful machine learning systems. The size, naturalness, and speaker and recording variety in the MSP-Podcast corpus allow the research community to create complex but powerful models with millions of parameters that generalize across environment. The MSP-Podcast corpus will also play a key role on other speech processing and human language understanding tasks. For the first time, the community will have the infrastructure to address automatic speech recognition and speaker verification solutions against variations due to emotional content. These improvements will facilitate the transition of emotionally aware algorithms into practical applications with clear societal benefits.

The proposed infrastructure relies on a novel approach based on cross-corpus emotion classification along with crowdsource-based annotations to effectively build a large, naturalistic emotional database with balanced emotional content, reduced cost and reduced manual labor. It relies on existing naturalistic recordings available on audio-sharing websites. The first task consists of selecting audio recordings conveying balanced and rich emotional content. The selected recordings contain natural conversations between many different people over various topics, both positive and negative. The second task is to segment the audio recordings into clean, single speaker segments, removing silence segments, background music, noisy segments, or overlapped speech. This process is automated with algorithms for voice activity detection, speaker diarization, background music detection and noise level estimation. The third task is to identify segments conveying balanced and rich emotional content. This task relies on machine learning models trained with existing corpora to retrieve samples with target emotional behaviors (e.g., detectors of ?happy? sentences). This step is important since most of the turns are emotionally neutral so randomly selecting turns will lead to a corpus with unbalanced emotional content. The community also plays an important role in the selection of target sentences to be emotionally annotated, with novel grand challenges and outreach activities to support the collection of similar corpora in different languages. The final task is to annotate the emotional content of the retrieved segments, relying on perceptual evaluations conducted on a crowdsourcing platform using a novel evaluation that tracks the performance of the workers in real-time. This scalable approach provides control over the emotional content, increases the speaker diversity, and maintains the spontaneous nature of the recordings.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
2016719
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2020-09-01
Budget End
2023-08-31
Support Year
Fiscal Year
2020
Total Cost
$1,075,386
Indirect Cost
Name
University of Texas at Dallas
Department
Type
DUNS #
City
Richardson
State
TX
Country
United States
Zip Code
75080