Despite a sizeable literature on emotional speech and speech under stress, little is understood about how features in continuous speech vary with subtle and real-world-relevant changes in physiological state within any particular speaker. This EArly Grant for Exploratory Research relates speech features to direct measures of physiological activation, rather than to categorical hand-annotated labels of emotion or state. The study collects and analyzes a corpus of speech and autonomic nervous system (ANS) sensor data to discover what changes occur in speech features when a person is exposed to different activation-relevant emotional, cognitive, stress-related conditions. The broader significance and impact is discovery of cues in speech that can be used to estimate changes in a speaker's physiological activation level when no sensors are available. Applications include health care (monitoring physical, mental, cognitive states), education and learning (monitoring engagement), social interaction (monitoring activation level), and law enforcement/intelligence (monitoring behavioral changes of high interest individuals).
In Phase 1 (Corpus Collection), the project creates a 40-subject corpus of time-aligned speech and physiological signals. Activation is measured using state-of-the-art methods to extract cardiovascular (ECG), blood pressure, respiration rate, and skin conductance signals. Each subject participates in five conditions: (1) neutral baseline; (2) emotional (description of emotionally salient pictures); (3) stressed (speaking task incentivized for accuracy and completion time); (4) cognitive load (speaking task with a visual distractor, incentivized for task completion and distractor task accuracy); and (5) computer-directed speech (task requiring perfect recognition from a speech recognizer). In Phase 2 (Analysis), sensor output is post-processed to calibrate the signals and look for changes. These changes are then compared to a range of automatically extracted features (based on acoustics, prosody, discourse patterns, and disfluency patterns) from the time-aligned speech. Analyses and machine learning experiments then examine which speech feature changes correlate with changes in sensor output, both within and across speakers. Results shed light on how information from natural continuous speech can be used to estimate changes in a speaker?s physiological activation level in ongoing, subtle and everyday contexts.