This dissertation research seeks to better understand how listening is informed by the act of moving our vocal apparatus as well as the acoustic and perceptual implications of crosslinguistic articulatory differences. A number of factors are known to contribute to our ability to make use of the acoustic signal such as articulatory strength, trading relations, and acoustic energy in contrasts. The fundamental theoretical question addressed here is whether articulation per se has direct implications for abstract gestures as distinctive features. The project builds on earlier work investigating the role of energy characteristics. Acoustic data will be collected for a range of native speakers of such typologically distinct languages as English, Dutch, German and French.

Tongue position and air pressure data for plosive sounds at the beginning and end of a syllable will be synchronized with the acoustic recordings. A comprehensive data set for these languages will be used to, first, establish percepts and the boundary conditions within which a perception study will be conducted, and, second, increase our understanding of the articulation-acoustic relation across the speech chain. The relative spectral energy levels in lower harmonics will be measured and evaluated, along with their changing characteristics as a speaker makes a vowel after or before one of these consonants. These spectral energy change characteristics have been found to be more correlated with voicing contrasts than the traditional measures of voice onset time, vowel length, frequency changes, etc. across contexts and across language types. Information gained from the acoustic and articulatory data collection will be used to generate tokens to be tested in a subsequent set of perception experiments. The work has implications for automatic speech recognition technology.

Project Report

This project, modeling speech perception of a speaker’s laryngeal states for stops (final sounds in bat and bad), aims at improving our understanding of how humans signal subtle speech differences. Speech is messy and often occurs in degraded acoustic environments. Nevertheless, humans have the ability to distinguish between sounds by balancing the subtleties among a family of perceptual indicators of those sounds. Plosives (or stops) such as found at the end of the words bat and bad are differentiated by acoustic landmarks on both the preceding vowel (e.g., length of vowel) and the consonant (e.g., length of consonant). While temporal, spectral and overall energy measures for the differentiation are known, changes in energy of different frequencies in the spectrum have not been extensively examined. Within a larger research program to examine these changes within particular bands of spectral energy, this project sought confirmation through direct intra-oral pressure changes. Pressure data inside the mouth during speech was collected to allow for accurate marking of time landmarks (e.g., the onset of tongue motion, points of contact between the tongue and hard palate). Comparison of the intra-oral pressure data with the synchronous acoustic signal reveals a novel method of identifying temporal landmarks of the plosive gesture for voiced (/d/) and voiceless (/t/) stops. Pressure data was collected from 17 German-speaking and four English-speaking participants. Complementary perception data was collected to verify abilities of speakers to categorize the difference between voiced and voiceless plosives. Results of the collection and subsequent pressure and acoustic analysis shows that a derivative of energy (i.e., the degree to which energy changes from period to period) is useful for signaling the difference between a /t/ and /d/. Specifically, a derivative of the energy in a band from 350 Hz to 560 Hz (for women) best captured differences between /t/ and /d/. Outcomes and broader Impacts: The finding of the utility of an energy derivative in this middle frequency band was then applied successfully to archival recordings of the Dictionary of American Regional English. A comparison of stop-vowel and vowel-stop boundaries by humans (gold standard) and existing technologies (Penn Forced Aligner), reveled that an algorithm based on the energy derivative reduced the error distance between the gold standard and existing technologies. This finding has implications, not only for improving those technologies, but also for speech recognition routines in general.

Project Start
Project End
Budget Start
2011-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2010
Total Cost
$11,600
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715