Distinctions of prosody (rhythm, stress, and intonation) are ubiquitous in spoken language. It often seems obvious to a native speakers of English what prosody is most appropriate in a given sentence and context, and researchers in Linguistics and related fields have proposed numerous formalized hypotheses about it. But establishing the validity of these hypotheses is remarkably elusive. Much of the problem is that it is difficult to observe enough examples of a given phenomenon to evaluate hypotheses. The project aims to address this problem of a dearth of data by collecting or "harvesting" examples of specific word sequences or word patterns from web sources. It is often possible to find hundreds or thousands of examples of people using the very same word pattern. If these examples are collected together into a dataset and made available to the research community, it will be possible to evaluate theories about the form and meaning of prosody on an unprecedented scale. Scaling up available data can be expected to have a transformative effect on our understanding of prosody.

Audio and audio-video recordings of spoken language, including podcasts, radio and television broadcasts, lectures, and much else, are pervasive on the web. This does not help in itself, because it is not possible to listen to tens of thousands of hours of speech in order to find a few hundred examples of a certain type. Fortunately, more sites are becoming available that provide text transcriptions obtained with automatic speech recognition (for instance Fox Business News, WNYC, Elections Video Search at Google, and university lectures at MIT). Industry blogs and newsletters indicate that more large sites will come online soon. By searching for a word pattern in the text transcription and subsequently retrieving an audio or video file, it becomes possible to find relevant data.

To construct datasets for prosody research from these web sources, the project team will implement software harvest engines that interact with the web through standard protocols. Datasets for eight to twelve specific phenomena will be collected. In order to demonstrate the impact of a data-intensive methodology, the samples will be analyzed using techniques of statistics and formal linguistics. For instance, an approach known as machine learning classification will be used to identify the specific features of the sound signal (such as pitch, vowel duration, and intensity) that are responsible for the perception of prosody.

Prosody and intonation play an important role in making the discourse coherent, in signaling what part of the communicated information is foregrounded and backgrounded, and disambiguating speaker intention. Any advancement in understanding prosody will not only deepen our understanding of the human language capability, it also has implications in a wide range of areas, including language instruction, translation studies, speech therapy, improving comprehensibility of synthesized speech, and improving speech recognition systems.

Project Report

Distinctions of prosody (rhythm, stress, and intonation) are ubiquitous in spoken language. It often seems obvious to a native speakers of English what prosody is most appropriate in a given sentence and context, and researchers in Linguistics and related fields have proposed numerous formalized hypotheses about it. But establishing the validity of these hypotheses is remarkably elusive. Much of the problem is that it is difficult to observe enough examples of a given phenomenon to evaluate hypotheses. The research project "Harvesting Speech Datasets for Linguistic Research on the Web" addressed this problem of a dearth of data by collecting or "harvesting" examples of specific word sequences or word patterns from web sources. It is often possible to find hundreds or thousands of examples of people using the very same word pattern. Collecting such datasets allows theories of the form and meaning of prosody to be evaluated an unprecedented scale. In the project, software components for collecting and analyzing web-sourced prosodic data were created, and specific datasets were gathered and analyzed. The project was a joint effort involving researchers at Cornell University and McGill University, and was funded in the first round of the international Digging into Data Challenge. Together with Mats Rooth, the research lead at Cornell, and Michael Wagner, who led the effort at McGill, graduate and undergraduate students at Cornell and McGill were key players in the research effort. Jonathan Howell, who was involved first as a graduate student at Cornell, and then as a postdoc at McGill, wrote his doctoral dissertation "Meaning and Intonation: On the Web, in the Lab, and from the Theorist's Armchair" in the project. David Lutz (also a grad student at Cornell) developed Ezra, a web platform for annotating web-sourced speech data. Lauren Garfinkle, an undergrad in Linguistics at McGill, used the interface to annotate and transcribe numerous datasets. Parry Cadwallader, an undergrad in Computer Science at Cornell, was also involved in software development. Target utterances were short word sequences such as "in my opinion", "some people", "South Korea", or "than I did" that show contrastive prosody or variation in prosody. Hundreds of candidate tokens of each target were retrieved by interfacing with sites that index content using automatic speech recognition. Since speech recognition is not completely accurate, the candidates were filtered to discard incorrect candidates, and then transcribed to obtain a correct representation of utterance context using the Ezra interface. The resulting datasets were then processed to obtain measurements of parameters such as vowel length, intensity, and pitch that are involved in signaling prosody. Howell showed in his dissertation research that a classifying algorithm that has access to these measurements can "hear" distinctions in prosody about as reliably as human listeners. The Ezra software is distributed on an open source basis at github.com/del82/ezra. Datasets developed in the project can be browsed at compling.cis.cornell.edu/digging/.

Project Start
Project End
Budget Start
2010-08-01
Budget End
2013-07-31
Support Year
Fiscal Year
2010
Total Cost
$100,000
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850