Technologies for storing and processing vast amounts of text are mature and well-defined. In contrast, technologies for browsing or mining content from large collections of non-textual material, especially audio and video, are less well developed. Large sale data mining on text has helped transform the relevant disciplines; the disciplines dealing with spoken language will reap similar benefits from accessible, searchable, large corpora.

This project explores the difficult problem of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. It applies and extends state-of-the-art techniques to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9,000 hours, 100 million words, or 2 terabytes), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, and psychology, and 100 to 1,000 times the amounts that are used in common practice.

Speech-to-text alignment and search tools will open a new universe of data to researchers in many fields, from linguistics and phonetics to anthropology, speech communication, oral history, and media studies. Audio-video usage on the internet is large and growing at an extraordinary rate, offering increasingly large amounts of an increasingly large range of material. Reliable automatic annotation, indexing and search of this material will allow researchers to examine the distribution of both form and content across time, space, and social structure.

Project Report

The main aim of this project was to demonstrate the applicability of data-mining techniques to the scientific investigation of collections of thousands of hours of transcribed speech. A secondary goal was to complete the digitization and phonetic alignment of the spoken portion of the British National Corpus, in collaboration with Oxford University and the British Library. Despite the advances we enjoy today in information retrieval, we still lack technologies that search spoken audio (and video) with the same ease as text. One solution to this problem is to provide time-aligned transcripts of audio recordings. Such transcripts include time-stamps for each part of the audio, from speaker turns and phrases to words, syllables, consonants, and vowels. Suitably configured systems allow a user to search the transcript and then use the time-stamps to retrieve the corresponding piece of the recording. But in addition to search and retrieval, such datasets become open to data-mining research in new ways. Scholar and scientists who are studying grammar and language change, who are looking for diagnostic or therapeutic measures in clinical contexts, or who are seeking to build better language technology, all benefit from access to these very large collections of automatically-analyzed information about spoken-language performance. The traditional way to create time-aligned phonetic transcripts is to use interactive tools that allow a human expert to annotate the audio corresponding to the associated text. This process is very labor-intensive if a detailed transcript is needed. taking more than 100 hours of effort for every hour of audio. To address this problem, we begin with technique called "forced alignment", which uses speech-recognition technology to indicate exactly where in the audio each word from the transcript appears, at the same time providing detailed information about the timing of the various syllables and segments in the word, the exact vowel qualities used, and so on. The main outcome from this project is the demonstration that our time alignment techniques produce accurate results for more than 10,000 hours of speech in English, Chinese and Spanish. To illusrate the utility of the results for speeh data-mining, we used the resulting audio and time-aligned transcripts to study variation in the pronunciation of consonants, vowels, and pitch contours, as documented in several papers published in refereed conference proceedings, and one in a refereed journal. In collaboration with colleagues at Oxford University and the British Library, we contributed to the alignment of the spoken portion of the "British National Corpus", which is now available to the public at: www.phon.ox.ac.uk/AudioBNC. A sampler is also available here: www.phon.ox.ac.uk/SpokenBNC. In a related project, we also worked with colleagues at oyez.org to word-align the complete set of Supreme Court Oral Arguments (available back to the 1950s), which is now available to the public on the oyez.org web site, and soon will be published for research use by LDC.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1048900
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2010-08-15
Budget End
2012-07-31
Support Year
Fiscal Year
2010
Total Cost
$99,899
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104