The focus of this proposal is on the effective use of parser-derived and tagger-derived features within discriminative approaches to language modeling for automatic speech recognition. Discriminative language modeling approaches provide a tremendous amount of flexibility in defining features, but the size of the potential parser-derived feature space requires efficient feature annotation and selection algorithms. The project has four specific aims. The first aim is to develop a set of efficient, general, and scalable syntactic feature selection algorithms for use with various kinds of annotation and several parameter estimation techniques. The second aim is to develop general tree and grammar transformation algorithms designed to preserve selected feature annotations yet lead to faster parsing or even tagging approximations to parsing. The third aim is to evaluate a broad range of feature selection and grammar transformation approaches on a large vocabulary continuous speech recognition (LVCSR) task, namely Switchboard. The final aim is to design and package the algorithms to straightforwardly support future research into other applications, such as machine translation (MT); and into other languages, such as Chinese and Arabic. The algorithms developed as a part of this project are expected to contribute to improvements in LVCSR accuracy and applications that rely upon this technology. The algorithms are being packaged into a publicly available software library, enabling researchers working in many application areas including LVCSR and MT and various languages to investigate best practices in syntactic language modeling for their specific task, without having to hand-select and evaluate feature sets.

Project Report

The focus of this project was on methods to improve statistical modeling for applications such as speech recognition, machine translation and optical character recognition. Such applications produce word sequences -- sentences or fragments of sentences -- corresponding to their input speech, text or images in a target language such as English. For example, a speech recognizer takes a spoken utterance and produces its best guess of the words that were spoken in that utterance. Statistical language models assign scores to these sentences that indicate whether they are "good" sentences or not, where "goodness" roughly means acceptable examples of the language being modeled. In such a manner, all else being equal, the application will output sequences of words that are a better a priori fit to the target language being modeled. Thus the speech recognition system would prefer "their dog" to "they're dog" even though they are acoustically identical. Language models typically simply look at whether each word in a given sequence often goes together with its neighboring words in some large corpus of observed sequences in the language. However, language is a very productive phenomenon, i.e., any given sequence of words -- valid or not -- will contain unobserved subsequences. More information beyond word collocations can be of utility, such as syntactic or morphological structure. Such information, however, unlike word collocations, is not explicit in the word sequence; rather, it is hidden structure that must be annotated by some kind of structural inference algorithm, e.g., syntactic parsing. This project was investigating methods of annotating and exploiting syntactic information to improve language models used in applications like speech recognition. Syntactic parsing is computationally expensive, particularly if a large set of word sequences must be collectively parsed in order to provide language model scores to each of them. For this reason, much of the work in this project involved exploring new methods for fast syntactic parsing, and many important innovations were achieved in that area. Over the five years of the project, we have published nearly 20 papers in leading journals and academic conferences on the topic of efficient annotation of syntactic structure and the use of that structure to support natural language processing applications. Among the accomplishments of the project, we: - Established the utility of syntactic and morphological features in discriminative language models for English and other languages such as Turkish. We found that features derived from part-of-speech tags can yield significant improvements to English speech recognition; and those derived from morphological annotations help in agglutinative languages such as Turkish. - Created new methods for using fast finite-state annotation algorithms to speed up the slower context-free parsing algorithms, and often making them more accurate in addition to faster. Novel cascaded system architectures (e.g., pipeline iteration), system combination approaches, and specialized finite-state classifiers yielded, in aggregate, orders of magnitude speedups of context-free parsing. Of particular note were methods to guarantee improved worst-case complexity bounds on parsing via novel finite-state annotations. - Created new methods for inducing stochastic context-free grammars that have beneficial properties, such as being very compact and efficient to parse with, while still recovering syntactic structure with high accuracy. We found that methods from statistical hypothesis testing could be used to distinguish between syntactic configurations that were unobserved due to sparse data ("sampling zeros") and those that were unobserved due to legitimate syntactic constraints ("structural zeros"). These methods can be used to impose constraints on grammars of very high utility, yielding compact but accurate grammars. - Applied these syntactic models to tasks in spoken language understanding and text processing, such as utterance segmentation and automatic discourse segmentation. We also investigated new language modeling applications, such as character-based language modeling for open-vocabulary typing systems for individuals with severe motor impairments, who can access only a single switch to indicate yes or no. In the course of the project, four graduate students received critical training in the discipline while working on this project. One finished her PhD and is now a post-doctoral fellow at the University of Maryland. Another is finishing his PhD this year, on topics central to this project. Three undergraduate summer interns worked on this project, as part of NSF's Research Experience for Undergraduates program, often in close collaboration with graduate student mentors. This project made a significant contribution to the educational goals of our research center, in addition to furthering the field's collective understanding of methods for efficiently annotating and using syntactic structure within natural language applications.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0447214
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2005-04-01
Budget End
2011-03-31
Support Year
Fiscal Year
2004
Total Cost
$527,550
Indirect Cost
Name
Oregon Health and Science University
Department
Type
DUNS #
City
Portland
State
OR
Country
United States
Zip Code
97239