The focus of this project is to develop finite-state syntactic processing models for natural language that use features encoding global structural constraints derived through multiple sequence alignment (MSA) techniques, to significantly improve accuracy without expensive context-free inference. MSAs are widely used in computational biology for building finite-state models that capture long-distance dependencies in sequences (e.g., in RNA secondary structure). Given a large set of functionally aligned sequences in MSA format, finite-state models can be constructed that allow for the efficient alignment of new sequences with the given MSA. In natural language processing (NLP), only very rarely have MSA techniques been used, and then to characterize phonetic or semantic similarity. This project is exploring the definition of a purely syntactic functional alignment between semantically unrelated strings from the same language, to define a structural MSA for constructing finite-state syntactic models. The project has two specific aims. The first aim is to develop natural language sequence processing algorithms and models that can: a) define sequence alignments with respect to syntactic function; b) build structural MSAs based on defined functional alignments; c) derive finite-state models to efficiently align new sequences with the built MSA; and d) extract features from an alignment with the MSA for improved sequence modeling. The second aim is to empirically validate this approach within a number of large-scale text processing applications in multiple domains and languages. The resulting algorithms are expected to provide improved finite-state natural language models that will contribute to the state-of-the-art in critical text processing applications.

Project Report

The focus of this project was to develop finite-state syntactic processing models for natural language that use features encoding global structural constraints derived through multiple sequence alignment (MSA) techniques, to significantly improve accuracy without expensive context-free inference. MSAs are widely used in computational biology for building finite-state models that capture long-distance dependencies in sequences (e.g., in RNA secondary structure). Given a large set of functionally aligned sequences in MSA format, finite-state models can be constructed that allow for the efficient alignment of new sequences with the given MSA. In natural language processing (NLP), only very rarely have MSA techniques been used, and then to characterize phonetic or semantic similarity. This project investigated methods for making use of new kinds of finite-state sequence models - such as multiple sequence alignments - for tasks that often rely on higher complexity models, such as syntactic processing. Syntactic parsing of natural language is an important step in many widely-used applications, such as information extraction and machine translation; but it can be too slow to support large scale use within such applications. In particular, the cubic complexity of syntactic parsing with context-free grammars makes linear complexity finite-state preprocessing and approximation an important component to scalable parsing systems, particularly those with very large grammars. For this reason, much of the syntactic processing work in this project was explored within a pre-processing scenario. In addition, the finite-state frameworks that we developed were applied to diverse problems in bioinformatics and natural language processing, including unsupervised morphological model induction, natural language generation, and weighted finite-state transducer representations of sequence models such as taggers. Over the three years of the project, we have published more than a dozen papers in leading journals and academic conferences on diverse topics related to the application of multiple sequence alignments in natural language processing and efficient inference with weighted finite-state transducers. Among the accomplishments of the project, we showed that: - Finite state tagging of syntactic constraints can lead to both worst-case and typical case complexity reductions, and very large efficiency gains relative to unconstrained context-free parsing. - Multiple sequence alignments can be used to build competitive systems for the learning of morphological structure from text. Further, supervised finite-state tagging approaches can be used to learn systems that simulate black-box or non-stochastic morphological models, leading to better generalization to new data and more efficient annotation. - Multiple sequence alignments can also be used for prenominal modifier ordering, an important task in natural language generation. Semi-supervised methods for learning such models on arbitrary amounts of text were shown to yield very high accuracy on this task. - Novel semirings -- multi-dimensional lexicographic semirings and new string semirings -- can be used to efficiently apply finite-state tagging models to the output of speech recognizers, yielding the Viterbi-best tag sequences for every path in the word lattice. In the course of the project, five graduate students received critical training in the discipline while working on this project. One visiting PhD student is finishing her PhD at the University of Aberdeen in Scotland, and will be starting a post-doctoral fellowship at Johns Hopkins University. Two others are in the final year of their PhD program, and are working on related topics in their PhD dissertation topics. A fourth has more than one year to go, but has begun making serious progress towards defining her topic. Three undergraduate students worked on this project as summer interns funded under NSF's Research Experience for Undergraduates program. One of these students became the lead author on a publication resulting from his summer work. This project contributed significantly to the educational goals of our research center, and improved our collective understanding of the role of finite-state models within larger natural language processing systems.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0811745
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2008-08-01
Budget End
2012-07-31
Support Year
Fiscal Year
2008
Total Cost
$437,750
Indirect Cost
Name
Oregon Health and Science University
Department
Type
DUNS #
City
Portland
State
OR
Country
United States
Zip Code
97239