The proposed research addresses the current stagnation in N- gram language modeling by developing a set of principled techniques by which higher level structure can be applied to the problem of language modeling. These techniques include trans-finite-state probabilistic grammar formalisms (probabilistic feature grammars and tree-insertion grammars) that are both practical and lexically sensitive, and similarity-based methods that address data sparsity problem for large-N models directly. A novel technique of multi- pass parsing has potential utility in addressing some of the practical computational issues that arise in using such models. This promising combination of techniques --- lexicalized grammar formalisms and the multi-pass parsing method to make them practical, along with similarity-based estimation to aid in solving data sparsity problems --- has the potential to improve language modeling beyond the asymptote that it has reached, with applications in improving speech recognition, information retrieval, and many other language-processing applications.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9712068
Program Officer
William Bainbridge
Project Start
Project End
Budget Start
1997-09-01
Budget End
2002-06-30
Support Year
Fiscal Year
1997
Total Cost
$450,000
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138