A major obstacle to building robust systems that can read, summarize, and extract information from text is the need for large amounts of linguistic knowledge to handle the myriad syntactic, semantic, and pragmatic ambiguities that pervade virtually all aspects of text analysis. The objective of this research is to address this knowledge engineering bottleneck for natural language processing (NLP) systems. The work extends a general knowledge acquisition framework that allows an NLP system to bootstrap its own knowledge bases directly from text using standard inductive machine learning techniques in conjunction with an annotated corpus and robust sentence analysis. In particular, the framework is being extended to handle additional problems in lexical and structural ambiguity resolution and is being evaluated using Penn Treebank data within the context of a larger NLP task. The work is of both theoretical and practical significance. First, the research will begin to determine the conditions under which machine learning techniques can be expected to offer a cost-effective approach to knowledge acquisition for NLP systems, especially in comparison to existing statistical techniques. Second, the work will expand the current system into an integrated tool that uses machine learning techniques to guide NLP system development.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9624639
Program Officer
Ephraim P. Glinert
Project Start
Project End
Budget Start
1996-04-01
Budget End
2000-03-31
Support Year
Fiscal Year
1996
Total Cost
$212,500
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850