There is today a broad consensus among theoretical linguists (of all frameworks) and researchers in Natural Language Processing (NLP) about what the syntactic phenomena are that we encounter in natural languages. However, there are many different frameworks in which analyses of these phenomena have been implemented, and there is even disagreement about specific analyses within one single framework. As a result, linguistic resources such as annotated corpora or grammars cannot be easily reused across frameworks. This project will investigate the common categorization of syntax that underlies work in linguistics and NLP. This underlying categorization is called a ``metagrammar''. Given a metagrammar, a tool can be produced to automatically generate grammars in different frameworks.

This research contains three main activities. The first involves comparative work in several languages (including English) that will lead to coordinated metagrammars for these languages. These framework-independent specifications will catalog syntactic properties and detail their possible interaction; categories shared between languages will lead to shared portions of the metagrammar. The second concerns the development of specific grammar statements that relate metagrammatical categories to constructs in particular frameworks and for particular languages. It is these statements that, in their interaction, determine word order. The third involves annotating the Penn Treebank (PTB) corpus with the syntactic properties from the metagrammar, thus making the information implicitly encoded in the phrase structure of the PTB explicit and usable by other frameworks.

This project will enable the NLP and linguistics communities to better share insights on syntactic phenomena. Additionally, the work will enable the development of new NLP tools that are less dependent on a particular representation. It will enable linguists to rapidly develop grammars and test-suites for different frameworks and languages, thus allowing for both cross- and inter-framework evaluation of linguistic grammars. Upon completion of the project, the PTB re-annotated with the high-level categories of the metagrammar will be made available to the research community .

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0414409
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2004-09-01
Budget End
2008-08-31
Support Year
Fiscal Year
2004
Total Cost
$506,000
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104