This project uses statistical models and human judgment to determine dynamic, probabilistic representations of extensible usages of words; these representations are suitable for incorporation into VerbNet, a lexical resource widely used in the Natural Language Processing (NLP) community. Existing lexical resources reflect a binary notion of usages as grammatical or not. However, in actual language use, forms vary in acceptability; moreover, the process of coercion extends words beyond their standard usages. For example, a strictly intransitive action verb such as 'sneeze' may be used as in 'She sneezed the foam off the cappuccino', expressing manner of motion. This research has a two-pronged approach involving extensive use of machine learning and a fundamental shift in the development and use of VerbNet. Specifically, the research develops probabilistic methods for: (1) analyzing usages of verbs in large corpora and incorporating the resulting probabilistic information into VerbNet classes; and (2) representing information about the likelihood of potential constructional coercions and the productivity of such extensions. These developments use the Hierarchical Bayesian Model of Parisien and Stevenson, which are an ideal framework for marrying probabilistic reasoning about complex, real-world data within the hierarchically-organized VerbNet lexicon. In addition to statistical models, the representations are also informed by human judgments with respect to the use of such constructions. Thus, this research enriches the current symbolic verb representations in VerbNet with probabilistic distributional information, which becomes salient through the influence of construction grammar.

Encoding verb knowledge probabilistically provides the necessary flexibility to represent extensional constructions and support their appropriate interpretation by NLP systems. This is especially useful for interpretation in new domains and genres, leading to advances in NLP technologies, such as question answering and machine translation, thus improving information access. Additionally, insights into statistical properties of constructions gained through this research are valuable for psycholinguistic models of language acquisition and second language learning.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1116782
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2011-09-01
Budget End
2015-08-31
Support Year
Fiscal Year
2011
Total Cost
$300,000
Indirect Cost
Name
University of Colorado at Boulder
Department
Type
DUNS #
City
Boulder
State
CO
Country
United States
Zip Code
80303