CCRI: Planning: Planning for the Development of a Platform to Support Multilingual and Multi-Domain Coreference Annotation for Natural Language Processing Research

O'Connor, Brendan; Iyyer, Mohit

Abstract

In natural language processing, coreference resolution involves clustering together all words and phrases within a text that refer to the same entity. For example, in the sentence "Monsieur Poirot assured Hastings that he ought to have faith in him," the strings "Monsieur Poirot" and "him" refer to the same person, while "Hastings" and "he" refer to a different character. Resolving these references is challenging because it requires the application of syntactic, semantic, and world knowledge, and it is important since coreference is essential to intelligently understand the meaning of text for question answering, translation, corpus insights, and many other applications. Unfortunately, current coreference models are held back by the lack of human-annotated training data from various domains and world languages, mainly because it is expensive and time-consuming to collect such data at scale.

This CCRI planning grant will take the first step toward breaking the coreference data bottleneck by creating two new resources for the community: (1) a software platform that facilitates cheap and accurate crowdsourced collection for tasks that require labeling text spans within documents, and (2) a multi-domain crowdsourced coreference dataset collected using this platform. The dataset resource will contain data from a variety of different domains (such as books and web forums), unlike prior datasets that focus primarily on newswire text, which will allow researchers who work on non-standard domains to integrate coreference systems into their modeling pipelines. This planning grant will also support discussions and conference workshops about the platform and data resources; the resulting community feedback will be incorporated into a CCRI full proposal that aims to use the platform to create a much larger and multilingual coreference dataset, as well as explore non-coreference data labeling tasks such as question answering.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Type: Standard Grant (Standard)
Application #: 1925548
Program Officer: Tatiana Korelsky

Project Start
Project End
Budget Start: 2019-09-01
Budget End: 2021-08-31
Support Year
Fiscal Year: 2019
Total Cost: $99,998
Indirect Cost

CCRI: Planning: Planning for the Development of a Platform to Support Multilingual and Multi-Domain Coreference Annotation for Natural Language Processing Research
O'Connor, Brendan Iyyer, Mohit
University of Massachusetts Amherst, Hadley, MA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments