In natural language processing, coreference resolution involves clustering together all words and phrases within a text that refer to the same entity. For example, in the sentence "Monsieur Poirot assured Hastings that he ought to have faith in him," the strings "Monsieur Poirot" and "him" refer to the same person, while "Hastings" and "he" refer to a different character. Resolving these references is challenging because it requires the application of syntactic, semantic, and world knowledge, and it is important since coreference is essential to intelligently understand the meaning of text for question answering, translation, corpus insights, and many other applications. Unfortunately, current coreference models are held back by the lack of human-annotated training data from various domains and world languages, mainly because it is expensive and time-consuming to collect such data at scale.
This CCRI planning grant will take the first step toward breaking the coreference data bottleneck by creating two new resources for the community: (1) a software platform that facilitates cheap and accurate crowdsourced collection for tasks that require labeling text spans within documents, and (2) a multi-domain crowdsourced coreference dataset collected using this platform. The dataset resource will contain data from a variety of different domains (such as books and web forums), unlike prior datasets that focus primarily on newswire text, which will allow researchers who work on non-standard domains to integrate coreference systems into their modeling pipelines. This planning grant will also support discussions and conference workshops about the platform and data resources; the resulting community feedback will be incorporated into a CCRI full proposal that aims to use the platform to create a much larger and multilingual coreference dataset, as well as explore non-coreference data labeling tasks such as question answering.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.