EAGER: Annotating and extracting detailed syntactic information from a 1.1-billion-word corpus

Santorini, Beatrice; Kulick, Seth

Abstract

Over the past decade, very large text corpora of English have become available to researchers that turn out to be of considerable value for the language sciences. Even more recently, methods in natural language processing have advanced to a point where we can begin to imagine conducting linguistic research using automatically parsed and uncorrected corpora of the sort that has so far been conducted using human-corrected corpora. It is this new situation that the PIs wish to exploit by producing an automatically parsed billion-plus word corpus of early modern English based on the digitized Early English Books Online (EEBO) corpus that has recently been completed and made accessible to research. The aim is to create an automatically parsed database with a level of accuracy suitable for both linguistic and computational research, using the recently developed cutting-edge methods in natural language processing. The resulting resource will make possible investigations hitherto impossible; specifically, the information contained in a parsed version of EEBO will permit researchers to investigate frequency effects not just of words, but of larger grammatical units (phrases and clauses). In addition to their inherent linguistic interest, the results of such investigations may lead to the discovery of more sophisticated meaning-based properties and how these vary, which should be of value for research in natural language processing. The PIs have made progress on this goal, having created a first automatically parsed version of the EEBO corpus and begun to assess its accuracy. Some features like the syntax of clausal negation are already within our reach, but for many other structures, it remains to be determined how accurate retrieval with large-scale methods can be.

Since EEBO is more than 300 times larger than even the largest individual human-corrected corpora, it is expected that a more accurately parsed version of it than the one now available will begin to allow researchers to study phenomena that are only sporadically attested in existing English corpora, to zero in on the very beginnings and ends of historical changes, to investigate many different types of frequency effects (including the novel ones already mentioned) with an accuracy and reliability not hitherto possible, and to rigorously evaluate mathematical models of language change. Because the stage of English covered by EEBO (1500-1700) is already recognizably the modern language, a parsed version of EEBO can to some extent stand proxy for a corpus of Present-Day English for research in the language sciences. As a result, it should be useful as a training and testing ground for applications in computational linguistics including part-of-speech tagging, parsing, named entity recognition, and eventually lemmatization, sense disambiguation, and others. EEBO?s great genre variety and variable orthography and its moderate distance from Present-Day English will also make a parsed version of it a natural candidate for assessing and improving the robustness of these applications and for developing novel parser evaluation metrics that can serve as linguistically informed benchmarks for computational linguistics.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Behavioral and Cognitive Sciences (BCS)
Type: Standard Grant (Standard)
Application #: 2026850
Program Officer: Joan Maling

Project Start
Project End
Budget Start: 2020-08-15
Budget End: 2023-01-31
Support Year
Fiscal Year: 2020
Total Cost: $298,386
Indirect Cost

EAGER: Annotating and extracting detailed syntactic information from a 1.1-billion-word corpus
Santorini, Beatrice Kulick, Seth
University of Pennsylvania, Philadelphia, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments