This project addresses one of the vital problems in natural language processing -- the quality of linguistic annotation of electronic natural language corpora. The purpose of this project is to explore a new method for automatically detecting and correcting errors in annotated corpora, to extend its applicability and improve its precision and recall for different types of linguistic annotation, and to investigate its relation to fraud/anomaly detection approaches developed outside of computational linguistics.
More specifically, the project examines and extends the variation n-gram method for detecting annotation errors by exploring its applicability to dependency annotation, increasing the recall of the method through a generalization of what constitutes comparable contexts for different types of annotation, adding an error correction stage, and researching and evaluating the effect of annotation errors and their correction on the use of corpus annotation for human language technology. The project includes an exploration of the potential broader impact beyond language technology, which is significant given that the error detection methodology developed by the project is in principle applicable to all collections of data which encode judgments or classifications of repeated data subunits.
The success of data-driven approaches and stochastic modeling which are widely used now in computational linguistic research and applications is rooted in the availability of linguistically annotated electronic natural language corpora. However, despite the central role that annotated corpora play for training and testing human language technology, the question of how errors in the annotation of corpora can be detected and corrected has received only little attention. This project is the first systematic attempt to remedy the situation by developing automatic methods to improve the linguistic annotation quality. The implemented error detection and correction algorithms will be made freely available, and the theoretical results will be made accessible through publications at leading international computational linguistics conferences.