At a September, 2009 NSF-sponsored meeting in New York City, the NLP community is discussing the standardization and harmonization of the content of manual/automatic linguistic annotation. The meeting is building on the results of the previous Computing Research Infrastructure (CRI) award "Towards a Comprehensive Linguistic Annotation of Language" by establishing standards that researchers and developers are likely to follow. These standards govern tokenization, part of speech, head selection and other basic components of linguistic content that higher level annotation schema assume in common. Once standards are set, violations should be conscious (not accidental) and researchers should justify any violations. The meeting also aims to set up incentives, in the form of grants for small (e.g., student) projects, because several initial standard-compliant annotation projects could plant the seeds needed for the standards to take root.
Intellectual merit: Establishing a common base for linguistic annotation will: (1) make it easier to use, merge and compare different types of annotation (from different transducers, different manual sets of annotation, etc.); (2) make a more rigorous set of annoation standards possible; and (3) facilitate the use of sophisticated natural language informed applications that can draw on annotation created by several different projects simultaneously.
Broader impact: This standardization process will bring about greater cooperation among annotation researchers and, as a result, greatly improve the efficiency of such research. This could significantly improve the state of the art of all linguistic processing, and thus, all applications (automatic search, translation, etc.) that rely on the automatic linguistic analysis of text.