Most word-centered linguistic annotations of texts proceed by identifying keywords and labeling the phrases around them that show their roles in the meaning structures evoked by the keywords. This procedure misses most idioms (took a turn for the worse) and irregular grammatical patterns (only then would she agree to it). The "Beyond the Core" project is exploring ways of augmenting such annotations with layered representations of multiword units and "non-core" grammatical constructions present in such texts. Toward this end, using FrameNet annotation tools, researchers are finding non-core structures in texts and labeling the phrases in a way that shows how they satisfy formal and semantic constraints dictated by the individual constructions. The "Constructicon", where such information is archived, links each construction with annotated sentences that exemplify it.
Although there is a strong interest in non-core structures in the Computational Linguistics community, researchers don't know how many there are, how important they are in NLP applications, how frequent they are in texts of different kinds, or whether the skills that enable trained linguists to recognize them can be reliably communicated to time-pressured annotators. This empirical study is providing that missing information.
The Constructicon and the full body of annotations will be made available to researchers via the FrameNet website, in both human-browsable and machine-readable form. The data will provide rich material for research on parsing, language understanding, and compositional semantics, and may possibly serve as a training corpus for machine-learning methods of detecting known non-core constructions in raw text.