The Manually Annotated Sub-Corpus (MASC) is a shared corpus that supports research across several disciplines: linguistics, computational linguistics, psycholinguistics, sociolinguistics and machine learning. It includes a wide variety of present-day American English texts annotated for several linguistic phenomena. Because MASC provides a unique resource, considerable community momentum has grown up around it. This project builds upon this momentum to enable the corpus to grow on its own, and to address the need for additional annotations. The major activities are to : (1) provide web-based mechanisms to facilitate community contribution and use of MASC annotations; (2) develop means to more fully automate the annotation validation process; (3) extend the WordNet annotations to cover adjectives, to support research on evaluation of ?subjective? annotations and harmonization of WordNet with other resources; (3) promote use of MASC and new annotations by diverse groups, by sponsoring shared tasks that exploit the corpus? unique characteristics and supporting beta-testers of software, data, and annotations; and (4) aggressively develop an ?Open Language Data? community around MASC through workshops, tutorials, and active participation in relevant community activities.
MASC provides an unparalleled resource for training and testing of tools for natural language processing, which can enable a major leap in the productivity of NLP research and ultimately impact the way people use and interact with computers. It is the first fully open, communitydriven resource in the field. All data and annotations are freely distributed in a manner that permits immediate and easy accessibility for users around the globe.
Corpora, or collections of texts organized around one or more commonalities, are an important resource for studies of language use across disciplines, including natural language processing, information retrieval, cognitive science, information science, and machine learning. Annotations enhance the observed language with unobserved information that is apparent to people in their use of language. Thus annotaton facilitates the study of human language. Annotations can identify the distinct units that make up a word or sentence, larger units that correspond to general discourse goals, or properties of language units such as relations among them, their meanings, or their purposes. The MASC project, a collaborative project, extended and built upon a heterogenous corpus of post-1990s American English that now has twelve kinds of annotations on the full corpus. MASC also contains a companion corpus of word sense annotations on MASC sentences that use senses from WordNet, an extensive and widely used lexical resource of word meaning. The present award produced three kinds of results that advance our understanding of meaning in language use: the completion of the MASC word sense sentence corpus, a companion corpus using crowdsourced word sense labels, and a study of genre variation in the core MASC corpus. The MASC word sense corpus applies senses from WordNet, a large widely used lexical resource, to sense annotation of sentences drawn from the very heterogeneous MASC corpus. As a result, less common WordNet senses occur in the MASC corpus. Words to be annotated were selected by four researchers prominent for creating and evaluating lexical and corpus resources. 116 moderately polysemous words were selected (nouns, verbs and adjectives; 7-8 word senses each), yielding a total corpus size of well over 2 million words. This corpus increases our understanding of the issues involved in collecting high quality sense annotations. Despite a common view that it was not possible to get high agreement among annotators using fine-grained senses of more than 3 or so per word, MASC methods led to very high agreement on half the words. It thus serves as a valuable resource for further study of why some words lead to less agreement among annotators. It has already been used to create a multilingual resource, and to study alignment of word meaning in distinct lexical resources. Creation of lexical resources with word sense annotation within and across languages is increasingly useful given the growth of resources that are part of Linked Open Data (LOD). LOD contains concept names and relations, and work has already begun in several organizations to link word senses to concept names in these resources, and thus to make it possible to gather knowledge from text. The result that half of the words in the MASC word sense corpus had low agreement led us to develop a novel crowdsourced annotation method and the corresponding corpus. Untrained but highly motivated annotators were recruited so that many sense labels, rather than one, would be assigned to each annotated word. To convert the crowd's many labels into a single "correct" label, we applied a probabilistic model originally developed in the 1980s to decide on a true label for diagnostic data, such as radiology films, given opinions from many experts. Unlike conventional annotator reliability methods used in language annotation projects, this model assumes that different annotators have different degrees of accuracy, and it can estimate an annotator's accuracy given enough data from that annotator. This produces much better results than using majority voting in cases where a few minority annotators are more accurate than the majority annotators. We extended the original probabilistic model for binary class labels to handle many class labels (word senses) per word. The results demonstrated that higher quality annotations can be produced for a lower cost per "correct" label using crowdsourcing, with the additional advantage over conventional methods of a confidence value for each "correct" label. Due to its inclusion of equal parts of nineteen distinct genres, and its high quality annotations, the MASC corpus is ideal for studies of genre variation, meaning differences in the kinds of words and grammatical constructions used in different genres such as news, fiction, technical reports, and social media. This project tested a hypothesis from the 1980s that had been developed on a similarly diverse corpus of British English, using less reliable annotations. Our results confirmed several of the dimensions of variation but added an additional highly influential dimension based on the distribution of proper nouns, which had not been studied in the earlier work. The visualizations and metrics produced as part of this study help explain why genre classification research can often produce apparently contradictory results. An increased understanding of genre variation can contribute to the growing field of domain adaption, which addresses the problem of lack of generalization of statistical methods for language analysis from one domain or genre to another.