With National Science Foundation support, Drs. Nancy Ide and Randi Reppen will conduct a three-year project to annotate extensively a 10-million word portion of the American National Corpus (ANC). The ANC consists of both spoken and written language from North America across a range of registers, such as planned speeches, conversations, fiction, and newspapers. This research project uses techniques from both computational linguistics and corpus linguistics to annotate the ANC for a range of grammatical and semantic characteristics. Specifically the project seeks to accomplish three major objectives: 1) develop automatic tools for annotating various elements and structures in the corpus; 2) create a 'gold standard' portion of the ANC, consisting of 10 million words in which the markup, annotation, and parts of speech have been hand-validated; and 3) describe the conceptual and meaning relations among words in the ANC within the framework of the 'semantic web', thus greatly enhancing analysis and retrieval capabilities. The investigators are to carry out this research through a variety of software programs (many created specifically for this project), and through extensive human/computer interaction to hand-validate the computer assigned labels.
This research project is important for several reasons. First, the resulting corpus will be the first publicly available tagged corpus of spoken and written American English. Second, because the annotation of the corpus will be hand-validated, the resulting product will approach 100% accuracy. With this carefully annotated 10-million word corpus, language researchers will be able to address a number of structural and linguistic relationships across texts that previously could not be addressed. Since the corpus will be hand-validated, researchers can use this information to develop models for processing previously unseen texts. The ANC corpus will be readily available to researchers via the web. In addition to the annotated corpus, the project will make available to researchers a suite of tools designed to retrieve information from the corpus.