For both general users and linguists, the Internet provides a massive amount of information and linguistic data, readily accessible to anyone with a computer, that has value for research in linguistics, computational linguistics and the social sciences. However, the nature of the different types of language used on the web remains unclear. To better understand the language of the internet, Drs. Biber and Davies will develop a comprehensive linguistic taxonomy of web registers. This taxonomy will be applied to a large, representative corpus of internet texts, which will be made freely available as an online resource for a range of research purposes.

The initial framework for classifying texts into register categories will be developed through hand-coding (using a rubric based on situational characteristics) and computational linguistic analysis of a sample of web sites indexed by Netscape's Open Directory Project. The resulting taxonomy will then be applied automatically to a second corpus (circa 100,000 texts) and integrated into a searchable online interface. Assuming that each web text contains an average of 1,000 words, this online searchable corpus will contain approximately 100 million words.

The linguistic descriptions resulting from this project, and the searchable online corpus, will provide the basis for more principled uses of the web as a data source, including using the web as a corpus to test hypotheses about language variation and change; using the web to identify probabilistic patterns, incorporated into tools for lexicographic research or natural language processing applications; studying ways to extract essential information on specialized topics from web documents; and identifying examples of 'authentic' language illustrating the use of words or grammatical constructions. The present project will provide detailed linguistic descriptions of the nature of the source documents, as well as a valuable online resource to facilitate future investigations of web-based texts.

Project Start
Project End
Budget Start
2012-09-15
Budget End
2017-02-28
Support Year
Fiscal Year
2011
Total Cost
$331,806
Indirect Cost
Name
Northern Arizona University
Department
Type
DUNS #
City
Flagstaff
State
AZ
Country
United States
Zip Code
86011