Written language data play a critical role in both linguistic research and in the development of modern software tools for language processing. Traditionally, data of this kind have been expensive to produce, involving either linguistic fieldwork, digitization of printed materials, or cooperation with commercial publishers. Over the last fifteen years, however, the vast quantities of text available on the web have made it possible to assemble "disposable" language databases quickly and easily for many languages. The relative ease with which indigenous and minority language groups can publish material online, through blogs, social media sites, and online newspapers, has brought the benefits of modern language processing to a much wider range of languages than was ever thought possible.
This project involves the collection and dissemination of language data (in the form of word frequency lists, sample texts, etc.) for over 1200 languages. The data are gathered by a web crawler that uses statistical methods to identify the language of documents automatically. The data will be made freely available in convenient formats to linguists, to support research on endangered languages; to software developers, to help in producing computational tools that make it easier for endangered language communities to use their languages online; and to local communities, to assist in grassroots language revitalization projects.
One of the scientific challenges will be the development of language identification techniques that scale up to thousands of languages, and that work effectively with limited training data.
Finally, while the primary focus of the project is the production of useful linguistic data, it will incidentally provide the best answer to date to the fundamental question "How many languages are represented on the web?" which has been the subject of research by academics and public-benefit organizations like UNESCO since the early days of the web. The project involves collaboration with local language groups to develop tools such as spell checkers and grammar checkers. The research involves collaboration with community members, with capacity building.
The Division of Information & Intelligent Systems of the Directorate for Computer & Information Science & Engineering is [co-]funding this award as part of its commitment to support the development of computational tools and methods for the documentation of endangered languages.