This project will furnish several 'building blocks' for data interoperability within linguistics and all disciplines which use language data. The work will be based on the General Ontology for Linguistic Description (GOLD), a machine-readable information structure which allows allows computers to process and 'understand' linguistic concepts and the relations among them. Using GOLD, the project will develop an extensive network of ontology-aware lexical items drawn from sixteen different projects and over 3000 languages. Thus, computers will be able to understand the relationship between linguistic categories across languages, and interpret what their linguistic function is when they appear in texts. In addition, the project will develop a set of low-barrier data requirements which lexicon creators can implement in order to join this ontology-based network. It will also create architecture to integrate network data into frameworks developed by major international standards initiatives. Finally, the project will establish DevSpace, an online facility designed to promote continuing information- and resource-sharing among linguists and developers interested in augmenting the network with additional tools and services.
Such a project is important because cross-linguistic language data is central to many research communities. Language history and language comparison can provide critical insights into the genetics, culture, migrations, and contacts of human populations. And natural language data is indispensable to major computational research initiatives, such as multilingual text processing. In providing linguistically interpreted lexical data from so many underdescribed languages, LEGO will ultimately aid in meaning extraction from texts even of languages far too small to justify a full-scale natural language processing system. Thus from both a computational perspective and a Humanities and Social Sciences perspective, the LEGO project will create a research resource of remarkable breadth and diversity, one which will serve multiple disciplines.
The major goal of this project was to create a sustainable, accessible data network of lexicons of endangered languages, with a multi-lexicon search facility based on the GOLD (General Ontology of Linguistic Description) ontology. Specifically, the LEGO project had the goal of making available to the public many significant number of lexicons of endangered languages, in a standardized format, with grammatical information mapped to the GOLD ontology, as well as significant number of wordlists of endangered languages, in a standardized XML format. These languages included Shoshone, Western Pantar, Western Sisaala, Tamashek, Fulfulde, Archi, Potawatomi, Mocovi, Biao Min, Shoshone, Qiang, VerbMobil German, Ibibio, Nhirrpi, Titan, Jarawara, Mbodomo, and Medumba. While most of the material was uploaded by project participants, an uploader allowing a linguist to join the datanet independently by uploading a lexicon and mapping it to GOLD was written. To make this material usable and accessible, a multi-lexicon/wordlist browsing and search facility was written, supporting search by language, language code, lexical item, gloss, and grammatical information. Over the five years of the LEGO project, it made publicly available on the Internet 25 lexicons of endangered languages (4 more are awaiting approval by their authors, and 5 more will be added this summer), 2817 wordlists from understudied languages, supplemented by downloadable schema and stylesheets for converting lexicons into the format required by the LEGO datanet (LL-LIFT).