Collaborative Research:CI-P: Creation of an annotated repository of multilingual and multigenre code switched data for several language pairs

Solorio, Thamar

Abstract

Code switching (CS) is the term used to describe a common practice among bilingual speakers of a given language pair in which the speakers switch back and forth between their common languages. CS occurs in all genres of communication, and at different levels of linguistic representation.

Computational algorithms trained for a single language fail when the input has other languages in the signal i.e. data with CS phenomena. One major barrier to research on processing CS is the lack of large, accurately annotated corpora of CS data.

This planning proposal aims at creating the framework for a large consistently annotated data repository that will target 7 different languages annotated with features at different levels of granularity. In the course of the planning grant, we plan to hold a community workshop to ensure that we are addressing their needs in the repository. We will work with the community in order to prepare the full CRI proposal.

This data will be transformative for computational linguistics research as it will provide a testbed for adaptive learning algorithms, lead to significant robustness in handling very diverse data sources, and create a framework for genuine multilingual processing. Moreover, it will have a direct impact on the way sociolinguists account for CS leading to more robust and replicable generalizations.

Research on CS will help acknowledge the creativity of bilinguals in exploiting their verbal repertoire. The CS repository will enable new research in many interconnected fields. This research will contribute to raising general awareness of bi/multilingualism.

Project Report

The main goal of this project was to undertake necessary planning activities that could help us develop a strong submission for a large Computing Research Iinfrastructure proposal. Our main goal is to collect a large repository of cpde-switched data, consistently annotated across different language pairs, at different levels of granularity, from phonology/morphology to pragmatics and discourse. Therefore, our panning activities were focused on specifying the types of annotation needed for our repository, identifying the research interests from the wider research community that this resource can provide, as well as ensuring the annotation standards adopted would allow a wide use of our repository. The findings from our planning actvities include the following: We need two levels of annotations. During our discussions it became apparent that in order to ensure interoperability across languages we needed a set of core annotations. However, due to the different typologies of the languages, we also recognized that some language pairs need different annotations to allow the study of more interesting CS questions. Therefore, we decided to have a set of core annotations across languages and then each language pair will have a sub set of more language specific annotations. Examples of this are the part-of-speech (POS) tags. We have a core set of tags for all languages and each language pair will have its own fain grained set. All annotators need to receive very specific training. Our pilot annotations have shown that the annotators need to receive a short training session that will give them the appropriate background for the project. We need to use a common tool that will support the annotation efforts. We agreed that developing a unified tool for the annotation process will be advantageous as it will ensure we will have a standardized process that will help us. Code-switching poses different challenges to different POS taggers. Our preliminary results on these experiments show that each taggerâ€™s POS tagging accuracy will degrade differently when processing code-switched data. Moreover, each of the taggersâ€™ accuracy decreased in different levels depending on the type of code-switching present. We are still in the process of analyzing these results, but so far the preliminary findings support our theory that syntactic analyzers, as well as higher-level language processing tasks, need to model code-switching phenomena. Outcomes of this project include a working set of core annotation guidelines that can be adopted by other researchers studying code-switching, a specific list of corpora to be annotated, and the resubmission of the large proposal to the National Science Foundation. We also started studying how code-switching affects the preformance of existing monolingual part-of-speech taggers. We plan on reporting these findings in a forthcoming research paper.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Type: Standard Grant (Standard)
Application #: 0958088
Program Officer: Tatiana D. Korelsky

Project Start
Project End
Budget Start: 2010-03-01
Budget End: 2012-02-29
Support Year
Fiscal Year: 2009
Total Cost: $21,992
Indirect Cost

Collaborative Research:CI-P: Creation of an annotated repository of multilingual and multigenre code switched data for several language pairs
Solorio, Thamar
University of Alabama Birmingham, Birmingham, AL, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments