Code switching (CS) is the term used to describe a common practice among bilingual speakers of a given language pair in which the speakers switch back and forth between their common languages. CS occurs in all genres of communication, and at different levels of linguistic representation.
Computational algorithms trained for a single language fail when the input has other languages in the signal i.e. data with CS phenomena. One major barrier to research on processing CS is the lack of large, accurately annotated corpora of CS data.
This planning proposal aims at creating the framework for a large consistently annotated data repository that will target 7 different languages annotated with features at different levels of granularity. In the course of the planning grant, we plan to hold a community workshop to ensure that we are addressing their needs in the repository. We will work with the community in order to prepare the full CRI proposal.
This data will be transformative for computational linguistics research as it will provide a testbed for adaptive learning algorithms, lead to significant robustness in handling very diverse data sources, and create a framework for genuine multilingual processing. Moreover, it will have a direct impact on the way sociolinguists account for CS leading to more robust and replicable generalizations.
Research on CS will help acknowledge the creativity of bilinguals in exploiting their verbal repertoire. The CS repository will enable new research in many interconnected fields. This research will contribute to raising general awareness of bi/multilingualism.
Intrasentential Linguistic Code Switching (ILCS) is the process by which bilingual speakers switch mid utterance. An example of which is "I was working on the program, bas el computer hanneg before finishing." ([Translation of Arabic portion: The computer hung] English/Egyptian Arabic ILCS). The phenomenon of ILCS is pervasive among bilingual speakers and is emerging rapidly in informal genres online. ILCS poses a significant impediment to NLP and Speech Processing of informal genres. We believe that the presence of large annotated resources for ILCS would be an invaluable contribution to computational linguistics as well as other theoretical linguistics areas of study of ILCS. In this project, we carried out a pilot feasibility study for the creation of an annotated repository of a multilingual and multigenre ILCS repository for several language pairs. We carried out preliminary studies of Modern Standard Arabic/Dialectal Arabic, Arabic/English, Hindi/English, Spanish/English. We held 2 round table workshops where we invited scientists from different fields interested in ILCS to give us feed back and help us design the make up of such a repository, i.e. what kind of information would be of interest to their line of research. Among the attendees were NLP and Speech Processing experts, Sociolinguists, Theoretical Linguists, and NeuroLinguists. We came up with a set of desiderata for the annotation of such data in the first workshop and then in the second workshop we experimented with the annotations on pilot data and further refined the annotation guidelines. We also collected large amounts of data for the different language pairs in question. For instance we collected over 9M words for Modern Standard Arabic/Dialectal Arabic. We collected both initial pilot data for Speech and written text. In our guidelines we identified several levels of annotation: morphological, lexical, syntactic and pragmatic. We tried to come up with a set of guidelines that would cover all language pairs simultaneously where addressing each level of linguistic annotation has a unified base but then specific language pairs/genres/modalities have more detailed guidelines that are relevant to the specific phenomenon. The intuition is to create a common framework for deriving contrasting environments that could have implications on our understanding of the typological variations and its implications on ILCS. Moreover, as a means of annotation, we used both in-lab annotators with lay people as well as informed linguists for annotations of sample data. We experimented with crowd sourcing as a means of collecting large amounts of annotation for the MSA/Dialectal data as well as for the Hindi/English data. In fact the latter effort was reported in a publication at IJCNLP 2011 Workshop on Asian Language Resources (ALR 9) in Thailand, November 2011. As a result of the successful planning award, we submitted a large CRI grant for creating the repository for annotated resources for both speech and written ILCS phenomena in 4 different language pairs: Chinese/English, Modern Standard Arabic/ Dialectal Arabic, Arabic/English|French, Hindi/English, Spanish/English. We believe that these language pairs cover a wide variety of language pairs of interest to the scientific community. Also these language pairs are of vital interest to the USA in general due to the presence of significant subcommunities of speakers of these different languages. In the choice of the languages we paid special attention to the typological characteristics of the language pairs for example contrasting rich/poor morphology, free/strict word order and hence studying the impact of such variation on the ILCS phenomenon. The presence of such a large annotated repository should help boost research in the areas of adaptive NLP as well as multilingual NLP in addition ot theoretical Linguistics, and Sociolinguistics.