The main goal of this project is to build a team of undergraduate students in Linguistics and Computer Science to work on the research project collaboratively. Through this experience, the students will be able to learn about each other's disciplines, linguistic properties of many interesting languages, tagset design, various supervised/unsupervised approaches to morphosyntactic tagging as well as acquire mathematical modeling and algorithmic skills -- all while working as part of the research team.
In this research, the PI together with her students will explore how cross-lingual correspondences can be used for scientific and technological benefit, and provide a better understanding of what linguistic properties are crucial for morphosyntactic transfer.
This award will play a central role in expanding inquiry-based learning, fostering student collaborations, and providing valuable hands-on experience of highly practical work on difficult, morphologically rich under-represented languages. The impact of this project is expected to translate into significant gains in recruitment of minority students through research training in Computer Science and Linguistics. The undergraduates who work on this project will collaborate on the writing of research publications and present at national meetings so that they can contribute to the research culture while enhancing their research awareness.
Morphological analysis, tagging and lemmatization are essential for many Natural Language Processing (NLP) applications of both practical and theoretical nature. Modern taggers and analyzers are very accurate. However, the standard way to create them for a particular language requires substantial amount of expertise, time and money. A tagger is usually trained on a large corpus (around 100,000+ words) annotated with correct tags. Morphological analyzers usually rely on large manually created lexicons. For example, the Czech analyzer (Hajic 2004) uses a lexicon with 300,000+ entries. As a result, most of the world languages and dialects have no realistic prospect for morphological taggers or analyzers created this way. We have developed a method for creating morphological taggers and analyzers for fusional languages without the need for large-scale knowledge- and labor-intensive resources for the target language. Instead, we rely on (i) resources available for a related language and (ii) a limited amount of high-impact, low-cost manually created resources. This greatly reduces cost, time requirements and the need for (language-specific) linguistic expertise. We have built a team of undergraduate students in linguistics and computer science who worked on the research project collaboratively. Through this experience, the students were able to learn about each other's disciplines, linguistic properties of many interesting languages, various supervised/unsupervised approaches to morphosyntactic tagging as well as acquire mathematical modeling and algorithmic skills -- all while working as part of the research team. Several undergraduate students have presented this research at local and national venues. They contributed to the writing of research publications. All undergraduate students who participated in this research have either secured full-time positions in software engineering or continued to graduate school.