A Morphological Analyzer for Old Icelandic in FM/Haskell

Tangherlini, Timothy

Abstract

The goal of this project is to develop a morphological analyzer for Old Icelandic using an extension of the functional programming language FM/Haskell. Old Icelandic, one of the most complex of the early Germanic languages, is closely related to Old English. Old Icelandic is also interesting as it is the language of the sagas, Nordic mythology and early Germanic law. One of the main features of FM/Haskell is code that is easily interpreted, even for non-programmers; this allows the investigators to use students as part of the research team, and to solicit feedback from the user community.

An important feature of the morphological analyzer system is the inclusion of an English language look-up tool. Because of the highly inflected nature of Old Icelandic, it is often difficult for students and non-experts to find appropriate English definitions without considerable effort. The system also incorporates a clear method for debugging and error correction. The research team will develop simple user interfaces for adding inflectional prototypes and sub-prototypes, and for adding and editing lexical resources. When fully developed, the system will offer far greater accuracy than earlier morphological analyzers for Old Icelandic.

Given the architecture of the system, the approach can be easily extended to other languages; this will be demonstrated by developing a proof-of-concept analyzer for Old English. This extensibility is important for international collaborative efforts that focus on the development of these systems. The research community will be provided with a fast and highly accurate method for adding morphosyntactic detail to the rapidly increasing digital corpus of Old Icelandic documents; with this added detail, the speed and sophistication of searches on the corpus will increase dramatically. All code and instructions will be freely distributed, along with the library of language functions, allowing others to build systems for other morphologically complex languages.

Project Report

IceMorph: An Automated Morphological Analyzer in FM/Haskell and English Language Look-up Tool for Old Icelandic The advent of inexpensive computing and the creation of large machine-actionable corpora consisting of well structured digital texts have made it possible to analyze and mark for morphosyntactic features significant amounts of text (> 1,000,000 tokens) with a high degree of accuracy (> 80%) rapidly and automatically. Although the problem of automatically tagging text with part-of-speech (POS) information has been largely solved for languages with little morphonological complexity, more complex languages, such as Old Icelandic (OIc) and other ancient languages, continue to pose problems for automated systems. Despite these difficulties, rich morphosyntactic markup that includes lemmatization holds great promise for both linguistic and textual scholarship. Accurate markup would enable the development of, sophisticated online study environments that allow researchers to perform complex searches, make comparisons across multiple texts, and generate calculations concerning word-use and syntactical patterns. Our work, focusing on Old Icelandic, confirms that even for morphonologically complex Indo-European languages, the information gain offered by automatic morphosyntactic analysis of texts, measured as the percent of correctly tagged tokens, sentences and complete texts over the extant corpus, offers a marked improvement over previously available hand-marked text. A dream of many researchers in Old Icelandic is to be able to work with a large number of texts (and ms witnesses to texts) ? or even a comprehensive corpus ? that include the high level of morphosyntactic detail of the early handbooks mentioned above. Similarly, historical linguists (especially syntacticians) are eager to work with a much larger parsed corpus of Old Icelandic texts than is currently available. Recent work, such as that of the Icelandic Parsed Historical Corpus group (IcePaHC) is a major step towards making such resources available, as it provides a considerable number of texts tagged in a semi-supervised fashion, and moves us closer to a comprehensive parsed Old Icelandic corpus. Yet it is unlikely that IcePaHC alone will provide adequate coverage for Old Icelandic textual research, in part because it is focused on the historical development of Icelandic up through the present, and in part because it provides limited lemmatization of the texts. As such, IcePaHC diverges from our project, which has as its sole focus the morphosyntactic analysis and lemmatization of Old Icelandic texts. We believe that the computational methods developed by our group can augment those of IcePaHC and others, and have the potential to extend not only the necessarily limited scope of the earlier, historical handbooks, but also to increase considerably the number of richly marked texts available to researchers. The automatic morphosyntactic tagging, lemmatization and disambiguation of Old Icelandic texts is a non-trivial task. To approach this challenge, we divided our system, which we dubbed IceMorph, into two main parts: (1) a probabilistic morphological analyzer and disambiguation machine, and (2) a deterministic inflection engine. A dictionary look-up tool provides a useful extension to the combined system, returning the available English-language definitions for any given lemma. Integrating these functions with a machine actionable text corpus completes the system. IceMorph is intended to meet these two challenges, and does so with increasing accuracy: (1) Given a word form from a text, IceMorph returns a lemma, its inflection, and syntactical detail for the word form in its textual context. (2) Given a lemma from the dictionary, IceMorph returns an inflectional paradigm of the word, including irregular features, and discovers attestations of the inflected forms in the corpus. In future iterations, IceMorph also will return all examples of the form in context (keyword in context). IceMorph is now available for researchers, students and the general public alike. Our easily repurposeable code, which can be used for similar projects in other highly inflected, ancient Indo-European languages such as Old English (the language of Beowulf) is available through Github. A feature of our approach is the use of FM/Haskell, a programming language that is easily understood by non-programmers. This programming decision allows linguistic scholars who may not have deep training in computer programming to contribute substantively to any similar implementation. Another important feature of our work is the very small "training set" that we use to "train" our morphological analyzer. While many other projects use training sets of approximately 100k words, we use a training set of ~400 words. By harnessing expert feedback into a bootstrapping function, the accuracy of our system grows with use. Finally, our implementation of an easy to use, inflected Old Icelandic-English dictionary allows students and researchers who are not familiar with the language to rapidly develop such familiarity. The three accompanying images show the IceMorph web portal, and the two tools: the saga browser and the dictionary browser.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Behavioral and Cognitive Sciences (BCS)
Application #: 0921123
Program Officer: Joan Maling

Project Start
Project End
Budget Start: 2009-09-15
Budget End: 2013-08-31
Support Year
Fiscal Year: 2009
Total Cost: $111,990
Indirect Cost

A Morphological Analyzer for Old Icelandic in FM/Haskell
Tangherlini, Timothy
University of California Los Angeles, Los Angeles, CA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments