This project is working to develop a new system that simultaneously discovers the patterns of word morphology and parts of speech for a wide range of the world's languages from unannotated text. Given a quantity of training text, such a system will yield a transducer, which segments the words in new texts into stems and affixes and determine the part of speech of each word as a whole. Through unsupervised learning, an iterative bootstrapping procedure will combine several different linguistic knowledge sources to gradually build up a representation of the language in the form of paradigms. From these paradigms, symbolic part of speech rules and morphophonological rewrite rules will be extracted, which will then be compiled into a probabilistic finite-state transducer, which can label new texts with morphology and part of speech.
Despite the widespread application of machine learning techniques to natural language processing, developing morphological analyzers still involves much human effort. While the morphology of English is very simple, the automatic analysis by computer of texts or speech in the majority of the world's languages depend on the availability of appropriate morphological analyzers. It is also important for the important problem of automatic information extraction in the biomedical domain, where it is necessary to analyze the complex structure of technical terms, even in English. Such analyzers are useful in most applications in natural language processing, including parsing, information retrieval, machine translation, text summarization, correct pronunciation in speech synthesis, language models in speech recognition, language generation, and named entity recognition.