This project is working to develop a new system that simultaneously discovers the patterns of word morphology and parts of speech for a wide range of the world's languages from unannotated text. Given a quantity of training text, such a system will yield a transducer, which segments the words in new texts into stems and affixes and determine the part of speech of each word as a whole. Through unsupervised learning, an iterative bootstrapping procedure will combine several different linguistic knowledge sources to gradually build up a representation of the language in the form of paradigms. From these paradigms, symbolic part of speech rules and morphophonological rewrite rules will be extracted, which will then be compiled into a probabilistic finite-state transducer, which can label new texts with morphology and part of speech.

Despite the widespread application of machine learning techniques to natural language processing, developing morphological analyzers still involves much human effort. While the morphology of English is very simple, the automatic analysis by computer of texts or speech in the majority of the world's languages depend on the availability of appropriate morphological analyzers. It is also important for the important problem of automatic information extraction in the biomedical domain, where it is necessary to analyze the complex structure of technical terms, even in English. Such analyzers are useful in most applications in natural language processing, including parsing, information retrieval, machine translation, text summarization, correct pronunciation in speech synthesis, language models in speech recognition, language generation, and named entity recognition.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0415138
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2005-06-01
Budget End
2010-05-31
Support Year
Fiscal Year
2004
Total Cost
$450,000
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104