This project advances learning methods for obtaining linguistic knowledge from raw or nearly raw text; such knowledge constitutes a core component of natural language processing technology but is difficult to obtain, usually relying on expensive manual annotation of text data. Specifically, this project aims to automate some of the mechanical aspects of developing learning algorithms for linguistic structure (in part by using a empirical Bayesian framework to unify considerable past work by the PI and others), to enrich models with richer linguistic bias (particularly through lexicalization and integration of morphology and syntax), and to apply these techniques to new natural language processing problems (identifying boilerplate and quotation extraction). Another exciting dimension is learning from text collections in multiple languages (not necessarily including translations), which past work has shown can lead to better unsupervised learning. The project will lead to working systems, including generic tools applicable to many problems in natural language processing and machine learning. These tools will provide infrastructure for the PI's courses and will be publicly available to the research community. Research results will be published in leading journals and at major conferences. The project supports one primary graduate student and a post-doctoral researcher. Major impacts of this project will be improvements in the quality of rapidly ported natural language processing tools for new languages and text domains, as well as a deeper scientific understanding of natural language learning by machines.

Project Report

This project advanced the development of algorithms that obtain linguistic knowledge from raw or nearly raw text. Such knowledge constitutes a core component of natural language processing technology but is difficult to obtain, usually relying on expensive manual annotation of text data. Specifically, this project aimed to automate some of the mechanical aspects of developing learning algorithms for linguistic structure, to enrich models with richer linguistic bias, and to apply these techniques to new natural language processing problems (including automatic question generation, social media analysis, citation analysis, and character analysis) as well as longstanding ones (machine translation, information extraction). Another key dimension was learning from text collections in multiple languages (not necessarily including translations), which past work has shown can lead to better unsupervised learning. The project contributed working systems, including tools applicable to many problems in natural language processing and machine learning and easily usable off-the-shelf components. These tools have been made publicly available to the research community. Research results were published in twenty-two peer reviewed publications. The project supported the PI's publication of a monograph used in graduate courses as well as his participation as an instructor at several summer schools. A doctoral thesis on the theoretical aspects of automatic learning of probabilistic grammars was completed under this project. The project supported research activities by graduate students at all levels, undergraduate researchers, and a postdoctoral researcher.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0915187
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2009-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$465,318
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213