Supervised Natural Language Processing (NLP) systems perform poorly on domains and vocabulary that differ from training texts. A growing body of empirical and theoretical work points to the features used by traditional NLP systems as the culprit for domain-dependence and for the inability to generalize to previously unseen words.

This project is the first to systematically investigate representation-learning as a technique for improving performance on domain adaptation. It explores latent-variable language models ? including Factorial Hidden Markov Models, dependency parsing models, and deep architectures ? as techniques for extracting novel features from text. The resulting representations yield similar features for distributionally-similar words, thereby allowing generalization to words not seen during training of a classifier. The project also explores novel procedures for training a language model, which incorporate Web-scale ngram statistics as substitutes for standard statistics used in unsupervised training.

Language users are extraordinarily inventive, and new domains of discourse appear constantly, such as in specialized areas of science and technology. By building on top of the representations produced by this project, NLP systems can improve in accuracy on new domains and on Web text, bringing applications like the Semantic Web closer to reality. For resource-poor languages and domains, the project can help reduce the cost of annotating texts by reducing the need for broad coverage in the training texts. By involving the diverse student bodies at Temple University and Philadelphia-area high schools, the project helps to broaden participation in computer science research by underrepresented groups.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1065397
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2011-04-01
Budget End
2016-03-31
Support Year
Fiscal Year
2010
Total Cost
$705,982
Indirect Cost
Name
Temple University
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19122