Each human language is a system of conventions for communicating information. Yet how does everyone know this complex system? Describing it is difficult even for linguists. Yet young children somehow figure out the rules and vocabulary of their native language. Adults continue to learn when confronted with unfamiliar words, with new conventions associated with social media, or with the layout conventions of a new website.

This project develops new artificial intelligence methods for tasks of this kind. These methods will enable computers to deal with a wider variety of human language data, thus improving information access and global communication. They will also provide insight as to why human intelligence is able to succeed at these problems.

The methods will seek to discern the systematic structure that explains the patterns in naturally occurring linguistic data. Specifically, our computers will analyze naturally occurring data in order to learn:

* How to break down words into meaningful parts and reassemble those parts into new words. This is a subject that linguists call morphophonology. It is practically important in automated analysis and translation of speech and text.

* How to break down sentences into meaningful phrases. This requires determining the basic word order facts of the language -- the problem of grammar induction, considered to be a central mystery of human language learning.

* How to extract machine-readable data from large websites that present databases in human-readable form. This involves automatically figuring out the database structure and layout conventions of a website.

* How to track names across large quantities of informal text. By discovering the principles that govern how people use and modify names, a computer can recognize that the nickname "Vlad P." or the misspelled patronymic "Vladimir Vladimirovich" might be variant ways of referring to "Vladimir Putin," especially in a political comment.

The project will address each of these domains in a principled way. Our strategy in each domain is to develop a novel Bayesian generative model along with efficient, principled machine learning algorithms for approximate inference. We expect to expand the range of modeling and inference techniques that are available to the natural language processing community.

Innovative technical directions include the automatic reconstruction of phonological underlying forms, a novel treatment of grammar induction as structured prediction, a nonparametric model of databases and database-backed websites, and a phylogenetic model of name variation.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1423276
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2014-08-01
Budget End
2018-07-31
Support Year
Fiscal Year
2014
Total Cost
$457,999
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218