Widely-deployed applications of language technology such as translation systems and smart assistants rely heavily on machine learning models for sentence understanding. These models learn to understand language from data, which can often be as simple as a collection of published books or a download of Wikipedia, rather than through any kind of manual engineering or hands-on guidance by linguistic expert. While modern machine learning methods are quite effective, they are not perfect. When they fail understand some text, it can be difficult to discover why, and even more difficult to craft interventions to address those failures. This CISE Research Initiation Initiative (CRII) project develops tools to help use methods and insights from research in linguistic science to analyze and refine machine learning systems for sentence understanding. The project should have a practical impact in making it easier to develop effective language technologies, a scientific impact in helping linguists use machine learning as a proxy to study human language learning, and a training impact in supporting several PhD students---through both research seminars and direct research collaborations---as they develop into experts in the interaction between linguistic science and language technology.

The methods used in this project relies on the human ability to judge the grammatical acceptability of a sentence; i.e., to decide whether someone could ever use a given sequence of words to say something. The project has three parts: (1) to build a large acceptability-based dataset for English which evaluates machine learning systems on their linguistic knowledge; (2) to use this data to evaluate widely-used standard approaches to machine learning for language, with a focus on promising recent approaches that use artificial neural networks learn from plain text; and (3) to develop methods for using small custom datasets to directly repair any gaps in the knowledge that these machine learning models acquire. Analyzing and improving artificial neural networks is difficult, since their internal representations of language are continuous and at least superficially, their internal representations of language do not at all resemble the kinds of representations that linguists use to analyze language. The investigators' methods are designed to minimize this difficulty, which rely on converging evidence from multiple ways of using the same data in its experiments.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-03-15
Budget End
2021-02-28
Support Year
Fiscal Year
2018
Total Cost
$174,894
Indirect Cost
Name
New York University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10012