Language technology permeates the daily life of all Americans: from spam filters and spellcheckers to Google translate and digital assistants like Alexa and Siri. But not all language communities benefit equally. In order to perform well, current language technology requires enormous amounts of data, which are not available for smaller languages and dialects such as Navajo or African American Vernacular English. This requirement stands in stark contrast to the learning needs of children, who do not require much to acquire their native language perfectly. Linguists have attributed the ease of language acquisition to innate learning biases--all languages share universal properties, and a learning algorithm that is aware of these universals requires less data. But linguistic models of language universals are couched in terms that make them hard to incorporate into current learning algorithms. Bridging the gap between linguistic theory and language technology requires a mathematically and computationally grounded understanding of universals, both of which are incorporated in this project to inform language science and language technology.

This project develops a computational model of universals in the domain of sentence structure and how it interacts with the shape of words. It adopts a framework that is built on techniques from theoretical computer science, mathematical logic, and abstract algebra. This makes it possible to characterize universals in a manner that is both rigorous and sufficiently flexible to easily accommodate the rich diversity of languages. The researcher will generalize this formal machinery from sounds to words and sentences. The project draws from from the linguistic literature on universals, and feeds it by deriving new universals from the computational analysis. The results will allow for new machine learning algorithms that incorporate linguistic universals as a strong learning bias. In addition, this award supports the development of an interactive, online learning platform. This resource will enable linguistics students across the country to master the skills they need to study language through a computational lens. Both the research and the education component of this project thus serve the purpose of bridging the gap between language science and language technology.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Behavioral and Cognitive Sciences (BCS)
Application #
1845344
Program Officer
Tyler Kendall
Project Start
Project End
Budget Start
2019-04-01
Budget End
2024-03-31
Support Year
Fiscal Year
2018
Total Cost
$165,592
Indirect Cost
Name
State University New York Stony Brook
Department
Type
DUNS #
City
Stony Brook
State
NY
Country
United States
Zip Code
11794