Much of our world knowledge (also known as "semantic" knowledge) comes from our direct experiences with the world. For example, people can learn that snow is cold through touch and that sparrows fly through sight. But analyses of language reveal that a large amount of information about the world is contained within the structure of language itself. That is, by tracking which words occur with which other words, it is possible to learn things that seem to require direct experience. Moreover, the co-occurrence of words in different languages seems to reflect somewhat different bodies of knowledge. The project's principal aim is to 1) explore the scope of semantic knowledge embedded in the structure of different languages and 2) understand the extent to which people use this embedded information to learn about the world. Our educational systems depend on the ability to transmit knowledge via language in both its spoken and written forms. Understanding the kinds of semantic knowledge typically learned through language can help reveal the consequences of inequities in language exposure, such as those caused by reading difficulties.

To understand the relationship between people?s knowledge and information embedded in the structure of different languages, the investigators will compile a corpus of semantic features and generic statements from native speakers of eight languages: English, French, German, Dutch, Italian, Spanish, Mandarin Chinese, and Russian. They will then compare this information, generated by people, to information automatically derived from the distributional structure of each of the eight languages. This will allow them to determine whether cross-linguistic differences in people?s knowledge are predicted from differences in the information embedded in the different languages. The investigators will estimate the causal impact of language on people?s semantic knowledge using a quasi-experimental approach, rather than the more typical correlational analyses. The data will be compiled into a user-friendly, large-scale data resource (all with open source code and data) and integrated with existing multilingual text resources. The multilingual feature norms and ratings of generic statements will be an important resource for artificial intelligence and NLP research and may help identify sources of biases in training sets used in machine learning. This approach brings computational and empirical methods to the study of one of the oldest of human questions: how do we know what we know?

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2020-08-15
Budget End
2023-07-31
Support Year
Fiscal Year
2020
Total Cost
$744,405
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715