Everyday knowledge about the world is a necessary condition for intelligent information processing and reasoning. People can read between the lines in text and see beyond what are visible in images because of everyday functional knowledge about how the world works. The primary goal of this research is to develop learning algorithms that can automatically acquire such knowledge, centered around entities and events, from large-scale multimodal web data. Entity knowledge includes a broad range of physical and conceptual knowledge about objects and people, including their attributes, their relative differences, and logical relations among them. Event knowledge focuses on structural knowledge about everyday events in people's lives organized through hierarchical and temporal relations among sub-events and the event participants. Together, the resulting knowledge will be a critical step forward to enable robust AI systems at the intersection between natural language processing and computer vision that can understand and reason about unstructured multimodal information. The potential impact of this research includes interactive assistive systems for the visually-impaired and multimodal educational interfaces.

This project investigates multimodal knowledge extraction as a new research paradigm drawing connections between relevant methods in natural language processing such as information extraction, textual entailments, and frame semantics with recent advances in computer vision. One of the critical challenges in commonsense knowledge acquisition is to overcome reporting bias, i.e., people do not state the obvious. Therefore, this project develops new learning algorithms based on a graph-based collective inference that can reason about unspoken knowledge that systematically influences the way people describe the world in language, images, and videos. In addition, this project develops new models for visual semantic parsing and event recognition, which generalize existing studies on activity recognition by specifying various structural components of events such as actors, objects, locations, tools, intents, and goals. The learned knowledge and representation will be validated through several applications including multimodal question answering and grounded language understanding.

Project Start
Project End
Budget Start
2017-08-01
Budget End
2021-07-31
Support Year
Fiscal Year
2017
Total Cost
$700,000
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195