A fundamental capability of human intelligence is being able to learn to act by watching instructional videos. Such capability is reflected in abstraction and summarization of the instructional procedures as well as in answering questions such as "why" and "how" something happened in the video. This project aims to build computational models that are able to perform well in above tasks, which require, beyond the conventional recognition of objects, actions and attributes in the scene, the higher-order inference of any relations therein. Here, the higher-order inference refers to inference that cannot be answered immediately by direct observations and thus requires stronger semantics. The developed technology will enable many applications in other fields, e.g., multimedia (video indexing and retrieval), robotics (reasoning capability of why and how questions), and healthcare (assistive devices for visually impaired people). In addition, the project will contribute to education and diversity by involving underrepresented groups in research activities, integrating research results into teaching curriculum, and conducting outreach activities to local K-12 communities.

The research will develop a framework to perform higher-order inference in understanding web instructional videos, such that models devised in this framework are capable of not only discovering and captioning procedures that constitute the instructional event but also answering questions such as why and how something happened. The framework is built on a video story graph that models the dynamics (the composition of actions at different scales) and evolution (the change in object states and attributes), and it supports higher-order inference upon deep learning units and incorporation of external knowledge graph in a unified framework. Methodologies to extract such video story graphs and use them to discover, caption procedures and perform question-answering will be explored. Expected outcomes of this project include: a software package for constructing and performing inference on video story graphs and incorporating external knowledge; a web-deployed system to process user-uploaded instructional videos; and a large video dataset with procedure and question-answering annotations.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2018-09-01
Budget End
2021-08-31
Support Year
Fiscal Year
2018
Total Cost
$465,990
Indirect Cost
Name
University of Rochester
Department
Type
DUNS #
City
Rochester
State
NY
Country
United States
Zip Code
14627