Vision and language provide fundamental means to interpret, learn, and communicate about the world around us. A primary goal of computer vision and natural language processing research is therefore to automatically uncover and analyze the information that images and video, or text and speech, convey about the world. Both communities are concerned with tasks that require increasingly deeper understanding, including the ability to reason with and draw inferences from this information. Since vision and language are complementary modalities, there is now also an increasing amount of work at the interface of both fields. However, progress in multimodal analysis requires a tighter collaboration between the two communities, since each currently relies on its own set of techniques, datasets and evaluation criteria.

This community planning grant explores the need for, feasibility, and usefulness of a "visual entailment" corpus and associated visual entailment recognition task. In natural language, entailment recognition is the problem of determining whether a particular statement can be inferred from a text document. This project explores a novel related problem - visual entailment - where the goal is to determine whether a statement in natural language can be inferred from an image or video. The outcomes of the project include a novel dataset and prototype research challenge, as well as increased collaboration between the vision and language communities.

Project Start
Project End
Budget Start
2012-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2012
Total Cost
$57,190
Indirect Cost
Name
State University New York Stony Brook
Department
Type
DUNS #
City
Stony Brook
State
NY
Country
United States
Zip Code
11794