This project supports travel expenses for participants at the workshop on advances in language and vision. In the past few years, great progress has been made in the fields of language and computer vision in developing technologies of extracting semantic content from text and imagery respectively. Each field has desires to adapt methods from the other, but often looking to the past literature rather than the current state of the art. This workshop makes significant scientific progress in multimodal representations and methods by bringing together the top researchers in both fields. The well organized brainstorming and discussion sessions contribute new ideas to this emerging area. The outcome of the workshop provides some guidelines for targeted research in this interdisciplinary area, including anticipated fundamental scientific advances, possible large-scale challenge problems, the needs and prospects for available datasets, and connections to significant applications and their associated long-term economic impact and other societal benefits.
The fields of language and vision have separately made great progress in recent years, developing automated extraction of semantic content in text and imageryrespectively. The state of the art of these fields are rapidly encroaching each other:language is increasingly focused on how to "ground" meaning in physical observations,and vision is exploiting ontological structures and trying to "tell a story" from an image, relating objects, activities, people, and scenes. These potentially transformative open research questions indicate that the time is ripe to start understandingand constructing deeper connections between imagery and its associatedlanguage. And in particular, it is time for these two fields to start communicatingand collaborating in a richer manner. Two workshops were supported in part by this NSF grant and were held to explore future research directions in languageand vision; one at NSF on May 17-18th, 2011, and one held at the NIPS conferenceon December 16th, 2011. Participants observed that considerable progresshas been made in each field by adopting relatively shallow models from the other field, and that this was likely to continue to bear fruit in various near term internetsearch-style applications. However, tranformative progress requires a significantadvance towards a symmetric integrated approach, with corresponding "deep" semantic representations. Recommendations for several specific research themes along this line included :• "understanding understanding" - going beyond recognizing individual imageelements to grasp the underlying meaning of a picture. This will involve moving beyond recognition as a list of objects present (ie what is in the picture)to determine what information is useful to extract from an image, possiblyincluding predictions of relationships, appearance characteristics, andwhat is happening or what will happen in the underlying world captured ina photo or video. • "visual entailment" - algorithms to verify whether statements can be inferredfrom images or videos. Similar to the first goal, this will require adeeper understanding of an image or video and will be necessary for interactionand communication with human consumers of computer vision. Thismay be framed as a visual Jeopardy problem. • "seeing between the lines" - collecting basic mundane facts about the worldto improve image and language understanding. Current research has movedaway from collecting large quantities of world knowledge, but simple knowledgeabout specific environments, or the world in general may be necessaryfor building effective visual systems that can operate in the real world .•a "semantic survival kit" - determining what to recognize first. In order tobuild a universal vision system we need to determine a set of basic recognitionunits and priority for what things are most important to be able torecognize. Some suggestions from the workshops included: first wordslearned in a second language, vocabulary words used by young children,words explained in how things work resources. For more details see the workshop materials available on the web the the URL: https://sites.google.com/site/languagevisionworkshop/