Understanding scenes around us, i.e., recognizing objects, human actions and events, and inferring their spatial, temporal, correlative and causal relations, is a fundamental capability in human intelligence. Similarly, designing computer algorithms that can understand scenes is a fundamental problem in artificial intelligence. Humans consciously or unconsciously use all five senses (vision, audition, taste, smell, and touch) to understand a scene, as different senses provide complimentary information. For example, watching a movie with the sound muted makes it very difficult to understand the movie; walking on a street with eyes closed without other guidance can be dangerous. Existing machine scene understanding algorithms, however, are designed to rely on just a single modality. Take the two most commonly used senses, vision and audition, as an example, there are scene understanding algorithms designed to deal with each single modality. However, no systematic investigations have been conducted to integrate these two modalities towards more comprehensive audio-visual scene understanding. Designing algorithms that jointly model audio and visual modalities towards a complete audio-visual scene understanding is important, not only because this is how humans understand scenes, but also because it will enable novel applications in many fields. These fields include multimedia (video indexing and scene editing), healthcare (assistive devices for visually and aurally impaired people), surveillance security (comprehensive monitoring of the suspicious activities), and virtual and augmented reality (generation and alternation of visuals and/or sound tracks). In addition, the investigators will involve graduate and undergraduate students in the research activities, integrate research results into the teaching curriculum, and conduct outreach activities to local schools and communities with an aim to broader participation in computer science.
This project aims to achieve human-like audio-visual scene understanding that overcomes the limitations of single-modality approaches through big data analysis of Internet videos. The core idea is to learn to parse a scene into elements and infer their relations, i.e., forming an audio-visual scene graph. Specifically, an element of the audio-visual scene can be a joint audio-visual component of an event when the event shows correlated audio and visual features. It can also be an audio component or a visual component if the event only appears in one modality. The relations between the elements include spatial and temporal relations at a lower level, as well as correlative and causal relations at a higher level. Through this scene graph, information across the two modalities can be extracted, exchanged and interpreted. The investigators propose three main research thrusts: (1) Learning joint audio-visual representations of scene elements; (2) Learning a scene graph to organize scene elements; and (3) Cross-modality scene completion. Each of the three research thrusts explores a dimension in the space of audio-visual scene understanding, yet they are also inter-connected. For example, the audio-visual scene elements are nodes in the scene graph, and the scene graph, in turn, guides the learning of relations among scene elements with structured information; the cross-modality scene completion generates missing data in the scene graph and is necessary for good audio-visual understanding of the scene. Expected outcomes of this proposal include: a software package for learning joint audio-visual representations of various scene elements; a web-deployed system for audio-visual scene understanding utilizing the learned scene elements and scene graphs, illustrated with text generation; a software package for cross-modality scene completion based on scene understanding; and a large-scale video dataset with annotations for audio-visual association, text generation and scene completion. Datasets, software and demos will be hosted on the project website.