Due to advances in computer vision and audio processing, it is now possible to automatically annotate large video collections with basic information about their audiovisual contents (e.g., people, places, objects, audio transcripts). However, it remains difficult to productively carry out higher level analytics tasks on video because of the challenges of defining higher level, complex events of interest. In response, this project seeks to enable more sophisticated and higher productivity video analysis through the design of a programming system for composing basic video annotations into higher level patterns and events of interest in a video. Queries authored in the proposed system can serve as a direct specification of video events of interest, or as a mechanism for automatically generating data labels that provide supervision for subsequent model training. The proposed systems will be applicable to many video domains; however, the project will feature a collaboration with journalists and news media personnel to conduct an audiovisual analysis that measures diversity and representation in nearly a decade of American cable TV news broadcasts (over 200,000 hours since 2010). Specifically, the project will create software tools for answering questions such as: What individuals appear most often on the news? In what contexts? (in interviews? on panels?) What topics and stories do certain individuals cover? In addition to disseminating the results of these analyses, the project will produce interactive web-based tools that will enable students and the public to perform their own diversity analyses of the contents of cable TV news.

The primary technical challenge of the project involves the design of a new video analysis system for defining spatio-temporal patterns and events of interest in video. Inspired by early multimedia database query systems, the system will support multi-modal video analyses by representing all video annotations (whether they result from pixels, audio, or transcripts) as continuous space-time volumes in a video. Users will define complex patterns via queries that compose (via spatio-temporal relations) and manipulate collections of simpler space-time annotations. The compositional nature of these queries will allow them to execute rapidly on large video collections, enabling analysts to iteratively conceptualize, prototype, and specify novel high-level patterns in videos. To reduce the cost of annotating large video collections, the project will also exploit the long running nature of TV and film video streams to train low-cost models that are specific to a show or film's video content. The project will investigate the use model distillation (and do so in a continuous, online setting) to train face and object detection models that maintain high accuracy on a video stream at an order of magnitude lower runtime cost than existing methods. All systems developed as part of the project will be distributed to the public as open source software, and the project will involve hosted hackathons to educate students and broader community about their use.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1908727
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2019-09-01
Budget End
2022-08-31
Support Year
Fiscal Year
2019
Total Cost
$500,000
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305