The bombing attacks at the Boston Marathon in April 2013 presented the law enforcement community with significant challenges in terms of the volume and variety of video and still images acquired in the course of the investigation. Tens of thousands of individual media files in multiple formats were submitted from a variety of sources. These sources included broadcast television feeds, private Close-Circuit Television (CCTV) systems, mobile device photographs and videos recovered from the scene, as well as photographs and videos submitted by the public. Teams of analysts reviewed this evidence using mostly manual processes to determine the sequence of events before and after the bombing, ultimately leading to a quick resolution of the case. In the aftermath, it has become evident that the proliferation of video and image recording devices in fixed and mobile devices make it inevitable that a similar situation will occur in future events. As a result, it is incumbent upon the law enforcement community and the U.S. Government at large to further explore the use of automated approaches, available today or in the coming years, to better organize and analyze such large volumes of multimedia data. The findings of this workshop will help define the future research agenda.

The problem of searching for actionable intelligence information from unconstrained images and videos is an unsolved problem. Solving this involves addressing many sub-problems such as video summarization, shot detection/scene change detection, geo-tagging, robust face recognition, human action recognition, semantic description, image recognition and designing human in the loop systems. In addition, issues such as data collection and performance evaluation have to be addressed. Given that several hundreds of videos and a large collection of still images may be available for analysis, there is a great need to develop robust computer vision techniques. While many existing computer vision algorithms perform reasonable well in constrained acquisition conditions, their performance when unconstrained images and videos are given, is less than satisfactory. This workshop precisely addresses the challenges that arise in analyzing a large collection of unstructured image/video collection. This workshop explores the state of the art in algorithms being developed in academia that can support forensic analysis and identification in large volumes of images and videos (e.g., multimedia). The workshop informs long- and near-term research and development efforts aimed at optimally addressing this situation in the future. The workshop identifies those video and image analysis problems which are: (1) Considered solved (i.e., ready to deploy in specific operational scenarios); (2) Nearly solved (i.e., could lead to solutions with one to three years of development); and (3) Over-the-Horizon problems (i.e., those challenges requiring concerted effort over the next 3-5 years and beyond).

Project Report

at a level suitable for public release. Based on discussions at the workshop and subsequent conversations, the attendees arrived at lists of problems that are seen as solved and problems that require near-term and long-term investments. We summarize the conclusions reached by the attendees as to where future research investments must be made in this area. Problems that Need Long Term Investments Video Summarization: Video summaries that are generated in response to user-specified set of rules that can be computationally interpreted and translated to image/video operations are desired. Video summaries that are robust to poor spatial resolution of objects of interest, noise and jitter, and poor illumination conditions must be developed. Summaries that come from multiple viewpoints, perhaps even city-wide camera networks may be needed to generate a complete picture of what is being imaged. Visual Analysis and Geo-localization of Large-Scale Imagery: Algorithms and systems that would (semi-)automatically determine the level of precision achievable for a given geo-location problem and then apply the appropriate methods to get to one of the three precision regimes: visual element location, region location, pinpoint location must be developed.. Image-based Biometrics: In the area of face recognition, recognizing an individual seen in challenging viewing conditions, such as in extremely low-resolution images, or recognizing a person from an extreme viewpoint, such as a profile, when only a frontal view is present in the gallery or face recognition across aging are problems that need long-term investments. Sublinear methods for searching over large databases using descriptive features such as attributes must be developed. Human in the Loop (HIL): Humans can quickly transfer knowledge from one task to the next, from one modality to another, etc. However, current HIL systems provide very little support for this type of functionality. This problem is present in all areas of HIL, from vision systems that cannot transfer information across camera views, to interfaces that cannot transfer information across user sessions, to perceptual models that cannot learn the commonalities and intricacies of different users. While there has been some progress in all these problems, both the theoretical and algorithmic foundations are still in their infancy. It is also only now that enough computation and storage are becoming available for researchers to worry about problems such as dataset bias, i.e. how well an algorithm learned under certain training conditions generalizes to others. Person Re-Identification: Full real-world scenarios, low-quality images, unconstrained and uncooperative conditions are challenges that will need a longer time horizon. This will involve dealing with natural videos with high clutter in the data, and severe variations in the environmental conditions. The main task - robust feature extraction - remains the key and will need to be achieved in far more challenging conditions. This may call for the development of novel features. It is to be expected that a larger use of context, like the joint re-id of groups and individuals, can be helpful. Semantically meaningful attributes could play a role in providing the required robustness. Human Activity Understanding (Detection and Recognition) in a Video: Theultimate goal of activity and action understanding is to be able to provide explanations and descriptions of an action or event captured in a video. This kind of analysis is very important for end users, e.g., video analysts. To understand and explain a video, it is important to have a rich representation of each event, action or activity in terms of objects, actions and scenes, which can be used to describe an event in natural language. Investments to support efforts that aim to bridge the gap between the semantics required by the high level description and what can be extracted from the lower level detectors are needed. The long term challenge for computer vision researchers is to develop approaches for human activity and action understanding which do not require any training and which can generalize to diverse datasets and provide explanation and recounting of actions and activities in a video. Large-scale Visual Recognition: Investments in addressing the problems listed below are recommended. Robust methods for designing 1000 category object classification and localization for image retrieval purpose are needed. Methods for handling 100 category human actions and activity recognition in conjunction with human pose estimation, concurrent action recognition (multiple actions in parallel) are required. At instance level, we will have to solve the long challenges on face recognition, human identification and re-identification from video under reasonable conditions.

Project Start
Project End
Budget Start
2014-01-15
Budget End
2014-12-31
Support Year
Fiscal Year
2014
Total Cost
$55,850
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742