The video analysis community has long attempted to bridge the gap from low-level feature extraction to semantic understanding and retrieval. One important barrier impeding progress is the lack of infrastructure needed to construct useful semantic concept ontologies, building modules to extract features from the video, interpreting the semantics of what the video contains, and evaluating the tasks against benchmark truth data. To solve this fundamental problem, this project will create a shared community resource around large video collections, extracted features, video segmentation tools, scalable semantic concept lexicons with annotations, ontologies relating the concepts to each other, tools for annotation, learned models and complete software modules for automatically describing the video through concepts, and finally a benchmark set of user queries for video retrieval evaluation.

The resource will allow researchers to build their own image/video classifiers, test new low-level features, expand the concept ontology, and explore higher level search services, etc., without having to redevelop several person year?s worth of infrastructure. Using this tool suite and reference implementation, researchers can quickly customize concept ontologies and classifiers for diverse subdomains.

The contribution of the proposed work lies in the development of a large number of critical research resources for digital video analysis and searching. The modular architecture of the proposed resources provides great flexibility in adding new ontologies and testing new analytics components developed by other researchers in different domains. The use of large diverse standardized video datasets and well-defined benchmark procedures ensures a rigorous process to assess scientific progress.

The results will facilitate rapid exploration of new ideas and solutions, contributing to advancements of major societal interest, such as next-generation media search and security.

URL: www.informedia.cs.cmu.edu/analyticsLibrary

Project Report

Project Report This project has created a suite of tools for systematic development and rapid deployment of video analysis, video annotations, and model learning in different domains. These tools will provide data, extracted features, learned concept models and automatic detectors, and an evaluation paradigm as a public resource. Together they provide many of the essential building blocks necessary for multimedia applications such as video search systems. Additional tools and datasets with annotations have been made available on the website LIBSCOM.ORG, including both data and running code. Part of the project dissemination and outreach activities has been the organization of workshops for large scale media analysis, bringing together researchers and disseminating our tools and results. The first such workshop took place at ACM Multimedia 2009. The workshop provided a forum to understand key factors related to research on very large scale multimedia dataset such as the construction of datasets, creation of ground truth, sharing and extending existing resources, features, algorithms, tools, etc. The project organized the next workshop on large scale multimedia collections at the International Conference for Pattern Recognition ICPR 2010, and most recently organized another very successful workshop on very large scale multimedia collections and evaluations at ACM Multimedia 2010. This was followed by a summer school course in 2011 at the Summer School in Vision, Learning and Pattern Recognition (VLPR2011). The summer school course described active ongoing work in semantic concept detection for video data and used our tools as examples. Key to this effort has been an ontology of visual concepts that could be automatically observed in the video. The Large Scale Concept Ontology for Multimedia contains associated annotated data over several core video datasets. LSCOM has been a collaborative effort to develop a large, standardized taxonomy for describing video. These concepts have been selected to be relevant for describing core content aspects of video, feasible for automatic detection with some level of accuracy and useful for video retrieval. LSCOM additionally connects all its concepts into a full ontology with hierarchical relations. The full LSCOM set contains over 2600 concepts, 449 of them have been fully annotated over the TRECVID 2005 collection. This is now available through the LIBSCOM.org website. An extended set of concepts for additional YouTube data are also now available there. Based on the local keypoints (known as the Scale-Invariant Feature Transform) extracted as salient image patches, an image can be described as a bag-of-visual-words (BoVW). The project conducted comprehensive studies on the representation choices of BoVW, including vocabulary size, weighting scheme, stop word removal, feature selection, spatial information, and visual bi-grams. This offered practical insights in how to optimize the performance of BoVW by choosing appropriate representation choices. Experiments showed that a soft-weighting scheme outperforms other popular weighting schemes such as TF-IDF with a large margin. Extensive experiments on TRECVID data sets also indicate that BoVW feature alone, with appropriate representation choices, already produces state-of-the-art concept detection performance. Based on these empirical findings, the method was applied to detect a large set of 374 semantic concepts. The detectors, as well as the features and detection scores on several recent benchmark data sets, were released to the multimedia community. The project also developed and shared code for a robust new approach to extract semantic concept information based on explicitly encoding static image appearance features together with motion information. For high-level semantic concept identification detection in broadcast video, multimodality classifiers were trained which combine the traditional static image features and a new motion feature analysis method (MoSIFT). The experimental result show that the combined features have solid performance for detecting a variety of motion related concepts and provide a large improvement over static image analysis features in video. As videos from a variety of different domains (e.g., news, documentaries, entertainment) have distinctive data distributions, cross-domain video concept detection becomes an important aspect, so that one can reuse labeled data from one domain to benefit classification in another domain with insufficient labeled data. The project developed a cross-domain active learning method which iteratively queries labels of the most informative samples in the target domain. In this work, the Gaussian random field model is the base learner which has the advantage of exploring the distributions in both source and target domains, and uncertainty sampling is the query strategy. Additionally, an instance weighting method was created to accelerate the adaptability of the base learner, and develop an efficient model updating method which significantly sped up the active learning process.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
0751185
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2008-03-15
Budget End
2012-02-29
Support Year
Fiscal Year
2007
Total Cost
$454,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213