In traditional computer vision, scene and object recognition are two related visual tasks generally studied separately. By devising systems that solve these tasks in an integrated fashion it is possible to build more efficient and robust recognition systems. At the lowest level, significant computational savings can be achieved if different categories share a common set of features. More importantly, jointly trained recognition systems can use similarities between object categories to their advantage by learning features which lead to better generalization. In complex natural scenes, object recognition systems can be further improved by using contextual knowledge both about the objects likely to be found in a given scene, and also the spatial relationships between those objects. Object detection and recognition is generally posed as a matching problem between the object representation and the image features while rejecting the background features using an outlier process. The PI will formulate object detection as a problem of aligning elements of the entire scene. The background, instead of being treated as a set of outliers will be used to guide the detection process.

In developing integrated systems that try to recognize many objects, the lack of large annotated datasets becomes a major problem. The PI created and will extend two datasets; LabelMe and the 80 million tiny images datasets. LabelMe is an online annotation tool that allows sharing and labeling images for computer vision research. Both datasets offers an invaluable resource for research and teaching on computer vision and computer graphics. The datasets are also intended to foster creativity, as they allows students at all levels to explore well established algorithms as well as devise new applications in computer vision and computer graphics. The PI will also develop new image and video datasets by exploiting the millions of images available on the internet.

The creation of robust systems for scene understanding will have a major impact on many fields by allowing the creation of smart devices able to interact and understand their environment, from aids to the visually-impaired, to autonomous vehicles, robotic assistants, or online tools for searching visual information.

The PI will extend his teaching and research activities beyond the boundaries of the classroom and the laboratory by developing a substantial amount of online material.

URL: http://people.csail.mit.edu/torralba/integratedSceneRecognition/

Project Report

Among the many problems that need to be addressed to build an artificial vision system, object and scene recognition are one of the central themes of today’s research. One of the recent successes of the field is face detection. For instance, the ability to localize faces inside images automatically, accurately and fast is now a common feature in most digital cameras. But general object detection and scene understanding remains a challenging task. For instance, detecting a plate appears to be a more challenging task than detecting a face. In part because its appearance is only defined by a few shape cues and the variability in shapes, textures and colors is very large compared to faces. In many situations, what is a plate might only be constrained by the context it is part of. The current challenge for computer vision scientists is to create systems that can search and recognize objects based on both location and context, such as understanding that a plate is likely to be on top of a dining room table or in a picture of a dining room. We know that contextual regularities play a fundamental role in human recognition of objects in natural images. The strength of context in visual perception is illustrated in Fig. 1. The role of context becomes essential when the features of the objects are degraded or even not available (e.g., object too small or largely occluded). In this picture subjects describe the scenes as containing (a) a car in the street, and (b) a pedestrian in the street. However, the pedestrian is in fact the same shape as the car except for a 90 degrees rotation. The non-typicality of this orientation for a car within the context defined by the street makes the car appear as a pedestrian. The goal of this award was to explore new representations for scene understanding and to develop the datasets needed to train such generic systems. Part of the goal within this research aim was to develop large datasets of annotated images. Indeed, datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. One important part of the research carried under this award has been the continuous development of a free web based image annotation tool (fig. 2). The goal of the tool is to build a large collection of annotated images for training systems for scene understanding. The tool is made available so that other researchers can build their own datasets. Using LabelMe, we have been building the SUN database. The SUN database spans a very large number of scene categories (more than 400 scene categories) and contains more than 300,000 segmented objects. The dataset has allowed us exploring new representations for context reasoning and object detection. A context model can rule out some unlikely combinations or locations of objects and guide detectors to produce a semantically coherent interpretation of a scene. However, the performance benefit of context models has been limited because most of the previous methods were tested on data sets with only a few object categories, in which most images contain one or two object categories. Our model incorporates global image features, dependencies between object categories, and outputs of local detectors into one probabilistic framework. Our context model improves object recognition performance and provides a more coherent interpretation of a scene than what is achieved with detectors alone. In addition, our model can be applied to scene understanding tasks that local detectors alone cannot solve, such as detecting objects out of context (fig. 3) or querying for the most typical and the least typical scenes in a data set. One of the important points about large databases is that even relatively simple algorithms can show improved performance when trained with very large amounts of data. However, in natural conditions, both large-data and small-data regimes co-exist. In a series of papers published at CVPR 2011 and NIPS 2011, we show how we can use transfer learning techniques to learn to recognize rare objects by sharing information with frequent objects. Interestingly, the algorithm learns to automatically cluster rare object classes with common object classes for which a lot of data is available. As a consequence we show that "big data" can help "small data" regimes when both co-exist. And this is a very common situation with naturally collected training data.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0747120
Program Officer
Kenneth C. Whang
Project Start
Project End
Budget Start
2008-04-01
Budget End
2013-03-31
Support Year
Fiscal Year
2007
Total Cost
$500,000
Indirect Cost
Name
Massachusetts Institute of Technology
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02139