Image interpretation, the ability to see and understand the three-dimensional world behind a two-dimensional image, goes to the very heart of the computer vision problem. The overall objective of this proposal is, given a single image, to automatically produce a coherent interpretation of the depicted scene. On one level, such interpretation should include opportunistically recognizing known objects (e.g. people, houses, cars, trees) and known materials (e.g. grass, sand, rock, foliage) as well as their rough positions and orientations within the scene. But more than that, the goal is to capture the overall "sense of the scene" even if we do not recognize some of its constituent parts.

To address this extremely difficult task, the PI proposes a novel framework that aims to jointly model the elements that make up a scene within the geometric context of the 3D space that they occupy. Because none of the measured quantities in the image -- geometry, materials, objects and object parts, scene classes, camera pose, etc. -- are reliable in isolation, they must all be considered together, in a coherent way. Having the geometric context representation will allow all the elements of the image to be physically "placed" within this contextual frame and will permit reasoning between them and their 3D environment in a joint optimization framework. During the timeframe of this proposal, the PI will develop such a framework which will allow a geometrically coherent semantic interpretation of a image to emerge.

Intellectual Merit: At the core of the proposal is an effort to unify two disjoint computer vision philosophies -- the traditional "Geometry" school that deals with 3D quantities like points and surfaces, and the newer "Appearance" school that operates in terms of 2D pixel patterns. These two views are here combined into one coherent framework, where appearance and geometry co-exist and rely on each other to jointly produce an interpretation of an image.

Broader Impact: There are a number of important real-world problems that will benefit from the proposed research even during its development. Direct applications of this work include: developing navigation assistant technology for the visually impaired, scene awareness for mobile robots and car safety, and creating graphical 3D walk-through environments from a single image.

URL: www.cs.cmu.edu/~efros/ImageInterpretation/

Project Report

Human vision is one of the most remarkable machines that ever existed. From sparse, noisy, hopelessly ambiguous local scene measurements our brain manages to create a complete, coherent visual experience, a comprehensible universe that is occasionally wrong but always rich and vivid. Not only are we able to identify the objects present in a scene, but we can also easily reason about their positions and relationships within the 3D world, even when looking at a simple 2D photograph. Understanding and someday, reproducing (or even surpassing) this remarkable human ability is among the most wonderful and exhilarating pursuits imaginable! But how can scene interpretation, while seemingly effortless for humans, remain so excruciatingly difficult for a computer? One of the main reasons appears to be that recognition is inherently a global process. When we see a person at the street corner (Figure 1), the simple act of recognition is made possible not just by the pixels inside the person-shape (there are rarely enough of them!), but also by many other cues: the surface on which she is standing, the 3D perspective of the street, other objects in the scene (cars, pedestrians), etc. As part of this project, we have developed the concept of geometric context as the glue that coherently binds together all the pieces of the scene understanding puzzle. We developed automatic algorithms to recover a 3D "contextual frame" of an image, a sort of theater stage representation containing major surfaces and their relationships to each other (Figure 2). Furthermore, we were able to use the geometric context as an optimization framework for connecting other scene elements. For example, given the geometric context estimate and a set of off-the-shelf local object detectors; we were able to capture their contextual dependencies in a geometrically valid way, producing an object detection system that worked much better than previous efforts. The central outcome of this project has been a large body of research that has opened up a new area of inquiry in computer vision -- geometrically coherent image interpretation. Not only is the geometric context algorithm now a part of the standard toolbox of computer vision tools (it even made it into the most popular college textbook on computer vision), but the applications of the method have been used in other areas, particularly in computer graphics and robotics. As further evidence of the impact, the papers done under this project have all generated disproportionately large number of citations, and one of the papers was awarded the CVPR’06 best paper award, ranking it the top of several hundred papers that year. Moreover, Derek Hoiem, the main student supported on the project won the ACM Dissertation Honorable Mention award for his thesis on this topic. Apart from scientific publications, the project resulted in several fully functional computer programs being released on the web, free of charge, for unlimited, non-commercial use Moreover, several annotated image datasets associated with the project have also been made freely available on the web.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0546547
Program Officer
Richard Voyles
Project Start
Project End
Budget Start
2006-02-01
Budget End
2012-01-31
Support Year
Fiscal Year
2005
Total Cost
$499,514
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213