Institution: University of Arizona.

This project will develop approaches for learning stochastic geometric models for object categories from image data. Good representations of object form that encode the variation in typical categories (e.g. cars) are needed for important problems in computer vision and intelligent systems. One key problem is object recognition at the category level. What makes an object a member of one category (e.g. tables) instead of another (e.g. chairs) strongly relates to its structure, and automatically choosing among them to robustly recognize a new object requires appropriate representations of form.

A second problem is reasoning about object configuration and structure. For example, a standard chair should be recognizable as being similar to a table in certain ways, different in other ways, perhaps seen as blocking a particular path in a room, and considered useful as a step for reaching something. To achieve this level of understanding, representations for geometric structure that can link to physics and semantics are needed. But where should they come from?

To address this question, this project will explore learning effective representations from image data. More specifically, this project will study the novel approach of putting representation at the core, and learn from data which objects can be modeled in this manner. The work will begin with simple effective representations that are appropriate for some objects, and then expand the pool of models, largely by exploiting the fact that many complex objects are composed of simpler, natural, substructures, and that these are shared across multiple object categories. One result of this process will be statistical models for objects based on image data that will be disseminated to the research community.

This research will have positive impact on many applications that rely on robust recognition and scene understanding from image data, particularly in cases where the configuration, orientation, and form of objects are relevant.

These include applications where robots must function in natural environments and systems for augmenting human operators in numerous industrial, military, and everyday situations.

The learned object category representations will have additional uses in image and video retrieval and for model palettes in computer graphics applications. This research will also impact biomedical research by improving automated extraction of biological structure from image data to recognize phenotypes and to quantify the relation of form and function in high throughput experiments.

This project integrates two important educational initiatives: 1) curriculum development to increase opportunities for classroom study in computer vision, machine learning, and scientific applications at the University of Arizona; and 2) an educational outreach program targeted at Tucson high-school students from low socioeconomic groups that will promote an understanding of the integration of science and computation.

Project URL: http://vision.cs.arizona.edu/kobus/CAREER

Project Report

This project advanced our understanding of computer vision systems that integrate and exploit 3D representations of the world. While 3D has a long history in computer vision, most recent work uses representations based on 2D images. However, for many applications, understanding the 3D geometry of the world is important. Further, 3D representations of the world are often simpler than those based on images. This is because information that is lost during projection (e.g., due to occlusion) is difficult to model, and the camera parameters (position, angle, and focal length), which lead to different views, are typically unknown. In this project we developed the hypothesis that given models for what is in the world, we can often infer the camera parameters. This is because evidence for what is the world (e.g., chair versus table) is different than, and not confusable with, what the camera parameters are (e.g., longer or shorter focal length). Our methodology throughout this project is to construct principled models for the expected evidence in images given a hypothesis for both what is in the world and the camera parameters. We then search for a hypothesis that gives high agreement with the evidence in images and our prior understanding of the world. We applied this methodology to three different related domains as follows. The first domain was learning models of object structure. Here we showed that a computer program can learn the topology of simple furniture objects (e.g., the parts and how they are connected) by assuming that objects are contiguous in 3D, are constructed from simple parts (e.g., blocks), and objects within a class have the same topology. We found that eight images of an object class was often sufficient to learn the topology of classes such as tables, chairs, sofas, and cabinets. This work lead to a paper published in a top venue (NIPS’09). The second domain was understanding indoor scenes from single images. Here, understanding includes the room layout, camera parameters, and the identity and geometry of objects within the room, including frames (windows, doors, and pictures). We contributed a principled fully Bayesian method for doing so. We first showed that the method can provide good estimates for the room layout with objects represented by their 3D bounding boxes. Next, we showed that we can use external information (e.g., numbers from furniture catalogues) for the dimensions of objects to identify objects within the scene, which also improved the estimated room layout by reducing confusion from non-existent objects with unrealistic dimensions. Finally, we demonstrated that more fine grained modeling of non-convex objects (e.g., tables and chairs) further improved scene understanding because the system was now able to tuck chairs partly underneath tables to better explain image evidence. In addition, we demonstrated that using context (e.g., that chairs are often near tables) to find objects improved scene understanding even further. These three thrusts lead to three papers in top venues (CVPR’11, CVPR’12, and CVPR’13). The third domain was temporal scene understanding in the context of people moving about on a horizontal ground plane as captured by a stationary video camera. Here we were able to track people’s 3D location on the ground plane, estimate their relative heights, width, and girth, and infer the camera parameters. Intuitively, people provide probes for the camera parameters as they move towards and away from the camera without changing size. Working entirely in 3D leads to richer information extraction, and has several advantages for tracking as well. First, in 3D, given a camera hypothesis, our system is not confused by one person occluding another. In fact, one person disappearing behind another is a good source of evidence. Second, we can use knowledge about the typical speed of people walking, which is much more useful than doing so with image data where apparent speed depends on their distance from the camera. And third, we can use the fact that people are roughly the same size as they walk, which again is hard to exploit within image-based tracking methods. In summary, the funded work has developed the notion that representing the world in 3D is more informative, enables better exploitation of prior knowledge, and enables using simpler representations than 2D, image-based methods. Please visit http://ivilab.org/CAREER for much more information about the research done as part of this project This project also contributed to the creation of three courses, three PhD degrees, and involved six undergraduate students in research, leading to publications for most of them. The project also supported seven instances of the integration of science and computing (ISC) summer camp for middle school students. Here students from groups under-represented in higher eductation spent a week visiting bioscience labs, collecting data, building models for their data, and presenting their work to parents and teachers. Please visit http://ivilab.org/ISC for more information about the camp.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0747511
Program Officer
Jie Yang
Project Start
Project End
Budget Start
2008-04-01
Budget End
2014-03-31
Support Year
Fiscal Year
2007
Total Cost
$481,607
Indirect Cost
Name
University of Arizona
Department
Type
DUNS #
City
Tucson
State
AZ
Country
United States
Zip Code
85721