Seamless understanding of the meaning of visual images is a key property of human cognition that is far beyond the abilities of current computer vision programs. The purpose of this project is to build a computational system that captures the dynamical and interactive aspects of human vision by integrating higher-level concepts with lower-level visual perception. If successful, this system will be able to interpret visual scenes in a way that scales well with the complexity of the scene. Current computer vision systems typically rely on relatively low-level visual information (e.g., color, texture, shape) to classify objects or determine the overall category of a scene. Such categorization is typically done in a "bottom-up" fashion, in which the vision system extracts lower-level features from all parts of the scene, and subsequently analyzes the extracted features to determine which parts of the scene contain objects of interest and how those objects should be categorized. Such systems lack the abilities to scale to large numbers of visual categories and to identify more complex visual concepts that involve spatial and abstract relationships among object categories. Visual perception by humans is known to be a temporal process with feedback, in which lower-level visual features serve to activate higher-level concepts (or knowledge). These active concepts, in turn, guide the perception of and attention given to lower-level visual features. Moreover, activated concepts can spread activation to semantically related concepts (e.g., "wheels" might activate "car" or "bicycle"; "bicycle" might activate "road" or "rider"). In this way there is a continual interaction between the lower and higher levels of vision, which allows the viewer to focus on and connect important aspects of a complex scene in order to perceive its meaning, without having to pay equal attention to every detail of the scene. The system proposed here will model these aspects of human visual perception.

The proposed system, called Petacat, will integrate and build on two existing projects: the HMAX model of object recognition originally developed by Riesenhuber and Poggio, and the Copycat model of high-level perception and analogy-making, developed by Hofstadter and Mitchell. HMAX models the "what" pathway of mammalian visual cortex via a feed-forward network that extracts increasingly complex textural and shape features from an image. (HMAX has been reimplemented, as the "Petascale Artificial Neural Network" or PANN, by the Synthetic Vision Group at Los Alamos to allow for high-performance computing on large numbers of neurons.) Copycat implements a process of interaction between high-level concepts and lower-level perception, and has been used to model focus of attention, conceptual slippage, and analogy-making in several non-visual domains. This project will marry the feature extraction abilities of HMAX/PANN with the higher-level interactive perceptual abilities of Copycat to build the Petacat architecture. The image interpretation abilities of Petacat will be evaluated on families of related semantic visual recognition tasks (e.g., recognizing, in a flexible, human-like way, instances of "walking a dog"). The evaluation part of the project will involve the creation of image databases for benchmarking semantic image-understanding systems. The Petacat source code and benchmarking databases will be made publically available via the web.

Project Report

The long-term goal of our work is automatic image interpretation --- that is, to give computers the ability to make sense of images in terms of known "stereotypical" situations, such as "walking a dog", or "a birthday party", or "a fight about to break out". Situation recognition by humans may appear on the surface to be effortless, but it relies on a complex dynamic interplay among human abilities to perceive objects, systems of relationships among objects, and analogies with stored knowledge and memories. No computer vision system yet comes close to capturing these human abilities. Enabling computers to flexibly recognize visual situations would create a flood of important applications in fields as diverse as medical diagnosis, interpretation of scientific imagery, enhanced human-computer interaction, and personal information organization. Our approach to situation interpretation is to integrate two types of artificial-intelligence architectures: brain-inspired neural networks for lower-level vision and cognitive-level models of concepts and analogy-making. The completed three-year project funded by this grant made several initial steps towards our long-term goals. The main outcomes of the three-year project were (1) The development and extensive analysis of a particular brain-inspired neural network for object-recognition; (2) New methods for training this neural network that significantly improve its performance; (3) The development of new methods for using "context" in a dynamic way in computer vision. These outcomes, which constitute the intellectual merit of the project, have set the stage for our current project: the implementation of a complete integrated system for automatic "situation recogntion". The broader impacts of our work so far have been seen in four ways: (1) the potential of the work to produce important ideas and methods for computer vision, an area with large impacts in many areas of society; (2) training provided to graduate students, undergraduates, and high-school interns who have been part of this project; (3) availability of all source code and image datasets, which will be useful to researchers working on related projects; and (4) the public dissemination of new ideas via our group's articles, lectures, and web-based demonstrations for non-expert audiences.

Project Start
Project End
Budget Start
2010-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$341,269
Indirect Cost
Name
Portland State University
Department
Type
DUNS #
City
Portland
State
OR
Country
United States
Zip Code
97207