As we move through our visual environment, the pattern of light that enters our eyes is strongly shaped by the properties of objects within the environment, their motion relative to each other, and our own motion relative to the external world. This collaborative project will quantify motion within natural scenes, record activity from populations of neurons in the early visual pathway in response to the motion, and develop models of motion representation across neuronal populations. The primary goals of the work are to fully characterize the biological representation of motion in natural scenes in the early stages of visual processing that sets the stage for cortical computation critical for visual perception, and to unify the biological findings with computational models of motion from the computer vision community.

The perception of visual motion is critical for both biological and computer vision systems. Motion reveals structure of the world including the relative and absolute depths of objects, surface boundaries between objects and information about ego-motion and the independent motion of other objects. The effects of visual motion on the relationship between spatially localized and global properties of the natural visual scene, and how this is represented by the early visual pathway of the brain, are largely unknown.

This project addresses the computation of local and global properties of natural visual scenes by both distributed neural systems and computer vision algorithms using a novel set of complex naturalistic stimuli in which ground truth properties of the scene are known, and all aspects of the scene, including its reflectance, surface properties, lighting and motion are under investigator control. A unified probabilistic modeling framework will be adopted, that ties together the computational and biological models of properties of the natural scene. Neural activity will be recorded from a large population of densely sampled single neurons from the visual thalamus. From the perspective of the computer vision community, an important challenge exists in inferring the motion of the external environment (or "optical flow") from sequences of 2D images. From the perspective of the neuroscience community, quantifying the distributed neural representation of luminance and motion in the early visual pathway will be a critical step in understanding how scene information is extracted and prepared for processing in higher visual centers. A team of investigators with experience in computer science, engineering, and neuroscience will develop a theoretical foundation and rich set of methods for the representation and recovery of local luminance, local motion boundaries and global motion by brains and machines.

Project Report

Motion provides powerful cues about scene structure. A fundamental challenge in computer vision the estimation of image motion (or optical flow) from a sequence of images. This project has advanced the state of the art in the field by developing new algorithms, understanding the performance of existing methods, and by providing a new challenging dataset for the field. Ground truth datasets have spurred innovation in several fields of computer vision, since they provide objective evaluation criteria and encourage competition in the community. In the case of optical flow, ground truth is difficult to measure in real scenes with natural motion. As a result, optical flow data sets are restricted in terms of size, complexity, and diversity, making optical flow algorithms difficult to train and test on realistic data. One aim of this project was to create a new optical flow data set derived from the open source 3D animated short film Sintel. From this movie, we extracted 35 sequences displaying different environments, characters/objects, and actions. The data set consisting of these scenes exhibits important features not present in previous data sets: long sequences, large motions, non-rigidly moving objects, specular reflections, motion blur, defocus blur, and atmospheric effects. While these effects increase the realism of our dataset compared to others previously used in the field of optical flow, the data set is nevertheless synthetic, and thus, at least perceptually, not ''real''. To validate the use of synthetic data, we collected real-life "lookalike" video clips from five semantic categories corresponding to our sequences: Fighting in snow, Bamboo forest, Indoor, Market chase, and Mountain. We compared the sequences in our data set to these lookalike clips and, as a further comparison, to clips from the popular Middlebury benchmark, using two sets of criteria: first-order image statistics and first-order optical flow statistics. For image statistics, we computed the brightness intensity histograms, power spectra, and gradient magnitude distributions. For the optical flow statistics, ground truth optical flow does not exist for the lookalike videos. We therefore used an optical flow algorithm developed in this project (Classic+NL) as a proxy for the real optical flow, and computed speed and direction distributions and spatial derivatives for these proxy optical flow fields. For both the image and the optical flow statistics, we found that the Sintel clips were, in all cases, between the lookalike sequences and Middlebury. Considering that both consist of real, photographic images, we conclude that, at least in terms of first-order statistics, Sintel is sufficiently similar to real video to serve as an optical flow benchmark. Using the test set, we evaluated a number of optical flow estimation algorithms, and, as expected, found the Sintel data set to be much more challenging in general than the existing Middlebury data set. Specifically, we identified two conditions under which current algorithms for optical flow estimation fail. The first condition is a high velocity. In regions with velocities above 40 pixels per frame the endpoint error is approximately 45 times higher than in regions with velocities between 0 and 10 pixels per frame. The second condition is whether a region is unmatched, i.e. only visible in one of two adjacent frames. In those regions, the errors are on average 8 times higher than in regions that are visible in both frames. From these two failure conditions, we conclude that optical flow estimation can be significantly improved by better modeling large motion, and by reasoning about the structure of the world to be able to model and make educated guesses about motion in occluded and unmatched regions. To address this latter issue we developed a new algorithm that represents the scene motion in terms of layers. Layered models offer an elegant approach to motion segmentation and have many advantages. A typical scene consists of very few moving objects and representing each moving object by a layer allows the motion of each layer to be described more simply. Such a representation can explicitly model the occlusion relationships between layers making the detection of occlusion boundaries possible. Previous methods, however, have failed to capture the structure of complex scenes, provide precise object boundaries, effectively estimate the number of layers in a scene, or robustly determine the depth order of the layers. Furthermore, previous methods have focused on optical flow between pairs of frames rather than longer sequences. We introduce a new layered model of moving scenes in which the layer segmentations, depth order, the number of layers, and the motion of each layer are all estimated. This represents the first layered flow model to achieve competitive results on benchmarks like Sintel. The dataset, as well as all results and more information can be downloaded from www.mpi-sintel.de.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0904875
Program Officer
Kenneth C. Whang
Project Start
Project End
Budget Start
2009-10-01
Budget End
2012-09-30
Support Year
Fiscal Year
2009
Total Cost
$173,412
Indirect Cost
Name
Brown University
Department
Type
DUNS #
City
Providence
State
RI
Country
United States
Zip Code
02912