The project proposes to address two difficult problems in stereo and motion analysis. These two areas of computer vision have been explored with numerous approaches and very good results have been achieved, although for limited types of scenes, such as a single, smooth and textured surface. Occlusion is a cause of many remaining difficulties, since most algorithms cannot operate on areas visible in one image only. Furthermore, even though the human visual system has no difficulty estimating the depth or velocity of occluded objects, computer vision systems are incapable of that. They either produce arbitrary results for the occluded regions, or, at best, mark them as occluded. The second issue we need to address is the need for processing at multiple scales. Many real scenes contain objects that are perceived at different scales, both due to variations in size and level of detail and due to variations in texture density. A multiple scale processing scheme is necessary for processing such scenes. It must be capable of detecting and preserving fine details, not allowing larger or richer in texture regions to dominate smaller or less textures ones, achieving good continuation of structures, and filling in missing data.
To overcome these unavoidable difficulties, it is proposed to address stereo and motion analysis from a perceptual organization point of view using the tensor voting framework. It is claimed that tokens, generated by matching corresponding pixels in the two images, form coherent perceptual structures in the appropriate space, while erroneous matches generate outlier tokens. The tensor voting framework allows us to efficiently detect perceptual structures without employing parametric models, which makes it possible to handle arbitrary structures. Furthermore, the tensor voting framework is non-iterative, requires no initial estimate and has only one critical parameter, the scale of voting. It is proposed to explicit handle occlusion by incorporating monocular information from the images in conjunction with the results of binocular processing. Our proposed methodology not only detects occluded regions, but also computes estimates for the disparity or velocity of these regions, as shown in the preliminary results presented in the project description. Instead of compromising the quality of the results by selecting a single scale of analysis for the entire scene, which would be sub-optimal in most regions, our approach automatically infers the smallest scale that can preserve the finer details, then proceeds with larger scales to ensure good continuation in regions with sparse or missing data.
The preliminary results compare favorably to the best algorithms reported to date, including some standard data sets. Based on these results, the proposed approach is expected to significantly advance the state of the art. The proposed research will also have an impact in the learning of both graduate and undergraduate students at USC, and will be made available to the scientific community through publications and presentations we intend to make, as well as on the World Wide Web. It is believed that addressing the difficulties described here is a critical step for the development of operational computer vision systems, since it is an attempt to bridge the gap between image acquisition and high-level vision, both of which are at a more mature state than mid-level computer vision.