This project explores new directions to solving top-down modulated visual saliency maps with three basic principles: discriminancy, sparsity and connectivity. The research identifies key factors for advancing the state-of-the-art and presents a novel latent variable model, which extends the classical conditional random field with an embedded layer of latent variables to exploit the sparsity nature of features for saliency maps. This sparse latent variable conditional random filed model can be considered as a joint optimization of group sparse coding and conditional random field, which can be solved with an efficient stochastic gradient descent algorithm. Unlike bottom-up saliency, this model facilities high-level visual recognition tasks by learning sparse image structures from objects of interest. The key intellectual contributions of this project are a novel formulation that considers all three important properties for visual saliency in a unified framework, and an efficient learning algorithm to estimate the model parameters.
With the developed techniques, the search regions of these vision tasks can be constrained and thereby reduce the computational complexity and enhancing robustness. Effective top-down modulated visual saliency algorithms have broad applications including object detection, object recognition, visual tracking, scene analysis, image compression, surveillance, and robotics. It also provides a crucial tool for studying and analyzing fixations of eye movements in cognitive science. The research results including code and data are made public on the project web site.
One of the most important problems in computational or biological perception is visual information overload. Without filtering out extraneous signals, it would be computationally expensive to process all the incoming information. Perceptual saliency is of great importance with survival relevance for animals to make decisions on regions for further visual processing. In computer vision, one of the long standing questions is to develop algorithms in order to focus on salient regions for efficient and effective image understanding. In a visual scene, is a particular object present or not? If yes, where is this object likely to appear? If it is moving, how can we predict its location in the next frame? When the scenes are relatively simple and the object appearance does not vary significantly, the state-of-the-art computer vision systems are able to handle these questions reasonably well. However, the real-world scenes are usually highly cluttered and the object appearance constantly changes as a result of variation in poses and illumination. Not surprisingly, those questions are answered by the human vision system. How does the human vision system handle the interference of cluttered background and accomplish the above-mentioned visual tasks effortlessly? Research work in neuroscience brings forth some answers to these questions with models for saliency map and visual attention mechanism In this project, we have developed effective methods to analyze salient objects using top-down contextual information. With the developed methods, we are able to detect objects of interest in scenes even when they are heavily occluded. Such algorithms are useful for further analysis (e.g., object category) with numerous applications (e.g., surveillance and autonomous driving). In addition, we have also carried out extensive experiments to evaluate the state-of-the-art saliency detection methods. The developed algorithms and source codes are available on the PI's web.