The proposed project is aimed at three objectives, (1) studying a common representational framework for learning and modeling hundreds of object categories, especially to account for the large intra-category variations; (2) constructing a large ground truth database with a minimum of one million objects annotated semi-automatically for learning and testing; and (3) building a robust large scale object recognition and image parsing system. The core to this proposal is a stochastic context sensitive image grammar for effective visual knowledge representation and robust Bayesian inference. The proposed stochastic image grammar combines the reconfigurability of stochastic context free grammar (SCFG) with the contextual constraints of graphical (MRF) models. This stochastic grammar model has strong compositional power for representing large intra-class structural variations and recursive structures for scalable computing, and can be learned from a relatively small sample set. To make the large scale modeling and learning framework practical, the PI has been constructing a large scale ground truth database in collaboration with the Lotus Hill Institute (LHI) in China. The current database contains over 500,000 images manually parsed hierarchically using a semi-automatic vision system, in 240 object categories and 20 scene categories. All the data are represented uniformly in a large And-Or graph structure for learning and testing. We propose to continue collecting and annotating up to 1,000,000 images and construct a series of benchmarks during this project period.
The proposal develops core techniques for large scale object recognition which can be used as the foundation for building a wide range of applications in commercial and defense industry, such as intelligence image search, security and surveillance, autonomous vehicle, and assisting the blind and visually impaired. The ground truth database shall be the world largest in its detailed parsing and annotation. A selected portion of this large dataset will be publicized for learning and benchmark evaluation. It is expected to have a significant impact in the vision community, and it may also help researchers studying human perception in cognitive science by providing more realistic stimuli and natural image statistics.
Progress of this project will be reported through the following webpage
http://civs.stat.ucla.edu/Category_Recognition