This project develops a new strategy for scene interpretation, especially for annotating cluttered scenes with instances from many object categories (e.g., a kitchen scene) and videos of people interacting with objects in everyday life (e.g., cooking). The research team develops a statistical model for scene interpretations and image measurements. One component of the model is a prior distribution on a huge interpretation vector. Each bit of this vector represents a high-level scene attribute with widely varying degrees of specificity and resolution ? some are very coarse (general hypotheses) and some are very fine (specific hypotheses). The other component is a simple conditional data model for a corresponding family of learned binary classifiers, one per bit. The scene interpretation is then computed by assessing hypotheses in a highly coarse-to-fine manner, using an image parsing algorithm called ?entropy pursuit? based on stepwise uncertainty reduction, and classifiers for detecting events in spatiotemporal volumes which leverage on recent advances at the intersection of machine learning and dynamical systems. The computational models and scene parsing algorithms developed in this project are broadly applicable to scene interpretation problems arising in many areas of science and engineering. Specific applications include home surveillance and security, assisted home living, infant and elderly care, etc. The project also provides research opportunities for graduate students in underrepresented minorities and even high school students.
A core challenge in artificial intelligence is automatically describing natural images in ordinary semantic terms. The dream is to build a "description machine" that produces a rich semantic description of the underlying scene, including the names, poses and spatial layout of the objects that are present, events that may be occurring, their contextual relationships, etc. This problem is sometimes called "computer vision." In this project, we have developed a framework for describing natural scenes that is inspired by two facets of human vision: divide-and-conquer querying (as in playing 20 questions) and selective attention. In our framework, given a scene, we ask a sequence of yes/no "questions" about the existence of people and objects, their activities and attributes, and their semantic relationships. A "description" of the scene is hence the sequence of "answers" to these questions. To decide which questions should be selected and how these questions should be answered, we have developed statistical models for the scene and applied principles from information theory to achieve efficient search and evidence aggregation. In particular, we have developed a mathematical framework and corresponding algorithm for automatic semantic annotation called "entropy pursuit" and another framework for testing the performance of any computer vision algorithm called a "restricted Turing test." In entropy pursuit, the "questions" are automatically selected in a sequential, adaptive way that optimizes the total information gain from each new query. The "answers" are provided by what are called "classifiers," which, due to the state-of-the-art in machine learning, are highly imperfect. There are many classifiers, each associated with a specific object category and location and scale for its appearance in the image. The selection of which question to ask next is based on the sequence of questions asked so far and their noisy answers. For this purpose, we also developed a statistical model which encodes general world knowledge, for instance prior expectations about how likely objects are and how they tend to appear in images from the general population under study. The final annotation balances the two sources of information - that acquired from the battery of classifiers for the particular image being described and the world knowledge about images. In the "restricted Turing test", the main motivation is that most current methods for evaluating the performance of computer vision systems measure detection accuracy, emphasizing the classification of regions according to objects from a pre-defined library. But object detection is not the same as understanding, which is significantly more complex. As a consequence, performance metrics currently in use in the community do not scale with respect to the richness of the semantic description. To address this issue, we have proposed a sharply different evaluation system, in which a query engine prepares a kind of written test that uses binary questions to probe a system's ability to identify attributes and relationships in addition to recognizing objects. The core contribution is an automatic "query generator'’ which interacts with an oracle (human being) to produce a sequence of questions and correct answers. The query generator is learned from annotated images and produces a sequence of "unpredictable" binary questions for any given "test'' image. In loose terms, this means that hearing the answers to the questions already asked without actually seeing the image provides no information about the likely answer to the next question. The score of a system is simply the fraction of correctly answered questions.