Understanding what people are doing in a video is one of the great unsolved problems of computer vision. A fair solution opens tremendous application possibilities. The proposed work will use existing tools from the speech and object recognition community --- in particular, finite state automata or FSA's --- to obtain an understanding of activities that depend on detailed information about the body.

The particular focus is everyday activity. In this case, a fixed vocabulary either doesn't exist, or isn't appropriate. For example, one does not know words for behaviors that appear familiar. One way to deal with this is to work with a notation (for example, laban notation); but such notations typically work in terms that are difficult to map to visual observables (for example, the weight of a motion). The alternatives are either to develop a vocabulary, or to develop expressive tools for authoring models.

This project will explore the third approach of building tools for authoring models of behavior quickly and expressively using finite-state methods. Research will explore a class of models that are easy to author from existing, or easily available, data. The interpretation of what someone is doing is affected by the objects nearby --- a person standing near a bus stop is doing something different from a person standing near an office door. The models studied make it practical to investigate this phenomenon of object context, using recent advances from the object recognition literature.

Evaluating models for everyday behaviors is hard, because there is no prospect of obtaining a large collection of marked up video (among other things, there isn't a vocabulary in which to mark it up). This project will use proxies --- statistics that are hard to measure from video without accurate inferences of behavior, but easy to measure in other ways --- to evaluate behavior representations. These will make it possible to tell whether, for example, a model of buying a beverage represents the concept accurately.

Intellectual merits: This project will produce very large finite state models of behavior using the same hierarchical authoring methods used in speech. There will be a particular emphasis on behaviors which require one to understand the kinematic configuration of the body, a topic that has been very difficult to study to date, with an intention of identifying basic building blocks of a vocabulary of everyday behavior. The results should include datasets of public behavior that can be disseminated, without encountering privacy concerns. New insights into the structure of human motion and behavior should emerge from (a) observations of people in public; (b) the process of authoring models; and (c) methods for identifying and modelling compositional structure in motion.

Broader impact: This project should make substantial progress on one of the key open and applicable problems in computer vision. Methods that can search video for particular behaviors and compute statistics of behaviors have a wide range of applications, including human-computer interfaces built around computers that can watch the body; an improved understanding of what people do in public which will result in better architectural planning; more efficient management of surveillance data, allowing searches for dangerous behaviors while preserving privacy. Education and access: This project will contribute to the graduate training of several students, and work described will contribute to a planned text on computing with human motion.

URL: http://luthuli.cs.uiuc.edu/~daf/action.html

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0534837
Program Officer
Jie Yang
Project Start
Project End
Budget Start
2006-03-01
Budget End
2010-02-28
Support Year
Fiscal Year
2005
Total Cost
$300,000
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820