This project tackles the development of new tools for the semantic analysis of temporal signals, in particular (but not restricted to) video sequences. While most of the emphasis in video analysis so far has been at the low-level, the investigators plan to explore the use of Causal Analysis to perform inference and decisions to analyze video signals. The challenge in this project is to bridge the gap between basic descriptor at the signal level and Causal Calculus, that acts on semantically meaningful representations. In particular, long-range prediction, not just short-range continuous extrapolation, requires the development of new tools that allow "interventions" into the model. How would the state "X" evolve if event "Y" were to occur? To attain the goals set forth in the proposal, the investigators must tackle fundamental problems in the analysis of time series, both at the low/mid-level (defining a proper notion of ``distance'' between time series that respects their intrinsic dynamics), at the mid-level (defining clustering schemes for action segments), and at the high-level (develop action semantics in an abductive framework).
During this pilot one-year project, the investigators plan to explore the feasibility of using causal analysis for performing long-range temporal prediction of events and actions from visual data. Sample applications that are impacted in case of success are broad ranging from surveillance to environmental monitoring to driver assistance in transportation, with significant societal impact in reducing traffic accidents.
Temporal signals provide a wealth of information on events, actions, moods, style, intentions, as illustrated by the pioneering experiments of the psychologist Johansson in the Seventies. Yet the way in which such "information" is encoded in the temporal signal, and how it can be inferred, is largely an open problem. This project has focused on inferring "temporal representations" from video signals that exploit the natural (causal) ordering, and that can be used for analysis and decisions in a larger, causal analysis inference system. At the low level, the project has introduced bottom-up inference methods for local sparse statitics extracted from video. These, coupled with off-the-shelf classifiers, have shown to be effective in detecting, localizing and classifying complex actions in unstructured and complex scenes, such as snippets of movies. Sample actions or events include standing up, kissing, shaking hands and other complex phenomena that entail interaction among multiple elements of the scene. These low-level descriptors represent the "symbolic building blocks" for further analysis. During the course of this project, we have advanced the general theory of causal inference. We have established the foundations of Causal Modeling in Structural Equations and their Structural Representations, and furthered the theoretical infrastructure that enables reasoning with counterfactual statements. Although significant work remains to be done to emulate human reasoning on long temporal sequences, and on the inference of robust and discriminative statistics from raw sensory data, the project has advanced the state of the art on both counts. In addition to applications to the analysis of video data, the tools of causal analysis developed in the course of this project have been demonstrated on a range of problems from sociology to epidemiology, and will likely continue to play an important role in the analysis of big data.