The goal of this research project is to develop methods which are able to accurately forecast what an observed scene looks like a few seconds from now. For instance, given frames of a video showing a traffic intersection, how can a machine anticipate the situation at the intersection a few seconds after the last observed video frame? Humans have a remarkable ability to address this task, which is used permanently, e.g., to safely navigate at an intersection, to effectively collaborate in a kitchen, and even when reading. To address this task, neuroscience hypothesizes that the situation is simulated using mental models. This helps to quickly converge to a set of likely outcomes while ruling out implausible situations. In contrast, present-day computer vision, machine learning and autonomous systems which address this challenge are at their infancy. While the last decade has shown tremendous progress for tasks which analyze observations, e.g., to detect visible objects and segment their contours, present-day systems are challenged when reasoning about something that is not directly observed, e.g., the situation a few seconds from now. Reasoning about the unobserved is challenging because the number of possibilities grows quickly. Yet, the ability to forecast is important for any system that wants to interact safely with its surroundings. To close this gap and lay the foundations for systems to anticipate, this project studies three aspects: 1) representations of the data which are suitable for forecasting, 2) properties of methods that permit accurate forecasting, and 3) what data is necessary to develop accurate models for forecasting.
Technically, to address the aforementioned three aspects, the project develops methods which learn how to anticipate via visual simulation. Specifically, the methods use the observed data to retrieve a model of the scene either explicitly or implicitly (Thrust 1). The methods also learn from data how this model is transformed to match likely futures, i.e., the systems learn to perform visual simulation. For this, the methods disentangle geometry, dynamics and relations between observed entities via latent variables (Thrust 2). Disentangling is important because geometry, dynamics and relations influence futures differently. The amount and detail of the annotated data which is used to develop these methods will affect the outcomes. This project studies those relations by collecting a novel dataset (Thrust 3). The representations, algorithms and data innovations will be incorporated into undergraduate and graduate courses as well as an outreach program which is developed to teach audience-centric presentations to undergraduate and graduate students, providing an opportunity to learn to anticipate audience behavior (Thrust 4).
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.