This project addresses a key problem in advancing the state of the art in cognitive assistant systems that can interact naturally with humans in order to help them perform everyday tasks more effectively. Such a system would help not only people with cognitive disabilities but all individuals as they perform complex tasks they are unfamiliar with. The research focuses on structured activities of daily living that lend themselves to practical experimentation, such as meal preparation and other kitchen activities.
Specifically, the core focus of the research is activity recognition, i.e., systems that can identify the goals and individual actions a person is performing as they work on a task. Key innovations of this work are 1) that the activity models are learned from the user via intuitive natural demonstration, and 2) that the system is able to reason over activity models to generalize and adapt them. In contrast, current practice requires specialized training supervised by the researchers and supports no reasoning over the models. This advance is accomplished by integrating capabilities that are typically studied separately, including activity recognition, knowledge representation and reasoning, natural language understanding and machine learning. The work addresses a significant step towards the goal of building practical and flexible in-home automated assistants.
We developed methods that enable a computer system to learn to recognize complex human activities from a combination of language and visual perception. The work provides fundamental contributions to natural language processing and machine vision, and provides a foundation for creating computer systems that a person can interact with in a entirely natural manner - as one would interact with a human assistant. Applications of the results include assistive technology that help people with cognitive disabilities perform complex activities. Part of our work created a robust framework for complex event recognition that is well-suited for integrating information that varies widely in detail and granularity. The system takes in a variety of information, including objects and gestures recognized by RGB-D and descriptions of events extracted from recognized and parsed speech, and outputs a complete reconstruction of the agent's plan, explaining actions in terms of more complex activities and filling in unobserved but necessary events. We compared the results from low-level events recognized using just the vision subsystem, with low-level events recognized as a result of vision with complex event structure (Table 1, Right). We show that a significant improvement in recognition precision (64% to 87%) and recall (66% to 80%) occurs when structural and temporal information is used through the use of the complex event structure when recognizing highly-structured activities. Another system we created, LegionAR, provides robust, deployable activity recognition training online activity recognition systems with on-demand, real-time labeling using input through crowdsourcing. Identifying and labeling activities is time-consuming, and automatic approaches to recognizing activities in the real world remain brittle. We use activity labels collected from crowd workers to train an online activity recognition system to automatically recognize future occurrences of activities. A third part of our work developed a system that integrated activity recognition with interactive prompting to help a human complete a task. To demonstrate how the system works in terms of generating prompts and asking questions, two volunteer actors (students), who have no prior knowledge of how the system works, are asked to walk through a series of scenarios in our lab. Each volunteer wore RFID bracelets on both hands and performed two tasks according to a simple schedule. The participants were asked to respond to all reminding prompts, i.e., to do exactly as the system instructed. The test scenarios are designed to include interruption cases where a task is suspended before completion and then resumed later. In the first scenario, the participants followed the scheduled tasks sequentially without any interruptions, i.e., they start with the task of breakfast, finish it and then take medicine. In the second and third scenarios, the participants stopped the breakfast halfway to initiate other tasks, either taking medicine or watching TV, and then went back to finish breakfast (after the system prompt). The general criteria for testing the success of the system was that the system could behave properly, i.e., generate appropriate prompts in the right situation and ask questions when needed. In general, the system was able to successfully guide the agent through the schedule by instructing them to start, finish or resume a task. Finally, we developed methods for automatically aligning videos of activities with written descriptions of activities. Manually pairing each video segment or image frame with the corresponding sentence can be tedious and may not be scalable to a large collection of videos and associated parallel text. We developed a way to automatically align video frames with their corresponding natural language expressions without any direct supervision. We also jointly learn the correspondences between nouns in the sentences and their referents in the physical environment.