Humans naturally use dialog and gestures to discuss complex phenomena and plans, especially when they refer to physical aspects of the environment while they communicate with each other. Existing robot vision systems can sense people and the environment, but are limited in their ability to detect the detailed conversational cues people often rely upon (such as head pose, eye gaze, and body gestures), and to exploit those cues in multimodal conversational dialog. Recent advances in computer vision have made it possible to track such detailed cues. Robots can use passive measures to sense the presence of people, estimate their focus of attention and body pose, and to recognize human gestures and identify physical references. But they have had limited means of integrating such information into models of natural language; heretofore, they have used dialog models for specific domains and/or were limited to one-on-one interaction. Separately, recent advances in natural language processing have led to dialog models that can track relatively free-form conversation among multiple participants, and extract meaningful semantics about people's intentions and actions. These multi-party dialog models have been used in meeting environments and other domains. In this project, the PI and his team will fuse these two lines of research to achieve a perceptually situated, natural conversation model that robots can use to interact multimodally with people. They will develop a reasonably generic dialog model that allows a situated agent to track the dialog around it, know when it is being addressed, and take direction from a human operator regarding where it should find or place various objects, what it should look for in the environment, and which individuals it should attend to, follow, or obey. Project outcomes will extend existing dialog management techniques to a more general theory of interaction management, and will also extend current state-of-the-art vision research to be able to recognize the subtleties of nonverbal conversational cues, as well as methods for integrating those cues with ongoing dialog interpretation and interaction with the world.

Broader Impacts: There are clearly many positive societal impacts that will derive from this research. Ultimately, development of effective human-robot interfaces will allow greater deployment of robots to perform dangerous tasks that humans would otherwise have to perform, and will also enable greater use of robots for service tasks in domestic environments. As part of the project, the PI will conduct outreach efforts to engage secondary-school students in the hope that exposure to HRI research may increase their interest in science and engineering studies.

Project Report

Our findings are in three main areas: multimodal pronoun reference for humanrobotinteraction, multimodal co-training for audio-visual gesture recognition,and grounded multimodal Learning for Human-Robot-Object Interaction, andare described in turn in the following sections. Automatic scene understanding, or the ability to categorize places and objectsin the immediate environment, is important for many HRI applications, includingmobile robotic assistants for the elderly and the disabled. Category-levelrecognition allows the system to recognize a class of objects, as opposed to justsingle instances, and is particularly useful. One approach to automatic sceneunderstanding is through image-based recognition, which involves training aclassifier for each scene or object category offline, using manually labeled images.However, to date, image-based category recognition has only reached afraction of human performance, especially in terms of the variety of recognizedcategories, partly due to the lack of labeled data. Accurate and efficient offthe-shelf recognizers are only available for a handful of objects, such as facesand cars. Thus, to enable an assistant robot, or a similar system, to accuratelyrecognize objects in the environment, the user currently would have to collect and manually annotate sample images of those objects. Alternatively, a robot can learn about its surroundings from interactions with the user. In this work, our goal is to enable human-computer interaction systems torecognize a variety of object categories in realistic environments without requiringmanual annotation of each category by the user. We propose a newapproach, combining speech and visual object category recognition. The approachconsists of two parts: disambiguation and adaptation. Disambiguationmeans that, instead of relying completely on one modality, we will use genericvisual object classifiers to help the speech recognizer obtain the correct objectlabel. The goal of adaptation is, given a labeled generic databaseand a small number of labeled adaptation examples, to build the optimal visualcategory classifiers for that particular environment. Speech and language modelscan also be adapted to the particular speaker. Since image databases arelimited in the number of categories, another goal of adaptation can be to learnout-of-vocabulary objects, i.e. objects whose images and referring words are notin the generic labeled databases. This can be acheived by exploiting unlabeledimages available on the web. For example, we can match the reference image ofthe unknown object to a subset of images returned by an image search for thetop most likely words for that object. We address the problem of unsupervised learning of object classifiers forvisually polysemous words. Visual polysemy means that a word has several dictionarysenses that are visually distinct. Web images are a rich and free resourcecompared to traditional human-labeled object datasets. Potential training datafor arbitrary objects can be easily obtained from image search engines like Yahooor Google. The drawback is that multiple word meanings often lead to mixed results,especially for polysemous words. For example, the query "mouse" returnsmultiple senses on the first page of results: "computer" mouse, "animal" mouse,and "Mickey Mouse". The dataset thus obtained suffers from low precision ofany particular visual sense. In this effort we have also collected a 3-D Object dataset (B3DO), an ongoing collection effort using theKinect sensor in domestic environments. The English pronoun you is the second most frequent word in unrestricted conversation(after I and right before it). Despite this, with the exception of Guptaet al. (2007b; 2007a), its resolution has received very little attention in theliterature. This is perhaps not surprising since the vast amount of work onanaphora and reference resolution has focused on text or discourse - mediumswhere second-person deixis is perhaps not as prominent as it is in dialogue.For spoken dialogue pronoun resolution modules however, resolving you is anessential task that has an important impact on the capabilities of dialogue summarizationsystems.Besides being important for computational implementations, resolving youis also an interesting and challenging research problem. When an utterance contains a singular referential you, resolving the youamounts to identifying the individual to whom the utterance is addressed. Thisis trivial in two-person dialogue since the current listener is always the addressee,but in conversations with multiple participants, it is a complex problem wheredifferent kinds of linguistic and visual information play important roles (Jovanovic,2007). One of the issues we investigate is how this applies to the moreconcrete problem of resolving the second person pronoun you. We approach thisissue as a three-step problem. Using the AMI Meeting Corpus (McCowan etal., 2005) of multi-party dialogues, we first discriminate between referential andgeneric uses of you. Then, within the referential uses, we distinguish betweensingular and plural, and finally, we resolve the singular referential instances byidentifying the intended addressee. We use multimodal features: initially, weextract discourse features from manual transcriptions and use visual informationderived from manual annotations, but then we move to a fully automatic approach,using 1-best transcriptions produced by an automatic speech recognizer(ASR) and visual features automatically extracted from raw video.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0819984
Program Officer
Ephraim P. Glinert
Project Start
Project End
Budget Start
2008-01-01
Budget End
2012-07-31
Support Year
Fiscal Year
2008
Total Cost
$846,987
Indirect Cost
Name
International Computer Science Institute
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704