This project investigates multimodal communication in humans and agents, focusing on two linguistic modalities - prosody and dialog structure, which reflect major communicative events, and three non-linguistic modalities - eye gaze, facial expressions, and body posture. It aims to determine 1. which of the non-linguistic modalities align with events marked by prosody and dialogue structure, and with one another; 2. whether, and if so when, these modalities are observed by the interlocutor; 3. whether the correct use of these channels actually aids the interlocutor's comprehension. Answers to these questions should provide a better understanding of the use of communicative resources in discourse and can subsequently aid the development of more effective animated conversational agents.
The outcomes of our observations will be modeled on controlled elicited dialog. To assure robust information on the interplay of modalities, we control the base conditions, genre, topic, and goals of unscripted dialogs. An ideal task for this is the Map Task, where dialog participants work together to reproduce on one player's map a route preprinted on the other's. The two maps, however, are slightly different, so that each player holds information important to the other. This scenario triggers a highly interactive, incremental and multimodal conversation.
In the proposed project a basic corpus of Map Task dialogues will be collected while recording spoken language, posture, facial expressions, and eye gaze. Hand gestures, discouraged by the task, will be recorded where they occur. These findings will be used in the Behavior Expression Animation Toolkit (BEAT) in order to augment the current intelligent system AutoTutor. AutoTutor has been developed for a broad range of tutoring environments that coach the student in following an expected set of descriptions or explanations. The coach-follower roles in the Map Task scenario make it possible to easily change the scenario for AutoTutor. In a series of usability experiments interactions of dialog participants with AutoTutor will be recorded. These experiments allow us to record not only the participant's impressions, but also his or her efficiency (the time to complete map, latency to find named objects, deviation of the instruction follower's drawn route from the instruction giver's model), and communicative behavior (discourse structure, gaze, facial expressions, etc.).
The research resulting from this project will benefit a large variety of fields, including cognitive science, computational linguistics, artificial intelligence, and computer science. In addition, the integration of the modalities into a working model will advance the development and use of intelligent conversational systems.