Automated telephone dialog systems rely disproportionately on accurate transcription of the speech signal into readable text. When the system has low confidence in the automatic speech transcription (ASR) of a caller's utterance, a typical dialog strategy requires the system to repeat its best guess, and ask for confirmation. This leads to unnatural interactions and dissatisfied callers. The current project focuses on developing better dialog strategies given current ASR capabilities by learning automatically from contrasting corpora, and comparing the results. Using a novel methodology, wizard ablation, simulated human-system dialogs are collected that vary in controlled ways. The testbed application, an Automated Readers Advisor for New York City's Andrew Heiskell Talking Book and Braille Library, has appropriately limited complexity, and potentially broad social benefit.
The motivation for wizard ablation is that research is needed into the problem-solving strategies humans would use if the human communication channel were restricted to be more like a machine's. In conventional wizard-of-oz studies, unsuspecting users interact with human wizards "behind-the-screen", thus providing data on the way humans interact with (what they believe to be) machines. Unlike a conventional wizard, an ablated wizard is restricted to seeing the ASR input to the system dialog manager. Under a further ablation condition, the wizard must choose actions from the repertoire that the system uses, but can combine them freely. The book-borrowing scenarios for the wizard interactions have been designed to be realistic, and Heiskell Library patrons participate in the studies. The collected dialogs will be made available to the community.
A spoken dialogue system is a program that conducts a dialogue with a person. Each time the machine receives an audio signal, it processes it in a variety of ways before it replies. Such a system confronts two activities that are extremely difficult for machines but easy for people. One is accurate speech recognition, the ability to identify spoken words from large vocabularies for a wide variety of speakers and subjects. The other is inference, the ability to recognize what people actually mean from what they say. This project acquires significant data resources and develops new techniques to support more flexible, more natural spoken dialogue systems. When people speak with one another, they rely on context and their ability to generalize, even under noisy conditions. People are also often unaware of the difference between what they say and what they mean. When people speak with a machine, however, they want to make certain that it understands what they intend. This is why people tolerate repeated requests from the system to confirm its understanding and to repeat themselves. Commercial spoken dialogue systems can achieve high reliability with relatively small vocabularies and deliberately rigid dialogue strategies. A commercial spoken dialogue system avoids the more fluid ways that people speak with one another. Such a system makes certain that it has understood the speaker’s meaning. A commercial system with moderately accurate speech recognition often asks users to repeat themselves. This approach fails, however, when overall recognition is poor. Research spoken dialogue systems are typically more ambitious: they contend with large vocabularies and learn their dialogue strategies from experience. A typical research system, while its strategies are less rigid, may be more successful with poor speech recognition. It must know in advance, however, exactly what it can say and what kinds of situations it may encounter. This project collected detailed dialogues between people and several spoken dialogue systems, where the vocabulary is quite large and speech recognition is deliberately poor. The collection includes extensive information about what the system received and how it responded. This data provides a fertile environment for computer scientists and linguists to study such interactions for years to come. The new dialogue systems produced by this project learned an expanded repertoire of dialogue strategies from the collected dialogues. The new, learned dialogue strategies can address a task as well as people do despite poor speech recognition. To do so, they rely on a substantial repertoire of clarification strategies learned from people who were asked to interpret text during dialogue, text similar to the poorly-recognized speech with which a dialogue system might contend. The resultant systems can confirm partial interpretations of what a person said and build on them, rather than ask the person to repeat. Successful use of these strategies requires rich ways to describe dialogue that draw on all phases of spoken language processing (e.g., recognition, parsing, comparison with a dictionary) and a supportive knowledge base. The learned dialogue strategies in our systems show how more varied actions and descriptions both make a dialogue system more flexible and move toward ways to generalize these ideas to a broad variety of contexts. The results point to a clear need for mechanisms that generalize across applications, and that adapt general dialogue strategies to specific contexts. This work represents ongoing research at Hunter College of The City University of New York and at Columbia University to support people and computers in dialogue about knowledge bases. Its potential uses are quite broad.