Annotated data sets are a necessity for data-driven speech and language processing approaches. Many of the speech and natural language processing tasks such as automatic speech recognition, question answering, machine translation, part-of-speech tagging, parsing, named entity extraction, and semantic role labeling have benefited significantly from shared tasks for benchmarking of algorithms and comparison of results on shared data sets. The goal of this project is to create a goal-oriented, mixed-initiative, naturally spoken human-machine spoken dialog system for conference services and publicize the spoken dialogs collected from this system for research purposes. The users can call a phone number and learn about the conference paper submission, program, venue, visa requirements, accommodation options and costs, etc.
We have an iterative approach, where the SDS is first deployed for the IEEE SLT workshop, to be held in December 2006, and all the components can be improved using the data collected from this deployment. Further data can be collected using the improved system for other conference/workshops.
Given that data-driven approaches are getting more popular for many speech and language processing applications, we believe that such a corpus annotated with system prompts, user utterance transcriptions, user intentions, overall task success, etc., would be a useful resource for dialog management, spoken language understanding, automatic speech recognition and other related tasks. These annotations can also be extended with user emotion tags, disfluencies, syntactic and semantic parses, etc. in the future.