A set of small-scale human-computer dialogue corpora ("micro-corpora") is being transcribed and annotated, and distributed to researchers in the area of dialogue. The raw data for these corpora already exists and has been generated using two existing system (DiSCoH from AT&T labs and ConQuest from Carnegie Mellon University). The resulting corpora are being distributed to researchers in the field with the goal of soliciting feedback on the corpus composition and annotation that can best support research in the field of human-machine dialogue interaction.
Feedback from this exercise is being collated and disseminated back to the community. Discussion on the outcome of this exercise at a workshop collocated with the 2007 HLT/NAACL conference provides the opportunity to develop a set of guidelines for the large scale collection of such data. The workshop collocation makes attendance convenient for many researchers in the dialogue community. In addition, support from NSF allows the workshop to ensure broad participation by researchers from both North America and international centers. Discussions and documents generated in the feedback and solicitation process provide a basis for the preparation of a community-supported proposal to the NSF CRI:CRD program. The availability of systematically collected and annotated data supports progress in human-computer dialogue research which in turn enables the development of more sophisticated and broadly-accessible technologies for information access.