Summarization evaluation is an important and challenging problem in language processing. While text summarization dates back as far as 40 years, speech summarization started only recently. A particular domain of speech data is multiparty meetings, which pose new challenges to evaluation metrics. Meetings differ from news article in many dimensions, such as the dialog structure in meetings, speaker turn, topic, and conversational speaking style. This exploratory work focuses on the impact of these differences on speech summarization definition and evaluation.
First, different human extractive summaries are generated for the meeting data from different points of views, such as based on topics, speakers, or discussion flow. These enable the examination of the impact of the meeting style on the consistency of the human generated summaries. Second, this project evaluates the correlation of the automated measurements, such as the ROUGE scores, with human judgments, and then develops metrics to take into account the characteristics in meetings.
The outcome from this exploratory project will help the research community to better understand the characteristics of the meeting domain, define the summarization task in meetings in a more consistent way, improve speech summarization evaluation metrics, and allow the wide use of speech summarization techniques in many applications (such as generating meeting minutes or lecture outlines). In addition, different summaries are created for the meeting corpus and will be released to the research community. Experimental results will be disseminated through conference and journal publications.