Text summarization generates a concise summary of documents to help a user quickly digest information and has become increasingly important due to the explosive growth of information. Most existing summarization methods can only generate an unstructured summary with a simple list of sentences. In many applications, however, we want to generate a (more structured) multi-faceted comparative summary, in which sentences are grouped into multiple facets and compared across different views. For example, a summary about laptop opinions may group sentences into facets such as ``battery life'' and ``memory'' and separate sentences with positive and negative opinions in each facet.
This project aims to systematically study this new summarization problem (called multi-faceted comparative summarization). The PI will develop general probabilistic approaches that can be applied to multiple instances of the problem in different domains. The basic idea is to use probabilistic mixture models to model and extract the multiple facets and multiple views of each facet in a set of text documents to be summarized. The extracted facets and views are then used to generate facet labels and select sentences for different facet-view combinations.
The proposed research will open up a new research direction in text summarization. The developed methods can be directly applied to provide more useful product opinion retrieval service on the Web to all people and help scientists in multiple fields to digest scientific literature more effectively, thus speeding up scientific discovery. All research resources and results will be disseminated to the research community through publications and downloadable software and data. The research results will also enhance the current curriculum in information retrieval, leading to improved education of the information technology workforce.