News stories or Web pages can contain a great deal of reused information. Different authors may each present different versions of a story or event based on the same sources, and the facts of an event may get recapitulated or restated each time it is presented. Sometimes such presentations have little in common with each other; at other times one may be a copy of the other with minor edits. Given a topic of interest, then, a sufficiently extensive archive could be used to identify when particular ideas or statements originated and to check their validity. The goal of this project is to develop techniques to identify alternative versions of the same information in order to reconstruct how information "flows" between documents.
The project involves the investigation of a range of approaches to detecting reuse at the level of sentences, passages and documents. The research is evaluated using a range of corpora, such as news, Web crawls, and blogs, in order to explore the dimensions of reuse and information flow in different situations.
The research and its outcomes will have a significant impact on the design of tools that can be used to validate and assess information that comes from sources of differing reliability. Such a tool would be valuable in many applications in education, scientific research, and national security. The results of the research will be published in papers, will be accessible via the project Web site (http://ciir.cs.umass.edu/research/textreuse.html) and source code will be distributed through the popular Lemur toolkit (www.lemurproject.org/).