Large scale corpora annotated at the sentence level have played a critical role in natural language research. They have enabled large scale integration of statistical knowledge (derived from the corpora) with linguistic knowledge leading to both technological and scientific applications, such as information extraction, question answering, summarization, and machine translation, among others. This approach is now being extended to the discourse level, thus going beyond the sentence level. Using a resource called the Penn Discourse Treebank (PDTB), a large scale corpus annotated with discourse structure along with the associated semantics, new major experimental work on discourse processing is being carried out, leading to the generation of more coherent summaries and texts, extraction of complex relations in texts, among others, as well as foundational research relevant to language technology. This work is also providing a deeper understanding of the relationship between sentence level and discourse level structures. While pursuing these goals, a variety of tools for making a productive use of the PDTB resource are also being developed. This research program is also coupled with a strong educational program involving training researchers in the PDTB methodology so that similar resources can be developed in other languages substantially divergent from English. This part of the research program has international components including collaboration with research groups in Czech Republic, India, and Finland. The international collaboration is funded by the NSF Office of International Science and Engineering.

Project Report

To benefit people, companies and governments, computers need to be able to automatically extract information from all kinds of texts – news reports, scientific papers, product manuals, blogs, etc. Currently, computers can only do this effectively when the text is a single clause. But if the needed information is spread across multiple clauses or multiple sentences, then computers first need to recognize how those clauses or sentences relate to each other. The ways in which they can do so are called discourse relations or coherence relations: For example, the situation or event described in one sentence may mean to explain the situation or event described in another one. Or it may be presented as similar to or a part of another one. These relations may be signaled explicitly with a word like "because" or "similarly", or a reader might be assumed to be able to infer the relationship for his or her self. Researchers believe that computers can be helped to recognize discourse relations automatically, using techniques from Machine Learning, provided that sufficient manually annotated data can be made available. Beginning in 2006, the National Science Foundation awarded funds to researchers at the University of Pennsylvania to start creating such data. The resulting Penn Discourse TreeBank is now the world’s largest resource of manually-annotated discourse relations. It contains over 40,000 relations annotated over the widely-used 1-million word Penn TreeBank corpus. Annotated with other linguistic information as well, The Penn TreeBank has become the most richly annotated corpus in the world and a "gold standard" for research and development in basic language technologies such as parsing, coreference resolution, word sense disambiguation, and temporal recognition (that is, identifying when and where an event has taken place). With the release of the Penn Discourse TreeBank (PDTB) in 2008, researchers around the world now have both a way of inducing procedures for automatically recognizing how clauses and sentences relate to each other and a gold standard for assessing such procedures. They have even started to use these procedures to improve the quality and capability of advanced language technology systems for automated text summarization, question answering, text quality assessment, and statistical machine translation. But this wide range of work has also revealed annotation missing from the PDTB that would further help them in their work, as well as stimulating a desire to have discourse relations annotated in text other than news reports, as style and register matter in how text is written. These important augmentations are planned for the PDTB in the near future.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0705671
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2007-09-15
Budget End
2012-08-31
Support Year
Fiscal Year
2007
Total Cost
$987,000
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104