Building large scale annotated resources is crucial for basic and applied research in Natural Language Processing (NLP). Our major long term goal of this project is to make very substantial extensions to an existing unique resource, the Penn Discourse Treebank (PDTB), developed under prior NSF support, augmenting it with a variety of new annotations as well as refining earlier annotations. Our proposed work involves conducting some new annotations and some pilot experiments to confirm the strategies for augmentation. A further goal is to bring together a cross section of potential users of this resource, first to acquaint them with the potential of this resource as well as to get their feedback for guiding further augmentations. Applications of PDTB for the task of summarization have already been made. Future applications are in the areas of information extraction, question-generation, and machine translation among others. On the theoretical side, our resource will prove useful in increased theoretical understanding of discourse structure of language.
Document understanding and extraction of information requires both understanding at the sentence level as well as at the discourse level. At the single clause level or at the sentence level "verbs" allow us to express relations between entities of various kinds. At the discourse level, we need to recognize the relations between the pieces of information expressed at the clause level. These connections are established by the discourse relations by words such as "because", "however", "while" and many others. They can, however, be inferred by the reader. However, in order for a computer to discover these relations, it must first learn them by observing instances of such relations in sufficiently large annotated corpora. The Penn Discourse Treebank (PDTB) is the largest extant annotated corpus of discourse relations. Following the extensive experimental work carried out with the PDTB in the NLP community, a major goal of this project is to carry out a retrospective analysis of the key features of the PDTB framework and annotations that have played an important role in the resulting research. The contribution of this work provides vital insight and understanding to researchers interested in further PDTB-style studies and annotation, including significant enhancements to the PDTB itself.