Natural Language Generation (NLG) systems aim to improve the accessibility and impact of information by turning data into coherent and fluent text or speech, automatically. Developing high-quality NLG systems, however, remains a difficult and costly undertaking, in large part because bridging the gap between content planning and surface realization---a task known as extit{sentence planning}---continues to require extensive knowledge engineering.
This Early Grant for Exploratory Research investigates ways of bridging this gap by employing machine learning together with Discourse Combinatory Categorial Grammar (DCCG). Using a restaurant recommendation application as a proof-of-concept, the project explores methods of (1) adapting previous work on acquiring lexicalized grammar entries for semantic parsing to learn mappings from domain-general semantic dependency representations to application-specific representations of messages; (2) extending the approach to learn rules for combining messages; (3) employing the acquired resources to map content plans to disjunctive logical forms (DLFs), which compactly specify the range of possible realizations of the selected content; and (4) improving the efficiency of realizing DLFs with OpenCCG through grammar specialization.
The project will evaluate the success of these novel methods and assess the portability of the approach. By demonstrating methods for radically simplifying the construction of NLG systems, the project promises to transform the way NLG systems are built, from today's knowledge-intensive approach to one that relies primarily on assembling a parallel corpus of input-output pairs. Ultimately, it will facilitate the development of generation components in data-to-text systems as well as dialogue systems, including ones for the visually impaired.
Natural Language Generation (NLG) systems aim to improve the accessibility and impact of information by turning data into coherent and fluent text or speech, automatically. Developing high-quality NLG systems, however, remains a difficult and costly undertaking, in large part because NLG systems employing traditional architectures have required extensive knowledge engineering. Moreover, while novel generation methods have recently been devised that allow end-to-end training of simple NLG systems, it remains to be seen whether the non-traditional architectures employed by these systems can be scaled up to work with richer discourse structures of the kind supported by traditional architectures. This Early Grant for Exploratory Research project investigated ways of reducing the knowledge engineering requirements for high-quality NLG by employing machine learning together with Discourse Combinatory Categorial Grammar (DCCG). Using a restaurant recommendation application as a proof-of concept, we began by adapting thetraditional architecture of the SPaRKy NLG system for use with DCCG. This enabled us to demonstrate in our ENLG-13 paper that the DCCG-based enhancements to the SPaRKy Restaurant Corpus proposed by Nakatsu and White (2010) can indeed make it possible to improve the naturalness of texts in this domain, in particular making better use of contrastive connectives and discourse adverbials. As such, this study has shown for the first time via a human evaluation that non-tree-structured discourses can be generated which improve upon ones limited to discourse relations that form a tree. Consequently, we expect these results will encourage others to investigate techniques better suited to modeling the full richness of discourse relations and connectives found in naturally occurring texts. Subsequently, in as yet unpublished research, we developed a proof-of-concept system for generating texts like those in the SPaRKy Restaurant Corpus using the OpenCCG broad coverage English grammar together with automatically induced rules for lexicalization and clause combining. The rules are induced from target input-output pairs for the system, where the inputs represent the content to be expressed as a text plan, and the outputs are target texts that are automatically parsed into semantic dependency graphs. From just a few dozen input-output pairs, we found that all of the clause-combining rules used by the original SPaRKy system could be automatically induced, along with additional useful rules not employed in the original system. Moreover, unlike the related dependency-based text-rewriting rules employed by Siddharthan for text simplification, the clause-combining rules employ the same kind of shared argument or shared predication constraints as with SPaRKy's hand-crafted rules, which are important for accurately implementing operators for making texts concise. As such, the project has taken an important step towards demonstrating for the first time that end-to-end NLG systems that employ more traditional generation architectures can be effectively learned. Ultimately, we expect the approach to transform the way NLG systems are built, from today's knowledge-intensive approach to one that relies primarily on assembling a parallel corpus of input-output pairs, thereby facilitating the development of generation components in data-to-text systems as well as dialogue systems, including ones for the visually impaired. The project also partially supported related ongoing work on broad coverage surface realization with CCG, where we investigated using dependency length minimization in discriminative realization. Comprehension and corpus studies have found that the tendency to minimize dependency length has a strong influence on constituent ordering choices. In our EMNLP-12 paper and in a chapter of Rajkumar's PhD thesis, we demonstrated that with a state-of-the-art, comprehensive realization ranking model, dependency length minimization yields statistically significant improvements in corpus matches, leads to a better match with the distributional characteristics of sentence orderings, and significantly reduces the number of heavy/light ordering errors, many of which are egregious, as confirmed by a targeted human evaluation. Additionally, in our INLG-14 paper, we developed an initial method for inducing broad coverage CCGs from dependencies, yielding competitive results on an enhanced version of the data from the first surface realization shared task. Since accurate parsing is important for the process of automatically acquiring generation resources from pairs of content plans and texts expressing the desired content, these explorations were initiated in order to make it possible for the approach to work with a variety of parsers. Finally, the project also partially supported the release of a new version of the OpenCCG library, implementing the improvements to broad coverage realization for dependency length minimization, and released the evaluation data for the ENLG-13 paper, including human ratings for the texts with and without the constrast-related enhancements.