This project investigates a new dynamic, adaptable approach for constructing evaluation metrics and methods for various NLP applications, with a specific focus on Machine Translation and Summarization. The main objective is to establish a general framework that can easily support constructing automatic evaluation metrics for a variety of specific NLP tasks and based on a variety of quality criteria. For a given NLP task (e.g. Machine Translation) and a given set of established quality criteria, the framework supports learning a set of parameters that result in an "instance" evaluation metric that has optimal correlation with the desired quality criteria. Training a new "instance" metric for a different task, or for a different set of quality criteria, can be accomplished by a fast training procedure using available training data consisting of system produced outputs, human-quality reference outputs for the same source data, and human quality judgments for the system outputs.
A powerful new innovation of the new framework is its ability to use the set of all overlapping sub-sequences (also known as "skip ngrams") of the two strings being compared. The process of skip n-gram matching is augmented with a powerful word-to-word alignment algorithm that pre-constrains the set of skip n-gram matches, while allowing matches between words that are morphological variants, synonyms or otherwise related. Furthermore, our framework uses a well-founded parameterized model for establishing the weight or significance that should be assigned to each detected overlapping subsequence, and can calculate these weights as an integral process during the detection of the matching skip ngrams. The result is an extremely powerful "metric-producing" framework. Under this framework, the project will produce (instantiate) specific metrics for machine translation, summarization, and other NLP tasks, that are more robust, sensitive, and have high-levels of correlation with human judgments. The project also explores methods for reducing the reliance of our resulting metrics on human judgments. The resulting framework and task-specific trained metrics will be made publicly available to the NLP research community. The impact of automatic evaluation methods extends beyond providing a flexible performance measuring mechanism for NLP tasks. We expect our work to enable customizing evaluation metrics for specific tasks within a variety of cross-lingual applications, which should significantly boost the overall performance of these applications.