To facilitate linguistic communications, natural language processing (NLP) technologies must be applicable to different languages across different domains. A limitation of many NLP systems is that they do not perform as well on data types that diverge from their training examples. The objective of this CAREER project is to increase the robustness and coverage of a fundamental NLP component, the syntactic parser.
Specifically, this project explores adaptation methods to extend a standard English parser for processing different domains (e.g., scientific literature, emails) and different languages (e.g., Chinese). Three types of correspondences are considered. First, if coarse-level correspondences are explicit in the data (e.g., bilingual documents), finer-grained correspondences at the word- or phrasal-level may be inferred, and semi-supervised learning may be used to transfer domain knowledge across the inferred correspondence. Second, if the correspondences are inexact (e.g., multiple translations of varying quality), the mis-matched portions may be identified and transformed to achieve a closer mapping. Third, if the correspondences are indirect, methods for inducing correspondences from non-parallel corpora may be appropriate.
Parser adaptation stands to increase the range of NLP applications; examples include: data mining from medical documents and automatic tutoring for non-English speakers. As the project aims to bring together several strands of research, it offers ample research opportunities to graduate and undergraduate students. The algorithmic aspects encourage forming synthesis from areas of semi-supervised learning, relational data modeling, grammar induction, and machine translation; the empirical aspects afford students an arena to hone their skills in good scientific methodologies.
This project tackles the challenges of using computers to automatically parse written sentences. While advances in natural language processing (NLP) have led to high quality parsers for processing standard newspaper English, computer systems face substantial obstacles when processing sentences from more diverse sources, including the writings of English-as-a-Second-Language (ESL) learners, informal blurbs made in social media, writings from specialized domains such as legal documents and scientific literatures, and computer-generated sentences such as the outputs of machine translation (MT) systems. The main problem is that these sentences diverge significantly from the example sentences used to develop the system. This project addresses the problem by framing it as a machine translation task: how should we model the relationship between a wider population of English expressions and "newspaper" sentences? The investigation of this project has especially focused on two domains: the writings of ESL learners, which often contain grammar mistakes and usage errors, and the outputs of MT systems, which often contain a wider variety of garbled phrases and disfluencies than ESL learners. The work has led to the following three main outcomes. The Chinese Room System: A visualization interface has been developed to bridge between an imperfect MT system and a human user who cannot read the source language. Through a visual display of various linguistic resources, the system helps the user to correct and improve MT outputs. (The system currently supports Chinese-English and Arabic-English, but it can be extended to arbitrary language pairs.) In addition to providing a service to users who wish to understand a document in a foreign language they cannot read, the Chinese Room System is also an instrument for collecting and analyzing the relationship between garbled MT outputs and the intended translation expressed in well-formed English. Computational Models of Common Writing Problems of ESL Learners: Several systems have been developed to identify common errors that non-native learners of English make. One is a predictor of preposition usage; it differs from existing systems in that its development requires fewer training examples. Another is a predictor of redundant words and phrases (e.g., in the phrase "ruby red slippers," the word "red" is redundant); leveraging translations to other languages, the system identifies those words whose meanings have already been conveyed by other words. A third is a model of correction detections: given an ill-formed sentence and its corresponding revision, detect the locations and reasons of the mistakes. By modeling the relationship between neighboring individual changes, the system made more accurate segmentations of corrections and improved upon a previous system's published results. Computational Models of Relationships between MT and the writings of ESL Learners: Current machine translation systems and second-language learners share some similarities: they both have an imperfect grasp of the target language; some aspects of a learner's native language (or an MT system's source language) might get transmitted and appear as disfluent artifacts in their English expressions. In this project, several computational models (as well as relevant data sets) have been designed and developed to help us gain a better understanding of the relationships between the types of mistakes made by MT systems and those made by ESL learners. First, a quasi-synchronous grammar model, a mathematical model previously used for MT, has been adapted for "translating" from problematic ESL sentences to their corrections. Mathematically, the model treats "ESL English" as a foreign language like French or Chinese. The implication is that mistakes made by ESL learners are not haphazard, that there are common, regularly occurring patterns such that they can be seen as obeying grammar rules of a different language. Second, problematic MT outputs (in English) have been analyzed and annotated based on the rubrics of ESL corrections. A model of correction detection has been developed; it identifies the location and reasons of the mistakes. This model is similar to the ESL model of correction discussed in the previous section. It can also be seen as a complement to the idea discussed in the previous paragraph. This time, the question is in reverse: are MT systems like low-proficient ESL learners? Empirical results suggest that this is not the case for current MT systems. Although they share some types of mistakes (e.g., preposition choices, article choices, and punctuation choices), MT systems have a much poorer global model (i.e. overall sentence) but stronger local model (i.e., within a short phrase) of English than ESL learners. In summary, in the course of this project, two domains of non-standard writing samples have been examined; one is automatically generated MT outputs; another is the writings of ESL learners. Both contain errors and disfluencies that make them difficult to parse. Computational models have been developed to identify and correct these problems.