Natural language processing (NLP) is a key technology for the digital age. At the core of most NLP systems is a parser, a program which identifies the grammatical structure of sentences. Parsing is an essential prerequisite for language understanding. But despite significant progress in recent decades, accurate wide-coverage parsing for any genre or language remains an unsolved problem. The broader impact of this CAREER project will be to advance the state of art in NLP technology through the development of more accurate statistical parsing models.
Since language is highly ambiguous, parsers require a statistical model which assigns the highest probability to the correct structure of each sentence. The accuracy of current parsers is limited by the amount of available training data on which their models can be trained, and by the amount of information the models take into account. This CAREER project aims to advance parsing by developing novel methods of indirect supervision to overcome the lack of labeled training data, as well as new kinds of models which incorporate information about the prior linguistic context in which sentences appear. It employs Bayesian techniques, which give robust estimates and allow rich parametrization, and applies them to lexicalized grammars, which provide a compact representation of the syntactic properties of a language. This CAREER project will also train graduate students in natural language processing and develop materials that can be used to teach middle and high school students about NLP and to inspire them to pursue an education in computer science.