Natural language processing tasks like question answering or machine translation require sophisticated parsers: systems that extract grammatical dependency relations between words. But traditional supervised methods of training parsers rely on very expensive hand-labeled datasets, and generalize poorly to new words, grammar, languages, or genres of text. This project is pursuing three directions to significantly augment current unsupervised models of grammar induction. First is a new mathematical model of dependency parsing that draws on linguistic intuitions of constituency. Second is an architecture that jointly learns grammar and parts-of-speech, eliminating the need for supervised part-of-speech tags and hand-labeled datasets, and making grammar induction possible on a vast number of languages and genres. Third are ways to exploit new sources of data for unsupervised learning, including anchor text in web data, vastly expanding the scope of the problem from the small clean annotated treebanks commonly used in current work.
Language understanding by machine is a crucial tool for our nation: machine translation makes international web sites broadly accessible, sentiment analysis helps newspapers make politics more transparent, question answering systems help people disseminate knowledge, and information extraction helps corporations and people draw insights from vast databases of documents. By improving the fundammental parsing technology that underlies each of these tasks, and making it possible to parse new languages and genres that have not been parsable before, this project has the power to vastly increase both the power and scope of these key applications.
Text on the web provides an important source of information for our day-to-day life, whether from news, consumer information like credit card legal statements, or health-related or other scientific materials. But the amount of text on the web is overwhelming, making it difficult to find answers to complex questions or solve problems whose answers may be available in text. For this reasons, systems that can extract information from large sources of text, for answering questions, for summarizing, for searching, or for solving real problems are an essential technology. Many such technologies rely on parsing, the task of recognizing and marking up the grammatical structures in text. Parse structures make it possible to extract the meaning of sentences by seeing how words relate to each other. The traditional way to do parsing is to have humans hand-label the parse structures for thousands of sentences and then train statistical models to predict these parses. But labeling sentences is very expensive (requiring highly trained labelers), very expensive (because it's slow) and tends to work best only if the labeled `training' sentence comes from a very similar kind of text to the sentence being parsed. For this reason, it is essential to solve the difficult problem of automatically inducing grammatical structure from text, learning the structure of raw sentences. Having systems that can perform this "grammar induction" task means that we could learn from millions of sentences on the web how to parse sentences, and then compute the parse over sentences to help us extract information. This project developed a suite of new ways to automatically discovering grammatical structures in text. One set of methods we developed show how to make use of hints in the text that correlate with syntactic structures (structures that consist of noun phrases like "the three requirements" or verb phrases like "confirm your participation"). For example text between certain kinds of punctuation, or text that is marked up in boldface or italics often consists of a syntactic phrase. The words that occur at the beginning of sentences are also likely to occur at the beginning of phrases. The words that occur at the end of sentences are also likely to occur at the end of phrases. Capitalization can be a useful cue to learn what is a phrase. But in addition to making use of all these cues that improve our ability to automatically learn syntactic structure from raw text on the web, we developed new methods for performing the induction itself. With these new algorithms we model the process of assigning a syntactic structure to a sentence as a problem of search: we search through all possible structures, eliminating the ones that don't fit the constraints of the sentence. But search, while an important method for lots of artificial intelligence tasks like parsing, is known to be very difficult for certain kinds of problems (called non-convex optimization problems). We showed new methods for breaking down the search problem. One way to improve search is to build on previous partial solutions, taking a partial answer to a problem and modifying it slightly. We show that forgetting parts of previous solutions can often result in a good place to restart a search. Another way to improve search is to combine multiple methods: we show a way to combine the results of previous grammar induction methods to produce an even better method. The results of our work are a new method for grammar induction that works significantly better than all previous such methods at learning the syntactic structure of text, one that we hope will make it easier in the future to extract meaning from raw text of any kind.