The goal of this project is to extend methods of extracting general knowledge from texts, so as to obtain not only simple "factoids" such as "A door can be open" or "A person may respond to a question" (exemplifying the millions of outputs of the U. Rochester KNEXT system), but also general, conditional knowledge such as that "If a car crashes into a tree, the driver may be hurt or killed". Such conditional knowledge is crucial for intelligent agents that can understand language and make commonsense inferences. The approach employed in the project involves bootstrapping of two principal sorts: (1) abstraction from simple factoids, both individually and collectively; (2) use of already-derived factoids to boost the performance of a natural language parser/interpreter, enabling (a) extraction of more complex conditional facts from miscellaneous texts, and (b) direct interpretation of general conditional facts stated in English in sources such as Common Sense Open Mind or WordNet glosses. The evaluation methodology for the derived knowledge involves both direct human judgement and judgement of inferences automatically generated with the aid of the extracted knowledge, using the EPILOG inference engine at U. Rochester.

The general knowledge obtained in this work will be made available to the broader AI community, and will advance the state of the art both in natural language understanding and in knowledge-dependent commonsense reasoning (for example, in question answering). It will also provide evidence relevant to the hypothesis that language understanding is a process dependent not only on a few thousand syntactic rules, but also on millions of pattern-like items of general knowledge that bias the parsing and interpretation process.

Project Report

It has been recognized in AI since the 1970s that human-level language understanding, commonsense reasoning, and goal-oriented planning in the real world depended on the possession and use of very large amounts of knowledge -- perhaps tens or hundreds of millions of individual (but richly interrelated) items. This raised the question of how this knowledge could be acquired by a machine, in a form suitable for reasoning, and this long-standing challenge has been called the "knowledge acquisition bottleneck". The present project extended previous work aimed at alleviating this bottleneck through knowledge acquisition from large text repositories, such as the 100-million word British National Corpus, the more than one billion word "Gigaword" newswire corpus, Wikipedia, and personal blogs. That previous work led to the KNEXT knowledge extraction system at the University of Rochester, and a repository of many millions of general "factoids" -- simple general claims, expressed in a formal, language-like logic called Episodic Logic; examples are that a person can have a brain, a dog may bark, people may want to be rid of a dictator, and so on. Unfortunately these factoids, though indicative of what the world is like, are too weakly formulated to be usable for inference; for instance, we cannot infer that a given person does have a brain, only that this is a possibility. Thus the present project has been aimed at "bootstrapping" KNEXT-like knowledge -- extending it and boosting the logical strength of the factoids to enable inferences. The research accomplished knowledge extensions and strengthening in three primary ways: Logical "sharpening" of many millions of the original KNEXT factoids, making them suitable for reasoning; this relied on various lexical and software resources, allowing nouns and verbs to be semantically classified in multiple ways, with the help of a new pattern matching and transformation system, TTT, created in the course of this project; Derivation of axiomatic knowledge about relationships among nominal concepts (person, dog, creature, tree, plant, computer, artifact, ...) as characterized in the online WordNet lexicon; again this relied on multiple lexical resources and TTT; and Direct creation of axioms expressing what is implied by verb concepts such as walking, giving (something to someone), refusing (to so something), believing (something), etc., using as a guide the online VerbNet resource, which contains a systematic classification of verbs. These new knowledge items support "obvious" inferences such as that John may well occasionally go to a dentist, very likely has teeth as a part, and possibly some of those teeth are occasionally fixed and some may be lost (given only that John is a person); or that IBM may well have products, probably experiences sales, may have headquarters, may have websites, may grow, etc. (given only that IBM is a company). Such inferences are obtainable with the EPILOG inference engine, whose capabilities have also been strengthened in the course of the project. Within the field of artificial intelligence, these results constitute significant progress towards overcoming the knowledge acquisition bottleneck. The work also has implications for linguistics. For example, the effort to axiomatize relations among nominal concepts in WordNet showed the importance of the mass-count distinction in making formal sense of the "hyponym" relations in WordNet. This is illustrated by the fact that the hyponym relation between "gold dust" and "gold" can be formalized by saying that all gold dust is gold, whereas the hyponym relation between "gold" and "noble metal" cannot be formalized by saying that all gold is a noble metal (rather, gold, considered as an elementary substance, is a noble metal). The problem can be traced to the fact that "gold dust" and "gold" are mass nouns, while "noble metal" is a count noun. More broadly, by providing large amounts of new, inference-enabling knowledge, as well as enhanced tools for creating knowledge (TTT) and for using it inferentially (EPILOG), the project has made progress towards the ambitious longer-range goal of creating artifacts with human-like language understanding and thinking abilities, with possible applications such as personal assistance in daily life, knowledge integration and analysis from various sources for various practical and scientific purposes, intelligent tutoring, and entertainment that involves interaction with intelligent virtual agents. The project has contributed to the development of human resources through the research involvement of 6 PhD candidates (of the three most directly involved, two have graduated and one is scheduled to graduate in 2013) and 10 undergraduates, as well as their participation in regular study and research groups held throughout the course of the project. Most of these students have continued in academia (as faculty or grads) or gone to hi-tech industry positions. The results of the work have also been disseminated in edited collections, conferences and journals, through talks at international refereed conferences, and locally through informal guest lectures and incorporation of materials into the PI's classes.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0916599
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2009-07-15
Budget End
2013-06-30
Support Year
Fiscal Year
2009
Total Cost
$459,435
Indirect Cost
Name
University of Rochester
Department
Type
DUNS #
City
Rochester
State
NY
Country
United States
Zip Code
14627