The text available on the Web and beyond embeds unprecedented volumes of valuable structured data, "hidden" in natural language. For example, a news article might discuss an outbreak of an infectious disease, reporting the name of the disease, the number of people affected, and the geographical regions involved. Keyword search, the prevalent query paradigm for text, is often insufficiently expressive for complex information needs that require structured data embedded in text. For such needs, users (e.g., an epidemiologist compiling statistics, as reported in the media, on recent foodborne disease outbreaks in a remote country) are forced to embark in labor-intensive cycles of keyword-based document retrieval and manual document filtering, until they locate the appropriate (structured) information. To move beyond keyword search, this project exploits information extraction technology, which identifies structured data in text, to enable structured querying. To capture diverse user information needs and depart from a "one-size-fits-all" querying approach, which is inappropriate for this extraction-based scenario, this project explores a wealth of structured query paradigms: sometimes users (e.g., a high-school student in need of some quick examples and statistics for a report on recent salmonella outbreaks in developing countries) are after a few exploratory results, which should be returned fast; some other times, users (e.g., the above epidemiologist investigating foodborne diseases) are after comprehensive results, for which waiting a longer time is acceptable. The project develops specialized cost-based query optimizers for each query paradigm, accounting for the efficiency and, critically, the result quality of the query execution plans. The technology produced will assist a vast range of users and information needs, by enabling efficient, diverse interactions with text databases -- for sophisticated searching and data mining -- that are cumbersome or impossible with today's technology. The research and educational components of the project will rely on -- and encourage -- a tight integration of three complementary Computer Science disciplines, namely, natural language processing, information retrieval, and databases. The project will also provide data sets and source code, for experimentation and evaluation, to the community at large over the Web (http://extraction.cs.columbia.edu/).
The text available on the Web and beyond embeds unprecedented volumes of valuable structured data, "hidden" in natural language. For example, a news article might discuss an outbreak of an infectious disease, reporting the name of the disease, the number of people affected, and the geographical regions involved. Traditional keyword search, the prevalent query paradigm for text, is often insufficiently expressive for complex information needs that require structured data embedded in text. For such needs, users (e.g., an epidemiologist compiling statistics, as reported in the media, on recent food-borne disease outbreaks in a remote country) are forced to embark in labor-intensive cycles of keyword-based document retrieval and manual document filtering, until they locate the appropriate (structured) information. To move beyond traditional keyword search, this project exploited information extraction technology, which identifies structured data in text, to enable structured querying. Furthermore, at the center of this project was the observation that user information needs are diverse, and "one-size-fits-all" approaches are hence inappropriate: sometimes users (e.g., a high-school student in need of some quick examples and statistics for a report on recent salmonella outbreaks in developing countries) are after a few exploratory results, which should be returned fast; some other times, users (e.g., the above epidemiologist investigating food-borne diseases) are after comprehensive results, for which waiting a longer time is acceptable. The project developed specialized cost-based query optimizers that adapt to the spectrum of user needs, accounting for the efficiency and, critically, the result quality and completeness of the query execution plans. The technology produced is likely to assist a vast range of users and information needs, by enabling efficient, diverse interactions with text databases --for sophisticated searching and data mining-- that are cumbersome or impossible with traditional technology. The research and educational components of the project relied on --and encouraged-- a tight integration of three complementary Computer Science disciplines, namely, natural language processing, information retrieval, and databases. The project provided source code for experimentation and evaluation to the community at large over the Web at http://reel.cs.columbia.edu/.