This project investigates techniques for extracting and searching Web-embedded structured datasets. For example, a manufacturer's site may contain technical product data, and a governmental site may contain economic statistics. Unfortunately, such data can be hard to isolate from surrounding text, and difficult to find using existing search engines that focus exclusively on documents. The approach for the extraction step is to use current incomplete datasets to induce a large "portfolio" of possible extractors, apply all of them to crawled Web content, then test which are most successful. The approach for the search step is to examine user query logs to find common patterns that describe the relationship between topic words and words that describe the dataset's structure; e.g., "endangered species near the Mississippi River" is a prototype for a many-to-many geographic relationship. The central goal of this work is to eventually construct a working search engine for the structured-data component of the Web.
The success of this project is likely to increase access to structured datasets for a very broad population of users. The project will also yield a large amount of novel extracted data relevant for scientific research, plus useful tools and query logs. To accompany the research program, this project involves an educational plan that includes revised undergraduate course material, development of online educational material surrounding the datasets and tools, and a course on Web topics taught to a local rural high school. All project results will be distributed at the project's Web site (www.eecs.umich.edu/~michjc/structuredweb/index.html).