This Small Business Innovation Research (SBIR) Phase II project will enable software systems to make use of data on the Web that is embedded in HTML pages. The semantic web is intended to allow data to be shared and used by software applications. Unfortunately, in the present world, data on the Web is generally inaccessible to most applications because it is presented in a format intended to be usable by humans, as opposed to computers. The goal of this project is to create a relational view of data on the Web, so that applications can access Web data based on entities and their relations. The approach uses unsupervised machine learning to extract data from web sites for conversion into relational form. This project will result in a new generation of Web harvesting technology that has clear commercial value.
Web harvesting is an area of growing commercial interest for a variety of vertical markets, including Sales Intelligence, Market Intelligence, News Aggregation, and Background Search. However, web harvesting technology is limited today, since the collection of rich, detailed data must be done on a site-by-site basis. The approach described here, if successful, will enable a new generation of intelligent web harvesting technology that can scale to the entire Web. Ultimately, our approach will enable applications to query the entire Web as if it were a relational database. This has tremendous commercial value, and will enable many new types of web applications to be developed. In addition to the commercial value, the technical approach is novel and has significant merits on its own. If it is successful, the proposed method should generalize to other complex domains (such as scene understanding and natural language processing) where multiple heterogeneous types of structure must be analyzed to discover underlying meaning.