This Small Business Innovation Research (SBIR) Phase I research project will enable software systems to make use of data on the Web. The semantic web is intended to allow data to be shared and used by software applications. Unfortunately, in the present world, data on the Web is generally inaccessible to most applications because it is presented in a format intended to be usable by humans, as opposed to computers. The ultimate goal is to create a relational view of data on the web, so that applications can access Web data based on entities and their relations. This project proposes to achieve this with an unsupervised machine learning approach that extracts data from web sites and converts it into relational form. It will develop and implement an unsupervised algorithm that takes advantage of multiple heterogeneous types patterns found on web sites, including the link structure, formatting conventions, and content regularities. This project will result in a powerful new generation of Web harvesting technology that has clear commercial value. In addition moreover, it will enable the vision of the semantic web to become a reality.

Web harvesting is an area of growing commercial interest for a variety of vertical markets, including Sales Intelligence, Market Intelligence, News Aggregation, and Background Search. However, web-harvesting technology is limited today, since the collection of rich, detailed data must be done on a site-by-site basis. The approach described here, if successful, will enable a new generation of intelligent Web harvesting technology that can scale to the entire Web. Ultimately, our approach will enable applications to query the entire Web as if it were a relational database. This has tremendous commercial value, and moreover, will enable many new types of web applications to be developed. In addition to the commercial value, the technical approach is novel and has significant merits on its own. If it is successful, the proposed method should generalize to other complex domains (such as scene understanding and natural language processing) where multiple heterogeneous types of structure must be analyzed to discover underlying meaning

Project Start
Project End
Budget Start
2005-01-01
Budget End
2005-06-30
Support Year
Fiscal Year
2004
Total Cost
$100,000
Indirect Cost
Name
Fetch Technologies
Department
Type
DUNS #
City
El Segundo
State
CA
Country
United States
Zip Code
90245