SBIR Phase I: Unsupervised Extraction of Relational Data from the Web

Minton, Steven

Abstract

This Small Business Innovation Research (SBIR) Phase I research project will enable software systems to make use of data on the Web. The semantic web is intended to allow data to be shared and used by software applications. Unfortunately, in the present world, data on the Web is generally inaccessible to most applications because it is presented in a format intended to be usable by humans, as opposed to computers. The ultimate goal is to create a relational view of data on the web, so that applications can access Web data based on entities and their relations. This project proposes to achieve this with an unsupervised machine learning approach that extracts data from web sites and converts it into relational form. It will develop and implement an unsupervised algorithm that takes advantage of multiple heterogeneous types patterns found on web sites, including the link structure, formatting conventions, and content regularities. This project will result in a powerful new generation of Web harvesting technology that has clear commercial value. In addition moreover, it will enable the vision of the semantic web to become a reality.

Web harvesting is an area of growing commercial interest for a variety of vertical markets, including Sales Intelligence, Market Intelligence, News Aggregation, and Background Search. However, web-harvesting technology is limited today, since the collection of rich, detailed data must be done on a site-by-site basis. The approach described here, if successful, will enable a new generation of intelligent Web harvesting technology that can scale to the entire Web. Ultimately, our approach will enable applications to query the entire Web as if it were a relational database. This has tremendous commercial value, and moreover, will enable many new types of web applications to be developed. In addition to the commercial value, the technical approach is novel and has significant merits on its own. If it is successful, the proposed method should generalize to other complex domains (such as scene understanding and natural language processing) where multiple heterogeneous types of structure must be analyzed to discover underlying meaning

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Industrial Innovation and Partnerships (IIP)
Type: Standard Grant (Standard)
Application #: 0441563
Program Officer: Errol Arkilic

Project Start
Project End
Budget Start: 2005-01-01
Budget End: 2005-06-30
Support Year
Fiscal Year: 2004
Total Cost: $100,000
Indirect Cost

SBIR Phase I: Unsupervised Extraction of Relational Data from the Web
Minton, Steven
Fetch Technologies, El Segundo, CA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments