The Web is enormous and in constant flux, causing much content to be lost over time. Historical collections of web content are thus of monumental value in preserving records of significant aspects of modern society. The Internet Archive offers access to hundreds of billions of historical web page snapshots. The scale of such archives, however, presents tremendous challenges to making this content fully searchable. This research effort investigates efficient and effective approaches to store, index, and retrieve web content from large-scale historical archives. In addition, the temporal content and structure of the archives are mined to exploit temporal characteristics that can improve search result ranking. Technological advances from this work are being tested on content from and in collaboration with the Internet Archive and integrated into its infrastructure, enabling new archival search capabilities for the public.

www.cse.lehigh.edu/~brian/nsf/archives-08.html

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0803605
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2008-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2008
Total Cost
$900,000
Indirect Cost
Name
Lehigh University
Department
Type
DUNS #
City
Bethlehem
State
PA
Country
United States
Zip Code
18015