This project will attempt to build a full text index to the textual web pages in the historical collections of the Internet Archive. The Internet Archive has taken a snapshot of the web every two months since 1996 and stored it. It now comprises approximately 40 billion web pages, consuming multiple petabytes of storage. The resulting index of the project may be the largest and best organized inverted index ever created that is freely available to academic researchers. It will enable social and information scientists to explore altogether new dimensions of contemporary events and practices, while offering information scientists a vital large-scale testing resource in areas such as advanced information retrieval on semistructured collections.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0634677
Program Officer
Stephen Griffin
Project Start
Project End
Budget Start
2006-10-01
Budget End
2007-09-30
Support Year
Fiscal Year
2006
Total Cost
$120,000
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850