Web Science is an emerging discipline that studies the Web: how human activity is shaped by Web interactions, how the Web can benefit society, and how Web technologies can be improved. Central to Web Science is access to data that records the history of the Web, as well as data that records human activity (e.g., posed queries, tagged pages, Twitter updates). It is currently very difficult for academic researchers to obtain such Web data because it is hard to locate, it is fragmented across diverse sites, and is recorded using inconsistent formats and strategies. This project will build a Web Archive Cooperative (WAC) that will integrate existing archives (repositories of Web data), making it feasible to access large volumes of data in a simplified fashion. The WAC will be a virtual service, providing search facilities and access mechanisms to existing resources. These resources will not just be Web pages, but all types of available Web information, such as query logs, tag annotations, blogs, profiles and Twitter updates. Furthermore, resources will also include the software tools for building and managing Web archives.

The project will explore three goals for a resource discovery service: (1) the manual or automated discovery of entire existing Web related archives; (2) the selection among known archives of the ones that support a specific research question; and (3) the identification of individual resources from within the selected archives. Tools for characterizing discovered archives, especially for the case where the archive does not provide rich descriptive metadata, will also be developed. Characterization of an archive includes elements such as an estimate of the archive's coverage, particulars of the crawling parameters, like dates/frequencies, crawl duration, depth, per-site ceiling on the number of collected pages, content statistics, and link structure. Mechanisms for integrating diverse archives will be developed, and the mechanisms will be applied to site reconstruction (from various archives) and archive views (a logical fusion of resources from multiple sources). Since integration issues are so challenging, an experimental testbed will be set up with small but diverse resources. The testbed will contain several crawls of the same target sites, each obtained with different crawlers and using different parameters. The testbed will also contain related resources. Storage trading schemes will be developed, allowing members to trade local backup space for remote space. A Web archive replication tool will be developed based on existing notions for self-preserving objects. Alternatives for replica synchronization will be studied.

Workshops to bring together key Web Science researchers will be organized to discuss available resources and impediments to sharing. These workshops will drive research and identify needed tools and protocols. With small groups of participants, challenge problems will be established, e.g., combining a set of Web archives. Reports of these results at future workshops can incentivize others to participate in the WAC. In addition, an Advisory Board of industrial, government, and academic experts has been set up to guide the project. A Summer Institute for Web Science graduate students will be held. At this Institute, students will learn to use the latest tools and will learn from each other's experiences in dealing with Web data. In addition, a one-day workshop will be developed, to be offered at Web Science conferences (WWW, SIGIR, etc.) to educate participants about WAC resources. An undergraduate Web Sciences track for computer science majors will be set up, taking advantage of WAC resources. The project will have impact in two ways. First, it will provide tools and services that facilitate access to Web resources. Any researcher, from a computer scientist studying efficient Web search, to a social scientist studying how human beliefs are changing today, to a historian studying how the early Web evolved, to a biologist understanding how disease spreads, will benefit from the work. Second, the project motivates students and young researchers to stay in academia. Currently top talent is flowing to industry because only they have comprehensive Web data, and it is so hard to do significant Web Science at universities. The WAC can provide an alternative, attracting more researchers and teachers to this important area.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1009916
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-08-01
Budget End
2015-07-31
Support Year
Fiscal Year
2010
Total Cost
$2,350,507
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305