Web Science is an emerging discipline that studies the Web: how human activity is shaped by Web interactions, how the Web can benefit society, and how Web technologies can be improved. Central to Web Science is access to data that records the history of the Web, as well as data that records human activity (e.g., posed queries, tagged pages, Twitter updates). It is currently very difficult for academic researchers to obtain such Web data because it is hard to locate, it is fragmented across diverse sites, and is recorded using inconsistent formats and strategies. This project will build a Web Archive Cooperative (WAC) that will integrate existing archives (repositories of Web data), making it feasible to access large volumes of data in a simplified fashion. The WAC will be a virtual service, providing search facilities and access mechanisms to existing resources. These resources will not just be Web pages, but all types of available Web information, such as query logs, tag annotations, blogs, profiles and Twitter updates. Furthermore, resources will also include the software tools for building and managing Web archives.

The project will explore three goals for a resource discovery service: (1) the manual or automated discovery of entire existing Web related archives; (2) the selection among known archives of the ones that support a specific research question; and (3) the identification of individual resources from within the selected archives. Tools for characterizing discovered archives, especially for the case where the archive does not provide rich descriptive metadata, will also be developed. Characterization of an archive includes elements such as an estimate of the archive's coverage, particulars of the crawling parameters, like dates/frequencies, crawl duration, depth, per-site ceiling on the number of collected pages, content statistics, and link structure. Mechanisms for integrating diverse archives will be developed, and the mechanisms will be applied to site reconstruction (from various archives) and archive views (a logical fusion of resources from multiple sources). Since integration issues are so challenging, an experimental testbed will be set up with small but diverse resources. The testbed will contain several crawls of the same target sites, each obtained with different crawlers and using different parameters. The testbed will also contain related resources. Storage trading schemes will be developed, allowing members to trade local backup space for remote space. A Web archive replication tool will be developed based on existing notions for self-preserving objects. Alternatives for replica synchronization will be studied.

Workshops to bring together key Web Science researchers will be organized to discuss available resources and impediments to sharing. These workshops will drive research and identify needed tools and protocols. With small groups of participants, challenge problems will be established, e.g., combining a set of Web archives. Reports of these results at future workshops can incentivize others to participate in the WAC. In addition, an Advisory Board of industrial, government, and academic experts has been set up to guide the project. A Summer Institute for Web Science graduate students will be held. At this Institute, students will learn to use the latest tools and will learn from each other's experiences in dealing with Web data. In addition, a one-day workshop will be developed, to be offered at Web Science conferences (WWW, SIGIR, etc.) to educate participants about WAC resources. An undergraduate Web Sciences track for computer science majors will be set up, taking advantage of WAC resources. The project will have impact in two ways. First, it will provide tools and services that facilitate access to Web resources. Any researcher, from a computer scientist studying efficient Web search, to a social scientist studying how human beliefs are changing today, to a historian studying how the early Web evolved, to a biologist understanding how disease spreads, will benefit from the work. Second, the project motivates students and young researchers to stay in academia. Currently top talent is flowing to industry because only they have comprehensive Web data, and it is so hard to do significant Web Science at universities. The WAC can provide an alternative, attracting more researchers and teachers to this important area.

Project Report

The WAC project, funded by the National Science Foundation, has provided an educational boon to the field of web science, promoted the development of software that leverages new web archiving protocols, and provided research opportunities for underrepresented groups in computing. Here is a summary of the activities and accomplishments that were conducted during this three year project: Creation of one of the first publicly available corpus of teaching resources for an introductory course on web science at the undergraduate level. These resources include slides, homework assignments, and project that can be used in a class with all computing students or in an interdisciplinary course where there is less focused on programming solutions. A two-day summer workshop held on the Stanford campus for twenty undergraduate and graduate students who were exploring web science and web archiving. Speakers represented a number of organizations including Stanford University, Los Alamos National Laboratory, Internet Archive, UC Berkeley School of Law, California Digital Library, Microsoft Research, and others. Development of a "time-travelling" web browser for iOS and Android that uses the Memento protocol. Users can easily see what CNN’s web page looked on 9/11, or they can browse the White House website under George W. Bush. Development of tools that allow web crawlers to discover and crawl mobile content in an automated fashion. As more and more people access websites with their mobile devices, mobile-friendly versions of websites are becoming more popular. We built tools to allow web archivists to more easily capture both the desktop and mobile versions of these websites. Development of a Firefox add-on called Volitrax that allows users to automatically re-discover web content that goes missing. This plug-in helps users locate music videos that disappear from YouTube because the owner removes the video or other reasons. Summer research opportunities for six undergraduate computing students at Harding University. Most undergraduate computing students do not get the opportunity to work on real research projects, but all of these participants built many of the tools mentioned above, and they presented their work at two workshops and three conferences. Half of these students were members of underrepresented groups in computer science.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1008492
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$108,340
Indirect Cost
Name
Harding University Main Campus
Department
Type
DUNS #
City
Searcy
State
AR
Country
United States
Zip Code
72149