There are very few publicly available network traces that contain application-level data, because of the enormous privacy risk that sharing such data creates. Application-level data is rich with personal and private information, such as human names, social security numbers, etc. that criminals can monetize. Yet such data is necessary for realistic testing of research products, and for understanding trends in the domain of networking and network applications.

This project develops a publicly accessible, diverse and fresh archive of content-rich network data, contributed by volunteer users, called Critter-at-home. Users join the Critter overlay whenever online, offering their data to interested researchers. Privacy of data contributors is protected by several means. First, contributors may opt to host their own data on their machines, thus retaining full control over it. Second, we process contributed data to modify all personal and private information (PPI) and we encrypt it. Third, no human apart from the contributor ever accesses the raw, PPI-sanitized, data. Instead, researchers query the data via our Critter-at-home framework, and they receive aggregate statistics (counts, distributions, etc.) of the traffic features they query for. Four, all contact with a contributor is at her discretion and is done through an anonymous network, where contributor identities are hidden.

The archive this project creates will greatly advance security research by providing necessary data for its validation and for data mining. This archive will further be valuable to a broader networking e.g., for realistic traffic generation, as ground truth in traffic classification, and for many other purposes.

Project Report

Researchers need real-world network data from real users for computer network and security research.Unfortunately, because such data contains large amounts of personal information---such as what web sites a user visits---collecting such data and granting access to researchers is often deemed to have too many privacy risks.This is especially true for content-rich data---application-level data such as the content of websites a user visits.Critter---which stands for "Content-Rich Traffic Trace Repository"--- aims to provide this very needed content-rich data to researchers through a network of volunteer data contributors. Critter allows researchers to run certain queries on user data and returns aggregate responses to protect user privacy.Unlike traditional network data sources, where researchers work with an ISP or other large organization to gain access to data, Critter connects researchers with individual end-users willing to share their data for research purposes. Since Critter works on an individual level, users retain much more control over their data and how their data is used than in traditional data collection methods such as network traffic traces collected at a university.Users keep their ``raw'' data locally on their machine and can withdraw their data at any time. When sharing privacy-sensitive data, the original data always remains under the control of its owner. The data owner releases information through responding to queries with a numerical value.These responses are aggregated on the Critter Server before being returned to a researcher.Since we release only aggregate responses, many active and passive attacks that work against data sets such as sanitized network traces or sanitized logs are ineffective in our context. Figure 1 illustrates how queries work in Critter. First, (1) a researcher submits a query via the public portal. Data contributors' clients (2) poll for new queries, and (3) retrieve this new query. The Critter client processes this query if the data contributor's policy permits it, and returns the result.The Critter Server aggregates the results and stores these aggregated results for the researcher to retrieve. Our result aggregation provides privacy protection through ``hiding in a crowd''.The Critter Server enforces k-anonymity criteria before any result is returned to the researcher.If a researcher asks for how often users visit a particular website during a specific week, k-anonymity ensures that the returned result is a set of grouped responses such that each group has a single value representing at least k different contributors' replies. Figure 2 depicts how this works with k = 3, and an example of responses from four data contributors to a query about how often each contributor visits a particular website. Since Group#2 does not have a k of at least 3, we cannot release such information.Instead, we drop this group and return only Group#1.Once results are aggregated, an attacker cannot know for sure which contributors participated in a query or know any single contributor's response to a query. The end result of this NSF funded effort is the implemented Critter System, including an easy to install Critter client, and a small base of volunteer data contributors.Becoming a data contributor through Critter is easy.We have created a simple install process and self-updating client which works under Windows and Mac.To see more or to join Critter please see:http://steel.isi.edu/Projects/critter/.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1224035
Program Officer
Jeremy Epstein
Project Start
Project End
Budget Start
2012-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2012
Total Cost
$375,000
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089