Scientists are generating data at an unprecedented scale and rate. There is tremendous value in not only analyzing this data but also sharing it among scientists. Cloud-computing platforms are well-suited to support such sharing: They offer a single logical location for data, access to data management tools for analyzing it, and a pay-as-you-go charging mechanism. However, while cloud-computing systems offer simple pricing schemes for storage and compute resources, the economics of data sharing are poorly understood and only coarsely supported.

This research is developing models and infrastructures to establish relational data markets in the cloud. These markets enable scientists to upload their data and make it publicly available in the cloud, then recoup costs by charging others for using the data. The data markets also enable scientists to share their data with direct collaborators and see the cloud costs fairly distributed among team members.

The project is building a prototype system on the Windows Azure cloud to implement these data markets.

This project is having impact on both cloud computing and science by introducing new pricing techniques for data sharing in the cloud. The software and technical papers resulting from this project are being disseminated through the project website (http://cloud-data-pricing.cs.washington.edu/).

Project Report

Data has value, and is increasingly bought and sold on the Web. The Windows Azure Marketplace (http://datamarket.azure.com/) illustrates this trend. This project studied challenges and tools related to managing data with value and made multiple contributions to the state-of-the-art in data management. First, this project developed new methods for pricing data sold online. Today, users can only purchase pre-defined subsets of data. For example, if a company sells business contact information, they may sell one dataset for an entire state and then smaller subsets of data per city. Alternatively, they may let users specify a zipcode and they will sell information only for that zipcode. The problem is that if a user wants a unique subset of the data, such as only Italian restaurants that are in the same zipcode as Thai restaurants, they must typically overpay by buying a superset of the data they need. In this example, the user would need to buy the data for all zipcodes of interest to check which ones have both Italian and Thai restaurants. Our new data pricing method enables users to purchase such personalized data products at a cheaper price than what is possible today. Our approach achieves this goal by automatically deriving the price of the requested data based on other price points defined by the seller. Second, when data is sold, it typically comes with a license agreement that constrains what can be done with the data. For example, certain vendors that provide ratings for businesses do not allow their ratings to be averaged with other ratings but allow multiple ratings to be displayed together. Today, license agreements are written in English and thus impose a significant burden on users who must manually check that they do not violate any agreements. The project developed methods and a new system, called the DataLawyer, that automatically checks license agreements. Users need not worry about licenses. If they are about to violate an agreement, the system informs them before they do anything wrong. Finally, the project developed new methods through which users can be remunerated for their anonymized private information and can select the trade-off between revenue and degree of anonymity. Data owners set a price for their private data, while data analysts purchase aggregate queries over that data. To reduce the cost (since a typical aggregate query may touch the private data of many data owners) our approach is to give buyers the option to receive a perturbed answer to their query. We have developed a principled formalism for computing the price based on the noise, which is both arbitrage-free and also compensates the data owners in proportion to their privacy loss. At one extreme, the data analyst may purchase the raw, unperturbed query answer at a high price; at the other extreme, he may purchase the query answer with a high perturbation that guarantees that all private data remains epsilon-differentially private, at a very small price. Our formalism offers a continuum of options in between.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1047815
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2011-04-01
Budget End
2014-03-31
Support Year
Fiscal Year
2010
Total Cost
$370,000
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195