Access to many important datasets in health care, biomedical informatics, sociology, and homeland security is restricted by HIPAA regulations on the release of detailed "microdata", impeding basic research in fields dependent on such data, and the problem will only get worse as more medical data is collected. A primary focus of this project is to "unlock" such critical datasets by developing techniques and tools to anonymize very large databases. Such tools will facilitate the release of datasets to researchers, while insuring that the privacy of individuals is maintained.

Many organizations, including the Census Bureau and departments of health at the local, state, and federal levels, routinely publish aggregated forms of data because they can be used to answer statistical queries over selected subsets of the data. However, existing methods for aggregation have significant weaknesses and proposed improvements scale poorly. In addition, the problem of building accurate predictive models (e.g., decision trees) from aggregated data has not been widely studied. There are many opportunities for building accurate predictive models from data aggregated to preserve privacy; conversely, the possibility of building such models suggests another way that sensitive information can be inadvertently "leaked" even when only aggregated data is published. This project aims to investigate the trade-off between privacy guarantees and the utility of the published data for specific analysis tasks, including: (a) privacy-preserving algorithms for aggregating large data sets, (b) algorithms for building predictive models using aggregated data, and (c) conditions under which accurate predictive models can be constructed from such data.

The project team includes the Chief Epidemiologist for the State of Wisconsin, who is the curator for many health-related datasets, a cancer researcher whose research program relies in part on datasets curated by the State, and three computer scientists with expertise in data mining and management.

This project will train graduate and undergraduate students at the University of Wisconsin in the tradeoffs between privacy protection and accurate model building, and the results and tools will be widely disseminated through publications and the project's Web site (www-db.cs.wisc.edu/dbprivacy).

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0524671
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2005-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2005
Total Cost
$1,600,000
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715