Data Protection laws that exempt data that is not individually identifiable have led to an explosion in anonymization research. Unfortunately, how well current de-identification and anonymization techniques control risks to privacy and confidentiality is not well understood. Neither is the usefulness of anonymized data for real-world applications. The project addresses anonymization on three fronts:

1) Textual data, even when explicit identifiers are removed (names, dates, locations), can contain highly identifiable information. For example, a sample of chief complaint fields from the Indiana Network for Patient Care (INPC) found several instances of "phantom limb pain". Amputees can be visually identifiable, but the HIPAA Safe Harbor rules do not list this as "identifying information". Any policy explicitly listing all types of identifying data is likely to fail. Through a joint effort with computer science and linguistics, the project is developing new methods to remove specific details from text while preserving meaning, eliminating such highly identifiable information without a priori knowledge of what would be identifying.

2) Current anonymization research is based on unproven measures of identifiability. Through a re-identification challenge on synthetic data (but based on real healthcare data), the project is evaluating the efficacy of these measures. Interdisciplinary teams of students are given challenge problems - anonymized data with hypothetical healthcare data - and asked to make (hypothetical) inferences about health information of individuals. The results can be used to calibrate the effectiveness of different anonymization measures.

3) The utility of anonymized data has been a concern among research: Does anonymized data provide credible research results? By partnering with healthcare studies at the Kinsey Institute and Purdue University School of Nursing, the project is comparing analyses on original data with analyses on anonymized data, and evaluating the impact of types of anonymization on research results. A related issue is determining the impact on data collection: Are individuals more candid in their responses if they know data will be anonymized? Outcomes are broadening the scope of research that can be performed on anonymized data, while ensuring that researchers know when access to individually identifiable data (with attendant restrictions and safeguards) is needed.

Through these tasks, the project is advancing our ability to utilize the wealth of data we now collect for the benefit of society, while ensuring individual privacy is protected.

For further information see the project web site at the URL: http://projects.cerias.purdue.edu/TextAnon

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1011984
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-09-01
Budget End
2015-08-31
Support Year
Fiscal Year
2010
Total Cost
$356,964
Indirect Cost
Name
Missouri University of Science and Technology
Department
Type
DUNS #
City
Rolla
State
MO
Country
United States
Zip Code
65409