Patient data collected during health care delivery and public health surveys possess a great deal of information that could be used in biomedical and epidemiological research. Access to these data, however, is usually limited because of the private nature of most personal health records. Methods of balancing the informativeness of data for research with the information loss required to minimize disclosure risk are needed before these data can be used to improve public health. Current methods are primarily focused on protecting privacy, but focusing on protecting privacy alone is inadequate. In statistical disclosure control techniques, information truthfulness is not well preserved so that unreliable results may be released. In generalization-based anonymization approaches, there is information loss due to attribute generalization and existing techniques do not provide sufficient control for maintaining data utility. What are currently needed are methods that protect both the privacy of individuals represented in the data as well as the integrity of relationships studied by researchers. The problem is that there is an inherent tradeoff between protecting the privacy of individuals and protecting the informativeness of the data set. Protecting the privacy of individuals always results in a loss of information and it is the information contained by the data set that affects the power of a statistical test. For a given anonymization strategy, however, there are often multiple ways of masking the data that meet the disclosure risk criteria provided. This can be taken advantage of to choose the solution that best preserves statistical information while meeting the disclosure risk criteria provided. This project will develop the first integrated software system that provides solutions for problems faced in all three stages in the release of sensitive health care data: 1. anonymize a data set by intervalizing/generalizing data to satisfy currently available anonymization strategies, 2. provide sufficient controls within anonymization procedures to satisfy constraints on statistical usefulness of the data, and 3. compute statistical tests for the anonymized data intervals. There are two main challenges facing this effort. The first is that, based on existing research results, integrating our proposed new control processes into anonymization procedures is expected to be computationally difficult. We will overcome this challenge by developing efficient and practically useful greedy algorithms, approximation algorithms, or algorithms working for realistic situations (if not for general cases). The other primary challenge facing this effort is the fact that statistical calculations with interval data sets are known to be computationally difficult, and these calculations are necessary both for control processes within anonymization procedures and for subsequent statistical computation and tests. We will overcome this challenge with efficient algorithms that exploit the structure present in data sets intervalized for privacy. The software will be tested on medical data sets of various sizes and structures to demonstrate the feasibility of the approach and to characterize the scalability of the algorithms with data set size.
Patient health records possess a great deal of information that is useful in medical research, but access to these data is usually limited because of the private nature of most personal health records. Methods of balancing the informativeness of data for research with the information loss required to minimize disclosure risk are needed before these data can be used to improve public health. This project will develop the first integrated software system that provides solutions for intervalizing/generalizing data, controlling data utility, and performing analyses using interval statistics.