Computer-assisted medicine is at a crossroads: medical care requires accurate data, but making such data widely available can create unacceptable risks to the privacy of individual patients. This tension between utility and privacy is especially acute in predictive personalized medicine (PPM). PPM holds the promise of making treatment decisions tailored to the individual based on her or his particular genetics and clinical history. Making PPM a reality requires running statistical, data mining and machine learning algorithms on combined genetic, clinical and demographic data to construct predictive models. Access to such data directly competes with the need for healthcare providers to protect the privacy of each patient's data, thus creating a tradeoff between model efficacy and privacy. Thus we find ourselves in an unfortunate standoff: significant medical advances that would result from more powerful mining of the data by a wider variety of researchers are hindered by significant privacy concerns on behalf of the patients represented in the data set. In this proposed work, we seek to develop and evaluate technology to resolve this standoff, enabling health practitioners and researchers to compute on privacy-sensitive medical records in order to make treatment decisions or create accurate models, while protecting patient privacy. We will evaluate our approach on a de-identified actual electronic medical record, with an average of 29 years of clinical history on each patient, and with detailed genetic data (650K SNPs) available for a subset of 5000 of the patients. This data set is available to us now through the Wisconsin Genomics Initiative, but only on a computer at the Marshfield Clinic. If successful our approach will make possible the sharing of this cutting-edge data set, and others like it that are now in development, including our ability to analyze this data at UW-Madison where we have thousands of processors available in our Condor pool. Our privacy approach integrates secure data access environments, including those appropriate to the use of laptops and cloud computing, with novel anonymization algorithms providing differential privacy guarantees for data and/or published results of data analysis. To this end, our specific aims are as follows:
AIM 1 : Develop and deploy a secure local environment that, in combination with secure network functionality, will ensure end-to-end security and privacy for electronic medical records and biomedical datasets shared between clinical institutions and researchers.
AIM 2 : Develop and deploy a secure virtual environment to allow large-scale, privacy-preserving data analysis "in the cloud." AIM 3: Develop and evaluate privacy-preserving data mining algorithms for use with original (not anonymized) data sets consisting of electronic medical records and genetic data.
AIM 4 : Develop and evaluate anonymizing data publishing algorithms and privacy guarantees that are appropriate to the complex structure present in electronic medical records with genetic data.

Public Health Relevance

This project will develop an integrated approach to secure sharing of clinical and genetic data that based on algorithms for anonymization of data to achieve differential privacy guarantees, for privacy-preserving publication of data analysis results, and secure environments for data sharing that include addressing the increasing use of laptops and of cloud computing. The end goal of this project is to meet the competing demands of providing patients with both privacy and accurate predictive models based on clinical history and genetics. This project includes the first concrete evaluation of privacy- preserving data mining algorithms on actual combined EMR and genetic data, using with the Wisconsin Genomics Initiative data set.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-C (J2))
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Wisconsin Madison
Biostatistics & Other Math Sci
Schools of Medicine
United States
Zip Code
Ye, Zhan; Mayer, John; Ivacic, Lynn et al. (2015) Phenome-wide association studies (PheWASs) for functional variants. Eur J Hum Genet 23:523-9
Liu, Jie; Zhang, Chunming; Burnside, Elizabeth et al. (2014) Multiple Testing under Dependence via Semiparametric Graphical Models. Proc Int Conf Mach Learn 2014:955-963
Liu, Jie; Zhang, Chunming; Burnside, Elizabeth et al. (2014) Learning Heterogeneous Hidden Markov Random Fields. JMLR Workshop Conf Proc 33:576-584
Xu, Yuanzhong; Dunn, Alan M; Hofmann, Owen S et al. (2014) Application-Defined Decentralized Access Control. Proc USENIX Annu Tech Conf 2014:395-408
Brubaker, Chad; Jana, Suman; Ray, Baishakhi et al. (2014) Using Frankencerts for Automated Adversarial Testing of Certificate Validation in SSL/TLS Implementations. IEEE Secur Priv 2014:114-129
Georgiev, Martin; Jana, Suman; Shmatikov, Vitaly (2014) Breaking and Fixing Origin-Based Access Control in Hybrid Web/Mobile Application Frameworks. NDDS Symp 2014:1-15