Exploiting the full potential of information rich and rapidly growing repositories of patient clinical text is hampered by the absence of scalable and robust de-identification tools. Clinical text contains protected health information (PHI), and the Health Insurance Portability and Accountability Act (HIPAA) restricts research use of patient information containing PHI to specific, limited, IRB-approved projects. As a result, vast repositories of clinical text remain under-used by internal researchers, and are even less available for external transmission to outside collaborators or for centralized processing by state-of-the-art natural language processing (NLP) technologies. De-identification, which is the removal of PHI from clinical text, is challenging. Despite their availability for over a decade, commercially available automated systems are expensive, require local tailoring, and have not gained widespread market penetration. Manual methods are costly and do not scale, yet continue to be used despite the small amount of residual PHI they leave behind. Open source de-identification tools based on state-of-the-art machine learning technologies can perform at or above the level of manual approaches but also suffer from the residual PHI problem. Current de-identification approaches, then, also severely limit the use and mobility of clinical text while exposing patients to privacy risks. These approaches redact PHI, blacking it out or replacing it with symbols (e.g., """"""""Here for cardiac eval is Mr. **PT_NAME, a **AGE<60s>yo male with his son Doug ...""""""""). Traditional approaches leave residual PHI (""""""""Doug"""""""" in this example) to be easily noticed by readers of the text, as it remains plainly visible among the prominent redactions. We developed and pilot tested an alternative approach we believe addresses the residual PHI problem. Our approach uses the strategy of concealing, rather than trying to eliminate, residual PHI. We call it the """"""""Hiding In Plain Sight"""""""" (HIPS) approach. HIPS replaces all known PHI with """"""""surrogate"""""""" PHI- fictional names, ages, etc.-that look real but do not refer to any actual patient. A HIPS version of the above text is: """"""""Here for cardiac eval is Mr. Jones, a 64 yo male with his son Doug ..."""""""" where the name """"""""Jones"""""""" and age """"""""64"""""""" are fictional surrogates, but the name """"""""Doug"""""""" is residual PHI. To a reader, the surrogates and the residual PHI are indistinguishable. This prevents the reader from detecting the latter, avoiding disclosure. Our preliminary studies suggest that HIPS can reduce the risk of disclosure of residual PHI by a factor of 10. This yields overall performance that far surpasses the performance attainable by manual methods, and is unlikely to be matched, we believe, by additional incremental improvements in PHI tagging models (i.e., efforts to reduce residual PHI). Our pilot studies indicate IRBs would welcome the HIPS approach if it were shown to be effective through rigorous evaluation. To expand usage of clinical text and enhance patient privacy, we propose to formalize rules of effective surrogate generation (Aim 1), extend related de-identification confidence scoring methods (Aim 2), and conduct rigorous efficacy testing of HIPS in diverse institutional settings (Aim 3).

Public Health Relevance

All known automated de-identification methods leave behind a small amount of residual protected health information (PHI), which presents a risk of disclosing patient privacy and creates barriers to more widespread internal use and external sharing of information-rich clinical text for broad research purpose. This project advances and evaluates the efficacy of a novel method, called the Hiding In Plain Sight (HIPS) approach, which conceals residual PHI by replacing all other instance of PHI found in a document with realistic appearing but fictitious surrogates. Rigorous efficacy testing is needed to confirm that HIPS surrogates effectively reduce risk of exposing patient privacy by concealing the small amount of residual PHI all known de-identification leave behind.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Group Health Cooperative
United States
Zip Code
Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun et al. (2017) Scalable Iterative Classification for Sanitizing Large-Scale Datasets. IEEE Trans Knowl Data Eng 29:698-711
Li, Muqun; Carrell, David; Aberdeen, John et al. (2016) Optimizing annotation resources for natural language de-identification via a game theoretic framework. J Biomed Inform 61:97-109
Carrell, David S; Cronkite, David J; Malin, Bradley A et al. (2016) Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification. Methods Inf Med 55:356-64
Li, Muqun; Carrell, David; Aberdeen, John et al. (2014) De-identification of clinical narratives through writing complexity measures. Int J Med Inform 83:750-67
Hanauer, David A; Mei, Qiaozhu; Malin, Bradley et al. (2013) Location bias of identifiers in clinical narratives. AMIA Annu Symp Proc 2013:560-9