Exploiting the full potential of information rich and rapidly growing repositories of patient clinical text is hampered by the absence of scalable and robust de-identification tools. Clinical text contains protected health information (PHI), and the Health Insurance Portability and Accountability Act (HIPAA) restricts research use of patient information containing PHI to specific, limited, IRB-approved projects. As a result, vast repositories of clinical text remain under-used by internal researchers, and are even less available for external transmission to outside collaborators or for centralized processing by state-of-the-art natural language processing (NLP) technologies. De-identification, which is the removal of PHI from clinical text, is challenging. Despite their availability for over a decade, commercially available automated systems are expensive, require local tailoring, and have not gained widespread market penetration. Manual methods are costly and do not scale, yet continue to be used despite the small amount of residual PHI they leave behind. Open source de-identification tools based on state-of-the-art machine learning technologies can perform at or above the level of manual approaches but also suffer from the residual PHI problem. Current de-identification approaches, then, also severely limit the use and mobility of clinical text while exposing patients to privacy risks. These approaches redact PHI, blacking it out or replacing it with symbols (e.g., """"""""Here for cardiac eval is Mr. **PT_NAME
All known automated de-identification methods leave behind a small amount of residual protected health information (PHI), which presents a risk of disclosing patient privacy and creates barriers to more widespread internal use and external sharing of information-rich clinical text for broad research purpose. This project advances and evaluates the efficacy of a novel method, called the Hiding In Plain Sight (HIPS) approach, which conceals residual PHI by replacing all other instance of PHI found in a document with realistic appearing but fictitious surrogates. Rigorous efficacy testing is needed to confirm that HIPS surrogates effectively reduce risk of exposing patient privacy by concealing the small amount of residual PHI all known de-identification leave behind.
Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun et al. (2017) Scalable Iterative Classification for Sanitizing Large-Scale Datasets. IEEE Trans Knowl Data Eng 29:698-711 |
Li, Muqun; Carrell, David; Aberdeen, John et al. (2016) Optimizing annotation resources for natural language de-identification via a game theoretic framework. J Biomed Inform 61:97-109 |
Carrell, David S; Cronkite, David J; Malin, Bradley A et al. (2016) Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification. Methods Inf Med 55:356-64 |
Li, Muqun; Carrell, David; Aberdeen, John et al. (2014) De-identification of clinical narratives through writing complexity measures. Int J Med Inform 83:750-67 |
Hanauer, David A; Mei, Qiaozhu; Malin, Bradley et al. (2013) Location bias of identifiers in clinical narratives. AMIA Annu Symp Proc 2013:560-9 |