Narrative clinical reports contain a rich set of clinical knowledge that could be invaluable for clinical research. However, they usually also contain personal identifiers that are considered protected health information (PHI), which is associated with use restrictions and risks to privacy. Computational de-identification seeks to remove all of the identifiers in such narrative text in order to produce de-identified documents that can be used in research with fewer constraints and with almost no risk to privacy. Computational de-identification uses natural language processing (NLP) tools and techniques to recognize patient-related individually identifiable information (e.g., names, addresses, and telephone and social security numbers) in the text, and redacts them. In this way, patient privacy is protected and clinical knowledge is preserved. After exploring existing de-identification tools, the U.S. National Library of Medicine (NLM) began developing a new software application that is capable of de-identifying many kinds of clinical reports with high accuracy. The software design uses a number of deterministic and probabilistic pattern recognition algorithms and various computational linguistic methods as well as large dictionaries of personal names, addresses, and organizations. The application accepts narrative reports in plain text or in HL7 format. When the reports are formatted as HL7 message, the application leverages the labeled patient-related information embedded in various HL7 segments to find such information in the free text narrative. The application software includes an editor for visualization and markup called the Visual Tagging Tool (VTT) that we use to produce gold standards against which to test the tool. Although designed specifically for tagging identifiers that contain personally identifiable, protected health information, VTT has been made publicly available to the greater NLP community for expanded lexical tagging and text annotation. We are now studying the performance of our approach on a large corpus of tagged clinical documents. The preliminary results of this study suggest that computational de-identification methods may attain an accuracy at or better than the level of 99.9% sensitivity and 99% specificity across a large spectrum of identifiers containing personally identifiable information.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kayaalp, Mehmet (2018) Patient Privacy in the Era of Big Data. Balkan Med J 35:8-17
Kayaalp, Mehmet; Browne, Allen C; Sagan, Pamela et al. (2015) Challenges and Insights in Using HIPAA Privacy Rule for Clinical Text Annotation. AMIA Annu Symp Proc 2015:707-16
Browne, Allen C; Kayaalp, Mehmet; Dodd, Zeyno A et al. (2014) The Challenges of Creating a Gold Standard for De-identification Research. AMIA Annu Symp Proc 2014:353-8
Huser, Vojtech; Kayaalp, Mehmet; Dodd, Zeyno A et al. (2014) Piloting a deceased subject integrated data repository and protecting privacy of relatives. AMIA Annu Symp Proc 2014:719-28
Kayaalp, Mehmet; Browne, Allen C; Dodd, Zeyno A et al. (2014) De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports. AMIA Annu Symp Proc 2014:767-76
Kayaalp, Mehmet; Browne, Allen C; Callaghan, Fiona M et al. (2013) The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them. J Am Med Inform Assoc :
Fung, Kin Wah; Kayaalp, Mehmet; Callaghan, Fiona et al. (2013) Comparison of electronic pharmacy prescription records with manually collected medication histories in an emergency department. Ann Emerg Med 62:205-11
Kang, Yanna Shen; Kayaalp, Mehmet (2013) Extracting laboratory test information from biomedical text. J Pathol Inform 4:23