One of the major barriers in leveraging Electronic Health Record (EHR) data for clinical and translational science is the prevalent use of unstructured or semi-structured clinical narratives for documenting clinical information. Natural Language Processing (NLP), which extracts structured information from narratives, has received great attention and has played a critical role in enabling secondary use of EHRs for clinical and translational research. As demonstrated by large scale efforts such as ACT (Accrual of patients for Clinical Trials), eMERGE, and PCORnet, using EHR data for research rests on the capabilities of a robust data and informatics infrastructure that allows the structuring of clinical narratives and supports the extraction of clinical information for downstream applications. Current successful NLP use cases often require a strong informatics team (with NLP experts) to work with clinicians to supply their domain knowledge and build customized NLP engines iteratively. This requires close collaboration between NLP experts and clinicians, not feasible at institutions with limited informatics support. Additionally, the usability, portability, and generalizability of the NLP systems are still limited, partially due to the lack of access to EHRs across institutions to train the systems. The limited availability of EHR data limits the training available to improve the workforce competence in clinical NLP.
We aim to address the above challenges by extending our existing collaboration among multiple CTSA hubs on open health natural language processing (OHNLP) to share distributional information of NLP artifacts (i.e., words, n-grams, phrases, sentences, concept mentions, concepts, and text segments) acquired from real EHRs across multiple institutions. We will leverage the advanced privacy-preserving computing infrastructure of iDASH (integrating Data for Analysis, Anonymization, and SHaring) for privacy- preserving data analysis models and will partner with diverse communities including Observational Health Data Sciences and Informatics (OHDSI), Precision Medicine Initiative (PMI), PCORnet, and Rare Diseases Clinical Research Network (RDCRN) to demonstrate the utility of NLP for translational research. This CTSA innovation award RFA provides us with a unique opportunity to address the challenges faced with clinical NLP and through strong partnership with multiple research communities and leadership roles of the research team in clinical NLP, we envision that the successful delivery of this project will broaden the utilization of clinical NLP across the research community. There are four aims planned: i) obtain PHI-suppressed NLP artifacts with retained distribution information across multiple institutions and assess the privacy risk of accessing PHI- suppressed artifacts, ii) generate a synthetic text corpus for exploratory analysis of clinical narratives and assess its utility in NLP tasks leveraging various NLP challenges, iii) develop privacy-preserving computational phenotyping models empowered with NLP, and iv) partner with diverse communities to demonstrate the utility of our project for translational research.

Public Health Relevance

s The proposed project aims to broaden the secondary use of electronic health records (EHRs) across the research community by combining innovative privacy-preserving computing techniques and clinical natural language processing.

National Institute of Health (NIH)
National Center for Advancing Translational Sciences (NCATS)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZTR1)
Program Officer
Zhang, Xinzhi
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Mayo Clinic, Rochester
United States
Zip Code
Bonomi, Luca; Jiang, Xiaoqian (2018) Patient ranking with temporally annotated data. J Biomed Inform 78:43-53
Liu, Sijia; Shen, Feichen; Komandur Elayavilli, Ravikumar et al. (2018) Extracting chemical-protein relations using attention-based neural networks. Database (Oxford) 2018:
Chen, Luyao; Aziz, Md Momin; Mohammed, Noman et al. (2018) Secure large-scale genome data storage and query. Comput Methods Programs Biomed 165:129-137
Kim, Andrey; Song, Yongsoo; Kim, Miran et al. (2018) Logistic regression model training based on the approximate homomorphic encryption. BMC Med Genomics 11:83
Wang, Yanshan; Liu, Sijia; Afzal, Naveed et al. (2018) A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 87:12-20
Lee, Junghye; Sun, Jimeng; Wang, Fei et al. (2018) Privacy-Preserving Patient Similarity Learning in a Federated Environment: Development and Analysis. JMIR Med Inform 6:e20
Rizvi, Rubina F; Adam, Terrence J; Lindemann, Elizabeth A et al. (2018) Comparing Existing Resources to Represent Dietary Supplements. AMIA Jt Summits Transl Sci Proc 2017:207-216
Son, Jung Hoon; Xie, Gangcai; Yuan, Chi et al. (2018) Deep Phenotyping on Electronic Health Records Facilitates Genetic Diagnosis by Clinical Exomes. Am J Hum Genet 103:58-73
McRoy, Susan; Rastegar-Mojarad, Majid; Wang, Yanshan et al. (2018) Assessing Unmet Information Needs of Breast Cancer Survivors: Exploratory Study of Online Health Forums Using Text Classification and Retrieval. JMIR Cancer 4:e10
Wang, Yanshan; Wang, Liwei; Rastegar-Mojarad, Majid et al. (2018) Clinical information extraction applications: A literature review. J Biomed Inform 77:34-49

Showing the most recent 10 out of 14 publications