There is a deluge of health-related texts in many genres, from the clinical narrative to newswire and social media. These texts are diverse in content, format, and style, and yet they represent complementary facets of biomedical and health knowledge. Natural Language Processing (NLP) holds much promise to extract, understand, and distill valuable information from these overwhelming large and complex streams of data, with the ultimate goal to advance biomedicine and impact the health and wellbeing of patients. There have been a number of success stories in various biomedical NLP applications, but the NLP methods investigated are usually tailored to one specific phenotype and one institution, thus reducing portability and scalability. Moreover, while there has been much work in the processing of clinical texts, other genres of health texts, like narratives and posts authored by health consumers and patients, are lacking solutions to marshal and make sense of the health information they contain. Robust NLP solutions that answer the needs of biomedicine and health in general have not been fully investigated yet. A unified, data-science approach to health NLP enables the exploration of methods and solutions unprecedented up to now. Our vision is to unravel the information buried in the health narratives by advancing text-processing methods in a unified way across all the genres of texts. The crosscutting theme is the investigation of methods for health NLP (hNLP) made possible by big data, fused with health knowledge. Our proposal moves the field into exploring semi-supervised and fully unsupervised methods, which only succeed when very large amounts of data are leveraged and knowledge is injected into the methods with care. Our hNLP proposal also targets a key challenge of current hNLP research: the lack of shared software. We seek to provide a clearinghouse for software created under this proposal, and as such all developed tools will be disseminated. Starting from the data characteristics of health texts and information needs of stakeholders, we will develop and evaluate methods for information extraction, information understanding. We will translate our research into the publicly available NLP software platform cTAKES, through robust modules for extraction and understanding across all genres of health texts. We will also demonstrate impact of our methods and tools through several use cases, ranging from clinical point of care to public health, to translational and precision medicine, to participatory medicine. Finally, we will disseminate our work through community activities, such as challenges to advance the state of the art in health natural language processing.

Public Health Relevance

There is a deluge of health texts. Natural Language Processing (NLP) holds much promise to unravel valuable information from these large data streams with the goal to advance medicine and the wellbeing of patients. We will advance state-of-the-art NLP by designing robust, scalable methods that leverage health big data, demonstrating relevance on high-impact use cases, and disseminating NLP tools for the research community and public at large.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Columbia University (N.Y.)
Internal Medicine/Medicine
Schools of Medicine
New York
United States
Zip Code
Osborne, John D; Neu, Matthew B; Danila, Maria I et al. (2018) CUILESS2016: a clinical corpus applying compositional normalization of text mentions. J Biomed Semantics 9:2
Xu, Dongfang; Yadav, Vikas; Bethard, Steven (2018) UArizona at the MADE1.0 NLP Challenge. Proc Mach Learn Res 90:57-65
Névéol, Aurélie; Dalianis, Hercules; Velupillai, Sumithra et al. (2018) Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semantics 9:12
Gonzalez-Hernandez, G; Sarker, A; O'Connor, K et al. (2017) Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text. Yearb Med Inform 26:214-227
Zhang, Shaodian; O'Carroll Bantum, Erin; Owen, Jason et al. (2017) Online cancer communities as informatics intervention for social support: conceptualization, characterization, and impact. J Am Med Inform Assoc 24:451-459
Sadeque, Farig; Xu, Dongfang; Bethard, Steven (2017) UArizona at the CLEF eRisk 2017 Pilot Task: Linear and Recurrent Models for Early Depression Detection. CEUR Workshop Proc 1866:
Zhang, Shaodian; Grave, Edouard; Sklar, Elizabeth et al. (2017) Longitudinal analysis of discussion topics in an online breast cancer community using convolutional neural networks. J Biomed Inform 69:1-9
Zhang, Shaodian; Kang, Tian; Qiu, Lin et al. (2017) Cataloguing Treatments Discussed and Used in Online Autism Communities. Proc Int World Wide Web Conf 2017:123-131
Zhang, Shaodian; Qiu, Lin; Chen, Frank et al. (2017) ""We make choices we think are going to save us"": Debate and stance identification for online breast cancer CAM discussions. Proc Int World Wide Web Conf 2017:1073-1081
Zhang, Shaodian; Kang, Tian; Zhang, Xingting et al. (2016) Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models. J Biomed Inform 60:334-41

Showing the most recent 10 out of 12 publications