Named Entity Recognition and Relationship Extraction in Biomedicine

Lu, Zhiyong

Abstract

Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. Built upon our previous success with PubTator which has served annotated PubMed abstracts for 300 million requests, we expanded its scope to include automated concept annotations for full-length articles in the newly released PubTator Central (PTC) system. Specifically, PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. To facilitate research in the development of pre-training language representations in the biomedicine domain, we introduced the Biomedical Language Understanding Evaluation (BLUE) benchmark. The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties. We also evaluate several baselines based on BERT and ELMo and find that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results. We make the datasets, pre-trained models, and codes publicly available. In 2019, we developed ML-Net, a novel end-to-end deep learning framework, for multi-label classification of biomedical texts where each document is associated with one or more labels. As an important task that has broad applications in biomedicine, a number of different computational methods have been proposed. Many of these methods, however, have only modest accuracy or efficiency and limited success in practical use. Our ML-Net method combines a label prediction network with an automated label count prediction mechanism to provide an optimal set of labels. This is accomplished by leveraging both the predicted confidence score of each label and the deep contextual information (modeled by ELMo) in the target document. Our benchmarking results show that ML-Net compares favorably to state-of-the-art methods in multi-label classification of biomedical text. ML-Net is also shown to be robust when evaluated on different text genres in biomedicine. In addition to tackling text-mining tasks with supervised machine-learning methods that typically require a large amount of annotated labeled data for algorithm training, we proposed a novel semi-supervised learning algorithm based on variational autoencoders (VAE) in order to utilize unlabeled data to reduce the burden of manual annotation required in supervised learning. Our model consists of the following three parts, a classifier, an encoder and a decoder. The classifier is implemented using multi-layer convolutional neural networks (CNNs), and the encoder and decoder are implemented using both bidirectional long short-term memory networks (Bi-LSTMs) and CNNs, respectively. The semi-supervised mechanism allows our model to learn features from both the labeled and unlabeled data. We evaluate our method on multiple public PPI, DDI and CPI corpora. Experimental results show that our method effectively exploits the unlabeled data to improve the performance and reduce the dependence on labeled data. To our best knowledge, this is the first semi-supervised VAE-based method for (biomedical) relation extraction. As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we organized a challenge task on text mining for precision medicine through BioCreative VI. The challenge was organized in two specific sub-tasks: (i) document triage subtask, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction subtask, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. Separately, we have applied cutting-edge deep learning techniques to the literature triage process for human kinome curation and the identification of translational research in genomic medicine beyond bench to bedside from the biomedical literature. In both applications, automated computer results are compared and validated by the expert curators of the two databases: neXtA5 and the CDCs Public Health Genomics Knowledge Base. Both results show that our deep-learning based computer method is more efficient and can replace the traditional manual triage methods of those databases. Our methods can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion. As shown above, deep learning, a class of machine learning algorithms, has showed impressive results in several of our recent studies this year. In addition to applying it to natural language processing, we have also seen its success in our medical image analysis such as processing chest X-rays, CT images, and various kinds of retinal images for autonomous disease diagonosis and prognosis. One such project relates to Age-related macular degeneration (AMD), which is the leading cause of blindness in developed countries and, by 2040, will affect approximately 300 million people worldwide. Accurate AMD severity detection and progression prediction to sight-threatening late disease stage is thus of significant importance for personalizing monitoring and preventative interventions. As a joint effort between National Library of Medicine and National Eye Institute, we developed novel machine-learning approaches to automatically classify AMD severity levels and predict risk of progression to late AMD based on retinal images taken from two US multi-center longitudinal studies. We demonstrated that the fully automated deep learning model was superior to existing clinical standards and has the potential for improved patient care.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM091813-07
Application #: 10007525
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 7
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2019 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2018 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2017 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2016 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2015 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2014 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2013 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine	$821,852

Publications

Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536

Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196

Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:

Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390

Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:

Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529

Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981

van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568

Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15

Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: