Named Entity Recognition and Relationship Extraction in Biomedicine

Lu, Zhiyong

Abstract

Mining useful knowledge from the biomedical literature is beginning to realize its potential for improving literature search, automating biological data curation, and enabling large scale studies not possible otherwise. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. In FY17, we continued the development of our own text mining tools for named entity recognition and relation extraction. We first developed tmVar 2.0, an improvement on our previous tool for recognition of genetic variants and to normalize them to unique identifiers (dbSNP RSIDs). We validated our method by a benchmarking evaluation, where it compared favorably to the state of the art. We then applied our approach to the entire PubMed and demonstrated that combining text mining in manual database curation can greatly assist human efforts in prioritizing variants in genomic research. Following the recognition of sequence variants, we created an extraction system for genotype-phenotype relationships. The method extracts disease-gene-variant triplets using machine learning approach that identifies the genes and protein products associated with each mutation from both the local context (same document) and global context (i.e. all of PubMed). The system was evaluated both by comparison with the entries in the human-curated database UniProt and using benchmark datasets, where it demonstrated a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. While aforementioned feature-based machine learning has been the most commonly used approach for text mining for some time, deep learning methods are rapidly being adopted due to their improved performance and the reduced need for feature engineering. We created a novel deep learning model (McDepCNN) for extracting protein-protein interactions, which uses multiple channels to combine information from each word with information from the head of the corresponding word. The method is thus capable of capturing long distance features, and was shown to achieve 24.4% relative improvement in F1-score over the existing methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on difficult instances. In addition to pure methodological work, we also developed several systems which combined novel methodologies and applications with potential clinical importance. We developed the DIGNiFi (Disease causing GeNe FInder) method using machine learning and features extracted from protein-protein interaction networks to identify disease causing genes. We evaluated our method using 1184 known disease genes from 128 orphan diseases collected from Orphanet, demonstrating our method to outperform existing methods for identifying disease-causing genes. We also demonstrated that combining the human-curated protein-protein interaction network with interactions text-mined from the biomedical literature results in an additional performance improvement. Finally, to demonstrate the utility of text mining in creating knowledge bases for large-scale studies, we created a new database containing over 100,000 labeled chest X-rays (ChestX-ray8), by text mining the associated radiological reports for eight commonly occurring thoracic diseases. We further demonstrated that these diseases can be automatically detected and even spatially-located via a unified weakly supervised multi-label image classification and disease localization framework.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM091813-05
Application #: 9564628
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 5
Fiscal Year: 2017
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2019 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2018 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2017 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2016 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2015 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2014 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2013 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine	$821,852

Publications

Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536

Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196

Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:

Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390

Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:

Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529

Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981

van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568

Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15

Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: