Named Entity Recognition and Relationship Extraction in Biomedicine

Lu, Zhiyong

Abstract

Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. Our recent research introduced a state-of-the-art system called DNorm for disease normalization based on pairwise learning to rank. In 2013-2014, we performed an investigation with regard to the different performance of DNorm when applied to clinical narratives vs. biomedical publications. We used closure properties to compare the richness of the vocabulary in clinical narrative text to biomedical publications. We found that while the size of the overall vocabulary is similar between clinical narrative and biomedical publications, clinical narrative uses a richer terminology to describe disorders than publications, which we believe to be one of the primary causes of reduced performance in clinical narrative. Accordingly, we introduced several lexical enhancements generalizable to other clinical NLP tasks that improved the ability of DNorm to handle this variation. The clinical version of DNorm (DNorm-C) is now made openly available to the research community, along with our other open source tools. One common challenge in biomedical named entity recognition (NER) and normalization is the identification and resolution of composite named entities, where a single span refers to more than one concept (e.g., BRCA1/2). Previous NER and normalization studies have either ignored composite mentions, used simple ad hoc rules, or only handled coordination ellipsis, making a robust approach for handling multitype composite mentions greatly needed. In 2014-2015, we proposed a hybrid method integrating a machine-learning model with a pattern identification strategy to identify the individual components of each composite mention. Our method, which we have named SimConcept, is the first to systematically handle many types of composite mentions. The technique achieves high performance in identifying and resolving composite mentions for three key biological entities: genes (90.42% in F-measure), diseases (86.47% in F-measure), and chemicals (86.05% in F-measure). Furthermore, our results show that using our SimConcept method can subsequently improve the performance of gene and disease concept recognition and normalization. As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we continued to improve our previous curation-assisting tool PubTator and to collaborate with domain experts: human database curators in this case. With these efforts, our PubTator system is now being used in the production curation pipeline of two external databases on a daily basis: 1. HuGE Navigator: a CDC knowledgebase of human genome epidemiology 2. SwissProt: an annotated database of protein sequence and functional information. In 2014-2015, we also investigated the feasibility of using crowdsourcing for respectively assisting gene-mutation curation and drug-indication cataloging, given the high cost of expert annotation. In both studies, we first translated the complex expert-annotation task into human intelligence tasks (HITs) suitable for the average workers. For instance, instead of asking people to find drug indications from free text (e.g. lengthy paragraphs), we simplified the task such that each HIT only involved a worker making a binary judgment of whether a highlighted disease, in context of a given drug label sentence, is an indication. Then we recruited annotators from an unknown network of workers through the technical environment of Amazon Mechanical Turk (MTurk). Judgments from the crowds were then aggregated to become the final answer. For evaluation, we assessed the ability of our proposed method to achieve high-quality annotations in a time-efficient and cost-effective manner. In comparison with the expert annotations, we find that our crowdsourcing approach not only results in significant cost and time saving, but also leads to accuracy comparable to that of domain experts. Therefore, we conclude that our crowdsourcing-based approach provides a readily scalable and cost-effective model to manual curation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM091813-03
Application #: 9160930
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 3
Fiscal Year: 2015
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2019 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2018 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2017 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2016 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2015 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2014 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2013 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine	$821,852

Publications

van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568

Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536

Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196

Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:

Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390

Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:

Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529

Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981

Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15

Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: