Named Entity Recognition and Relationship Extraction in Biomedicine

Lu, Zhiyong

Abstract

Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. Our recent research introduced a machine learning method called DNorm for normalization based on pairwise learning to rank. In 2013-2014, we have continued our efforts in improving DNorm by increasing its scalability through a dimension reduction technique based on low-rank matrix approximation. When assessed on our recently developed NCBI disease corpus, the new algorithm demonstrates a significant reduction in the number of parameters to be learned while maintaining a high accuracy. Besides disease named entity recognition (NER), we also improved the state of the art in the chemical NER problem. Through participation of the BioCreative IV CHEMDNER task, we introduced the tmChem system, a chemical named entity recognizer created by combining two independent machine-learning models in an ensemble. We used the challenge task corpus to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the mention-level evaluation, which is the highest performance among all participating teams. To improve the interoperability among various biomedical text-mining tools our group has created over the years (e.g. DNorm, tmChem, tmVar, etc), we recently adopted a newly proposed community-wide scheme called BioC for handling heterogeneity and variety in data formats. Specifically, we modified our tools and enabled them to read/write data in the proposed BioC format. The resulting BioC wrapped toolkit, is named as tmBioC. Through empirical studies, we demonstrated that our tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we focused on Gene Ontology (GO) annotation, which is a common task among model organism database (MOD) groups. It is also a very time-consuming and labor-intensive task, thus often considered as one of the bottlenecks in literature curation. There is a growing need for semi- or fully-automated GO curation techniques that will help database curators rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The lack of sentence-level training data and opportunities for interaction between text mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two sub-tasks: a) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and b) to automatically identify relevant GO terms for the genes in a given article (a concept recognition task). With the support from five MODs, we provided teams with nearly 4,000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available due to the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we find an overall improvement in performance for recognizing GO terms when comparing to similar task results in the past. Future work should focus on improving performance of GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM091813-02
Application #: 8943240
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 2
Fiscal Year: 2014
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2019 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2018 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2017 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2016 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2015 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2014 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine
NIH 2013 ZIA LM	Named Entity Recognition and Relationship Extraction in Biomedicine Lu, Zhiyong / National Library of Medicine	$821,852

Publications

van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568

Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536

Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196

Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:

Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390

Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:

Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529

Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981

Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15

Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: